NetBSD Problem Report #46896

From mhitch@net5.msu.montana.edu  Mon Sep  3 20:36:30 2012
Return-Path: <mhitch@net5.msu.montana.edu>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id A088D63B86D
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  3 Sep 2012 20:36:30 +0000 (UTC)
Message-Id: <20120903190351.81288B434@net5.msu.montana.edu>
Date: Mon,  3 Sep 2012 13:03:51 -0600 (MDT)
From: mhitch@lightning.msu.montana.edu
Reply-To: mhitch@lightning.msu.montana.edu
To: gnats-bugs@gnats.NetBSD.org
Subject: iSCSI initiator ccb_pool gets corrupted
X-Send-Pr-Version: 3.95

>Number:         46896
>Category:       kern
>Synopsis:       iSCSI initiator ccb_pool gets corrupted
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    mhitch
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 03 20:40:00 +0000 2012
>Closed-Date:    
>Last-Modified:  Tue Sep 11 14:55:01 +0000 2012
>Originator:     Michael L. Hitch
>Release:        NetBSD 6.0_RC1 as of 19-Aug-2012
>Organization:
	Montana State University
>Environment:
System: NetBSD net5.msu.montana.edu 6.0_RC1 NetBSD 6.0_RC1 (XEN3_DOM0) #43: Sun Sep 2 20:19:33 MDT 2012 mhitch@net8.msu.montana.edu:/home/mhitch/NetBSD-6/OBJ/amd64/home/mhitch/NetBSD-6/src/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
	After updating to 6.0_RC1, I started a XEN DOMU kernel using an iSCSI
	disk.  I'm fairly certain that I had been able to run this for some time
	previously (netbsd-6 tree as of 24-May).  Shortly after starting the DOMU
	kernel, the iSCSI initiator started reporting no ccbs:

	Aug 30 00:20:11 net5 /netbsd: S2C1: No CCB in run_xfer
	Aug 30 00:20:11 net5 /netbsd: sd1(iscsi0:0:0:0): adapter resource shortage
	Aug 30 00:20:12 net5 /netbsd: S2C1: No CCB in run_xfer
	Aug 30 00:20:12 net5 /netbsd: sd1(iscsi0:0:0:0): adapter resource shortage

	I'm running a  6.0_RC1 XEN3_DOM0 kernel (with the iscsi initiator added
	to the kernel config, since xen kernels won't load modules), and an i386
	XEN3 DOMU running cacti (lots and lots of disk updates).

	After writing a quick kernel groveler to extract information from the
	various iSCSI initiator tables, I found that indeed, the ccb_pool
	head for the session showed it was empty.  Dumping out the contents of
	all the ccbs seemed to indicate they were all free, just no longer on the
	free list.

	Session 0xffffa00002945000: id=2
	ccb_pool 0x0000000000000000:0xffffa0000294c588 ccb_throttled 0x0000000000000000
	ccb[ 0]  0xffffa00002945208 next 0xffffa0000294d3f8 status 0 disp 0 ITT 80000200
	...
	ccb[55]  0xffffa0000294c378 next 0xffffa0000294c168 status 0 disp 0 ITT 49000237
	ccb[56]  0xffffa0000294c588 next 0x0000000000000000 status 0 disp 0 ITT 89000238
	ccb[57]  0xffffa0000294c798 next 0xffffa0000294c588 status 0 disp 0 ITT 87000239

	I was not able to see anything obvious in changes to sys/dev/iscsi source
	that might have caused this.  I then added the ccbs_waiting queue header,
	and noted that when this condition occurs, the tail entry of the header
	pointed to the ccb_pool - certainly not correct.

	This leads me to suspect that removing ccbs from ccbs_waiting and
	adding them to the free pool has some trouble.  From looking at the
	code, it looks to me like a ccb on the ccb_waiting queue is passed to
	wake_ccb(), which removes it from the ccb_waiting queue.  However, there
	appears to be no protection of something else from getting the same ccb
	on the ccbs_waiting queue and calling wake_ccb().  The first caller wins,
	removing the ccb from ccbs_waiting and adding it to ccb_pool.  The second
	caller now tries to remove the same ccb from ccbs_waiting and adding it
	to ccb_pool with nasty results.  I'm now working on seeing if this is
	indeed the case (adding some debug code to check and print information
	if it detects this occuring).

>How-To-Repeat:
	I suspect this problem is relatively rare, and needs something similar
	to my above described setup to get enough random activity with the iSCSI
	target to duplicate.
>Fix:
	If the problem is multiple processing of a ccb on the ccbs_waiting queue,
	try to prevent that from happening, or at least prevent it from clobbering
	the ccb_pool and ccbs_waiting queues.

>Release-Note:

>Audit-Trail:
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Date: Sat, 8 Sep 2012 20:17:29 -0600 (MDT)

 >> How-To-Repeat:
 > 	I suspect this problem is relatively rare, and needs something similar
 > 	to my above described setup to get enough random activity with the iSCSI
 > 	target to duplicate.

    I had added tagged queueing support for iscsi in my netbsd-6 while 
 debugging the problem, and wanted to test tagged queueing in -current 
 before commiting that change.  With tagged queueuing, things very quickly 
 went awry and hung when I tried to start the DOMU, so adding tagged 
 queueing should make this problem much more repeatable.

 >> Fix:
 > 	If the problem is multiple processing of a ccb on the ccbs_waiting queue,
 > 	try to prevent that from happening, or at least prevent it from clobbering
 > 	the ccb_pool and ccbs_waiting queues.

    What I've discovered so far:  I do indeed see ccbs passed to wake_ccb() 
 which are not on the ccbs_waiting queue.  If I check the ccbs_waiting 
 queue and only remove the ccb if it's on that queue, I do not have the 
 problems of ccb corruption.  Also, I noticed that all the ccbs that passed 
 to wake_ccb() have a dispostion of CCBDISP_SCSIPI, which presumably are 
 the result of requests from the scsipi layer and will report their status 
 back to the scsipi layer.  I don't yet know if there are any ccbs with 
 CCBDISP_SCSIPI that get added to the ccbs_waiting queue, and my cursory 
 attempt to follow the logic of the scsipi requests got me rather lost.

    I think that the iscsi initiator working as well as it has for several 
 people is pure luck and happenstance.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Date: Sat, 8 Sep 2012 23:58:28 -0600 (MDT)

 Found it!  And it was a bug introduced when I updated.

 An update by mlelstv had included an extraneous change that had been an 
 earlier attempt by me to fix a system freeze when writing > 64K. I had 
 done my fix differently when I commited my changes, but mlelstv's change 
 still had that.  That error was causing the ccb to not be added to the 
 ccbs_waiting queue properly..


 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

Responsible-Changed-From-To: kern-bug-people->mhitch
Responsible-Changed-By: mhitch@NetBSD.org
Responsible-Changed-When: Sun, 09 Sep 2012 06:01:43 +0000
Responsible-Changed-Why:
I'm taking it.


State-Changed-From-To: open->analyzed
State-Changed-By: mhitch@NetBSD.org
State-Changed-When: Sun, 09 Sep 2012 06:01:43 +0000
State-Changed-Why:
I've analyzed the problem.


From: "Michael L. Hitch" <mhitch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46896 CVS commit: src/sys/dev/iscsi
Date: Sun, 9 Sep 2012 06:06:30 +0000

 Module Name:	src
 Committed By:	mhitch
 Date:		Sun Sep  9 06:06:30 UTC 2012

 Modified Files:
 	src/sys/dev/iscsi: iscsi_send.c

 Log Message:
 Remove an erroneous change that crept in.  It was part of an earlier,
 incorrect fix I had done for writing > 64K bytes.  It was still in the
 set of changes mlelstv had committed.  I noticed it at the time, but did
 not think it was problematic - until now.  This fixes the ccb corruption
 I was seeing and fixes PR kern/46896.


 To generate a diff of this commit:
 cvs rdiff -u -r1.6 -r1.7 src/sys/dev/iscsi/iscsi_send.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: mhitch@NetBSD.org
Subject: Re: PR/46896 CVS commit: src/sys/dev/iscsi
Date: Sun, 9 Sep 2012 09:24:04 +0200

 With this patch applied to -6 I still see pretty much repeatable locks up
 (endless loop) here:

 (gdb) list *(iscsiioctl+0xb76)
 0x1ca0 is in iscsiioctl (/usr/src-6/sys/dev/iscsi/iscsi_ioctl.c:782).
 777             session->active_connections++;
 778     
 779             TAILQ_INIT(&old_waiting);
 780             while ((ccb = TAILQ_FIRST(&connection->ccbs_waiting)) != NULL) {
 781                     TAILQ_REMOVE(&connection->ccbs_waiting, ccb, chain);
 782                     TAILQ_INSERT_TAIL(&old_waiting, ccb, chain);
 783             }
 784     
 785             init_sernum(&connection->StatSN_buf);
 786             wakeup(connection);

 However, my randomly failing device discovery seems to be fixed.

 Martin

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/46896 CVS commit: src/sys/dev/iscsi
Date: Sun, 9 Sep 2012 01:37:30 -0600 (MDT)

 On Sun, 9 Sep 2012, Martin Husemann wrote:

 > With this patch applied to -6 I still see pretty much repeatable locks up
 > (endless loop) here:

    After I applied that to -6, I started having trouble as well, although 
 in my case it seems to be loosing the session.  I went back to my -current
 kernel, and it also seems to have the same problem.

    My -6 has ISCSI_DEBUG set to 0 so I can easily patch the debug level, 
 and when I changed the level to 1, I got this:

 S2C1: CheckCmdSN resending - CmdSN=578093, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578094, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578095, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578096, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578097, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578098, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578099, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578100, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578101, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578102, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578103, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578104, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578105, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578106, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578107, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578108, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578109, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578110, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578111, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578112, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578113, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578114, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578115, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578116, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578117, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578118, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578119, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578120, ExpCmdSn=578092
 S2C1: CheckCmdSN resending - CmdSN=578121, ExpCmdSn=578092
 Received Reject PDU, reason = 4, data_len = 48
 Received Asynch PDU, Event 1
 ISCSI: kill_session 2, status 44, logout 0, recover 0
 S2C1: Write failed sock 0xffffa00002514f00 (ret: 32, req: 48, resid: 48)
 S2C1: *** Connection Error, status=18, logout=-1, state=3
 S2C1: Kill_connection: terminating=0, status=18, logout=-1, state=3
 Session 2 not found

    [This is with tagged queueing enabled, so I've got lots more 
 transactions going on.]

    I need to look at the more;  I'm not sure if I should back out my last 
 change, although setting the ccb->disp is not the correct thing at that 
 point (later code relies on ccb->disp to be <= CCBDISP_NOWAIT to add the 
 ccb to the ccbs_waiting queue).

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Date: Tue, 11 Sep 2012 08:51:34 -0600 (MDT)

 > Found it!  And it was a bug introduced when I updated.
 >
 > An update by mlelstv had included an extraneous change that had been an
 > earlier attempt by me to fix a system freeze when writing > 64K. I had
 > done my fix differently when I commited my changes, but mlelstv's change
 > still had that.  That error was causing the ccb to not be added to the
 > ccbs_waiting queue properly..

    Unfortunately, this just changes to failure to something else.  After 
 running a while, I eventually get messages (with ISCSI_DEBUG 1):

   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676370, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676371, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676372, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676373, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676374, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676375, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676376, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676377, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676378, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676379, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676380, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676381, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676382, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676383, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676384, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676385, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676386, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676387, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676388, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676389, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676390, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676391, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676392, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676393, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676394, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676395, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676396, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676397, ExpCmdSn=676369
   /netbsd: S2C1: CheckCmdSN resending - CmdSN=676398, ExpCmdSn=676369
   /netbsd: Received Reject PDU, reason = 4, data_len = 48
   /netbsd: Received Asynch PDU, Event 1
   /netbsd: ISCSI: kill_session 2, status 44, logout 0, recover 0
   /netbsd: S2C1: Write failed sock 0xffffa00003471420 (ret: 32, req: 48, 
 resid: 48)
   /netbsd: S2C1: *** Connection Error, status=18, logout=-1, state=3
   /netbsd: S2C1: Kill_connection: terminating=0, status=18, logout=-1, 
 state=3
   /netbsd: Session 2 not found

    I'm looking at an alternative method of resolving this.

 --
 Michael L. Hitch			mhitch@montana.edu
 Computer Consultant
 Information Technology Center
 Montana State University	Bozeman, MT	USA

>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.