NetBSD Problem Report #46896
From mhitch@net5.msu.montana.edu Mon Sep 3 20:36:30 2012
Return-Path: <mhitch@net5.msu.montana.edu>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id A088D63B86D
for <gnats-bugs@gnats.NetBSD.org>; Mon, 3 Sep 2012 20:36:30 +0000 (UTC)
Message-Id: <20120903190351.81288B434@net5.msu.montana.edu>
Date: Mon, 3 Sep 2012 13:03:51 -0600 (MDT)
From: mhitch@lightning.msu.montana.edu
Reply-To: mhitch@lightning.msu.montana.edu
To: gnats-bugs@gnats.NetBSD.org
Subject: iSCSI initiator ccb_pool gets corrupted
X-Send-Pr-Version: 3.95
>Number: 46896
>Category: kern
>Synopsis: iSCSI initiator ccb_pool gets corrupted
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: mhitch
>State: analyzed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Sep 03 20:40:00 +0000 2012
>Closed-Date:
>Last-Modified: Tue Sep 11 14:55:01 +0000 2012
>Originator: Michael L. Hitch
>Release: NetBSD 6.0_RC1 as of 19-Aug-2012
>Organization:
Montana State University
>Environment:
System: NetBSD net5.msu.montana.edu 6.0_RC1 NetBSD 6.0_RC1 (XEN3_DOM0) #43: Sun Sep 2 20:19:33 MDT 2012 mhitch@net8.msu.montana.edu:/home/mhitch/NetBSD-6/OBJ/amd64/home/mhitch/NetBSD-6/src/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
After updating to 6.0_RC1, I started a XEN DOMU kernel using an iSCSI
disk. I'm fairly certain that I had been able to run this for some time
previously (netbsd-6 tree as of 24-May). Shortly after starting the DOMU
kernel, the iSCSI initiator started reporting no ccbs:
Aug 30 00:20:11 net5 /netbsd: S2C1: No CCB in run_xfer
Aug 30 00:20:11 net5 /netbsd: sd1(iscsi0:0:0:0): adapter resource shortage
Aug 30 00:20:12 net5 /netbsd: S2C1: No CCB in run_xfer
Aug 30 00:20:12 net5 /netbsd: sd1(iscsi0:0:0:0): adapter resource shortage
I'm running a 6.0_RC1 XEN3_DOM0 kernel (with the iscsi initiator added
to the kernel config, since xen kernels won't load modules), and an i386
XEN3 DOMU running cacti (lots and lots of disk updates).
After writing a quick kernel groveler to extract information from the
various iSCSI initiator tables, I found that indeed, the ccb_pool
head for the session showed it was empty. Dumping out the contents of
all the ccbs seemed to indicate they were all free, just no longer on the
free list.
Session 0xffffa00002945000: id=2
ccb_pool 0x0000000000000000:0xffffa0000294c588 ccb_throttled 0x0000000000000000
ccb[ 0] 0xffffa00002945208 next 0xffffa0000294d3f8 status 0 disp 0 ITT 80000200
...
ccb[55] 0xffffa0000294c378 next 0xffffa0000294c168 status 0 disp 0 ITT 49000237
ccb[56] 0xffffa0000294c588 next 0x0000000000000000 status 0 disp 0 ITT 89000238
ccb[57] 0xffffa0000294c798 next 0xffffa0000294c588 status 0 disp 0 ITT 87000239
I was not able to see anything obvious in changes to sys/dev/iscsi source
that might have caused this. I then added the ccbs_waiting queue header,
and noted that when this condition occurs, the tail entry of the header
pointed to the ccb_pool - certainly not correct.
This leads me to suspect that removing ccbs from ccbs_waiting and
adding them to the free pool has some trouble. From looking at the
code, it looks to me like a ccb on the ccb_waiting queue is passed to
wake_ccb(), which removes it from the ccb_waiting queue. However, there
appears to be no protection of something else from getting the same ccb
on the ccbs_waiting queue and calling wake_ccb(). The first caller wins,
removing the ccb from ccbs_waiting and adding it to ccb_pool. The second
caller now tries to remove the same ccb from ccbs_waiting and adding it
to ccb_pool with nasty results. I'm now working on seeing if this is
indeed the case (adding some debug code to check and print information
if it detects this occuring).
>How-To-Repeat:
I suspect this problem is relatively rare, and needs something similar
to my above described setup to get enough random activity with the iSCSI
target to duplicate.
>Fix:
If the problem is multiple processing of a ccb on the ccbs_waiting queue,
try to prevent that from happening, or at least prevent it from clobbering
the ccb_pool and ccbs_waiting queues.
>Release-Note:
>Audit-Trail:
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Date: Sat, 8 Sep 2012 20:17:29 -0600 (MDT)
>> How-To-Repeat:
> I suspect this problem is relatively rare, and needs something similar
> to my above described setup to get enough random activity with the iSCSI
> target to duplicate.
I had added tagged queueing support for iscsi in my netbsd-6 while
debugging the problem, and wanted to test tagged queueing in -current
before commiting that change. With tagged queueuing, things very quickly
went awry and hung when I tried to start the DOMU, so adding tagged
queueing should make this problem much more repeatable.
>> Fix:
> If the problem is multiple processing of a ccb on the ccbs_waiting queue,
> try to prevent that from happening, or at least prevent it from clobbering
> the ccb_pool and ccbs_waiting queues.
What I've discovered so far: I do indeed see ccbs passed to wake_ccb()
which are not on the ccbs_waiting queue. If I check the ccbs_waiting
queue and only remove the ccb if it's on that queue, I do not have the
problems of ccb corruption. Also, I noticed that all the ccbs that passed
to wake_ccb() have a dispostion of CCBDISP_SCSIPI, which presumably are
the result of requests from the scsipi layer and will report their status
back to the scsipi layer. I don't yet know if there are any ccbs with
CCBDISP_SCSIPI that get added to the ccbs_waiting queue, and my cursory
attempt to follow the logic of the scsipi requests got me rather lost.
I think that the iscsi initiator working as well as it has for several
people is pure luck and happenstance.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Date: Sat, 8 Sep 2012 23:58:28 -0600 (MDT)
Found it! And it was a bug introduced when I updated.
An update by mlelstv had included an extraneous change that had been an
earlier attempt by me to fix a system freeze when writing > 64K. I had
done my fix differently when I commited my changes, but mlelstv's change
still had that. That error was causing the ccb to not be added to the
ccbs_waiting queue properly..
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
Responsible-Changed-From-To: kern-bug-people->mhitch
Responsible-Changed-By: mhitch@NetBSD.org
Responsible-Changed-When: Sun, 09 Sep 2012 06:01:43 +0000
Responsible-Changed-Why:
I'm taking it.
State-Changed-From-To: open->analyzed
State-Changed-By: mhitch@NetBSD.org
State-Changed-When: Sun, 09 Sep 2012 06:01:43 +0000
State-Changed-Why:
I've analyzed the problem.
From: "Michael L. Hitch" <mhitch@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46896 CVS commit: src/sys/dev/iscsi
Date: Sun, 9 Sep 2012 06:06:30 +0000
Module Name: src
Committed By: mhitch
Date: Sun Sep 9 06:06:30 UTC 2012
Modified Files:
src/sys/dev/iscsi: iscsi_send.c
Log Message:
Remove an erroneous change that crept in. It was part of an earlier,
incorrect fix I had done for writing > 64K bytes. It was still in the
set of changes mlelstv had committed. I noticed it at the time, but did
not think it was problematic - until now. This fixes the ccb corruption
I was seeing and fixes PR kern/46896.
To generate a diff of this commit:
cvs rdiff -u -r1.6 -r1.7 src/sys/dev/iscsi/iscsi_send.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: mhitch@NetBSD.org
Subject: Re: PR/46896 CVS commit: src/sys/dev/iscsi
Date: Sun, 9 Sep 2012 09:24:04 +0200
With this patch applied to -6 I still see pretty much repeatable locks up
(endless loop) here:
(gdb) list *(iscsiioctl+0xb76)
0x1ca0 is in iscsiioctl (/usr/src-6/sys/dev/iscsi/iscsi_ioctl.c:782).
777 session->active_connections++;
778
779 TAILQ_INIT(&old_waiting);
780 while ((ccb = TAILQ_FIRST(&connection->ccbs_waiting)) != NULL) {
781 TAILQ_REMOVE(&connection->ccbs_waiting, ccb, chain);
782 TAILQ_INSERT_TAIL(&old_waiting, ccb, chain);
783 }
784
785 init_sernum(&connection->StatSN_buf);
786 wakeup(connection);
However, my randomly failing device discovery seems to be fixed.
Martin
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: PR/46896 CVS commit: src/sys/dev/iscsi
Date: Sun, 9 Sep 2012 01:37:30 -0600 (MDT)
On Sun, 9 Sep 2012, Martin Husemann wrote:
> With this patch applied to -6 I still see pretty much repeatable locks up
> (endless loop) here:
After I applied that to -6, I started having trouble as well, although
in my case it seems to be loosing the session. I went back to my -current
kernel, and it also seems to have the same problem.
My -6 has ISCSI_DEBUG set to 0 so I can easily patch the debug level,
and when I changed the level to 1, I got this:
S2C1: CheckCmdSN resending - CmdSN=578093, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578094, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578095, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578096, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578097, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578098, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578099, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578100, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578101, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578102, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578103, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578104, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578105, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578106, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578107, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578108, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578109, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578110, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578111, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578112, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578113, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578114, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578115, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578116, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578117, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578118, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578119, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578120, ExpCmdSn=578092
S2C1: CheckCmdSN resending - CmdSN=578121, ExpCmdSn=578092
Received Reject PDU, reason = 4, data_len = 48
Received Asynch PDU, Event 1
ISCSI: kill_session 2, status 44, logout 0, recover 0
S2C1: Write failed sock 0xffffa00002514f00 (ret: 32, req: 48, resid: 48)
S2C1: *** Connection Error, status=18, logout=-1, state=3
S2C1: Kill_connection: terminating=0, status=18, logout=-1, state=3
Session 2 not found
[This is with tagged queueing enabled, so I've got lots more
transactions going on.]
I need to look at the more; I'm not sure if I should back out my last
change, although setting the ccb->disp is not the correct thing at that
point (later code relies on ccb->disp to be <= CCBDISP_NOWAIT to add the
ccb to the ccbs_waiting queue).
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
From: "Michael L. Hitch" <mhitch@lightning.msu.montana.edu>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Date: Tue, 11 Sep 2012 08:51:34 -0600 (MDT)
> Found it! And it was a bug introduced when I updated.
>
> An update by mlelstv had included an extraneous change that had been an
> earlier attempt by me to fix a system freeze when writing > 64K. I had
> done my fix differently when I commited my changes, but mlelstv's change
> still had that. That error was causing the ccb to not be added to the
> ccbs_waiting queue properly..
Unfortunately, this just changes to failure to something else. After
running a while, I eventually get messages (with ISCSI_DEBUG 1):
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676370, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676371, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676372, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676373, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676374, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676375, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676376, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676377, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676378, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676379, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676380, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676381, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676382, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676383, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676384, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676385, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676386, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676387, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676388, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676389, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676390, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676391, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676392, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676393, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676394, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676395, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676396, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676397, ExpCmdSn=676369
/netbsd: S2C1: CheckCmdSN resending - CmdSN=676398, ExpCmdSn=676369
/netbsd: Received Reject PDU, reason = 4, data_len = 48
/netbsd: Received Asynch PDU, Event 1
/netbsd: ISCSI: kill_session 2, status 44, logout 0, recover 0
/netbsd: S2C1: Write failed sock 0xffffa00003471420 (ret: 32, req: 48,
resid: 48)
/netbsd: S2C1: *** Connection Error, status=18, logout=-1, state=3
/netbsd: S2C1: Kill_connection: terminating=0, status=18, logout=-1,
state=3
/netbsd: Session 2 not found
I'm looking at an alternative method of resolving this.
--
Michael L. Hitch mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.