NetBSD Problem Report #43318
From www@NetBSD.org Sun May 16 23:08:42 2010
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 095DF63C34A
for <gnats-bugs@gnats.NetBSD.org>; Sun, 16 May 2010 23:08:42 +0000 (UTC)
Message-Id: <20100516230841.9BAD363BC69@www.NetBSD.org>
Date: Sun, 16 May 2010 23:08:41 +0000 (UTC)
From: scotte@warped.com
Reply-To: scotte@warped.com
To: gnats-bugs@NetBSD.org
Subject: Recent -current asserts with USB activity
X-Send-Pr-Version: www-1.0
>Number: 43318
>Category: kern
>Synopsis: Recent -current asserts with USB activity
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: bouyer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun May 16 23:10:00 +0000 2010
>Closed-Date: Wed May 26 17:40:05 +0000 2010
>Last-Modified: Fri Jun 11 23:40:01 +0000 2010
>Originator: Scott Ellis
>Release: amd64-current (5.99.29)
>Organization:
None
>Environment:
NetBSD intrepid 5.99.29 NetBSD 5.99.29 (INTREPID.P5W.DEBUG) #1: Sun May 16 08:15:45 PDT 2010 scotte@intrepid:/nbu/source/netbsd/src/obj.amd64/nbu/source/netbsd/src/sys/arch/amd64/compile/INTREPID.P5W.DEBUG amd64
>Description:
USB activity eventually causes kernel to assert and panic. This is evident from reboots in my daily scripts (backing up via rsync to a umass disk), but also seen manifesting itself with uplcom not actually sending any data, and then eventually a kernel panic. Kernel reports:
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
panic: kernel diagnostic assertion "c->c_func != NULL" failed: file "/nbu/source/netbsd/src/sys/kern/kern_timeout.c", line 331
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff8014a8ad cs 8 rflags 246 cr2 7f7ffdb51020 cpl 8 rsp ffff8000534ba530
Stopped in pid 0.3 (system) at netbsd:breakpoint+0x5: leave
db{0}>
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x2ba
kern_assert() at netbsd:kern_assert+0x2d
callout_schedule_locked() at netbsd:callout_schedule_locked+0x168
syn_cache_add() at netbsd:syn_cache_add+0x4ca
tcp_input() at netbsd:tcp_input+0x910
ip_input() at netbsd:ip_input+0x5b0
ipintr() at netbsd:ipintr+0x90
softint_dispatch() at netbsd:softint_dispatch+0xcc
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffff8000534bad70
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
db{0}> machine cpu 1
using CPU 1
db{0}> bt
x86_mwait() at netbsd:x86_mwait+0xd
idle_loop() at netbsd:idle_loop+0x199
?() at 0
db{0}>
Appears to have been introduced sometime after April 21st, and before May 14th (my gap in staying with -current). I don't have an exact workload to trigger it, but from the backtrace it appears that it's not "load" related, but more a timing problem.
>How-To-Repeat:
Generate USB traffic on uplcom or umass.
Wait.
(Sad steps, I know.)
>Fix:
>Release-Note:
>Audit-Trail:
From: "Scott Ellis" <scotte@warped.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Mon, 17 May 2010 09:27:04 -0700
Extra data-point: System rebooted w/o use of umass, during the daily run.
There is constant USB traffic (10's of bytes every second @ 9600 baud) on
uplcom, but otherwise USB isn't being used in the system.
Happened at 3am, so I wasn't able to capture the panic message, but I
suspect it's very similar to what was shown in the PR.
From: Scott Ellis <scotte@intrepid.warped.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Tue, 18 May 2010 07:52:37 -0700
Same behavior happens with a kernel from May 8, 2010, so it's some
change between April 22 and May 8. Continuing to do a binary search to
see if I can narrow down the window of breakage.
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Tue, 18 May 2010 21:31:06 +0200
On Sun, May 16, 2010 at 11:10:00PM +0000, scotte@warped.com wrote:
> panic: kernel diagnostic assertion "c->c_func != NULL" failed: file "/nbu/source/netbsd/src/sys/kern/kern_timeout.c", line 331
> fatal breakpoint trap in supervisor mode
> trap type 1 code 0 rip ffffffff8014a8ad cs 8 rflags 246 cr2 7f7ffdb51020 cpl 8 rsp ffff8000534ba530
> Stopped in pid 0.3 (system) at netbsd:breakpoint+0x5: leave
> db{0}>
> db{0}> bt
> breakpoint() at netbsd:breakpoint+0x5
> panic() at netbsd:panic+0x2ba
> kern_assert() at netbsd:kern_assert+0x2d
> callout_schedule_locked() at netbsd:callout_schedule_locked+0x168
> syn_cache_add() at netbsd:syn_cache_add+0x4ca
Can you see what syn_cache_add+0x4ca points to in kernel sources ?
Maybe try reverting tcp_input.c 1.304 and see if it helps ...
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Scott Ellis <scotte@intrepid.warped.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Wed, 19 May 2010 08:22:03 -0700
On 5/18/2010 12:31 PM, Manuel Bouyer wrote:
> Can you see what syn_cache_add+0x4ca points to in kernel sources ?
> Maybe try reverting tcp_input.c 1.304 and see if it helps ...
>
Reverting to tcp_input.c 1.303 seems to have solved my problem. ~5 hour
max uptime with little USB activity has turned into 16 hours with
significant umass activity (I did a dry-run of the system backup, which
is what repeatedly killed the system before).
Thanks for the clue. I'm not cluefull enough to know WHY the changes in
1.304 are a problem, but at least now it's clear that it is the problem. :-)
ScottE
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
scotte@warped.com
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Wed, 19 May 2010 18:32:26 +0200
On Wed, May 19, 2010 at 04:05:03PM +0000, Scott Ellis wrote:
> On 5/18/2010 12:31 PM, Manuel Bouyer wrote:
> > Can you see what syn_cache_add+0x4ca points to in kernel sources ?
> > Maybe try reverting tcp_input.c 1.304 and see if it helps ...
> >
>
> Reverting to tcp_input.c 1.303 seems to have solved my problem. ~5 hour
> max uptime with little USB activity has turned into 16 hours with
> significant umass activity (I did a dry-run of the system backup, which
> is what repeatedly killed the system before).
>
> Thanks for the clue. I'm not cluefull enough to know WHY the changes in
> 1.304 are a problem, but at least now it's clear that it is the problem. :-)
Then can you try the attached patch, and report if the
printf("syn dropped\n") does show up in the kernel log ?
Also, it should fix the assert problem.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
scotte@warped.com
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Wed, 19 May 2010 19:07:17 +0200
--y0ulUmNC+osPPQO6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Wed, May 19, 2010 at 06:32:26PM +0200, Manuel Bouyer wrote:
> Then can you try the attached patch, and report if the
> printf("syn dropped\n") does show up in the kernel log ?
> Also, it should fix the assert problem.
With the patch, this time
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
--y0ulUmNC+osPPQO6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff
Index: tcp_input.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
retrieving revision 1.304
diff -u -p -u -r1.304 tcp_input.c
--- tcp_input.c 21 Apr 2010 20:40:16 -0000 1.304
+++ tcp_input.c 19 May 2010 16:31:16 -0000
@@ -4207,6 +4207,12 @@ syn_cache_add(struct sockaddr *src, stru
syn_cache_insert(sc, tp);
} else {
s = splsoftnet();
+ printf("syn dropped\n");
+ /*
+ * syn_cache_put() will try to schedule the timer, so
+ * we need to initialize it
+ */
+ SYN_CACHE_TIMER_ARM(sc);
syn_cache_put(sc);
splx(s);
TCP_STATINC(TCP_STAT_SC_DROPPED);
--y0ulUmNC+osPPQO6--
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
scotte@warped.com
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Fri, 21 May 2010 10:48:00 +0200
On Wed, May 19, 2010 at 06:32:26PM +0200, Manuel Bouyer wrote:
> Then can you try the attached patch, and report if the
> printf("syn dropped\n") does show up in the kernel log ?
> Also, it should fix the assert problem.
Hello,
any news about testing this ? I'd really like to get this in 5.1 if
possible.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: "Scott Ellis" <scotte@warped.com>
To: "Manuel Bouyer" <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org,
kern-bug-people@netbsd.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Fri, 21 May 2010 09:35:10 -0700
> On Wed, May 19, 2010 at 06:32:26PM +0200, Manuel Bouyer wrote:
> any news about testing this ? I'd really like to get this in 5.1 if
> possible.
Installed the patch last night, but haven't done any significant testing
yet (read: Significant USB activity). That said, this morning the system
has remained up for 14 hours (which is about 3x what it did with 1.304 of
tcp_input.c), and I see two instances of "syn dropped" in dmesg.
I'd be happy declaring victory, and can work on stressing it more this
weekend.
From: Scott Ellis <scotte@intrepid.warped.com>
To: gnats-bugs@NetBSD.org
Cc: Scott Ellis <scotte@warped.com>, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Sat, 22 May 2010 10:37:22 -0700
On 5/21/2010 9:40 AM, Scott Ellis wrote:
> I'd be happy declaring victory, and can work on stressing it more this
> weekend.
Did my usual USB test which would tickle this issue, and had no crashes.
There are a number of "syn dropped" messages, however, so I think the
patch is touching the right area.
Recommend commit (w/o the printf, duh), and close this CR. Thanks for
the speedy fix Manuel!
Responsible-Changed-From-To: kern-bug-people->bouyer
Responsible-Changed-By: bouyer@NetBSD.org
Responsible-Changed-When: Wed, 26 May 2010 17:40:05 +0000
Responsible-Changed-Why:
Related to tcp_input.c 1.304
State-Changed-From-To: open->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Wed, 26 May 2010 17:40:05 +0000
State-Changed-Why:
Proposed patch has been commited.
From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/43318 CVS commit: src/sys/netinet
Date: Wed, 26 May 2010 17:38:29 +0000
Module Name: src
Committed By: bouyer
Date: Wed May 26 17:38:29 UTC 2010
Modified Files:
src/sys/netinet: tcp_input.c
Log Message:
Make sure SYN_CACHE_TIMER_ARM() has been run before calling syn_cache_put()
as it will reschedule the timer. Fixes PR kern/43318.
To generate a diff of this commit:
cvs rdiff -u -r1.304 -r1.305 src/sys/netinet/tcp_input.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Jeff Rizzo <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/43318 CVS commit: [netbsd-5] src/sys/netinet
Date: Fri, 11 Jun 2010 23:36:08 +0000
Module Name: src
Committed By: riz
Date: Fri Jun 11 23:36:08 UTC 2010
Modified Files:
src/sys/netinet [netbsd-5]: tcp_input.c
Log Message:
Pull up following revision(s) (requested by bouyer in ticket #1382):
sys/netinet/tcp_input.c: revision 1.304
sys/netinet/tcp_input.c: revision 1.305
syn_cache_put(): defer all pool_put() to the callout. Reschedule
the callout if needed so frees are not delayed too much.
syn_cache_timer(): we can't call syn_cache_put() here any more,
so move code deleted from syn_cache_put() here.
Avoid KASSERT() in kern_timeout.c because pool_put() is called from
ipintr context, as reported in
http://mail-index.netbsd.org/tech-kern/2010/03/19/msg007762.html
Thanks to Andrew Doran and Mindaugas Rasiukevicius for help and review.
Make sure SYN_CACHE_TIMER_ARM() has been run before calling syn_cache_put()
as it will reschedule the timer. Fixes PR kern/43318.
To generate a diff of this commit:
cvs rdiff -u -r1.291.4.4 -r1.291.4.5 src/sys/netinet/tcp_input.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.