NetBSD Problem Report #43318

From www@NetBSD.org  Sun May 16 23:08:42 2010
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 095DF63C34A
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 16 May 2010 23:08:42 +0000 (UTC)
Message-Id: <20100516230841.9BAD363BC69@www.NetBSD.org>
Date: Sun, 16 May 2010 23:08:41 +0000 (UTC)
From: scotte@warped.com
Reply-To: scotte@warped.com
To: gnats-bugs@NetBSD.org
Subject: Recent -current asserts with USB activity
X-Send-Pr-Version: www-1.0

>Number:         43318
>Category:       kern
>Synopsis:       Recent -current asserts with USB activity
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bouyer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun May 16 23:10:00 +0000 2010
>Closed-Date:    Wed May 26 17:40:05 +0000 2010
>Last-Modified:  Fri Jun 11 23:40:01 +0000 2010
>Originator:     Scott Ellis
>Release:        amd64-current (5.99.29)
>Organization:
None
>Environment:
NetBSD intrepid 5.99.29 NetBSD 5.99.29 (INTREPID.P5W.DEBUG) #1: Sun May 16 08:15:45 PDT 2010  scotte@intrepid:/nbu/source/netbsd/src/obj.amd64/nbu/source/netbsd/src/sys/arch/amd64/compile/INTREPID.P5W.DEBUG amd64

>Description:
USB activity eventually causes kernel to assert and panic.  This is evident from reboots in my daily scripts (backing up via rsync to a umass disk), but also seen manifesting itself with uplcom not actually sending any data, and then eventually a kernel panic.  Kernel reports:

usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
usbd_do_request: not in process context
panic: kernel diagnostic assertion "c->c_func != NULL" failed: file "/nbu/source/netbsd/src/sys/kern/kern_timeout.c", line 331
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff8014a8ad cs 8 rflags 246 cr2  7f7ffdb51020 cpl 8 rsp ffff8000534ba530
Stopped in pid 0.3 (system) at  netbsd:breakpoint+0x5:  leave
db{0}>
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x2ba
kern_assert() at netbsd:kern_assert+0x2d
callout_schedule_locked() at netbsd:callout_schedule_locked+0x168
syn_cache_add() at netbsd:syn_cache_add+0x4ca
tcp_input() at netbsd:tcp_input+0x910
ip_input() at netbsd:ip_input+0x5b0
ipintr() at netbsd:ipintr+0x90
softint_dispatch() at netbsd:softint_dispatch+0xcc
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffff8000534bad70
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:

db{0}> machine cpu 1
using CPU 1
db{0}> bt
x86_mwait() at netbsd:x86_mwait+0xd
idle_loop() at netbsd:idle_loop+0x199
?() at 0
db{0}>

Appears to have been introduced sometime after April 21st, and before May 14th (my gap in staying with -current).  I don't have an exact workload to trigger it, but from the backtrace it appears that it's not "load" related, but more a timing problem.
>How-To-Repeat:
Generate USB traffic on uplcom or umass.
Wait.
(Sad steps, I know.)
>Fix:

>Release-Note:

>Audit-Trail:
From: "Scott Ellis" <scotte@warped.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Mon, 17 May 2010 09:27:04 -0700

 Extra data-point: System rebooted w/o use of umass, during the daily run. 
 There is constant USB traffic (10's of bytes every second @ 9600 baud) on
 uplcom, but otherwise USB isn't being used in the system.

 Happened at 3am, so I wasn't able to capture the panic message, but I
 suspect it's very similar to what was shown in the PR.

From: Scott Ellis <scotte@intrepid.warped.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Tue, 18 May 2010 07:52:37 -0700

 Same behavior happens with a kernel from May 8, 2010, so it's some 
 change between April 22 and May 8.  Continuing to do a binary search to 
 see if I can narrow down the window of breakage.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Tue, 18 May 2010 21:31:06 +0200

 On Sun, May 16, 2010 at 11:10:00PM +0000, scotte@warped.com wrote:
 > panic: kernel diagnostic assertion "c->c_func != NULL" failed: file "/nbu/source/netbsd/src/sys/kern/kern_timeout.c", line 331
 > fatal breakpoint trap in supervisor mode
 > trap type 1 code 0 rip ffffffff8014a8ad cs 8 rflags 246 cr2  7f7ffdb51020 cpl 8 rsp ffff8000534ba530
 > Stopped in pid 0.3 (system) at  netbsd:breakpoint+0x5:  leave
 > db{0}>
 > db{0}> bt
 > breakpoint() at netbsd:breakpoint+0x5
 > panic() at netbsd:panic+0x2ba
 > kern_assert() at netbsd:kern_assert+0x2d
 > callout_schedule_locked() at netbsd:callout_schedule_locked+0x168
 > syn_cache_add() at netbsd:syn_cache_add+0x4ca

 Can you see what syn_cache_add+0x4ca points to in kernel sources ?
 Maybe try reverting tcp_input.c 1.304 and see if it helps ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Scott Ellis <scotte@intrepid.warped.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, 
 gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Wed, 19 May 2010 08:22:03 -0700

 On 5/18/2010 12:31 PM, Manuel Bouyer wrote:
 > Can you see what syn_cache_add+0x4ca points to in kernel sources ?
 > Maybe try reverting tcp_input.c 1.304 and see if it helps ...
 >

 Reverting to tcp_input.c 1.303 seems to have solved my problem.  ~5 hour 
 max uptime with little USB activity has turned into 16 hours with 
 significant umass activity (I did a dry-run of the system backup, which 
 is what repeatedly killed the system before).

 Thanks for the clue.  I'm not cluefull enough to know WHY the changes in 
 1.304 are a problem, but at least now it's clear that it is the problem. :-)


 	ScottE

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        scotte@warped.com
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Wed, 19 May 2010 18:32:26 +0200

 On Wed, May 19, 2010 at 04:05:03PM +0000, Scott Ellis wrote:
 >  On 5/18/2010 12:31 PM, Manuel Bouyer wrote:
 >  > Can you see what syn_cache_add+0x4ca points to in kernel sources ?
 >  > Maybe try reverting tcp_input.c 1.304 and see if it helps ...
 >  >
 >  
 >  Reverting to tcp_input.c 1.303 seems to have solved my problem.  ~5 hour 
 >  max uptime with little USB activity has turned into 16 hours with 
 >  significant umass activity (I did a dry-run of the system backup, which 
 >  is what repeatedly killed the system before).
 >  
 >  Thanks for the clue.  I'm not cluefull enough to know WHY the changes in 
 >  1.304 are a problem, but at least now it's clear that it is the problem. :-)

 Then can you try the attached patch, and report if the
 printf("syn dropped\n") does show up in the kernel log ?
 Also, it should fix the assert problem.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        scotte@warped.com
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Wed, 19 May 2010 19:07:17 +0200

 --y0ulUmNC+osPPQO6
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Wed, May 19, 2010 at 06:32:26PM +0200, Manuel Bouyer wrote:
 > Then can you try the attached patch, and report if the
 > printf("syn dropped\n") does show up in the kernel log ?
 > Also, it should fix the assert problem.

 With the patch, this time

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --y0ulUmNC+osPPQO6
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: tcp_input.c
 ===================================================================
 RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
 retrieving revision 1.304
 diff -u -p -u -r1.304 tcp_input.c
 --- tcp_input.c	21 Apr 2010 20:40:16 -0000	1.304
 +++ tcp_input.c	19 May 2010 16:31:16 -0000
 @@ -4207,6 +4207,12 @@ syn_cache_add(struct sockaddr *src, stru
  		syn_cache_insert(sc, tp);
  	} else {
  		s = splsoftnet();
 +		printf("syn dropped\n");
 +		/*
 +		 * syn_cache_put() will try to schedule the timer, so
 +		 * we need to initialize it
 +		 */
 +		SYN_CACHE_TIMER_ARM(sc);
  		syn_cache_put(sc);
  		splx(s);
  		TCP_STATINC(TCP_STAT_SC_DROPPED);

 --y0ulUmNC+osPPQO6--

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        scotte@warped.com
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Fri, 21 May 2010 10:48:00 +0200

 On Wed, May 19, 2010 at 06:32:26PM +0200, Manuel Bouyer wrote:
 > Then can you try the attached patch, and report if the
 > printf("syn dropped\n") does show up in the kernel log ?
 > Also, it should fix the assert problem.

 Hello,
 any news about testing this ? I'd really like to get this in 5.1 if
 possible.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "Scott Ellis" <scotte@warped.com>
To: "Manuel Bouyer" <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org,
 kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Fri, 21 May 2010 09:35:10 -0700

 > On Wed, May 19, 2010 at 06:32:26PM +0200, Manuel Bouyer wrote:
 > any news about testing this ? I'd really like to get this in 5.1 if
 > possible.

 Installed the patch last night, but haven't done any significant testing
 yet (read: Significant USB activity).  That said, this morning the system
 has remained up for 14 hours (which is about 3x what it did with 1.304 of
 tcp_input.c), and I see two instances of "syn dropped" in dmesg.

 I'd be happy declaring victory, and can work on stressing it more this
 weekend.

From: Scott Ellis <scotte@intrepid.warped.com>
To: gnats-bugs@NetBSD.org
Cc: Scott Ellis <scotte@warped.com>, kern-bug-people@netbsd.org, 
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/43318: Recent -current asserts with USB activity
Date: Sat, 22 May 2010 10:37:22 -0700

 On 5/21/2010 9:40 AM, Scott Ellis wrote:
 >   I'd be happy declaring victory, and can work on stressing it more this
 >   weekend.

 Did my usual USB test which would tickle this issue, and had no crashes. 
   There are a number of "syn dropped" messages, however, so I think the 
 patch is touching the right area.

 Recommend commit (w/o the printf, duh), and close this CR.  Thanks for 
 the speedy fix Manuel!

Responsible-Changed-From-To: kern-bug-people->bouyer
Responsible-Changed-By: bouyer@NetBSD.org
Responsible-Changed-When: Wed, 26 May 2010 17:40:05 +0000
Responsible-Changed-Why:
Related to tcp_input.c 1.304


State-Changed-From-To: open->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Wed, 26 May 2010 17:40:05 +0000
State-Changed-Why:
Proposed patch has been commited.


From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43318 CVS commit: src/sys/netinet
Date: Wed, 26 May 2010 17:38:29 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Wed May 26 17:38:29 UTC 2010

 Modified Files:
 	src/sys/netinet: tcp_input.c

 Log Message:
 Make sure SYN_CACHE_TIMER_ARM() has been run before calling syn_cache_put()
 as it will reschedule the timer.  Fixes PR kern/43318.


 To generate a diff of this commit:
 cvs rdiff -u -r1.304 -r1.305 src/sys/netinet/tcp_input.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Jeff Rizzo <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43318 CVS commit: [netbsd-5] src/sys/netinet
Date: Fri, 11 Jun 2010 23:36:08 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Fri Jun 11 23:36:08 UTC 2010

 Modified Files:
 	src/sys/netinet [netbsd-5]: tcp_input.c

 Log Message:
 Pull up following revision(s) (requested by bouyer in ticket #1382):
 	sys/netinet/tcp_input.c: revision 1.304
 	sys/netinet/tcp_input.c: revision 1.305
 syn_cache_put(): defer all pool_put() to the callout. Reschedule
 the callout if needed so frees are not delayed too much.
 syn_cache_timer(): we can't call syn_cache_put() here any more,
 so move code deleted from syn_cache_put() here.
 Avoid KASSERT() in kern_timeout.c because pool_put() is called from
 ipintr context, as reported in
 http://mail-index.netbsd.org/tech-kern/2010/03/19/msg007762.html
 Thanks to Andrew Doran and Mindaugas Rasiukevicius for help and review.
 Make sure SYN_CACHE_TIMER_ARM() has been run before calling syn_cache_put()
 as it will reschedule the timer.  Fixes PR kern/43318.


 To generate a diff of this commit:
 cvs rdiff -u -r1.291.4.4 -r1.291.4.5 src/sys/netinet/tcp_input.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.