NetBSD Problem Report #53598

From martin@aprisoft.de  Wed Sep 12 11:37:25 2018
Return-Path: <martin@aprisoft.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 9665A7A111
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 12 Sep 2018 11:37:25 +0000 (UTC)
Message-Id: <20180912113715.03F7F5CC906@emmas.aprisoft.de>
Date: Wed, 12 Sep 2018 13:37:15 +0200 (CEST)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: assertion failure in usb detach
X-Send-Pr-Version: 3.95

>Number:         53598
>Category:       kern
>Synopsis:       assertion failure in usb detach
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    mrg
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Sep 12 11:40:00 +0000 2018
>Closed-Date:    Fri Aug 23 07:37:39 +0000 2019
>Last-Modified:  Fri Aug 23 07:37:39 +0000 2019
>Originator:     Martin Husemann
>Release:        NetBSD 8.99.25
>Organization:
The NetBSD Foundation, Inc.
>Environment:
(not this actual kernel...)
System: NetBSD seven-days-to-the-wolves.aprisoft.de 8.99.25 NetBSD 8.99.25 (GENERIC) #236: Tue Sep 4 08:48:16 CEST 2018 martin@seven-days-to-the-wolves.aprisoft.de:/work/src/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:

Trying to reboot a GENERIC kernel from the 201809111830Z daily build
reproducable crashes for me like this (manual transscript):

panic: kernel diagnostic assertion "cv_is_valid(cv)" failed: file ...kern/kern_condvar.c line 151
vpanic
ch_voltag_convert_in() at netbsd:ch_voltag_convert_in
cv_destroy() at netbsd:cv_destroy+0x55
usb_detach() at netbsd:usb_detach+0xf9
...

>How-To-Repeat:
s/a

>Fix:
n/a

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/53598: assertion failure in usb detach
Date: Wed, 12 Sep 2018 14:25:42 +0000

 Problem is that this (VirtualBox) xhci only partly attached, USB revision
 is 0 and we never make it to usb_doattach(), but set sc_dying instead.

 When we get to usb_detach, the softint never had been established
 nor the cv initialized.

 Maybe we should define several values for sc_dying, representing how
 far during initalization we got? Something like: sc_dying = 3 if we
 did not initialize anything, sc_dying = 2 if softint is valid but
 cv is not, sc_dying = 1 if we have been fully functional but are now
 winding down.

 Martin

From: "matthew green" <mrg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53598 CVS commit: src/sys/dev
Date: Tue, 18 Sep 2018 05:24:10 +0000

 Module Name:	src
 Committed By:	mrg
 Date:		Tue Sep 18 05:24:10 UTC 2018

 Modified Files:
 	src/sys/dev/pci: xhci_pci.c
 	src/sys/dev/usb: usb.c

 Log Message:
 deal with partial attach failures in usb_attach vs usb_detach aka PR 53598.

 - make sure xhci's sc->sc_ios is NULL if failure happens.
 - rearrange usb_attach() / usb_doattach() to make it simpler to clean up.
 - move usb_async_intr softint into usb_once_init().  previously, each USB
   controller would start a new one, and leave the old one leaked.
 - handle controller interrupts without a bus attached


 To generate a diff of this commit:
 cvs rdiff -u -r1.13 -r1.14 src/sys/dev/pci/xhci_pci.c
 cvs rdiff -u -r1.172 -r1.173 src/sys/dev/usb/usb.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->mrg
Responsible-Changed-By: mrg@NetBSD.org
Responsible-Changed-When: Tue, 18 Sep 2018 05:39:11 +0000
Responsible-Changed-Why:
i wrote a patch.


State-Changed-From-To: open->feedback
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Tue, 18 Sep 2018 05:39:11 +0000
State-Changed-Why:
can you test -current?  i forced a detach failure at the top of
usb_attach() like you described, and i was able to boot, reboot
and it worked.  thanks.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53598 CVS commit: [netbsd-8] src
Date: Thu, 27 Sep 2018 14:52:27 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu Sep 27 14:52:27 UTC 2018

 Modified Files:
 	src/share/man/man4 [netbsd-8]: usb.4
 	src/sys/ddb [netbsd-8]: db_command.c db_output.c
 	src/sys/dev/pci [netbsd-8]: xhci_pci.c
 	src/sys/dev/usb [netbsd-8]: ehci.c ohci.c uhci.c uhcivar.h uhub.c usb.c
 	    usb_subr.c usbdi.c xhci.c
 	src/sys/external/bsd/dwc2 [netbsd-8]: dwc2.c
 	src/sys/kern [netbsd-8]: subr_userconf.c

 Log Message:
 Pull up following revision(s) (requested by mrg in ticket #1037):

 	sys/dev/usb/uhub.c: revision 1.139
 	sys/external/bsd/dwc2/dwc2.c: revision 1.55
 	sys/ddb/db_output.c: revision 1.34
 	sys/ddb/db_command.c: revision 1.160
 	sys/dev/usb/ehci.c: revision 1.264
 	sys/dev/usb/xhci.c: revision 1.99
 	sys/dev/usb/ehci.c: revision 1.265
 	sys/kern/subr_userconf.c: revision 1.27
 	sys/dev/usb/ehcivar.h: revision 1.46
 	sys/dev/usb/ohci.c: revision 1.287
 	sys/dev/usb/uhci.c: revision 1.284
 	sys/dev/usb/usbdi.c: revision 1.178
 	sys/dev/usb/usb.c: revision 1.172
 	sys/dev/pci/xhci_pci.c: revision 1.14
 	sys/dev/usb/usb.c: revision 1.173
 	sys/dev/usb/usb.c: revision 1.174
 	share/man/man4/usb.4: revision 1.110
 	sys/ddb/db_command.c: revision 1.159
 	sys/dev/usb/usb_subr.c: revision 1.227
 	sys/dev/usb/uhcivar.h: revision 1.56
 	(all via patch)

 consolidate the handling of polling across HC drivers, and generic USB:
 - don't take mutexes if polling
 - normalise the code across all drivers
 - add some not yet code to block discovery to/from polling
 - minor CSE
 - adjust comment for usbd_set_polling() to reality now i properly
   understand what it is used for and why.

 this, with a hack to make RB_ASKNAME to wait 5 seconds allows boot -a
 work with USB keyboards.  there are still multiple issues remaining:
 - discovery and polling need to be mutually exclusive
 - attachment of ukbd and wskbd is not handled by config_pending, and
   the 5 second delay isn't going to always be enough.

 call cnpollc(1) and cnpollc(0) around cngetc().
 (christos has a good idea to add a function that does all 3,
 and we should switch all the callers in this sequence to use
 it (and fix the MD ones missing it still).  not all can, as
 eg, line-grabbing functions can use cngetsn(), which only
 calls cnpollc() twice.)

 When this file is used when not building the kernel (eg: /usr/sbin/crash)
 make cnpollc() go away.

 reorder some struct members to remove holes.

 add config_pending usage to uhub and general USB device attachment.
 - call config_pending_incr() and config_pending_decr() around attaching
   devices against "usbdevif" attribute.

 uhub:
 - convert sc_explorepending and sc_running to bool.  add new sc_first_explore.
 - call config_pending_incr() at the start of uhub_attach().  dropped in
   uhub_explore(), if this is the first explore.

 implement a gross hack to fix "boot -a" on systems with usb keyboards on
 systems with ehci handover to uhci (and maybe ohci), and fix a similar
 problem for "boot -s".

 there is effort to ensure that all devices attached via USB are probed
 before RB_ASKNAME or RB_SINGLE attempts to ask any questions on the console,
 and largely this works, often by chance, today, for USB disks and root.
 i've recently pushed this more into uhub and general USB device attachment
 as well, and kept a config_pending reference across the first explore of
 a bus.  these fix many issues with directly attached hubs.

 however, on systems where devices connected to ehci ports are handed over
 to a companion uhci or ohci port, it may not be the first, or even second,
 bus explore that finds the device finally before attachment, and at this
 point all config_pending references are dropped.

 there is no direct communication between drivers, the potentials are
 looked up but their device_t is only used for generic things like the name,
 so informing the correct companion to expect a device and deal with the
 config_pending references is not possible without some fairly ugly layer
 violations or multi-level callbacks (eg, we have "ehci0", and usually an
 the relevant companion, eg, "uhci2", but it is the uhub that uhci2 has
 attached that will deal with the device attachment.)

 with the above fixes to generic USB code, the disown happens during the
 first explore.  the hack works by, at this point, checking if (a) root
 is not mounted, (b) single user or ask name are set, and (c) if the hack
 as not been triggered already.  if all 3 conditions are true, then a
 config_pending_incr() is called and a callback is triggered for (default)
 5 seconds to call config_pending_decr().  ehci detach pauses waiting for
 this callback if scheduled.

 this allows enough time for the uhub and the ukbd/wskbd to attach before
 the RK_ASKROOT prompts appear.  testing shows it takes between 1.5 and
 2 seconds for the keyboard to appear after the disown occurs.

 Index: dev/usb/ehcivar.c
 - new sc_compcallout, sc_compcallout, sc_complock, and a state for th
   handover hack.

 Index: dev/usb/ehci.c
 ehci_init():
 - use aprint_normal_dev() instead of manual device_xname().
 - initialise sc_compcallout, sc_compcallout, sc_complock, and sc_comp_state.
 ehci_detach():
 - if there are companion controllers, tear own the above, including waiting
   if there is a callback scheduled.
 ehci_disown_callback():
 - new callout to call config_pending_decr() in the the future.
   schedule this ca
 ehci_disown_sched_callback():
 - if booting to single user or asking names, call config_pending_incr() and
   schedule the callout above, default 5 second delay.
 ehci_disown():
 - if disowning a port call ehci_disown_sched_callback().
 deal with partial attach failures in usb_attach vs usb_detach aka PR 53598.
 - make sure xhci's sc->sc_ios is NULL if failure happens.
 - rearrange usb_attach() / usb_doattach() to make it simpler to clean up.
 - move usb_async_intr softint into usb_once_init().  previously, each USB
   controller would start a new one, and leave the old one leaked.
 - handle controller interrupts without a bus attached

 remove usb(4)'s "flags 1" code.  it has been dead for a while,
 as it runs during the interrupts part of configuration now,
 and all the devices try attach as early as possible, including
 any root or boot required disk or keyboard device, which is
 what this flag was for.


 To generate a diff of this commit:
 cvs rdiff -u -r1.105 -r1.105.4.1 src/share/man/man4/usb.4
 cvs rdiff -u -r1.148.8.4 -r1.148.8.5 src/sys/ddb/db_command.c
 cvs rdiff -u -r1.33 -r1.33.32.1 src/sys/ddb/db_output.c
 cvs rdiff -u -r1.8 -r1.8.6.1 src/sys/dev/pci/xhci_pci.c
 cvs rdiff -u -r1.254.8.4 -r1.254.8.5 src/sys/dev/usb/ehci.c
 cvs rdiff -u -r1.273.6.3 -r1.273.6.4 src/sys/dev/usb/ohci.c
 cvs rdiff -u -r1.275.2.4 -r1.275.2.5 src/sys/dev/usb/uhci.c
 cvs rdiff -u -r1.53.10.1 -r1.53.10.2 src/sys/dev/usb/uhcivar.h
 cvs rdiff -u -r1.136.2.1 -r1.136.2.2 src/sys/dev/usb/uhub.c
 cvs rdiff -u -r1.165.6.3 -r1.165.6.4 src/sys/dev/usb/usb.c
 cvs rdiff -u -r1.220.2.4 -r1.220.2.5 src/sys/dev/usb/usb_subr.c
 cvs rdiff -u -r1.173.2.2 -r1.173.2.3 src/sys/dev/usb/usbdi.c
 cvs rdiff -u -r1.72.2.7 -r1.72.2.8 src/sys/dev/usb/xhci.c
 cvs rdiff -u -r1.46.2.2 -r1.46.2.3 src/sys/external/bsd/dwc2/dwc2.c
 cvs rdiff -u -r1.26 -r1.26.22.1 src/sys/kern/subr_userconf.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Fri, 23 Aug 2019 07:37:39 +0000
State-Changed-Why:
feedback timeout.  submitter applied fixes to netbsd-8 branch 11 months ago.. ;)


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.