NetBSD Problem Report #54963

From gson@gson.org  Fri Feb 14 14:19:44 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 2D5081A9213
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 14 Feb 2020 14:19:44 +0000 (UTC)
Message-Id: <20200214141939.B1B89253FA3@guava.gson.org>
Date: Fri, 14 Feb 2020 16:19:39 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: UHCI panic on shutdown
X-Send-Pr-Version: 3.95

>Number:         54963
>Category:       kern
>Synopsis:       UHCI panic on shutdown
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 14 14:20:01 +0000 2020
>Closed-Date:    Mon Feb 17 07:45:43 +0000 2020
>Last-Modified:  Sun Mar 01 12:40:00 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

My amd64 bare metal testbed has recently started panicing in the
uhci(4) driver on shutdown:

  [ 7185.0687237] panic: kernel diagnostic assertion "sc->sc_intr_xfer == NULL" failed: file "/tmp/bracket/build/2020.02.13.00.42.59-amd64-baremetal/src/sys/dev/usb/uhci.c", line 3928

Full console log including boot messages and panic message:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.02.13.00.42.59/test.log

Source date 2020.02.11.07.03.16 shuts down cleanly, 2020.02.13.00.42.59 panics.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: "John D. Baker" <jdbaker@consolidated.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/54963: UHCI panic on shutdown
Date: Fri, 14 Feb 2020 09:33:16 -0600 (CST)

 I've seen this too.

 While collecting data for kern/52440, I booted a test system with my
 live-USB flash drive (9.99.46 from 202002130200Z).  Upon rebooting, the
 machine panicked with:

 reboot: rebooted by root
 [ 147.4887060] syncing disks... done
 [ 147.5210129] cd0: detached
 [ 147.5584525] atapibus0: detached
 [ 147.5887468] atabus1: detached
 [ 147.6312528] atabus0: detached
 [ 147.6587763] makphy0: detached
 [ 147.6987924] msk0: detached
 [ 147.7288127] atabus3: detached
 [ 147.7688234] atabus2: detached
 [ 147.8049328] uhub6: detached
 [ 147.8382125] Skipping crash dump on recursive panic
 [ 147.8954128] panic: kernel diagnostic assertion "sc->sc_intr_xfer == NULL" failed: file "/x/current/src/sys/dev/usb/uhci.c", line 3928 
 [ 148.0399730] cpu0: Begin traceback...
 [ 148.0826129] vpanic() at netbsd:vpanic+0x178
 [ 148.1304527] kern_assert() at netbsd:kern_assert+0x48
 [ 148.1897324] uhci_root_intr_close() at netbsd:uhci_root_intr_close+0x96
 [ 148.2698124] usbd_close_pipe() at netbsd:usbd_close_pipe+0xca
 [ 148.3301332] uhub_detach() at netbsd:uhub_detach+0xa8
 [ 148.3894123] config_detach() at netbsd:config_detach+0xe5
 [ 148.4601319] config_detach_all() at netbsd:config_detach_all+0x97
 [ 148.5298127] cpu_reboot() at netbsd:cpu_reboot+0x18f
 [ 148.5901326] sys_reboot() at netbsd:sys_reboot
 [ 148.6400525] sys_reboot() at netbsd:sys_reboot+0x75
 [ 148.6993331] syscall() at netbsd:syscall+0x299
 [ 148.7492532] --- syscall (number 208) ---
 [ 148.7981326] 7d651d24317a:
 [ 148.8293325] cpu0: End traceback...
 [ 148.8698927] fatal breakpoint trap in supervisor mode
 [ 148.9291720] trap type 1 code 0 rip 0xffffffff8021f55d cs 0x8 rflags 0x202 cr2 0x75e474c075b8 ilevel 0 rsp 0xffffc380c1197cd0
 [ 149.0633322] curlwp 0xffff925692009a40 pid 11.1 lowest kstack 0xffffc380c11942c0
 Stopped in pid 11.1 (reboot) at netbsd:breakpoint+0x5:  leave
 db{0}>

 It didn't unmount the USB flash drive (sd0a) and it doesn't appear to
 have been detached before the panic, so the file system wasn't clean.

 Seems to happen only if a umass device is mounted.  Following operation
 on unmounted umass (fsck), rebooting goes normally.

 -- 
 |/"\ John D. Baker, KN5UKS               NetBSD     Darwin/MacOS X
 |\ / jdbaker[snail]consolidated[flyspeck]net  OpenBSD            FreeBSD
 | X  No HTML/proprietary data in email.   BSD just sits there and works!
 |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54963 CVS commit: src/sys/dev/usb
Date: Fri, 14 Feb 2020 16:47:11 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Fri Feb 14 16:47:11 UTC 2020

 Modified Files:
 	src/sys/dev/usb: uhci.c

 Log Message:
 Initialize xfer->ux_status in uhci_root_intr_start.

 Otherwise, it will be USBD_NOT_STARTED, so usbd_ar_pipe will skip
 calling upm_abort.

 Candidate fix for PR kern/54963, same problem as reported at:
 https://mail-index.NetBSD.org/current-users/2020/02/13/msg037740.html


 To generate a diff of this commit:
 cvs rdiff -u -r1.290 -r1.291 src/sys/dev/usb/uhci.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54963 CVS commit: src/sys
Date: Sat, 15 Feb 2020 01:21:57 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sat Feb 15 01:21:56 UTC 2020

 Modified Files:
 	src/sys/arch/mips/adm5120/dev: ahci.c
 	src/sys/dev/ic: sl811hs.c
 	src/sys/dev/usb: ehci.c motg.c ohci.c uhci.c vhci.c xhci.c
 	src/sys/external/bsd/dwc2: dwc2.c

 Log Message:
 Fix mistakes in previous sloppy change with root intr xfers.

 - Make sure ux_status is set to USBD_IN_PROGRESS when started.
   Otherwise, if it is still in flight when we abort the pipe,
   usbd_ar_pipe will skip calling upm_abort.

 - Initialize ux_status under the lock; in principle a completion
   interrupt (or a delay) could race with the initialization.

 - KASSERT that the xfer is in progress when we're about to complete
   it.

 Candidate fix for PR kern/54963 for other HCI drivers than uhci.

 ok nick
 ok phone

 (This is the change that nick evidently MEANT to ok when he ok'd the
 previous one!)


 To generate a diff of this commit:
 cvs rdiff -u -r1.19 -r1.20 src/sys/arch/mips/adm5120/dev/ahci.c
 cvs rdiff -u -r1.102 -r1.103 src/sys/dev/ic/sl811hs.c
 cvs rdiff -u -r1.272 -r1.273 src/sys/dev/usb/ehci.c
 cvs rdiff -u -r1.26 -r1.27 src/sys/dev/usb/motg.c
 cvs rdiff -u -r1.294 -r1.295 src/sys/dev/usb/ohci.c
 cvs rdiff -u -r1.292 -r1.293 src/sys/dev/usb/uhci.c
 cvs rdiff -u -r1.5 -r1.6 src/sys/dev/usb/vhci.c
 cvs rdiff -u -r1.117 -r1.118 src/sys/dev/usb/xhci.c
 cvs rdiff -u -r1.68 -r1.69 src/sys/external/bsd/dwc2/dwc2.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 17 Feb 2020 07:45:43 +0000
State-Changed-Why:
Panic no longer happens as of source date 2020.02.14.22.04.12.  Thanks.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54963 CVS commit: [netbsd-9] src/sys
Date: Sun, 1 Mar 2020 12:35:17 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Sun Mar  1 12:35:16 UTC 2020

 Modified Files:
 	src/sys/arch/mips/adm5120/dev [netbsd-9]: ahci.c
 	src/sys/dev/ic [netbsd-9]: sl811hs.c
 	src/sys/dev/usb [netbsd-9]: ehci.c motg.c ohci.c uhci.c usb.c usbdi.c
 	    usbdi.h usbdivar.h xhci.c
 	src/sys/external/bsd/dwc2 [netbsd-9]: dwc2.c dwc2var.h

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #744):

 	sys/dev/usb/uhci.c: revision 1.292
 	sys/dev/usb/uhci.c: revision 1.293
 	sys/dev/usb/usbdi.h: revision 1.99
 	sys/dev/usb/motg.c: revision 1.26
 	sys/dev/usb/motg.c: revision 1.27
 	sys/dev/usb/motg.c: revision 1.28
 	sys/dev/usb/motg.c: revision 1.29
 	sys/external/bsd/dwc2/dwc2.c: revision 1.70
 	sys/external/bsd/dwc2/dwc2.c: revision 1.71
 	sys/dev/usb/usb.c: revision 1.181
 	sys/arch/mips/adm5120/dev/ahci.c: revision 1.20
 	sys/dev/usb/usb.c: revision 1.182
 	sys/dev/usb/xhci.c: revision 1.116
 	sys/dev/usb/xhci.c: revision 1.117
 	sys/dev/usb/xhci.c: revision 1.118
 	sys/dev/usb/uhci.c: revision 1.289
 	sys/dev/usb/usbdivar.h: revision 1.121
 	sys/dev/usb/usbdi.c: revision 1.190
 	sys/dev/usb/usbdivar.h: revision 1.122
 	sys/dev/usb/usbdi.c: revision 1.191
 	sys/dev/usb/usbdi.c: revision 1.192
 	sys/external/bsd/dwc2/dwc2var.h: revision 1.7
 	sys/dev/usb/motg.c: revision 1.30
 	sys/dev/usb/motg.c: revision 1.31
 	sys/dev/usb/motg.c: revision 1.32
 	sys/dev/usb/motg.c: revision 1.33
 	sys/external/bsd/dwc2/dwc2.c: revision 1.67
 	sys/external/bsd/dwc2/dwc2.c: revision 1.68
 	sys/dev/usb/ehci.c: revision 1.270
 	sys/external/bsd/dwc2/dwc2.c: revision 1.69
 	sys/dev/usb/usbdi.h: revision 1.100
 	sys/dev/usb/ehci.c: revision 1.271
 	sys/arch/mips/adm5120/dev/ahci.c: revision 1.18
 	sys/dev/usb/usbdi.h: revision 1.101
 	sys/dev/usb/ehci.c: revision 1.272
 	sys/dev/ic/sl811hs.c: revision 1.103
 	sys/arch/mips/adm5120/dev/ahci.c: revision 1.19
 	sys/dev/usb/ehci.c: revision 1.273
 	sys/dev/usb/ohci.c: revision 1.293
 	sys/dev/usb/uhci.c: revision 1.290
 	sys/dev/usb/ohci.c: revision 1.294
 	sys/dev/usb/uhci.c: revision 1.291
 	sys/dev/usb/ohci.c: revision 1.295

 Teach usb_rem_task to return whether removed from queue or not.

 New function usb_task_pending for diagnostic assertions.
 Usable only for negative diagnostic assertions:

         KASSERT(!usb_task_pending(dev, task))

 If you can think of a better name for this than !usb_task_pending,
 I'm all ears.

  -

 Nothing guarantees xfer's timeout has completed.

 Wait for it when we free the xfer.

  -

 New xfer state variables ux_timeout_set and ux_timeout_reset.

 These are needed because:
 - The host controller interrupt cannot wait for the callout or task
   to finish running.
 - Nothing in the USBD API as is waits for the callout or task to
   finish running.
 - Callers expect to be able to resubmit USB xfers from xfer callbacks
   without waiting for anything to finish running.

 The variable ux_timeout_set can be used by a host controller to
 decide on submission whether to schedule the callout or to ask an
 already-scheduled callout or already-queued task to reschedule the
 callout, by setting the variable ux_timeout_reset to true.

 When the callout or task runs and sees that ux_timeout_reset is true,
 rather than queue the task or abort the xfer, it can instead just
 schedule the callout anew.

  -

 Fix steady state of timeouts in ehci.

 This is complicated because:
 1. There are three ways that an xfer can be completed:
    (a) hardware interrupt completes xfer
    (b) software decision aborts xfer with USBD_CANCELLED
    (c) timeout aborts xfer with USBD_TIMEOUT
 2. The timeout abort can't be done in callout because ehci_sync_hc,
    called unconditionally by ehci_abort_xfer to wait until the device
    has finished using any references to the xfer, may sleep.  So we
    have to schedule a callout that, when run, will schedule a usb_task.
 3. The hardware completion interrupt can't sleep waiting for a callout
    or task to finish -- can't use callout_halt or usb_rem_task_wait.
    So the callout and usb_task must be able to run _after_ the hardware
    completion interrupt, and recognize that they're late to the party.
    (Note, though, that usbd_free_xfer does wait for the callout and
    task to complete, so there's no danger they may use themselves after
    free.)
 4. The xfer may resubmitted -- and the timeout may be rescheduled --
    immediately after the hardware completion interrupt, _while_ the
    callout and/or usb_task may still be scheduled.  Specifically, we
    may have the following sequence of events:
    (a) hardware completion interrupt
    (b) callout or usb_task fires
    (c) driver resubmits xfer
    (d) callout or usb_task acquires lock and looks around dazed and
        bewildered at the firehose of events like reading the news in 2019

 The mechanism for sorting this out is that we have two bits of state:
 - xfer->ux_timeout_set informs the driver, when submitting an xfer and
   setting up its timeout, whether either the callout or usb_task is
   already scheduled or not.
 - xfer->ux_timeout_reset informs the callout or usb_task whether it
   should reschedule the callout, because the xfer got resubmitted, or
   not.

  -

 Factor out HCI-independent xfer completion logic.

 New API for HCI drivers to synchronize hardware completion
 interrupts, synchronous aborts, and asynchronous timeouts:
 - When submitting an xfer to hardware, call
   usbd_xfer_schedule_timeout(xfer).
 - On HCI completion interrupt for xfer completion:
         if (!usbd_xfer_trycomplete(xfer))
                 return;         /* timed out or aborted, ignore it */
 - In upm_abort methods, call usbd_xfer_abort(xfer).

 For HCI drivers that use this API (not needed in drivers that don't,
 or for xfers like root intr xfers that don't use it):
 - New ubm_abortx method serves role of former *hci_abort_xfer, but
   without any logic for wrangling timeouts/callouts/tasks -- caller
   in usbd_xfer_abort has already handled them.
 - New ubm_dying method, returns true if the device is in the process
   of detaching, used by the timeout logic.

 Converted and tested:
 - ehci
 - ohci

 Converted and compile-tested:
 - ahci (XXX did this ever work?)
 - dwc2
 - motg (XXX missing usbd_xfer_schedule_timeout in motg_*_start?)
 - uhci
 - xhci

 Not changed:
 - slhci (sys/dev/ic/sl811hs.c) -- doesn't use a separate per-xfer
   callout for timeouts (XXX but maybe should?)
 - ugenhc (sys/rump/dev/lib/libugenhc/ugenhc.c) -- doesn't manage its
   own transfer timeouts

  -

 Fix steady state of root intr xfers.

 Why?
 - Avoid completing a root intr xfer multiple times in races.
 - Avoid potential use-after-free in poll_hub callouts (uhci, ahci).

 How?
 - Use sc->sc_intr_xfer or equivalent to store only a pending xfer
   that has not yet completed -- whether successfully, by timeout, or
   by synchronous abort.  When any of those happens, set it to null
   under the lock, so the xfer is completed only once.
 - For hci drivers that use a callout to poll the root hub (uhci, ahci):
   . Pass the softc pointer, not the xfer, to the callout, so the
     callout is not even tempted to use xfer after free -- if the
     callout fires, but the xfer is synchronously aborted before the
     callout can do anything, the xfer might be freed by the time the
     callout starts to examine it.
   . Teach the callout to do nothing if it is callout_pending after it
     has fired.  This way:
     1. completion or synchronous abort can just callout_stop
     2. start can just callout_schedule
     If the callout had already fired before (1), and doesn't acquire
     the bus lock until after (2), it may be tempted to abort the new
     root intr xfer just after submission, which would be wrong -- so
     instead we just have the callout do nothing if it notices it has
     been rescheduled, since it will fire again after the appropriate
     time has elapsed.

  -

 Initialize xfer->ux_status in uhci_root_intr_start.

 Otherwise, it will be USBD_NOT_STARTED, so usbd_ar_pipe will skip
 calling upm_abort.
 Candidate fix for PR kern/54963, same problem as reported at:
 href="https://mail-index.NetBSD.org/current-users/2020/02/13/msg037740.html

  -

 Set ux_isdone in uhci_poll_hub for DIAGNOSTIC.

  -

 Fix mistakes in previous sloppy change with root intr xfers.
 - Make sure ux_status is set to USBD_IN_PROGRESS when started.
   Otherwise, if it is still in flight when we abort the pipe,
   usbd_ar_pipe will skip calling upm_abort.
 - Initialize ux_status under the lock; in principle a completion
   interrupt (or a delay) could race with the initialization.
 - KASSERT that the xfer is in progress when we're about to complete
   it.

 Candidate fix for PR kern/54963 for other HCI drivers than uhci.
 ok nick
 ok phone
 (This is the change that nick evidently MEANT to ok when he ok'd the
 previous one!)

  -

 Fix build

  -

 Fix non-DIAGNOSTIC builds.

  -

 Fix wrong KASSERT in motg abort.
 This has been wrong since last summer when we did the transition to
 xfer->ux_status = USBD_CANCELLED earlier.
 XXX pullup-9

  -

 Fix mistakes in timeout/abort/completion changes in motg(4).
 - Call usbd_xfer_schedule_timeout so we actually do time out.
 - Don't call usbd_xfer_trycomplete until all the data have been
   transferred -- it commits to completion, not timeout.
 - Use xfer->ux_status != USBD_IN_PROGRESS to test whether, after a
   partial write, an xfer has been interrupted or timed out and need
   not be continued.
 - Remove wrong assertion.

  -

 Fix mistake in use of usbd_xfer_schedule_timeout in motg.

 This code path is used both for xfers that are new, and xfers that
 are being done piece by piece and are partway done.  For the latter
 case, skip usbd_xfer_schedule_timeout so we schedule it only once per
 xfer.

  -

 Simplify some branches and kassert some redundant assignments.


 To generate a diff of this commit:
 cvs rdiff -u -r1.17.4.1 -r1.17.4.2 src/sys/arch/mips/adm5120/dev/ahci.c
 cvs rdiff -u -r1.101 -r1.101.4.1 src/sys/dev/ic/sl811hs.c
 cvs rdiff -u -r1.267.2.2 -r1.267.2.3 src/sys/dev/usb/ehci.c
 cvs rdiff -u -r1.25 -r1.25.4.1 src/sys/dev/usb/motg.c
 cvs rdiff -u -r1.289.4.4 -r1.289.4.5 src/sys/dev/usb/ohci.c
 cvs rdiff -u -r1.288.4.1 -r1.288.4.2 src/sys/dev/usb/uhci.c
 cvs rdiff -u -r1.179.2.1 -r1.179.2.2 src/sys/dev/usb/usb.c
 cvs rdiff -u -r1.182.4.2 -r1.182.4.3 src/sys/dev/usb/usbdi.c
 cvs rdiff -u -r1.96.4.1 -r1.96.4.2 src/sys/dev/usb/usbdi.h
 cvs rdiff -u -r1.118 -r1.118.4.1 src/sys/dev/usb/usbdivar.h
 cvs rdiff -u -r1.107.2.4 -r1.107.2.5 src/sys/dev/usb/xhci.c
 cvs rdiff -u -r1.59.4.2 -r1.59.4.3 src/sys/external/bsd/dwc2/dwc2.c
 cvs rdiff -u -r1.6 -r1.6.4.1 src/sys/external/bsd/dwc2/dwc2var.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.