NetBSD Problem Report #54495
From sc.dying@gmail.com Wed Aug 28 06:13:36 2019
Return-Path: <sc.dying@gmail.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 336D87A14B
for <gnats-bugs@gnats.NetBSD.org>; Wed, 28 Aug 2019 06:13:36 +0000 (UTC)
Message-Id: <614e1030-3f60-0956-40e7-e39d72b2b6e5@gmail.com>
Date: Wed, 28 Aug 2019 06:13:27 +0000
From: sc.dying@gmail.com
Reply-To:
To: gnats-bugs@NetBSD.org
Subject: stopping axe(4) may locks up if NET_MPSAFE is enabled
>Number: 54495
>Category: kern
>Synopsis: stopping axe(4) may locks up if NET_MPSAFE is enabled
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Aug 28 06:15:00 +0000 2019
>Closed-Date: Tue Apr 21 19:35:07 +0000 2020
>Last-Modified: Tue Apr 21 19:35:07 +0000 2020
>Originator: sc.dying@gmail.com
>Release: NetBSD 9.99.10
>Organization:
>Environment:
Boardtype: Raspberry Pi 3 Model B Rev 1.2
System: NetBSD-earmv7hf/9.99.10 (GENERIC) with NET_MPSAFE enabled
Architecture: earmv7hf
Machine: evbarm
>Description:
The command `ifconfig axe0 down' often locks up on NET_MPSAFE kernel.
Suspicious processes are ifconfig and softclock. It looks like that
ifconfig thread holds unp_lock and waits for kpause is waked up,
and usbnet_tick waits for unp_lock released to hold it.
usbnet_stop() sets unp_stopping, but usbnet_tick() checks it after
mutex_enter(unp_lock), so softclock thread is stalled.
The problem is easily reproduced because my axe(4) uses AX88772B so axe
kpauses 600ms at least in axe_reset if NET_MPSAFE is defined.
crash> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
15 > 1 7 0 0 911be120 crash
13 1 3 0 0 90fe9700 ifconfig usbdly
6 1 3 0 80 90e0b3c0 sh wait
1 1 3 0 80 90fab420 init wait
0 92 3 3 200 90fe9440 pooldrain pooldrain
0 91 3 3 200 90fe9180 aiodoned aiodoned
0 90 3 0 200 90fe8ec0 ioflush syncer
0 89 3 3 200 90fe8c00 pgdaemon pgdaemon
0 88 3 1 280 90e50620 VCHIQka-0 lnxcmplt
0 87 3 3 280 90fe8940 vcaudio data
0 86 3 0 280 90fe8680 VCHIQs-0 semacv
0 85 3 3 280 90fe83c0 VCHIQr-0 semacv
0 84 3 0 280 90fe8100 VCHIQ-0 semacv
0 83 5 0 200 90fabc60 (zombie)
0 81 3 3 200 90fab9a0 usb0 usbevt
0 80 3 2 200 90fab6e0 npfgc-0 npfgccv
0 79 3 0 200 90fab160 rt_free rt_free
0 78 3 0 200 90faaea0 unpgc unpgc
0 77 3 0 200 90faabe0 key_timehandler key_timehandler
0 76 3 3 200 90faa920 icmp6_wqinput/3 icmp6_wqinput
0 75 3 2 200 90faa660 icmp6_wqinput/2 icmp6_wqinput
0 74 3 1 200 90faa3a0 icmp6_wqinput/1 icmp6_wqinput
0 73 3 0 200 90e500a0 icmp6_wqinput/0 icmp6_wqinput
0 72 3 0 200 90faa0e0 nd6_timer nd6_timer
0 71 3 3 200 90e73c40 carp6_wqinput/3 carp6_wqinput
0 70 3 2 200 90e73980 carp6_wqinput/2 carp6_wqinput
0 69 3 1 200 90e736c0 carp6_wqinput/1 carp6_wqinput
0 68 3 0 200 90e73400 carp6_wqinput/0 carp6_wqinput
0 67 3 3 200 90e73140 carp_wqinput/3 carp_wqinput
0 66 3 2 200 90e72e80 carp_wqinput/2 carp_wqinput
0 65 3 1 200 90e72bc0 carp_wqinput/1 carp_wqinput
0 64 3 0 200 90e72900 carp_wqinput/0 carp_wqinput
0 63 3 0 200 90e72640 tcp_slowtimo tcp_slowtimo
0 62 3 3 200 90e72380 icmp_wqinput/3 icmp_wqinput
0 61 3 2 200 90e720c0 icmp_wqinput/2 icmp_wqinput
0 60 3 1 200 90e51c20 icmp_wqinput/1 icmp_wqinput
0 59 3 0 200 90e51960 icmp_wqinput/0 icmp_wqinput
0 58 3 0 200 90e516a0 rt_timer rt_timer
0 56 3 2 200 90e508e0 vmem_rehash vmem_rehash
0 55 3 1 200 90e50360 vcmbox0 vcmbox0
0 54 3 0 200 90e513e0 usbtask-dr usbtsk
0 53 3 0 200 90e51120 usbtask-hc usbtsk
0 52 3 0 200 90e50e60 dwc2 dwc2
0 51 3 0 200 90e50ba0 sdmmc1 mmctaskq
0 50 3 0 200 90e0b100 sdmmc0 mmctaskq
0 40 3 0 200 90e0ae40 iic2 iicintr
0 39 3 0 200 90e0ab80 iic1 iicintr
0 38 3 0 200 90e0a8c0 iic0 iicintr
0 37 3 3 200 90e0a600 xcall/3 xcall
0 36 1 3 200 90e0a340 softser/3
0 35 1 3 200 90e0a080 softclk/3
0 34 1 3 200 90b85be0 softbio/3
0 33 1 3 200 90b85920 softnet/3
0 > 32 7 3 201 90b85660 idle/3
0 31 3 2 200 90b853a0 xcall/2 xcall
0 30 1 2 200 90b850e0 softser/2
0 29 1 2 200 90b84e20 softclk/2
0 28 1 2 200 90b84b60 softbio/2
0 27 1 2 200 90b848a0 softnet/2
0 > 26 7 2 201 90b845e0 idle/2
0 25 3 1 200 90b84320 xcall/1 xcall
0 24 1 1 200 90b84060 softser/1
0 23 1 1 200 90b6bbc0 softclk/1
0 22 1 1 200 90b6b900 softbio/1
0 21 1 1 200 90b6b640 softnet/1
0 > 20 7 1 201 90b6b380 idle/1
0 19 3 2 200 90b6b0c0 lnxpwrwq lnxpwrwq
0 18 3 1 200 90b6ae00 lnxlngwq lnxlngwq
0 17 3 0 200 90b6ab40 lnxsyswq lnxsyswq
0 16 3 0 200 90b6a880 lnxrcugc lnxrcugc
0 15 3 0 200 90b6a5c0 sysmon smtaskq
0 14 3 2 200 90b6a300 pmfsuspend pmfsuspend
0 13 3 3 200 90b6a040 pmfevent pmfevent
0 12 3 0 200 90b57ba0 sopendfree sopendfr
0 11 3 0 200 90b578e0 nfssilly nfssilly
0 10 3 0 200 90b57620 cachegc cachegc
0 9 3 0 200 90b57360 vdrain vdrain
0 8 3 0 200 90b570a0 modunload mod_unld
0 7 3 0 200 90b56de0 xcall/0 xcall
0 6 1 0 200 90b56b20 softser/0
0 5 3 0 200 90b56860 softclk/0 tstile
0 4 1 0 200 90b565a0 softbio/0
0 3 1 0 200 90b562e0 softnet/0
0 2 1 0 201 90b56020 idle/0
0 1 3 3 200 807cad80 swapper uvm
crash> bt/a 90fe9700
trace: pid 13 lid 1 at 0x9ae7fbb4
0x9ae7fbb4: mi_switch+0x10
0x9ae7fbe4: sleepq_block+0x110
0x9ae7fc24: kpause+0x138
0x9ae7fc64: axe_ax_init+0x79c
0x9ae7fc7c: axe_stop+0x20
0x9ae7fcac: usbnet_stop+0x7c
0x9ae7fccc: usbnet_stop_ifp+0x34
0x9ae7fce4: ether_ioctl_reinit+0x60
0x9ae7fd04: usbnet_ioctl+0x3c
0x9ae7fe24: doifioctl+0x8c8
0x9ae7feec: sys_ioctl+0x424
0x9ae7ffac: syscall+0x12c
--- tf 0x9ae7ffb0 ---
crash> bt/a 90b56860
trace: pid 0 lid 5 at 0x99de3df4
0x99de3df4: mi_switch+0x10
0x99de3e24: sleepq_block+0xb4
0x99de3e74: turnstile_block+0x4c4
0x99de3ee4: mutex_enter+0x234
0x99de3efc: usbnet_tick+0x1c
0x99de3f44: callout_softclock+0x134
0x99de3fac: softint_dispatch+0x120
Bad frame pointer: 0x80b41f04
>How-To-Repeat:
Build the kernel with NET_MPSAFE enabled, connect axe(4), and
repeat ifconfig it up and down.
>Fix:
>Release-Note:
>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, skrll@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/54495: stopping axe(4) may locks up if NET_MPSAFE is enabled
Date: Thu, 29 Aug 2019 08:23:47 +1000
> The command `ifconfig axe0 down' often locks up on NET_MPSAFE kernel.
skrll added this in rev 1.76, and it comes with this comment:
/*
* softnet_lock can be taken when NET_MPAFE is not defined when ca=
lling
* if_addr_init -> if_init. This doesn't mix well with the
* usbd_delay_ms calls in the init routines as things like nd6_slo=
wtimo
* can fire during the wait and attempt to take softnet_lock and t=
hen
* block the softclk thread meaning the wait never ends.
*/
#ifndef NET_MPSAFE
/* XXX What to reset? */
/* Wait a little while for the chip to get its brains in order. */
DELAY(1000);
#else
axe_ax_init(un);
#endif
i think this is two problems. (1) the above issue probably
occurs with NET_MPSAFE as well, just less often, and is what
you see here, and (2) we shouldn't reset so completely for
stop, just just disable/turn off stuff.
(fortunately, usbnet appears to be unrelated to this :-)
.mrg.
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, sc.dying@gmail.com
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: kern/54495: stopping axe(4) may locks up if NET_MPSAFE is enabled
Date: Thu, 29 Aug 2019 08:45:05 +1000
actually, i see how usbnet_tick triggers this. some of
the driver did this, and i copied it into all, but i see
how it is espcially bad for axe(4) -- it has many many
more usbd_delay_us() calls compared to most of the other
usbnet converted drivers.
i can see a few ideas for fixes.
- remove any access to adaptive locks in softint context,
which comes down to don't take the lock in usbnet_tick().
- arrange to have the un_lock passed to kpause so it is
dropped and re-taken during kpause.
- push *all* the work in usbnet_tick() into the
usbnet_tick_task() (hm, this may also be #1.)
i think the first idea is best and simplest. i suspect
that it needs to be written like;
struct usbnet_private * const unp =3D un->un_pri;
if (unp !=3D NULL && !unp->unp_stopping && !unp->unp_dying) {
/* Perform periodic stuff in process context */
usb_add_task(un->un_udev, &unp->unp_ticktask, USB_TASKQ_DR=
IVER);
}
can you test? my axe(4) is broken.
(i've got a few more fixes to check unp for NULL in a
tree i'm testing, thanks to a report from maxv about
upl(4) crash during detach, which is the only fixed
part so far, though the rest of them are more safety
than actually fixing bugs.)
thanks.
.mrg.
From: sc.dying@gmail.com
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/54495: stopping axe(4) may locks up if NET_MPSAFE is enabled
Date: Thu, 29 Aug 2019 08:14:09 +0000
On Wed, Aug 28, 2019 at 10:45 PM matthew green <mrg@eterna.com.au> wrote:
>
> actually, i see how usbnet_tick triggers this. some of
> the driver did this, and i copied it into all, but i see
> how it is espcially bad for axe(4) -- it has many many
> more usbd_delay_us() calls compared to most of the other
> usbnet converted drivers.
>
> i can see a few ideas for fixes.
>
> - remove any access to adaptive locks in softint context,
> which comes down to don't take the lock in usbnet_tick().
>
> - arrange to have the un_lock passed to kpause so it is
> dropped and re-taken during kpause.
>
> - push *all* the work in usbnet_tick() into the
> usbnet_tick_task() (hm, this may also be #1.)
My idea was
- call callout_halt instead of callout_stop to wait for its completion
before calling unp_stop.
It does not solve problem.
> i think the first idea is best and simplest. i suspect
> that it needs to be written like;
>
> struct usbnet_private * const unp = un->un_pri;
>
> if (unp != NULL && !unp->unp_stopping && !unp->unp_dying) {
> /* Perform periodic stuff in process context */
> usb_add_task(un->un_udev, &unp->unp_ticktask, USB_TASKQ_DRIVER);
> }
>
> can you test? my axe(4) is broken.
It works. I repeated 10 times ifconfig up and down but it does not lock up.
>
> (i've got a few more fixes to check unp for NULL in a
> tree i'm testing, thanks to a report from maxv about
> upl(4) crash during detach, which is the only fixed
> part so far, though the rest of them are more safety
> than actually fixing bugs.)
>
> thanks.
>
>
> .mrg.
Thanks.
From: "matthew green" <mrg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54495 CVS commit: src/sys/dev/usb
Date: Thu, 29 Aug 2019 09:17:51 +0000
Module Name: src
Committed By: mrg
Date: Thu Aug 29 09:17:51 UTC 2019
Modified Files:
src/sys/dev/usb: usbnet.c
Log Message:
fix a lock hang reported by sc.dying in PR#54495.
remove locking in usbnet_tick(). assume that all locking
needs are handled inside usbnet_tick_task(), which runs in
the usbtask thread. ensure that usbnet private is valid
before using it.
also check NULL private pointer in usbnet_isdying().
all the other cases should never happen.
To generate a diff of this commit:
cvs rdiff -u -r1.24 -r1.25 src/sys/dev/usb/usbnet.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->pending-pullups
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Thu, 29 Aug 2019 09:47:04 +0000
State-Changed-Why:
supplied fix works; pullups along with usbnet are pending yet.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54495 CVS commit: [netbsd-9] src/sys/dev/usb
Date: Fri, 13 Sep 2019 06:51:59 +0000
Module Name: src
Committed By: martin
Date: Fri Sep 13 06:51:59 UTC 2019
Modified Files:
src/sys/dev/usb [netbsd-9]: if_upl.c if_url.c if_urndis.c usb_mem.c
usbnet.c usbnet.h
Log Message:
Pull up following revision(s) (requested by mrg in ticket #197):
sys/dev/usb/if_url.c: revision 1.71
sys/dev/usb/usbnet.h: revision 1.15
sys/dev/usb/usb_mem.c: revision 1.72
sys/dev/usb/if_urndis.c: revision 1.24
sys/dev/usb/if_upl.c: revision 1.67
sys/dev/usb/usbnet.c: revision 1.25
sys/dev/usb/usbnet.c: revision 1.26
sys/dev/usb/usbnet.c: revision 1.27
sys/dev/usb/usbnet.c: revision 1.28
Teach urndis to handle some REMOTE_NDIS_INDICATE_STATUS_MSG. If the status
is reasonable, don't tell userland we got an error. Stops spurious EIO.
From openbsd.
Fix bug, remove {0,0} because we switched to usb_lookup().
s/no free/no freelist entry/ in a debug message.
fix a lock hang reported by sc.dying in PR#54495.
remove locking in usbnet_tick(). assume that all locking
needs are handled inside usbnet_tick_task(), which runs in
the usbtask thread. ensure that usbnet private is valid
before using it.
also check NULL private pointer in usbnet_isdying().
all the other cases should never happen.
don't try to set dying when we haven't usbnet_attach()d yet.
reported by maxv.
- use CALLARGS vs CALLED for better usbhist
- turn off usbnetdebug default
- log for all entry/exit points of usbnet_pipe_intr()
- in usbnet_start_locked() track whether any packet has been
transmitted for setting the timer. avoids spurious
"watchdog timeouts"
- in usbnet_stop() use callout_halt() vs callout_halt, and
also stop the usb task. fixes crash of usbtask after the
phy has detached.
- add a little more defensive checking in the tick task, and
add some high-log-level logs.
- in usbnet_detach() move the call to usbnet_stop_ifp() above
the calls to callout/usbtask stopping.
- set ec_mii and unp_pri to NULL when freeing their data
normalise an error message.
document usbnet_private locking. minor knf.
To generate a diff of this commit:
cvs rdiff -u -r1.64.2.1 -r1.64.2.2 src/sys/dev/usb/if_upl.c
cvs rdiff -u -r1.66.2.1 -r1.66.2.2 src/sys/dev/usb/if_url.c
cvs rdiff -u -r1.21.4.1 -r1.21.4.2 src/sys/dev/usb/if_urndis.c
cvs rdiff -u -r1.70.10.1 -r1.70.10.2 src/sys/dev/usb/usb_mem.c
cvs rdiff -u -r1.25.2.2 -r1.25.2.3 src/sys/dev/usb/usbnet.c
cvs rdiff -u -r1.14.2.2 -r1.14.2.3 src/sys/dev/usb/usbnet.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: pending-pullups->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Tue, 21 Apr 2020 19:35:07 +0000
State-Changed-Why:
Closed as it was reported fixed and pulled up. Please do not hesitate to suggest any additional changes (although, it might be better received in a separate bug report). THanks for the report!
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.