NetBSD Problem Report #54495

From sc.dying@gmail.com  Wed Aug 28 06:13:36 2019
Return-Path: <sc.dying@gmail.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 336D87A14B
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 28 Aug 2019 06:13:36 +0000 (UTC)
Message-Id: <614e1030-3f60-0956-40e7-e39d72b2b6e5@gmail.com>
Date: Wed, 28 Aug 2019 06:13:27 +0000
From: sc.dying@gmail.com
Reply-To:
To: gnats-bugs@NetBSD.org
Subject: stopping axe(4) may locks up if NET_MPSAFE is enabled

>Number:         54495
>Category:       kern
>Synopsis:       stopping axe(4) may locks up if NET_MPSAFE is enabled
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Aug 28 06:15:00 +0000 2019
>Closed-Date:    Tue Apr 21 19:35:07 +0000 2020
>Last-Modified:  Tue Apr 21 19:35:07 +0000 2020
>Originator:     sc.dying@gmail.com
>Release:        NetBSD 9.99.10
>Organization:
>Environment:
	Boardtype: Raspberry Pi 3 Model B Rev 1.2
	System: NetBSD-earmv7hf/9.99.10 (GENERIC) with NET_MPSAFE enabled
	Architecture: earmv7hf
	Machine: evbarm
>Description:
The command `ifconfig axe0 down' often locks up on NET_MPSAFE kernel.

Suspicious processes are ifconfig and softclock.  It looks like that
ifconfig thread holds unp_lock and waits for kpause is waked up,
and usbnet_tick waits for unp_lock released to hold it.
usbnet_stop() sets unp_stopping, but usbnet_tick() checks it after
mutex_enter(unp_lock), so softclock thread is stalled.

The problem is easily reproduced because my axe(4) uses AX88772B so axe
kpauses 600ms at least in axe_reset if NET_MPSAFE is defined.


crash> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
15   >   1 7   0         0           911be120              crash
13       1 3   0         0           90fe9700           ifconfig usbdly
6        1 3   0        80           90e0b3c0                 sh wait
1        1 3   0        80           90fab420               init wait
0       92 3   3       200           90fe9440          pooldrain pooldrain
0       91 3   3       200           90fe9180           aiodoned aiodoned
0       90 3   0       200           90fe8ec0            ioflush syncer
0       89 3   3       200           90fe8c00           pgdaemon pgdaemon
0       88 3   1       280           90e50620          VCHIQka-0 lnxcmplt
0       87 3   3       280           90fe8940            vcaudio data
0       86 3   0       280           90fe8680           VCHIQs-0 semacv
0       85 3   3       280           90fe83c0           VCHIQr-0 semacv
0       84 3   0       280           90fe8100            VCHIQ-0 semacv
0       83 5   0       200           90fabc60           (zombie)
0       81 3   3       200           90fab9a0               usb0 usbevt
0       80 3   2       200           90fab6e0            npfgc-0 npfgccv
0       79 3   0       200           90fab160            rt_free rt_free
0       78 3   0       200           90faaea0              unpgc unpgc
0       77 3   0       200           90faabe0    key_timehandler key_timehandler
0       76 3   3       200           90faa920    icmp6_wqinput/3 icmp6_wqinput
0       75 3   2       200           90faa660    icmp6_wqinput/2 icmp6_wqinput
0       74 3   1       200           90faa3a0    icmp6_wqinput/1 icmp6_wqinput
0       73 3   0       200           90e500a0    icmp6_wqinput/0 icmp6_wqinput
0       72 3   0       200           90faa0e0          nd6_timer nd6_timer
0       71 3   3       200           90e73c40    carp6_wqinput/3 carp6_wqinput
0       70 3   2       200           90e73980    carp6_wqinput/2 carp6_wqinput
0       69 3   1       200           90e736c0    carp6_wqinput/1 carp6_wqinput
0       68 3   0       200           90e73400    carp6_wqinput/0 carp6_wqinput
0       67 3   3       200           90e73140     carp_wqinput/3 carp_wqinput
0       66 3   2       200           90e72e80     carp_wqinput/2 carp_wqinput
0       65 3   1       200           90e72bc0     carp_wqinput/1 carp_wqinput
0       64 3   0       200           90e72900     carp_wqinput/0 carp_wqinput
0       63 3   0       200           90e72640       tcp_slowtimo tcp_slowtimo
0       62 3   3       200           90e72380     icmp_wqinput/3 icmp_wqinput
0       61 3   2       200           90e720c0     icmp_wqinput/2 icmp_wqinput
0       60 3   1       200           90e51c20     icmp_wqinput/1 icmp_wqinput
0       59 3   0       200           90e51960     icmp_wqinput/0 icmp_wqinput
0       58 3   0       200           90e516a0           rt_timer rt_timer
0       56 3   2       200           90e508e0        vmem_rehash vmem_rehash
0       55 3   1       200           90e50360            vcmbox0 vcmbox0
0       54 3   0       200           90e513e0         usbtask-dr usbtsk
0       53 3   0       200           90e51120         usbtask-hc usbtsk
0       52 3   0       200           90e50e60               dwc2 dwc2
0       51 3   0       200           90e50ba0             sdmmc1 mmctaskq
0       50 3   0       200           90e0b100             sdmmc0 mmctaskq
0       40 3   0       200           90e0ae40               iic2 iicintr
0       39 3   0       200           90e0ab80               iic1 iicintr
0       38 3   0       200           90e0a8c0               iic0 iicintr
0       37 3   3       200           90e0a600            xcall/3 xcall
0       36 1   3       200           90e0a340          softser/3
0       35 1   3       200           90e0a080          softclk/3
0       34 1   3       200           90b85be0          softbio/3
0       33 1   3       200           90b85920          softnet/3
0    >  32 7   3       201           90b85660             idle/3
0       31 3   2       200           90b853a0            xcall/2 xcall
0       30 1   2       200           90b850e0          softser/2
0       29 1   2       200           90b84e20          softclk/2
0       28 1   2       200           90b84b60          softbio/2
0       27 1   2       200           90b848a0          softnet/2
0    >  26 7   2       201           90b845e0             idle/2
0       25 3   1       200           90b84320            xcall/1 xcall
0       24 1   1       200           90b84060          softser/1
0       23 1   1       200           90b6bbc0          softclk/1
0       22 1   1       200           90b6b900          softbio/1
0       21 1   1       200           90b6b640          softnet/1
0    >  20 7   1       201           90b6b380             idle/1
0       19 3   2       200           90b6b0c0           lnxpwrwq lnxpwrwq
0       18 3   1       200           90b6ae00           lnxlngwq lnxlngwq
0       17 3   0       200           90b6ab40           lnxsyswq lnxsyswq
0       16 3   0       200           90b6a880           lnxrcugc lnxrcugc
0       15 3   0       200           90b6a5c0             sysmon smtaskq
0       14 3   2       200           90b6a300         pmfsuspend pmfsuspend
0       13 3   3       200           90b6a040           pmfevent pmfevent
0       12 3   0       200           90b57ba0         sopendfree sopendfr
0       11 3   0       200           90b578e0           nfssilly nfssilly
0       10 3   0       200           90b57620            cachegc cachegc
0        9 3   0       200           90b57360             vdrain vdrain
0        8 3   0       200           90b570a0          modunload mod_unld
0        7 3   0       200           90b56de0            xcall/0 xcall
0        6 1   0       200           90b56b20          softser/0
0        5 3   0       200           90b56860          softclk/0 tstile
0        4 1   0       200           90b565a0          softbio/0
0        3 1   0       200           90b562e0          softnet/0
0        2 1   0       201           90b56020             idle/0
0        1 3   3       200           807cad80            swapper uvm

crash> bt/a 90fe9700
trace: pid 13 lid 1 at 0x9ae7fbb4
0x9ae7fbb4: mi_switch+0x10
0x9ae7fbe4: sleepq_block+0x110
0x9ae7fc24: kpause+0x138
0x9ae7fc64: axe_ax_init+0x79c
0x9ae7fc7c: axe_stop+0x20
0x9ae7fcac: usbnet_stop+0x7c
0x9ae7fccc: usbnet_stop_ifp+0x34
0x9ae7fce4: ether_ioctl_reinit+0x60
0x9ae7fd04: usbnet_ioctl+0x3c
0x9ae7fe24: doifioctl+0x8c8
0x9ae7feec: sys_ioctl+0x424
0x9ae7ffac: syscall+0x12c
--- tf 0x9ae7ffb0 ---

crash> bt/a 90b56860
trace: pid 0 lid 5 at 0x99de3df4
0x99de3df4: mi_switch+0x10
0x99de3e24: sleepq_block+0xb4
0x99de3e74: turnstile_block+0x4c4
0x99de3ee4: mutex_enter+0x234
0x99de3efc: usbnet_tick+0x1c
0x99de3f44: callout_softclock+0x134
0x99de3fac: softint_dispatch+0x120
Bad frame pointer: 0x80b41f04


>How-To-Repeat:
Build the kernel with NET_MPSAFE enabled, connect axe(4), and
repeat ifconfig it up and down.

>Fix:

>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, skrll@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/54495: stopping axe(4) may locks up if NET_MPSAFE is enabled
Date: Thu, 29 Aug 2019 08:23:47 +1000

 > The command `ifconfig axe0 down' often locks up on NET_MPSAFE kernel.

 skrll added this in rev 1.76, and it comes with this comment:

         /*
          * softnet_lock can be taken when NET_MPAFE is not defined when ca=
 lling
          * if_addr_init -> if_init.  This doesn't mix well with the
          * usbd_delay_ms calls in the init routines as things like nd6_slo=
 wtimo
          * can fire during the wait and attempt to take softnet_lock and t=
 hen
          * block the softclk thread meaning the wait never ends.
          */
 #ifndef NET_MPSAFE
         /* XXX What to reset? */

         /* Wait a little while for the chip to get its brains in order. */
         DELAY(1000);
 #else
         axe_ax_init(un);
 #endif


 i think this is two problems.  (1) the above issue probably
 occurs with NET_MPSAFE as well, just less often, and is what
 you see here, and (2) we shouldn't reset so completely for
 stop, just just disable/turn off stuff.


 (fortunately, usbnet appears to be unrelated to this :-)


 .mrg.

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org, sc.dying@gmail.com
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/54495: stopping axe(4) may locks up if NET_MPSAFE is enabled
Date: Thu, 29 Aug 2019 08:45:05 +1000

 actually, i see how usbnet_tick triggers this.  some of
 the driver did this, and i copied it into all, but i see
 how it is espcially bad for axe(4) -- it has many many
 more usbd_delay_us() calls compared to most of the other
 usbnet converted drivers.

 i can see a few ideas for fixes.

 - remove any access to adaptive locks in softint context,
   which comes down to don't take the lock in usbnet_tick().

 - arrange to have the un_lock passed to kpause so it is
   dropped and re-taken during kpause.

 - push *all* the work in usbnet_tick() into the
   usbnet_tick_task() (hm, this may also be #1.)


 i think the first idea is best and simplest.  i suspect
 that it needs to be written like;

         struct usbnet_private * const unp =3D un->un_pri;

         if (unp !=3D NULL && !unp->unp_stopping && !unp->unp_dying) {
                 /* Perform periodic stuff in process context */
                 usb_add_task(un->un_udev, &unp->unp_ticktask, USB_TASKQ_DR=
 IVER);
         }

 can you test?  my axe(4) is broken.

 (i've got a few more fixes to check unp for NULL in a
 tree i'm testing, thanks to a report from maxv about
 upl(4) crash during detach, which is the only fixed
 part so far, though the rest of them are more safety
 than actually fixing bugs.)

 thanks.


 .mrg.

From: sc.dying@gmail.com
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/54495: stopping axe(4) may locks up if NET_MPSAFE is enabled
Date: Thu, 29 Aug 2019 08:14:09 +0000

 On Wed, Aug 28, 2019 at 10:45 PM matthew green <mrg@eterna.com.au> wrote:
 >
 > actually, i see how usbnet_tick triggers this.  some of
 > the driver did this, and i copied it into all, but i see
 > how it is espcially bad for axe(4) -- it has many many
 > more usbd_delay_us() calls compared to most of the other
 > usbnet converted drivers.
 >
 > i can see a few ideas for fixes.
 >
 > - remove any access to adaptive locks in softint context,
 >   which comes down to don't take the lock in usbnet_tick().
 >
 > - arrange to have the un_lock passed to kpause so it is
 >   dropped and re-taken during kpause.
 >
 > - push *all* the work in usbnet_tick() into the
 >   usbnet_tick_task() (hm, this may also be #1.)

 My idea was
 - call callout_halt instead of callout_stop to wait for its completion
   before calling unp_stop.

 It does not solve problem.

 > i think the first idea is best and simplest.  i suspect
 > that it needs to be written like;
 >
 >         struct usbnet_private * const unp = un->un_pri;
 >
 >         if (unp != NULL && !unp->unp_stopping && !unp->unp_dying) {
 >                 /* Perform periodic stuff in process context */
 >                 usb_add_task(un->un_udev, &unp->unp_ticktask, USB_TASKQ_DRIVER);
 >         }
 >
 > can you test?  my axe(4) is broken.

 It works. I repeated 10 times ifconfig up and down but it does not lock up.

 >
 > (i've got a few more fixes to check unp for NULL in a
 > tree i'm testing, thanks to a report from maxv about
 > upl(4) crash during detach, which is the only fixed
 > part so far, though the rest of them are more safety
 > than actually fixing bugs.)
 >
 > thanks.
 >
 >
 > .mrg.

 Thanks.

From: "matthew green" <mrg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54495 CVS commit: src/sys/dev/usb
Date: Thu, 29 Aug 2019 09:17:51 +0000

 Module Name:	src
 Committed By:	mrg
 Date:		Thu Aug 29 09:17:51 UTC 2019

 Modified Files:
 	src/sys/dev/usb: usbnet.c

 Log Message:
 fix a lock hang reported by sc.dying in PR#54495.

 remove locking in usbnet_tick().  assume that all locking
 needs are handled inside usbnet_tick_task(), which runs in
 the usbtask thread.  ensure that usbnet private is valid
 before using it.

 also check NULL private pointer in usbnet_isdying().
 all the other cases should never happen.


 To generate a diff of this commit:
 cvs rdiff -u -r1.24 -r1.25 src/sys/dev/usb/usbnet.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->pending-pullups
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Thu, 29 Aug 2019 09:47:04 +0000
State-Changed-Why:
supplied fix works; pullups along with usbnet are pending yet.


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54495 CVS commit: [netbsd-9] src/sys/dev/usb
Date: Fri, 13 Sep 2019 06:51:59 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Fri Sep 13 06:51:59 UTC 2019

 Modified Files:
 	src/sys/dev/usb [netbsd-9]: if_upl.c if_url.c if_urndis.c usb_mem.c
 	    usbnet.c usbnet.h

 Log Message:
 Pull up following revision(s) (requested by mrg in ticket #197):

 	sys/dev/usb/if_url.c: revision 1.71
 	sys/dev/usb/usbnet.h: revision 1.15
 	sys/dev/usb/usb_mem.c: revision 1.72
 	sys/dev/usb/if_urndis.c: revision 1.24
 	sys/dev/usb/if_upl.c: revision 1.67
 	sys/dev/usb/usbnet.c: revision 1.25
 	sys/dev/usb/usbnet.c: revision 1.26
 	sys/dev/usb/usbnet.c: revision 1.27
 	sys/dev/usb/usbnet.c: revision 1.28

 Teach urndis to handle some REMOTE_NDIS_INDICATE_STATUS_MSG. If the status
 is reasonable, don't tell userland we got an error. Stops spurious EIO.
 From openbsd.

 Fix bug, remove {0,0} because we switched to usb_lookup().

 s/no free/no freelist entry/ in a debug message.

 fix a lock hang reported by sc.dying in PR#54495.

 remove locking in usbnet_tick().  assume that all locking
 needs are handled inside usbnet_tick_task(), which runs in
 the usbtask thread.  ensure that usbnet private is valid
 before using it.

 also check NULL private pointer in usbnet_isdying().
 all the other cases should never happen.

 don't try to set dying when we haven't usbnet_attach()d yet.
 reported by maxv.

 - use CALLARGS vs CALLED for better usbhist
 - turn off usbnetdebug default
 - log for all entry/exit points of usbnet_pipe_intr()
 - in usbnet_start_locked() track whether any packet has been
   transmitted for setting the timer.  avoids spurious
   "watchdog timeouts"
 - in usbnet_stop() use callout_halt() vs callout_halt, and
   also stop the usb task.  fixes crash of usbtask after the
   phy has detached.
 - add a little more defensive checking in the tick task, and
   add some high-log-level logs.
 - in usbnet_detach() move the call to usbnet_stop_ifp() above
   the calls to callout/usbtask stopping.
 - set ec_mii and unp_pri to NULL when freeing their data

 normalise an error message.

 document usbnet_private locking.  minor knf.


 To generate a diff of this commit:
 cvs rdiff -u -r1.64.2.1 -r1.64.2.2 src/sys/dev/usb/if_upl.c
 cvs rdiff -u -r1.66.2.1 -r1.66.2.2 src/sys/dev/usb/if_url.c
 cvs rdiff -u -r1.21.4.1 -r1.21.4.2 src/sys/dev/usb/if_urndis.c
 cvs rdiff -u -r1.70.10.1 -r1.70.10.2 src/sys/dev/usb/usb_mem.c
 cvs rdiff -u -r1.25.2.2 -r1.25.2.3 src/sys/dev/usb/usbnet.c
 cvs rdiff -u -r1.14.2.2 -r1.14.2.3 src/sys/dev/usb/usbnet.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: maya@NetBSD.org
State-Changed-When: Tue, 21 Apr 2020 19:35:07 +0000
State-Changed-Why:
Closed as it was reported fixed and pulled up. Please do not hesitate to suggest any additional changes (although, it might be better received in a separate bug report). THanks for the report!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.