NetBSD Problem Report #59497

From stix@stix.id.au  Tue Jul  1 09:19:42 2025
Return-Path: <stix@stix.id.au>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 22F561A923A
	for <gnats-bugs@gnats.NetBSD.org>; Tue,  1 Jul 2025 09:19:42 +0000 (UTC)
Message-Id: <20250701091227.833741A010@stix.id.au>
Date: Tue,  1 Jul 2025 19:12:27 +1000 (AEST)
From: stix@stix.id.au
Reply-To: stix@stix.id.au
To: gnats-bugs@NetBSD.org
Subject: Panic in ucompoll
X-Send-Pr-Version: 3.95

>Number:         59497
>Notify-List:    bad@bsd.de
>Category:       kern
>Synopsis:       Panic in ucompoll
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jul 01 09:20:00 +0000 2025
>Last-Modified:  Sun Jul 20 13:25:01 +0000 2025
>Originator:     Paul Ripke
>Release:        NetBSD 10.1_STABLE
>Organization:
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.
>Environment:
System: NetBSD slave 10.1_STABLE NetBSD 10.1_STABLE (SLAVE) #17: Fri Apr 18 13:51:35 AEST 2025 stix@slave:/home/netbsd/netbsd-10/obj.amd64/home/netbsd/netbsd-10/src/sys/arch/amd64/compile/SLAVE amd64
Architecture: x86_64
Machine: amd64
>Description:
Crash appears due to intermittent disconnect/reconnect of a uplcom device while open.

Device is:
addr 54: full speed, power 100 mA, config 1, USB-Serial Controller D(0x2303), Prolific Technology Inc.(0x067b), rev 4.00(0x0400)

uplcom0 at uhub9 port 1
uplcom0: Prolific Technology Inc. (0x067b) USB-Serial Controller D (0x2303), rev 1.10/4.00, addr 5
ucom0 at uplcom0

Periodically:
Jun 28 14:27:01 slave /netbsd: [ 2332157.9354694] ucom2: detached
Jun 28 14:27:01 slave /netbsd: [ 2332157.9354694] uplcom1: detached
Jun 28 14:27:01 slave /netbsd: [ 2332157.9354694] uplcom1: at uhub1 port 8 (addr 52) disconnected
Jun 28 14:27:10 slave /netbsd: [ 2332166.7886134] uplcom1 at uhub1 port 8
Jun 28 14:27:10 slave /netbsd: [ 2332166.7886134] uplcom1: Prolific Technology Inc. (0x067b) USB-Serial Controller D (0x2303), rev 1.10/4.00, addr 53
Jun 28 14:27:10 slave /netbsd: [ 2332166.8096137] ucom2 at uplcom1
Jun 28 14:27:10 slave /netbsd: [ 2332166.8246139] ucom2: detached
Jun 28 14:27:10 slave /netbsd: [ 2332166.8246139] uplcom1: detached
Jun 28 14:27:10 slave /netbsd: [ 2332166.8246139] uplcom1: at uhub1 port 8 (addr 53) disconnected
Jun 28 14:27:11 slave /netbsd: [ 2332167.3396223] uplcom1 at uhub1 port 8
Jun 28 14:27:11 slave /netbsd: [ 2332167.3396223] uplcom1: Prolific Technology Inc. (0x067b) USB-Serial Controller D (0x2303), rev 1.10/4.00, addr 54
Jun 28 14:27:11 slave /netbsd: [ 2332167.3606226] ucom2 at uplcom1

crash> bt
__kernel_end() at 0
kern_reboot() at sys_reboot
vpanic() at vpanic+0x18d
panic() at vprintf
trap() at startlwp
--- trap (number 6) ---
ucompoll() at ucompoll+0x2a
cdev_poll() at cdev_poll+0x87
spec_poll() at spec_poll+0x6a
VOP_POLL() at VOP_POLL+0x5d
sel_do_scan() at sel_do_scan+0x3ba
selcommon() at selcommon+0x309
sys___select50() at sys___select50+0x75
syscall() at syscall+0x1fc
--- syscall (number 417) ---
syscall+0x1fc:

Have core and kernel with symbols.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: Christoph Badura <bad@bsd.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/59497: Panic in ucompoll
Date: Wed, 2 Jul 2025 00:47:37 +0200

 On Tue, Jul 01, 2025 at 09:20:00AM +0000, stix@stix.id.au wrote:
 > Crash appears due to intermittent disconnect/reconnect of a uplcom device while open.

 Are you sure this is a genuine Prolific device?  I've tried to get some
 Prolific USB serial fobs at the start of the year and found that the market
 is swamped with buggy fake prolific chips.  Even supposedly reputable
 manufacturers had fake chips on the fobs that claimed to be PL2303HX /
 PL2303HXD.  In the end i managed to get some fobs with genuine Prolific
 chips for some USD 20 per fob.  The fake ones all sold for about USD 3-4 and
 were easily identifiable by the missing part number and Prolific logo on the
 SSOP chip.

 The real ones also don't periodically disconnect/reconnect. :-)

 Of course, using the fake chips shouldn't crash the system.

 Obviously you were running a process that had the corresponding ttyUX open
 when the crash happened.  Otherwise it wouldn't have been triggered from
 the select(2) code.  Can you please describe what command exactly you were
 running and what its command line options and other configuration settings
 were.  I'd like to try to reproduce this locally.

 > crash> bt
 > __kernel_end() at 0
 > kern_reboot() at sys_reboot
 > vpanic() at vpanic+0x18d
 > panic() at vprintf
 > trap() at startlwp
 > --- trap (number 6) ---
 > ucompoll() at ucompoll+0x2a
 > cdev_poll() at cdev_poll+0x87
 > spec_poll() at spec_poll+0x6a
 > VOP_POLL() at VOP_POLL+0x5d
 > sel_do_scan() at sel_do_scan+0x3ba
 > selcommon() at selcommon+0x309
 > sys___select50() at sys___select50+0x75
 > syscall() at syscall+0x1fc
 > --- syscall (number 417) ---
 > syscall+0x1fc:
 > 
 > Have core and kernel with symbols.

 Could you try to disassemble the ucompoll() until the offending
 instruction?

 Could you try to find out if TS_CANCEL is set in tp->t_state?

 > >How-To-Repeat:
 > 
 > >Fix:

 This might be relatively easy to work around.

 ucycom(4) has (https://nxr.netbsd.org/xref/src/sys/dev/usb/ucycom.c#897):

 	if (sc->sc_dying)
 		return EIO;

 of course, it should return POLLHUP.

 uhso has (https://nxr.netbsd.org/xref/src/sys/dev/usb/uhso.c#1791):

 	if (!device_is_active(sc->sc_dev))
 		return POLLHUP;

 So apparently there is no agreement how this should be handled.

 Could you try adding

 	if (sc->sc_dying)
 		return POLLHUP;

 before line 853 in ucom.c and see if that makes the symtomps go away?

 But maybe the right fix would be to make ttycancel() deal with any pending
 select()s too?  Or something similar that ties in with the d_cancel
 framework?

 --chris

From: Paul Ripke <stix@stix.id.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, stix@stix.id.au
Subject: Re: kern/59497: Panic in ucompoll
Date: Thu, 3 Jul 2025 21:25:18 +1000

 On Tue, Jul 01, 2025 at 11:00:02PM +0000, Christoph Badura via gnats wrote:
 >  On Tue, Jul 01, 2025 at 09:20:00AM +0000, stix@stix.id.au wrote:
 >  > Crash appears due to intermittent disconnect/reconnect of a uplcom device while open.
 >  
 >  Are you sure this is a genuine Prolific device?  I've tried to get some
 >  Prolific USB serial fobs at the start of the year and found that the market
 >  is swamped with buggy fake prolific chips.  Even supposedly reputable
 >  manufacturers had fake chips on the fobs that claimed to be PL2303HX /
 >  PL2303HXD.  In the end i managed to get some fobs with genuine Prolific
 >  chips for some USD 20 per fob.  The fake ones all sold for about USD 3-4 and
 >  were easily identifiable by the missing part number and Prolific logo on the
 >  SSOP chip.

 I'm really not sure - it's old, and it was cheap. I have used it for the
 serial console on an old Sun SPARCserver 5, but that system now has dodgy RAM
 that needs replacing.

 >  The real ones also don't periodically disconnect/reconnect. :-)

 I should hope not :)
 I was considering shopping around for a USB FTDI-based serial adapter -
 but I wonder if there are also fakes of those on the market...

 >  Of course, using the fake chips shouldn't crash the system.

 Indeed.

 >  Obviously you were running a process that had the corresponding ttyUX open
 >  when the crash happened.  Otherwise it wouldn't have been triggered from
 >  the select(2) code.  Can you please describe what command exactly you were
 >  running and what its command line options and other configuration settings
 >  were.  I'd like to try to reproduce this locally.

 That could be challenging. I had it hooked up to a Tandy Color Computer (coco1)
 at 38400 baud, via alligator clips, and the software was drivewire.py:

 https://github.com/n6il/pyDriveWire

 Basically doing remote floppy disk access over the serial port.

 >  > crash> bt
 >  > __kernel_end() at 0
 >  > kern_reboot() at sys_reboot
 >  > vpanic() at vpanic+0x18d
 >  > panic() at vprintf
 >  > trap() at startlwp
 >  > --- trap (number 6) ---
 >  > ucompoll() at ucompoll+0x2a
 >  > cdev_poll() at cdev_poll+0x87
 >  > spec_poll() at spec_poll+0x6a
 >  > VOP_POLL() at VOP_POLL+0x5d
 >  > sel_do_scan() at sel_do_scan+0x3ba
 >  > selcommon() at selcommon+0x309
 >  > sys___select50() at sys___select50+0x75
 >  > syscall() at syscall+0x1fc
 >  > --- syscall (number 417) ---
 >  > syscall+0x1fc:
 >  > 
 >  > Have core and kernel with symbols.
 >  
 >  Could you try to disassemble the ucompoll() until the offending
 >  instruction?

 That's easy, it's a tiny function:

 (gdb) x/20i ucompoll
    0xffffffff804960a5 <ucompoll>:       push   %rbp
    0xffffffff804960a6 <ucompoll+1>:     mov    %rsp,%rbp
    0xffffffff804960a9 <ucompoll+4>:     push   %r13
    0xffffffff804960ab <ucompoll+6>:     push   %r12
    0xffffffff804960ad <ucompoll+8>:     mov    %esi,%r12d
    0xffffffff804960b0 <ucompoll+11>:    mov    %rdx,%r13
    0xffffffff804960b3 <ucompoll+14>:    mov    %edi,%eax
    0xffffffff804960b5 <ucompoll+16>:    shr    $0xc,%eax
    0xffffffff804960b8 <ucompoll+19>:    movzbl %dil,%esi
    0xffffffff804960bc <ucompoll+23>:    and    $0x3ff00,%eax
    0xffffffff804960c1 <ucompoll+28>:    or     %eax,%esi
    0xffffffff804960c3 <ucompoll+30>:    mov    $0xffffffff81896660,%rdi
    0xffffffff804960ca <ucompoll+37>:    call   0xffffffff80e42be0 <device_lookup_private>
    0xffffffff804960cf <ucompoll+42>:    mov    0xe8(%rax),%rdi		<------
    0xffffffff804960d6 <ucompoll+49>:    mov    0x168(%rdi),%rax
    0xffffffff804960dd <ucompoll+56>:    mov    0x60(%rax),%rax
    0xffffffff804960e1 <ucompoll+60>:    mov    %r13,%rdx
    0xffffffff804960e4 <ucompoll+63>:    mov    %r12d,%esi
    0xffffffff804960e7 <ucompoll+66>:    pop    %r12
    0xffffffff804960e9 <ucompoll+68>:    pop    %r13

 >  Could you try to find out if TS_CANCEL is set in tp->t_state?

 Yeah, I was actually wondering how to do that. I can't figure out for the
 life of me how to switch between cpu stacks in gdb. I realize most of the
 kernel debugging I've done has been on single cpu machines...

 However, doesn't this imply sc is null?

 (gdb) p ucom_cd
 $9 = {
   cd_list = {
     le_next = 0xffffffff818966a0 <umidi_cd>,
     le_prev = 0xffffffff81896620 <ugen_cd>
   },
   cd_attach = {
     lh_first = 0xffffffff81815260 <ucom_ca>
   },
   cd_devs = 0x0,
   cd_name = 0xffffffff813e59e8 "ucom",
   cd_class = DV_DULL,
   cd_ndevs = 0,
   cd_attrs = 0x0
 }

 >  This might be relatively easy to work around.
 >  
 >  ucycom(4) has (https://nxr.netbsd.org/xref/src/sys/dev/usb/ucycom.c#897):
 >  
 >  	if (sc->sc_dying)
 >  		return EIO;
 >  
 >  of course, it should return POLLHUP.
 >  
 >  uhso has (https://nxr.netbsd.org/xref/src/sys/dev/usb/uhso.c#1791):
 >  
 >  	if (!device_is_active(sc->sc_dev))
 >  		return POLLHUP;
 >  
 >  So apparently there is no agreement how this should be handled.
 >  
 >  Could you try adding
 >  
 >  	if (sc->sc_dying)
 >  		return POLLHUP;
 >  
 >  before line 853 in ucom.c and see if that makes the symtomps go away?

 or perhaps:

   if (sc == NULL)
     return POLLHUP;

 ?

 >  But maybe the right fix would be to make ttycancel() deal with any pending
 >  select()s too?  Or something similar that ties in with the d_cancel
 >  framework?

 Yeah, I haven't studied the code that much as yet.

 -- 
 Paul Ripke
 "Great minds discuss ideas, average minds discuss events, small minds
  discuss people."
 -- Disputed: Often attributed to Eleanor Roosevelt. 1948.

From: Christoph Badura <bad@bsd.de>
To: 
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org
Subject: Re: kern/59497: Panic in ucompoll
Date: Fri, 4 Jul 2025 00:13:27 +0200

 On Thu, Jul 03, 2025 at 09:25:18PM +1000, Paul Ripke wrote:
 > On Tue, Jul 01, 2025 at 11:00:02PM +0000, Christoph Badura via gnats wrote:
 > >  On Tue, Jul 01, 2025 at 09:20:00AM +0000, stix@stix.id.au wrote:
 > I'm really not sure - it's old, and it was cheap. I have used it for the
 > serial console on an old Sun SPARCserver 5, but that system now has dodgy RAM
 > that needs replacing.

 The photo of the chip that you sent me privately make it clear that it is a
 genuine PL-2303HX.  Good for you, I guess.  Bad for us as it suggests we
 have a bug in our driver that causes the disconnects.

 > I was considering shopping around for a USB FTDI-based serial adapter -
 > but I wonder if there are also fakes of those on the market...

 I think there are also fakes on the market.  Genuine FTDI fobs seem to be
 available mostly via Mouser, Farnell, etc.  I ended up buying a couple at
 ~USD25 from Farnell earlier this year; before I could hunt down a source
 for genuine Prolific fobs -- which cost basically the same.

 > >  [...] I'd like to try to reproduce this locally.
 > 
 > That could be challenging. I had it hooked up to a Tandy Color Computer (coco1)
 > at 38400 baud, via alligator clips, and the software was drivewire.py:
 > 
 > https://github.com/n6il/pyDriveWire
 > 
 > Basically doing remote floppy disk access over the serial port.

 Well, I could just try out pyDriveWire without a CoCo (or anything else)
 connected and see if that provokes the crash, too.

 > >  Could you try to disassemble the ucompoll() until the offending
 > >  instruction?
 > 
 > That's easy, it's a tiny function:
 > 
 > (gdb) x/20i ucompoll
 >    0xffffffff804960a5 <ucompoll>:       push   %rbp
 >    0xffffffff804960a6 <ucompoll+1>:     mov    %rsp,%rbp
 >    0xffffffff804960a9 <ucompoll+4>:     push   %r13
 >    0xffffffff804960ab <ucompoll+6>:     push   %r12
 >    0xffffffff804960ad <ucompoll+8>:     mov    %esi,%r12d
 >    0xffffffff804960b0 <ucompoll+11>:    mov    %rdx,%r13
 >    0xffffffff804960b3 <ucompoll+14>:    mov    %edi,%eax
 >    0xffffffff804960b5 <ucompoll+16>:    shr    $0xc,%eax
 >    0xffffffff804960b8 <ucompoll+19>:    movzbl %dil,%esi
 >    0xffffffff804960bc <ucompoll+23>:    and    $0x3ff00,%eax
 >    0xffffffff804960c1 <ucompoll+28>:    or     %eax,%esi
 >    0xffffffff804960c3 <ucompoll+30>:    mov    $0xffffffff81896660,%rdi
 >    0xffffffff804960ca <ucompoll+37>:    call   0xffffffff80e42be0 <device_lookup_private>
 >    0xffffffff804960cf <ucompoll+42>:    mov    0xe8(%rax),%rdi		<------
 >    0xffffffff804960d6 <ucompoll+49>:    mov    0x168(%rdi),%rax
 >    0xffffffff804960dd <ucompoll+56>:    mov    0x60(%rax),%rax
 >    0xffffffff804960e1 <ucompoll+60>:    mov    %r13,%rdx
 >    0xffffffff804960e4 <ucompoll+63>:    mov    %r12d,%esi
 >    0xffffffff804960e7 <ucompoll+66>:    pop    %r12
 >    0xffffffff804960e9 <ucompoll+68>:    pop    %r13
 > 
 > >  Could you try to find out if TS_CANCEL is set in tp->t_state?
 > 
 > Yeah, I was actually wondering how to do that. I can't figure out for the
 > life of me how to switch between cpu stacks in gdb. I realize most of the
 > kernel debugging I've done has been on single cpu machines...
 > 
 > However, doesn't this imply sc is null?

 Yes, that has to be the ``tp = sc->sc_tty'' assignment.

 Do you have the kernel messages right before the panic?  I.e. print the
 contents of msgbuf.  Your original mail only showed what is syslogged,
 doesn't it?

 What I'm wondering is if the panic happend between a "ucom2:
 detached\nuplcom1: detached" and a subsequent "uplcom1 at uhub1 port 8".

 sc being null implies the device being detached, if I remember things
 correctly.  Which makes the situation somewhat worse, because detaching
 the device should revoke the open vnode for the device.

 Maybe spec_poll() needs to check if sn->sn_gone is set after calling
 spec_io_enter()?

 https://nxr.netbsd.org/xref/src/sys/miscfs/specfs/spec_vnops.c#1378
 https://nxr.netbsd.org/xref/src/sys/miscfs/specfs/spec_vnops.c#618?

 But maybe that is pampering over the symptoms.  I haven't stared long
 enough at the code.

 > >  This might be relatively easy to work around.
 > >  
 > >  ucycom(4) has (https://nxr.netbsd.org/xref/src/sys/dev/usb/ucycom.c#897):
 > >  
 > >  	if (sc->sc_dying)
 > >  		return EIO;
 > >  
 > >  of course, it should return POLLHUP.
 > >  
 > >  uhso has (https://nxr.netbsd.org/xref/src/sys/dev/usb/uhso.c#1791):
 > >  
 > >  	if (!device_is_active(sc->sc_dev))
 > >  		return POLLHUP;
 > >  
 > >  So apparently there is no agreement how this should be handled.
 > >  
 > >  Could you try adding
 > >  
 > >  	if (sc->sc_dying)
 > >  		return POLLHUP;
 > >  
 > >  before line 853 in ucom.c and see if that makes the symtomps go away?
 > 
 > or perhaps:
 > 
 >   if (sc == NULL)
 >     return POLLHUP;
 > 
 > ?

 That certainly would avoid the crash.  But I think it is just pampering
 over the symptoms.

 Or maybe it and the other two placesshould return POLLERR like spec_poll()
 does?

 > >  But maybe the right fix would be to make ttycancel() deal with any pending
 > >  select()s too?  Or something similar that ties in with the d_cancel
 > >  framework?
 > 
 > Yeah, I haven't studied the code that much as yet.

 What a rabbit hole!

 I'm sorry, I don't have time right now and the next 2 weeks to dive down
 into it.  But you do have a local workaround, I think.  And if you can
 debug this further, we would greatly appreciate it.

 --chris

From: Christoph Badura <bad@bsd.de>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org
Cc: 
Subject: Re: kern/59497: Panic in ucompoll
Date: Fri, 4 Jul 2025 00:43:52 +0200

 Actually, could you test a -current kernel?

 I missed that you are reporting this against 10.1_STABLE.

 --chris

From: Paul Ripke <stix@stix.id.au>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/59497: Panic in ucompoll
Date: Sat, 5 Jul 2025 17:44:42 +1000

 Re msgbuf, after dumping it out, I realise this crash was actually due to
 a failing USB hub flaking out intermittently. I have seen this device
 intermittently disconnect/reconnect without that hub, so there's still
 something going on.

 [ 3804001.1947412] ukbd0: was console keyboard
 [ 3804001.1947412] wskbd0: detached
 [ 3804001.1987412] ukbd0: detached
 [ 3804001.1987412] uhidev0: detached
 [ 3804001.1987412] uhidev0: at uhub8 port 1 (addr 2) disconnected
 [ 3804001.1987412] wskbd2: disconnecting from wsdisplay0
 [ 3804001.1987412] wskbd2: detached
 [ 3804001.1987412] ukbd3: detached
 [ 3804001.2027413] wskbd1: disconnecting from wsdisplay0
 [ 3804001.2027413] wskbd1: detached
 [ 3804001.2027413] ukbd2: detached
 [ 3804001.2027413] ukbd1: detached
 [ 3804001.2027413] uhid2: detached
 [ 3804001.2027413] uhid1: detached
 [ 3804001.2027413] uhid0: detached
 [ 3804001.2027413] uhidev1: detached
 [ 3804001.2027413] uhidev1: at uhub8 port 1 (addr 2) disconnected
 [ 3804001.2117415] wsmouse0: detached
 [ 3804001.2117415] ums0: detached
 [ 3804001.2117415] uhidev2: detached
 [ 3804001.2117415] uhidev2: at uhub8 port 2 (addr 12) disconnected
 [ 3804001.2127414] uhid7: detached
 [ 3804001.2127414] uhid6: detached
 [ 3804001.2127414] uhid5: detached
 [ 3804001.2157415] wskbd3: disconnecting from wsdisplay0
 [ 3804001.2157415] wskbd3: detached
 [ 3804001.2157415] ukbd4: detached
 [ 3804001.2157415] uhidev4: detached
 [ 3804001.2157415] uhidev4: at uhub8 port 2 (addr 12) disconnected
 [ 3804001.2157415] uhidev5: detached
 [ 3804001.2157415] uhidev5: at uhub8 port 2 (addr 12) disconnected
 [ 3804001.2307417] ucom1: detached
 [ 3804001.2307417] uplcom0: detached
 [ 3804001.2307417] uplcom0: at uhub9 port 1 (addr 13) disconnected
 [ 3804001.2407419] ucom0: detached
 [ 3804001.2407419] umodem0: detached
 [ 3804001.2407419] umodem0: at uhub9 port 2 (addr 11) disconnected
 [ 3804001.2477419] uhub9: detached
 [ 3804001.2477419] uhub9: at uhub8 port 3 (addr 4) disconnected
 [ 3804001.2537420] uhub8: detached
 [ 3804001.2537420] uhub8: at uhub1 port 5 (addr 1) disconnected
 [ 3804001.7737508] uhub8 at uhub1 port 5: GenesysLogic (0x05e3) USB2.0 Hub (0x0610), class 9/0, rev 2.10/92.26, addr 14
 [ 3804001.7737508] uhub8: multiple transaction translators
 [ 3804001.7887511] uhub8: 4 ports with 1 removable, self powered
 [ 3804002.1207568] uvm_fault(0xffffb1c7ac104780, 0x0, 1) -> e
 [ 3804002.1207568] fatal page fault in supervisor mode
 [ 3804002.1207568] trap type 6 code 0 rip 0xffffffff804960cf cs 0x8 rflags 0x10246 cr2 0xe8 ilevel 0 rsp 0xffffb41236bc5bf0
 [ 3804002.1207568] curlwp 0xffffb1ca652e1340 pid 23833.26753 lowest kstack 0xffffb41236bc12c0
 [ 3804002.1207568] panic: trap
 [ 3804002.1207568] cpu1: Begin traceback...
 [ 3804002.1207568] vpanic() at netbsd:vpanic+0x183
 [ 3804002.1217568] panic() at netbsd:panic+0x3c
 [ 3804002.1227568] trap() at netbsd:trap+0xbaf
 [ 3804002.1227568] --- trap (number 6) ---
 [ 3804002.1227568] ucompoll() at netbsd:ucompoll+0x2a
 [ 3804002.1227568] cdev_poll() at netbsd:cdev_poll+0x87
 [ 3804002.1237565] spec_poll() at netbsd:spec_poll+0x6a
 [ 3804002.1237565] VOP_POLL() at netbsd:VOP_POLL+0x5d
 [ 3804002.1247569] sel_do_scan() at netbsd:sel_do_scan+0x3ba
 [ 3804002.1247569] selcommon() at netbsd:selcommon+0x309
 [ 3804002.1247569] sys___select50() at netbsd:sys___select50+0x75
 [ 3804002.1257569] syscall() at netbsd:syscall+0x1fc
 [ 3804002.1257569] --- syscall (number 417) ---
 [ 3804002.1257569] netbsd:syscall+0x1fc:
 [ 3804002.1257569] cpu1: End traceback...

 -- 
 Paul Ripke
 "Great minds discuss ideas, average minds discuss events, small minds
  discuss people."
 -- Disputed: Often attributed to Eleanor Roosevelt. 1948.

From: Christoph Badura <bad@bsd.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/59497: Panic in ucompoll
Date: Sat, 5 Jul 2025 11:17:29 +0200

 On Sat, Jul 05, 2025 at 07:45:02AM +0000, Paul Ripke via gnats wrote:
 >  Re msgbuf, after dumping it out, I realise this crash was actually due to
 >  a failing USB hub flaking out intermittently. I have seen this device
 >  intermittently disconnect/reconnect without that hub, so there's still
 >  something going on.

 I'm confused.  Does the device also intermittently disconnect/reconnect
 without the hub?

 Anyway, even a flaky USB hub shouldn't cause a panic.

 >  [ 3804001.2307417] ucom1: detached
 >  [ 3804001.2307417] uplcom0: detached
 >  [ 3804001.2307417] uplcom0: at uhub9 port 1 (addr 13) disconnected
 >  [ 3804001.2407419] ucom0: detached
 >  [ 3804001.2537420] uhub8: detached
 >  [ 3804001.2537420] uhub8: at uhub1 port 5 (addr 1) disconnected
 >  [ 3804001.7737508] uhub8 at uhub1 port 5: GenesysLogic (0x05e3) USB2.0 Hub (0x0610), class 9/0, rev 2.10/92.26, addr 14
 >  [ 3804001.7737508] uhub8: multiple transaction translators
 >  [ 3804001.7887511] uhub8: 4 ports with 1 removable, self powered
 >  [ 3804002.1207568] uvm_fault(0xffffb1c7ac104780, 0x0, 1) -> e
 >  [ 3804002.1207568] fatal page fault in supervisor mode

 This looks to me like the trap happens while uplcom0 (did it move from
 uplcom1?) was disconnected/detached.

 If you are testing the suggested changes (check for sc != NULL and/or the
 change for spec_poll()) could you add a printf when it triggers so that we
 can verify that this happens while the uplcom/ucom is disconnected?

 --chris

From: Paul Ripke <stix@stix.id.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, stix@stix.id.au, bad@bsd.de
Subject: Re: kern/59497: Panic in ucompoll
Date: Sun, 20 Jul 2025 23:24:39 +1000

 On Sat, Jul 05, 2025 at 09:20:02AM +0000, Christoph Badura via gnats wrote:
 > 
 >  On Sat, Jul 05, 2025 at 07:45:02AM +0000, Paul Ripke via gnats wrote:
 >  >  Re msgbuf, after dumping it out, I realise this crash was actually due to
 >  >  a failing USB hub flaking out intermittently. I have seen this device
 >  >  intermittently disconnect/reconnect without that hub, so there's still
 >  >  something going on.
 >  
 >  I'm confused.  Does the device also intermittently disconnect/reconnect
 >  without the hub?

 Yes, it does - but it seems the only crash dump I had was one due to the
 failing hub.

 >  Anyway, even a flaky USB hub shouldn't cause a panic.

 Indeed.

 >  >  [ 3804001.2307417] ucom1: detached
 >  >  [ 3804001.2307417] uplcom0: detached
 >  >  [ 3804001.2307417] uplcom0: at uhub9 port 1 (addr 13) disconnected
 >  >  [ 3804001.2407419] ucom0: detached
 >  >  [ 3804001.2537420] uhub8: detached
 >  >  [ 3804001.2537420] uhub8: at uhub1 port 5 (addr 1) disconnected
 >  >  [ 3804001.7737508] uhub8 at uhub1 port 5: GenesysLogic (0x05e3) USB2.0 Hub (0x0610), class 9/0, rev 2.10/92.26, addr 14
 >  >  [ 3804001.7737508] uhub8: multiple transaction translators
 >  >  [ 3804001.7887511] uhub8: 4 ports with 1 removable, self powered
 >  >  [ 3804002.1207568] uvm_fault(0xffffb1c7ac104780, 0x0, 1) -> e
 >  >  [ 3804002.1207568] fatal page fault in supervisor mode
 >  
 >  This looks to me like the trap happens while uplcom0 (did it move from
 >  uplcom1?) was disconnected/detached.

 It may have moved - I do have two of them, and have used both on occasion.

 >  If you are testing the suggested changes (check for sc != NULL and/or the
 >  change for spec_poll()) could you add a printf when it triggers so that we
 >  can verify that this happens while the uplcom/ucom is disconnected?

 I ran a test - with the 'if (sc == null) return POLLHUP' patch, with drivewire
 running on /dev/dtyU0, and pulling the USB:

 [ 2724.7698958] xhci0: xhci_reset_endpoint: endpoint 0x0: timed out
 [ 2724.7738960] WARNING: pipe closed with active xfers on addr 4
 [ 2724.7808961] ucom0: detached
 [ 2724.7808961] uplcom0: detached
 [ 2724.7808961] uplcom0: at uhub8 port 3 (addr 4) disconnected
 [ 2725.6829076] ucompoll: sc == NULL
 [ 2725.6829076] uvm_fault(0xffff80e5b04f1848, 0x0, 1) -> e
 [ 2725.6829076] fatal page fault in supervisor mode
 [ 2725.6829076] trap type 6 code 0 rip 0xffffffff80497ab7 cs 0x8 rflags 0x10246 cr2 0xe8 ilevel 0 rsp 0xffff881237c78cd0
 [ 2725.6829076] curlwp 0xffff80e6239de680 pid 6220.6226 lowest kstack 0xffff881237c742c0
 [ 2725.6829076] panic: trap
 [ 2725.6829076] cpu1: Begin traceback...
 [ 2725.6829076] vpanic() at netbsd:vpanic+0x183
 [ 2725.6839076] panic() at netbsd:panic+0x3c
 [ 2725.6849078] trap() at netbsd:trap+0xbaf
 [ 2725.6849078] --- trap (number 6) ---
 [ 2725.6849078] ucomread() at netbsd:ucomread+0x2a
 [ 2725.6849078] cdev_read() at netbsd:cdev_read+0x87
 [ 2725.6859078] spec_read() at netbsd:spec_read+0x2d3
 [ 2725.6859078] VOP_READ() at netbsd:VOP_READ+0x42
 [ 2725.6869079] vn_read() at netbsd:vn_read+0x18e
 [ 2725.6869079] dofileread() at netbsd:dofileread+0x79
 [ 2725.6869079] sys_read() at netbsd:sys_read+0x49
 [ 2725.6879078] syscall() at netbsd:syscall+0x1fc
 [ 2725.6879078] --- syscall (number 3) ---
 [ 2725.6879078] netbsd:syscall+0x1fc:
 [ 2725.6879078] cpu1: End traceback...

 So, this exited the poll, and died in read, which I guess is an
 improvement? If I get the chance, I'll try to figure out how this
 is supposed to work.

 btw: this is still on the 10.1 branch.

 Cheers,
 -- 
 Paul Ripke
 "Great minds discuss ideas, average minds discuss events, small minds
  discuss people."
 -- Disputed: Often attributed to Eleanor Roosevelt. 1948.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2025 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.