NetBSD Problem Report #58043

From paul@whooppee.com  Sat Mar 16 15:01:50 2024
Return-Path: <paul@whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B87FB1A924F
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 16 Mar 2024 15:01:50 +0000 (UTC)
Message-Id: <20240316150148.8DD545E33C5@speedy.whooppee.com>
Date: Sat, 16 Mar 2024 08:01:48 -0700 (PDT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@NetBSD.org
Subject: kernel crash in -current
X-Send-Pr-Version: 3.95

>Number:         58043
>Category:       kern
>Synopsis:       kernel crash in assert_sleepable() in -current, dk(4) driver?
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    hannken
>State:          needs-pullups
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 16 15:05:00 +0000 2024
>Closed-Date:    
>Last-Modified:  Sun Aug 18 18:00:02 +0000 2024
>Originator:     Paul Goyette
>Release:        NetBSD 10.99.10
>Organization:
+---------------------+--------------------------+----------------------+
| Paul Goyette (.sig) | PGP Key fingerprint:     | E-mail addresses:    |
| (Retired)           | 1B11 1849 721C 56C8 F63A | paul@whooppee.com    |
| Software Developer  | 6E2E 05FD 15CE 9F2D 5102 | pgoyette@netbsd.org  |
| & Network Engineer  |                          | pgoyette99@gmail.com |
+---------------------+--------------------------+----------------------+
>Environment:


System: NetBSD speedy.whooppee.com 10.99.10 NetBSD 10.99.10 (SPEEDY 2024-03-13 18:25:47 UTC) #0: Wed Mar 13 20:05:25 UTC 2024 paul@speedy.whooppee.com:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/SPEEDY amd64
Architecture: x86_64
Machine: amd64
>Description:
	At unpredictable times, but always under heavy disk load (ie,
	build.sh runnning with -j30) I am seeing random crashes.  I
	have a crash dump from one of these crashes, and stack trace
	seems to implicate the disk driver:

	Crash version 10.99.10, image version 10.99.10.
	crash: _kvm_kvatop(0)
	Kernel compiled without options LOCKDEBUG.
	System panicked: dump forced via kernel debugger
	Backtrace from time of crash is available.
	crash> bt
	end() at 0
	kern_reboot() at kern_reboot+0x87
	db_sync_cmd() at db_sifting_cmd
	db_command() at db_command+0x123
	db_command_loop() at db_command_loop+0x1c7
	db_trap() at db_trap+0xcc
	kdb_trap() at kdb_trap+0x106
	trap() at trap+0x2de
	--- trap (number 1) ---
	breakpoint() at breakpoint+0x5
	vpanic() at vpanic+0x173
	panic() at printf_nostamp
	assert_sleepable() at assert_sleepable+0x99
	pool_cache_get_paddr() at pool_cache_get_paddr+0x13c
	end() at ffffffff813ad275
	bdev_strategy() at bdev_strategy+0x81
	spec_strategy() at spec_strategy+0x6e
	VOP_STRATEGY() at VOP_STRATEGY+0x3c
	dkstart() at dkstart+0x13e
	dkiodone() at dkiodone+0xa6
	lddone() at lddone+0x10
	nvme_q_complete() at nvme_q_complete+0xff
	softint_dispatch() at softint_dispatch+0x112
	DDB lost frame for Xsoftintr+0x4c, trying 0xffffd220dfd9d0f0
	Xsoftintr() at Xsoftintr+0x4c
	--- interrupt ---
	0:

	I've had several other similar crashes, although I haven't
	saved dump details.  All stack traces seem to have pointed
	in the same area, and all fail at the assert_sleepable().

	Config and/or dmesg are available.  One item of note is that
	this machine contains multiple SSDs, and in one case I have
	a ccd(4) of two 2-TB CCD partitions (each of which occupies
	a complete SSD device).

>How-To-Repeat:
	No specific recipe to reproduce, it is seeming random when
	under high disk activity.
>Fix:
	Please.  In fact, pretty-please.

>Release-Note:

>Audit-Trail:
From: "J. Hannken-Illjes" <hannken@mailbox.org>
To: NetBSD GNATS <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/58043: kernel crash in -current
Date: Sat, 16 Mar 2024 17:25:27 +0100

 This looks like a GPT labeled ccd: "dkX at ccdX=E2=80=9D

 Here we get softint_dispatch -> dkstart -> ccdstart -> ccdbuffer -> =
 pool_cache_get via

 sys/dev/ccd.c:844 and sys/dev/ccd.c:932

 Trying to allocate memory here panics as allocation from softint is =
 forbidden.

 =E2=80=94
 J. Hannken-Illjes (hannken@mailbox.org <mailto:hannken@mailbox.org>)=

From: Taylor R Campbell <riastradh@NetBSD.org>
To: Paul Goyette <paul@whooppee.com>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current, dk(4) driver?
Date: Sat, 16 Mar 2024 16:27:25 +0000

 This is a multi-part message in MIME format.
 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv

 Annoyingly, the part of the stack trace we really want here -- the
 part which would tell us where something called pool_cache_get(_paddr)
 -- has been obscured:

 > 	assert_sleepable() at assert_sleepable+0x99
 > 	pool_cache_get_paddr() at pool_cache_get_paddr+0x13c
 > 	end() at ffffffff813ad275
 > 	bdev_strategy() at bdev_strategy+0x81

 My best guess from the rest of the stack trace:

 > 	spec_strategy() at spec_strategy+0x6e
 > 	VOP_STRATEGY() at VOP_STRATEGY+0x3c
 > 	dkstart() at dkstart+0x13e
 > 	dkiodone() at dkiodone+0xa6
 > 	lddone() at lddone+0x10
 > 	nvme_q_complete() at nvme_q_complete+0xff

 is that the missing part looks something like this:

 nvme_ns_dobio
 ld_nvme_start
 ld_diskstart
 dk_start (note: not dkstart)
 dk_strategy
 ldstrategy

 There's a call to bus_dmamap_load here which looks like, in this stack
 trace, it will pass BUS_DMA_WAITOK because ld_nvme_start doesn't pass
 NVME_NS_CTX_F_POLL.  I wonder whether this should unconditionally pass
 BUS_DMA_NOWAIT instead?  After all, the dmamap is created with
 BUS_DMA_ALLOCNOW so maybe there should be no need for allocation here.

 (And I wonder whether maybe bus_dmamap_load should assert_sleepable if
 you pass BUS_DMA_WAITOK, to shake out more of these paths early.)

 Can you try the attached patch?

 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv
 Content-Type: text/plain; charset="ISO-8859-1"; name="nvme"
 Content-Transfer-Encoding: quoted-printable
 Content-Disposition: attachment; filename="nvme.patch"

 diff --git a/sys/dev/ic/nvme.c b/sys/dev/ic/nvme.c
 index d41c88296dbd..29f877c01031 100644
 --- a/sys/dev/ic/nvme.c
 +++ b/sys/dev/ic/nvme.c
 @@ -786,8 +786,7 @@ nvme_ns_dobio(struct nvme_softc *sc, uint16_t nsid, voi=
 d *cookie,
  	dmap =3D ccb->ccb_dmamap;
  	error =3D bus_dmamap_load(sc->sc_dmat, dmap, data,
  	    datasize, NULL,
 -	    (ISSET(flags, NVME_NS_CTX_F_POLL) ?
 -	      BUS_DMA_NOWAIT : BUS_DMA_WAITOK) |
 +	    BUS_DMA_NOWAIT |
  	    (ISSET(flags, NVME_NS_CTX_F_READ) ?
  	      BUS_DMA_READ : BUS_DMA_WRITE));
  	if (error) {

 --=_jGdl80rZzbJ3AV6nW+2j7DYaEAx46dnv--

From: Paul Goyette <paul@whooppee.com>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current,
 dk(4) driver?
Date: Sat, 16 Mar 2024 10:18:36 -0700 (PDT)

 On Sat, 16 Mar 2024, Taylor R Campbell wrote:

 > Annoyingly, the part of the stack trace we really want here -- the
 > part which would tell us where something called pool_cache_get(_paddr)
 > -- has been obscured:
 >
 >> 	assert_sleepable() at assert_sleepable+0x99
 >> 	pool_cache_get_paddr() at pool_cache_get_paddr+0x13c
 >> 	end() at ffffffff813ad275
 >> 	bdev_strategy() at bdev_strategy+0x81
 >
 > My best guess from the rest of the stack trace:
 >
 >> 	spec_strategy() at spec_strategy+0x6e
 >> 	VOP_STRATEGY() at VOP_STRATEGY+0x3c
 >> 	dkstart() at dkstart+0x13e
 >> 	dkiodone() at dkiodone+0xa6
 >> 	lddone() at lddone+0x10
 >> 	nvme_q_complete() at nvme_q_complete+0xff
 >
 > is that the missing part looks something like this:
 >
 > nvme_ns_dobio
 > ld_nvme_start
 > ld_diskstart
 > dk_start (note: not dkstart)
 > dk_strategy
 > ldstrategy
 >
 > There's a call to bus_dmamap_load here which looks like, in this stack
 > trace, it will pass BUS_DMA_WAITOK because ld_nvme_start doesn't pass
 > NVME_NS_CTX_F_POLL.  I wonder whether this should unconditionally pass
 > BUS_DMA_NOWAIT instead?  After all, the dmamap is created with
 > BUS_DMA_ALLOCNOW so maybe there should be no need for allocation here.
 >
 > (And I wonder whether maybe bus_dmamap_load should assert_sleepable if
 > you pass BUS_DMA_WAITOK, to shake out more of these paths early.)

 I thought the stack trace looked like it wasn't complete!  Here is the
 backtrace using gdb - perhps more useful?

 #0  0xffffffff80239b95 in cpu_reboot (howto=howto@entry=256,
      bootstr=bootstr@entry=0x0)
      at /build/netbsd-local/src_ro/sys/arch/amd64/amd64/machdep.c:708
 #1  0xffffffff806a84f5 in kern_reboot (howto=howto@entry=256,
      bootstr=bootstr@entry=0x0)
      at /build/netbsd-local/src_ro/sys/kern/kern_reboot.c:91
 #2  0xffffffff80588d23 in db_sync_cmd (addr=<optimized out>,
      have_addr=<optimized out>, count=<optimized out>, modif=<optimized out>)
      at /build/netbsd-local/src_ro/sys/ddb/db_command.c:1651
 #3  0xffffffff805894ca in db_command (
      last_cmdp=last_cmdp@entry=0xffffd220dfd9c958)
      at /build/netbsd-local/src_ro/sys/ddb/db_command.c:970
 #4  0xffffffff80589abf in db_execute_commandlist (
      cmdlist=0xffffffff80e353e0 <db_cmd_on_enter> "bt; show reg; sync")
      at /build/netbsd-local/src_ro/sys/ddb/db_command.c:466
 #5  db_command_loop () at /build/netbsd-local/src_ro/sys/ddb/db_command.c:618
 #6  0xffffffff8058dc98 in db_trap (type=type@entry=1, code=code@entry=0)
      at /build/netbsd-local/src_ro/sys/ddb/db_trap.c:91
 #7  0xffffffff80236a54 in kdb_trap (type=type@entry=1, code=code@entry=0,
      regs=regs@entry=0xffffd220dfd9cc10)
      at /build/netbsd-local/src_ro/sys/arch/amd64/amd64/db_interface.c:251
 #8  0xffffffff8023c066 in trap (frame=0xffffd220dfd9cc10)
      at /build/netbsd-local/src_ro/sys/arch/amd64/amd64/trap.c:314
 #9  0xffffffff80234a24 in alltraps ()
 #10 0xffffffff80235365 in breakpoint ()
 #11 0xffffffff806ef1be in vpanic (
      fmt=fmt@entry=0xffffffff80b34a1b "%s: %s caller=%p",
      ap=ap@entry=0xffffd220dfd9cd48)
      at /build/netbsd-local/src_ro/sys/kern/subr_prf.c:286
 #12 0xffffffff806ef29d in panic (
      fmt=fmt@entry=0xffffffff80b34a1b "%s: %s caller=%p")
      at /build/netbsd-local/src_ro/sys/kern/subr_prf.c:209
 #13 0xffffffff8069349d in assert_sleepable ()
      at /build/netbsd-local/src_ro/sys/kern/kern_lock.c:109
 #14 0xffffffff806ec0e7 in pool_cache_get_paddr (pc=0xfffff7cf1a829540,
 --Type <RET> for more, q to quit, c to continue without paging--
      flags=1, pap=0x0) at 
 /build/netbsd-local/src_ro/sys/kern/subr_pool.c:2721
 #15 0xffffffff813ad275 in ?? ()
 #16 0x000000000000003a in ?? ()
 #17 0x000000009662dc80 in ?? ()
 #18 0xfffff7cf1a6644e8 in ?? ()
 #19 0x000000009662dcba in ?? ()
 #20 0xffffd220bf420000 in ?? ()
 #21 0x0000000000001000 in ?? ()
 #22 0xffffd220dfd9ce70 in ?? ()
 #23 0x0000000000000100 in ?? ()
 #24 0xfffff7cf1b4a85c0 in ?? ()
 #25 0xffffffff813a84e0 in ?? ()
 #26 0xfffff7cf1a35b478 in ?? ()
 #27 0xfffff7cf1a35b360 in ?? ()
 #28 0xffffd220dfd9ced0 in ?? ()
 #29 0xffffffff806d8331 in bdev_strategy (bp=0xfffff7cf1b04be80)
      at /build/netbsd-local/src_ro/sys/kern/subr_devsw.c:1267
 Backtrace stopped: frame did not save the PC

 > Can you try the attached patch?

 I will test this out later today and report back.

 +---------------------+--------------------------+----------------------+
 | Paul Goyette (.sig) | PGP Key fingerprint:     | E-mail addresses:    |
 | (Retired)           | 1B11 1849 721C 56C8 F63A | paul@whooppee.com    |
 | Software Developer  | 6E2E 05FD 15CE 9F2D 5102 | pgoyette@netbsd.org  |
 | & Network Engineer  |                          | pgoyette99@gmail.com |
 +---------------------+--------------------------+----------------------+

From: Taylor R Campbell <riastradh@NetBSD.org>
To: Paul Goyette <paul@whooppee.com>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current,
	dk(4) driver?
Date: Sat, 16 Mar 2024 17:52:17 +0000

 > Date: Sat, 16 Mar 2024 10:18:36 -0700 (PDT)
 > From: Paul Goyette <paul@whooppee.com>
 > 
 > I thought the stack trace looked like it wasn't complete!  Here is the
 > backtrace using gdb - perhps more useful?
 > 
 > #14 0xffffffff806ec0e7 in pool_cache_get_paddr (pc=0xfffff7cf1a829540,
 >      flags=1, pap=0x0) at 
 > /build/netbsd-local/src_ro/sys/kern/subr_pool.c:2721
 > #15 0xffffffff813ad275 in ?? ()
 > ...
 > #28 0xffffd220dfd9ced0 in ?? ()
 > #29 0xffffffff806d8331 in bdev_strategy (bp=0xfffff7cf1b04be80)
 >      at /build/netbsd-local/src_ro/sys/kern/subr_devsw.c:1267

 Nope, this one isn't much help either...

 Are you loading drivers from modules?  Maybe it would help if you used
 src/sys/gdbscripts/modload to load debug data from the modules?

 (gdb) source /path/to/src/sys/gdbscripts/modload
 (gdb) modload

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current,
 dk(4) driver?
Date: Sat, 16 Mar 2024 10:59:36 -0700 (PDT)

 On Sat, 16 Mar 2024, Taylor R Campbell wrote:

 > >
 > > #14 0xffffffff806ec0e7 in pool_cache_get_paddr (pc=0xfffff7cf1a829540,
 > >      flags=1, pap=0x0) at
 > > /build/netbsd-local/src_ro/sys/kern/subr_pool.c:2721
 > > #15 0xffffffff813ad275 in ?? ()
 > > ...
 > > #28 0xffffd220dfd9ced0 in ?? ()
 > > #29 0xffffffff806d8331 in bdev_strategy (bp=0xfffff7cf1b04be80)
 > >      at /build/netbsd-local/src_ro/sys/kern/subr_devsw.c:1267
 >
 > Nope, this one isn't much help either...
 >
 > Are you loading drivers from modules?  Maybe it would help if you used
 > src/sys/gdbscripts/modload to load debug data from the modules?
 >
 > (gdb) source /path/to/src/sys/gdbscripts/modload
 > (gdb) modload

 Nope, this is from a stock GENERIC kernel, all modules built-in.  At
 least, I think it is!

 There are a couple of local patches, but nothing in this vicinity.


 +---------------------+--------------------------+----------------------+
 | Paul Goyette (.sig) | PGP Key fingerprint:     | E-mail addresses:    |
 | (Retired)           | 1B11 1849 721C 56C8 F63A | paul@whooppee.com    |
 | Software Developer  | 6E2E 05FD 15CE 9F2D 5102 | pgoyette@netbsd.org  |
 | & Network Engineer  |                          | pgoyette99@gmail.com |
 +---------------------+--------------------------+----------------------+

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current,
 dk(4) driver?
Date: Sat, 16 Mar 2024 11:13:55 -0700 (PDT)

 Hmm, looks like this wasn't GENERIC after all!

 I ran modload, anda then did another back-trace.  Looks a bit better...

 #0  0xffffffff80239b95 in cpu_reboot (howto=howto@entry=256, bootstr=bootstr@entry=0x0) at /build/netbsd-local/src_ro/sys/arch/amd64/amd64/machdep.c:708
 #1  0xffffffff806a84f5 in kern_reboot (howto=howto@entry=256, bootstr=bootstr@entry=0x0) at /build/netbsd-local/src_ro/sys/kern/kern_reboot.c:91
 #2  0xffffffff80588d23 in db_sync_cmd (addr=<optimized out>, have_addr=<optimized out>, count=<optimized out>, modif=<optimized out>) at /build/netbsd-local/src_ro/sys/ddb/db_command.c:1651
 #3  0xffffffff805894ca in db_command (last_cmdp=last_cmdp@entry=0xffffd220dfd9c958) at /build/netbsd-local/src_ro/sys/ddb/db_command.c:970
 #4  0xffffffff80589abf in db_execute_commandlist (cmdlist=0xffffffff80e353e0 <db_cmd_on_enter> "bt; show reg; sync") at /build/netbsd-local/src_ro/sys/ddb/db_command.c:466
 #5  db_command_loop () at /build/netbsd-local/src_ro/sys/ddb/db_command.c:618
 #6  0xffffffff8058dc98 in db_trap (type=type@entry=1, code=code@entry=0) at /build/netbsd-local/src_ro/sys/ddb/db_trap.c:91
 #7  0xffffffff80236a54 in kdb_trap (type=type@entry=1, code=code@entry=0, regs=regs@entry=0xffffd220dfd9cc10) at /build/netbsd-local/src_ro/sys/arch/amd64/amd64/db_interface.c:251
 #8  0xffffffff8023c066 in trap (frame=0xffffd220dfd9cc10) at /build/netbsd-local/src_ro/sys/arch/amd64/amd64/trap.c:314
 #9  0xffffffff80234a24 in alltraps ()
 #10 0xffffffff80235365 in breakpoint ()
 #11 0xffffffff806ef1be in vpanic (fmt=fmt@entry=0xffffffff80b34a1b "%s: %s caller=%p", ap=ap@entry=0xffffd220dfd9cd48) at /build/netbsd-local/src_ro/sys/kern/subr_prf.c:286
 #12 0xffffffff806ef29d in panic (fmt=fmt@entry=0xffffffff80b34a1b "%s: %s caller=%p") at /build/netbsd-local/src_ro/sys/kern/subr_prf.c:209
 #13 0xffffffff8069349d in assert_sleepable () at /build/netbsd-local/src_ro/sys/kern/kern_lock.c:109
 #14 0xffffffff806ec0e7 in pool_cache_get_paddr (pc=0xfffff7cf1a829540, flags=flags@entry=1, pap=pap@entry=0x0) at /build/netbsd-local/src_ro/sys/kern/subr_pool.c:2721
 #15 0xffffffff813ad275 in ccdbuffer (bcount=4096, addr=0xffffd220bf420000, bn=5046122874, bp=0xfffff7cf1b4a85c0, cs=0xfffff7cf1b04be40) at /build/netbsd-local/src_ro/sys/dev/ccd.c:932
 #16 ccdstart (cs=0xfffff7cf1b04be40) at /build/netbsd-local/src_ro/sys/dev/ccd.c:844
 #17 0xffffffff806d8331 in bdev_strategy (bp=0xfffff7cf1b4a85c0) at /build/netbsd-local/src_ro/sys/kern/subr_devsw.c:1267
 #18 0xffffffff8076f142 in spec_strategy (v=<optimized out>) at /build/netbsd-local/src_ro/sys/miscfs/specfs/spec_vnops.c:1508
 #19 0xffffffff80762459 in VOP_STRATEGY (vp=vp@entry=0xfffff7cf1c61cb00, bp=bp@entry=0xfffff7cf1b4a85c0) at /build/netbsd-local/src_ro/sys/kern/vnode_if.c:1733
 #20 0xffffffff8077226d in dkstart (sc=0xfffff7cf1a35b248) at /build/netbsd-local/src_ro/sys/dev/dkwedge/dk.c:1626
 #21 0xffffffff80772f69 in dkiodone (bp=<optimized out>) at /build/netbsd-local/src_ro/sys/dev/dkwedge/dk.c:1658
 #22 0xffffffff802e186a in lddone (sc=0xfffff7cf188dcb40, bp=<optimized out>) at /build/netbsd-local/src_ro/sys/dev/ld.c:527
 #23 0xffffffff802f0930 in nvme_q_complete (sc=0xffffd200fac10000, q=0xfffff7cf1717a600) at /build/netbsd-local/src_ro/sys/dev/ic/nvme.c:1541
 #24 0xffffffff806b6bb1 in softint_execute (s=3, l=0xfffff7de5cf6b800) at /build/netbsd-local/src_ro/sys/kern/kern_softint.c:599
 #25 softint_dispatch (pinned=<optimized out>, s=3) at /build/netbsd-local/src_ro/sys/kern/kern_softint.c:848
 #26 0xffffffff8023475c in Xsoftintr ()
 quit
 You can't do that without a process to debug.


 +---------------------+--------------------------+----------------------+
 | Paul Goyette (.sig) | PGP Key fingerprint:     | E-mail addresses:    |
 | (Retired)           | 1B11 1849 721C 56C8 F63A | paul@whooppee.com    |
 | Software Developer  | 6E2E 05FD 15CE 9F2D 5102 | pgoyette@netbsd.org  |
 | & Network Engineer  |                          | pgoyette99@gmail.com |
 +---------------------+--------------------------+----------------------+

From: "J. Hannken-Illjes" <hannken@mailbox.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current, dk(4)
 driver?
Date: Mon, 18 Mar 2024 10:52:09 +0100

 --Apple-Mail=_C00F97DB-BCE8-4515-9DA6-D93726D027F3
 Content-Transfer-Encoding: 7bit
 Content-Type: text/plain;
 	charset=us-ascii

 Paul,

 please try the attached patch -- if it prevents assert_sleepable() to fire
 it is the call to CCD_GETBUF() aka. pool_cache_get(ccd_cache, PR_WAITOK)
 from softint context that has to be fixed.

 --
 J. Hannken-Illjes - hannken@mailbox.org <mailto:hannken@mailbox.org>

 --Apple-Mail=_C00F97DB-BCE8-4515-9DA6-D93726D027F3
 Content-Disposition: attachment;
 	filename=001_ccd_defer.diff
 Content-Type: application/octet-stream;
 	x-unix-mode=0644;
 	name="001_ccd_defer.diff"
 Content-Transfer-Encoding: 7bit

 ccd_defer

 Always defer requests to the helper thread so ccdstart() doesn't
 get called from softint anymore.

 diff -r 8b8d2498ffd9 -r 2ec4e85f1120 sys/dev/ccd.c
 --- sys/dev/ccd.c
 +++ sys/dev/ccd.c
 @@ -777,17 +777,10 @@ ccdstrategy(struct buf *bp)
  		return;
  	}

 -	/* Defer to thread if system is low on memory. */
 +	/* Always defer to thread. */
  	bufq_put(cs->sc_bufq, bp);
 -	if (__predict_false(ccdbackoff(cs))) {
 -		mutex_exit(cs->sc_iolock);
 -#ifdef DEBUG
 - 		if (ccddebug & CCDB_FOLLOW)
 - 			printf("ccdstrategy: holding off on I/O\n");
 -#endif
 -		return;
 -	}
 -	ccdstart(cs);
 +	cv_broadcast(&cs->sc_push);
 +	mutex_exit(cs->sc_iolock);
  }

  static void

 --Apple-Mail=_C00F97DB-BCE8-4515-9DA6-D93726D027F3--

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current,
 dk(4) driver?
Date: Mon, 18 Mar 2024 05:03:17 -0700 (PDT)

 I've already converted my config to raid0 instead of ccd, so I am sorry
 that I am unable to test the patch.  :-(

 On Mon, 18 Mar 2024, J. Hannken-Illjes wrote:

 > The following reply was made to PR kern/58043; it has been noted by GNATS.
 >
 > From: "J. Hannken-Illjes" <hannken@mailbox.org>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current, dk(4)
 > driver?
 > Date: Mon, 18 Mar 2024 10:52:09 +0100
 >
 > --Apple-Mail=_C00F97DB-BCE8-4515-9DA6-D93726D027F3
 > Content-Transfer-Encoding: 7bit
 > Content-Type: text/plain;
 > 	charset=us-ascii
 >
 > Paul,
 >
 > please try the attached patch -- if it prevents assert_sleepable() to fire
 > it is the call to CCD_GETBUF() aka. pool_cache_get(ccd_cache, PR_WAITOK)
 > from softint context that has to be fixed.
 >
 > --
 > J. Hannken-Illjes - hannken@mailbox.org <mailto:hannken@mailbox.org>
 >
 > --Apple-Mail=_C00F97DB-BCE8-4515-9DA6-D93726D027F3
 > Content-Disposition: attachment;
 > 	filename=001_ccd_defer.diff
 > Content-Type: application/octet-stream;
 > 	x-unix-mode=0644;
 > 	name="001_ccd_defer.diff"
 > Content-Transfer-Encoding: 7bit
 >
 > ccd_defer
 >
 > Always defer requests to the helper thread so ccdstart() doesn't
 > get called from softint anymore.
 >
 > diff -r 8b8d2498ffd9 -r 2ec4e85f1120 sys/dev/ccd.c
 > --- sys/dev/ccd.c
 > +++ sys/dev/ccd.c
 > @@ -777,17 +777,10 @@ ccdstrategy(struct buf *bp)
 >  		return;
 >  	}
 >
 > -	/* Defer to thread if system is low on memory. */
 > +	/* Always defer to thread. */
 >  	bufq_put(cs->sc_bufq, bp);
 > -	if (__predict_false(ccdbackoff(cs))) {
 > -		mutex_exit(cs->sc_iolock);
 > -#ifdef DEBUG
 > - 		if (ccddebug & CCDB_FOLLOW)
 > - 			printf("ccdstrategy: holding off on I/O\n");
 > -#endif
 > -		return;
 > -	}
 > -	ccdstart(cs);
 > +	cv_broadcast(&cs->sc_push);
 > +	mutex_exit(cs->sc_iolock);
 >  }
 >
 >  static void
 >
 > --Apple-Mail=_C00F97DB-BCE8-4515-9DA6-D93726D027F3--
 >
 >
 > !DSPAM:65f80f7713337849015859!
 >
 >

 +---------------------+--------------------------+----------------------+
 | Paul Goyette (.sig) | PGP Key fingerprint:     | E-mail addresses:    |
 | (Retired)           | 1B11 1849 721C 56C8 F63A | paul@whooppee.com    |
 | Software Developer  | 6E2E 05FD 15CE 9F2D 5102 | pgoyette@netbsd.org  |
 | & Network Engineer  |                          | pgoyette99@gmail.com |
 +---------------------+--------------------------+----------------------+

From: triaxx@NetBSD.org
To: Paul Goyette <paul@whooppee.com>, gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current, dk(4)
 driver?
Date: Wed, 27 Mar 2024 11:55:11 +0100

 I have a system with:

      $ dmesg -t | grep ccd
      dk7 at wd1: "ccd0p0", 488397088 blocks at 40, type: ccd
      dk8 at wd2: "ccd0p1", 488397088 blocks at 40, type: ccd
      ccd0: Interleaving 2 components (63 block interleave)
      ccd0: /dev/dk7 (488397042 blocks)
      ccd0: /dev/dk8 (488397042 blocks)
      ccd0: total 976794084 blocks
      ccd0: GPT GUID: 546b7b1d-bf71-46a2-9788-226ebaf7ae2d
      dk12 at ccd0: "ccd0", 976794008 blocks at 40, type: ffs

 The patch fixes the issue with the kernel that crashes on this system.

 > I've already converted my config to raid0 instead of ccd, so I am sorry
 > that I am unable to test the patch.  :-(
 > 
 > On Mon, 18 Mar 2024, J. Hannken-Illjes wrote:
 > 
 >> The following reply was made to PR kern/58043; it has been noted by 
 >> GNATS.
 >>
 >> From: "J. Hannken-Illjes" <hannken@mailbox.org>
 >> To: gnats-bugs@netbsd.org
 >> Cc:
 >> Subject: Re: kern/58043: kernel crash in assert_sleepable() in 
 >> -current, dk(4)
 >> driver?
 >> Date: Mon, 18 Mar 2024 10:52:09 +0100
 >>
 >> Paul,
 >>
 >> please try the attached patch -- if it prevents assert_sleepable() to 
 >> fire
 >> it is the call to CCD_GETBUF() aka. pool_cache_get(ccd_cache, PR_WAITOK)
 >> from softint context that has to be fixed.
 >>
 >> -- 
 >> J. Hannken-Illjes - hannken@mailbox.org <mailto:hannken@mailbox.org>
 >>
 >> ccd_defer
 >>
 >> Always defer requests to the helper thread so ccdstart() doesn't
 >> get called from softint anymore.
 >>
 >> diff -r 8b8d2498ffd9 -r 2ec4e85f1120 sys/dev/ccd.c
 >> --- sys/dev/ccd.c
 >> +++ sys/dev/ccd.c
 >> @@ -777,17 +777,10 @@ ccdstrategy(struct buf *bp)
 >>          return;
 >>      }
 >>
 >> -    /* Defer to thread if system is low on memory. */
 >> +    /* Always defer to thread. */
 >>      bufq_put(cs->sc_bufq, bp);
 >> -    if (__predict_false(ccdbackoff(cs))) {
 >> -        mutex_exit(cs->sc_iolock);
 >> -#ifdef DEBUG
 >> -         if (ccddebug & CCDB_FOLLOW)
 >> -             printf("ccdstrategy: holding off on I/O\n");
 >> -#endif
 >> -        return;
 >> -    }
 >> -    ccdstart(cs);
 >> +    cv_broadcast(&cs->sc_push);
 >> +    mutex_exit(cs->sc_iolock);
 >>  }
 >>
 >>  static void

From: triaxx@NetBSD.org
To: Paul Goyette <paul@whooppee.com>, gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current, dk(4)
 driver?
Date: Wed, 27 Mar 2024 14:01:19 +0100

 I have two kernels on this system: netbsd-GENERIC and 
 netbsd-GOLIATH-noamdgpu.

 Surprisingly, GOLIATH-noamdgpu crashes when GENERIC doesn't. The diff 
 between both configurations is:

 1c1
 < ### START CONFIG FILE "/usr/src/sys/arch/amd64/conf/GENERIC"
 ---
  > ### START CONFIG FILE 
 "/home/triaxx/NetBSD/src/sys/arch/amd64/conf/GOLIATH-noamdgpu"
 118,119c118,119
 < #options      DEBUG           # expensive debugging checks/support
 < #options      LOCKDEBUG       # expensive locking checks/support
 ---
  > options       DEBUG           # expensive debugging checks/support
  > options       LOCKDEBUG       # expensive locking checks/support
 130,131c130,131
 < #options      KGDB            # remote debugger
 < #options      KGDB_DEVNAME="\"com\"",KGDB_DEVADDR=0x3f8,KGDB_DEVRATE=9600
 ---
  > options       KGDB            # remote debugger
  > options       KGDB_DEVNAME="\"com\"",KGDB_DEVADDR=0x3f8,KGDB_DEVRATE=9600
 255c255
 < #options      ACPIVERBOSE     # verbose ACPI configuration messages
 ---
  > options       ACPIVERBOSE     # verbose ACPI configuration messages
 461,462c461,462
 < i915drmkms*   at pci? dev ? function ?
 < intelfb*      at intelfbbus?
 ---
  > #i915drmkms*  at pci? dev ? function ?
  > #intelfb*     at intelfbbus?
 464,465c464,465
 < radeon*       at pci? dev ? function ?
 < radeondrmkmsfb* at radeonfbbus?
 ---
  > #radeon*      at pci? dev ? function ?
  > #radeondrmkmsfb* at radeonfbbus?
 470,471c470,471
 < nouveau*      at pci? dev ? function ?
 < nouveaufb*    at nouveaufbbus?
 ---
  > #nouveau*     at pci? dev ? function ?
  > #nouveaufb*   at nouveaufbbus?
 1245c1245
 < ### END CONFIG FILE "/usr/src/sys/arch/amd64/conf/GENERIC"
 ---
  > ### END CONFIG FILE 
 "/home/triaxx/NetBSD/src/sys/arch/amd64/conf/GOLIATH-noamdgpu"

 > I have a system with:
 > 
 >      $ dmesg -t | grep ccd
 >      dk7 at wd1: "ccd0p0", 488397088 blocks at 40, type: ccd
 >      dk8 at wd2: "ccd0p1", 488397088 blocks at 40, type: ccd
 >      ccd0: Interleaving 2 components (63 block interleave)
 >      ccd0: /dev/dk7 (488397042 blocks)
 >      ccd0: /dev/dk8 (488397042 blocks)
 >      ccd0: total 976794084 blocks
 >      ccd0: GPT GUID: 546b7b1d-bf71-46a2-9788-226ebaf7ae2d
 >      dk12 at ccd0: "ccd0", 976794008 blocks at 40, type: ffs
 > 
 > The patch fixes the issue with the kernel that crashes on this system.
 > 
 >> I've already converted my config to raid0 instead of ccd, so I am sorry
 >> that I am unable to test the patch.  :-(
 >>
 >> On Mon, 18 Mar 2024, J. Hannken-Illjes wrote:
 >>
 >>> The following reply was made to PR kern/58043; it has been noted by 
 >>> GNATS.
 >>>
 >>> From: "J. Hannken-Illjes" <hannken@mailbox.org>
 >>> To: gnats-bugs@netbsd.org
 >>> Cc:
 >>> Subject: Re: kern/58043: kernel crash in assert_sleepable() in 
 >>> -current, dk(4)
 >>> driver?
 >>> Date: Mon, 18 Mar 2024 10:52:09 +0100
 >>>
 >>> Paul,
 >>>
 >>> please try the attached patch -- if it prevents assert_sleepable() to 
 >>> fire
 >>> it is the call to CCD_GETBUF() aka. pool_cache_get(ccd_cache, PR_WAITOK)
 >>> from softint context that has to be fixed.
 >>>
 >>> -- 
 >>> J. Hannken-Illjes - hannken@mailbox.org <mailto:hannken@mailbox.org>
 >>>
 >>> ccd_defer
 >>>
 >>> Always defer requests to the helper thread so ccdstart() doesn't
 >>> get called from softint anymore.
 >>>
 >>> diff -r 8b8d2498ffd9 -r 2ec4e85f1120 sys/dev/ccd.c
 >>> --- sys/dev/ccd.c
 >>> +++ sys/dev/ccd.c
 >>> @@ -777,17 +777,10 @@ ccdstrategy(struct buf *bp)
 >>>          return;
 >>>      }
 >>>
 >>> -    /* Defer to thread if system is low on memory. */
 >>> +    /* Always defer to thread. */
 >>>      bufq_put(cs->sc_bufq, bp);
 >>> -    if (__predict_false(ccdbackoff(cs))) {
 >>> -        mutex_exit(cs->sc_iolock);
 >>> -#ifdef DEBUG
 >>> -         if (ccddebug & CCDB_FOLLOW)
 >>> -             printf("ccdstrategy: holding off on I/O\n");
 >>> -#endif
 >>> -        return;
 >>> -    }
 >>> -    ccdstart(cs);
 >>> +    cv_broadcast(&cs->sc_push);
 >>> +    mutex_exit(cs->sc_iolock);
 >>>  }
 >>>
 >>>  static void
 > 

From: "J. Hannken-Illjes" <hannken@mailbox.org>
To: NetBSD GNATS <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/58043: kernel crash in assert_sleepable() in -current, dk(4)
 driver?
Date: Wed, 27 Mar 2024 14:21:38 +0100

 The crash is from ASSERT_SLEEPABLE() which is enabled for option DEBUG only.

 --
 J. Hannken-Illjes - hannken@mailbox.org

From: "Juergen Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/58043 CVS commit: src/sys/dev
Date: Sun, 31 Mar 2024 14:56:41 +0000

 Module Name:	src
 Committed By:	hannken
 Date:		Sun Mar 31 14:56:41 UTC 2024

 Modified Files:
 	src/sys/dev: ccd.c

 Log Message:
 Using a ccd(4) with GPT (dk* at ccd*) the disk framework will call
 ccdstrategy() -> ccdstart() -> ccdbuffer()  from softint context.
 Allocating the buffer with PR_WAITOK here is forbidden.

 Change ccdstart() / ccdbuffer() to report failure back to caller and
 pass PR_WAITOK / PR_NOWAIT as an additional argument.

 Call ccdstart() with PR_NOPWAIT from ccdstrategy() and on error defer
 to the kthread.  Call ccdstart() with PR_WAITOK from kthread so requests
 from kthread always succeed to allocate the buffers.

 Remove the (non working) throttling on low memory as it is no longer needed.

 Fixes PR kern/58043 "kernel crash in assert_sleepable() in -current,
 dk(4) driver?"


 To generate a diff of this commit:
 cvs rdiff -u -r1.189 -r1.190 src/sys/dev/ccd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/58043 CVS commit: [netbsd-10] src/sys/dev
Date: Thu, 18 Apr 2024 18:24:31 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Thu Apr 18 18:24:31 UTC 2024

 Modified Files:
 	src/sys/dev [netbsd-10]: ccd.c

 Log Message:
 Pull up following revision(s) (requested by hannken in ticket #669):

 	sys/dev/ccd.c: revision 1.190

 Using a ccd(4) with GPT (dk* at ccd*) the disk framework will call
 ccdstrategy() -> ccdstart() -> ccdbuffer()  from softint context.

 Allocating the buffer with PR_WAITOK here is forbidden.

 Change ccdstart() / ccdbuffer() to report failure back to caller and
 pass PR_WAITOK / PR_NOWAIT as an additional argument.

 Call ccdstart() with PR_NOPWAIT from ccdstrategy() and on error defer
 to the kthread.  Call ccdstart() with PR_WAITOK from kthread so requests
 from kthread always succeed to allocate the buffers.

 Remove the (non working) throttling on low memory as it is no longer needed.

 Fixes PR kern/58043 "kernel crash in assert_sleepable() in -current,
 dk(4) driver?"


 To generate a diff of this commit:
 cvs rdiff -u -r1.189 -r1.189.4.1 src/sys/dev/ccd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->needs-pullups
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Tue, 23 Jul 2024 22:20:23 +0000
State-Changed-Why:
Fixed in HEAD and pulled up to 10; does this need pullup-9 too?


Responsible-Changed-From-To: kern-bug-people->hannken
Responsible-Changed-By: riastradh@NetBSD.org
Responsible-Changed-When: Sun, 18 Aug 2024 16:55:29 +0000
Responsible-Changed-Why:
Can you assess whether this is reasonable to pull up to 9?


From: "J. Hannken-Illjes" <hannken@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: riastradh@NetBSD.org
Subject: Re: kern/58043 (kernel crash in assert_sleepable() in -current,
 dk(4) driver?)
Date: Sun, 18 Aug 2024 17:55:25 +0000

 On Sun, Aug 18, 2024 at 04:55:30PM +0000, riastradh@NetBSD.org wrote:
 > Synopsis: kernel crash in assert_sleepable() in -current, dk(4) driver?
 > 
 > Responsible-Changed-From-To: kern-bug-people->hannken
 > Responsible-Changed-By: riastradh@NetBSD.org
 > Responsible-Changed-When: Sun, 18 Aug 2024 16:55:29 +0000
 > Responsible-Changed-Why:
 > Can you assess whether this is reasonable to pull up to 9?

 As the assertion fires from DEBUG-kernels only, the diff does not
 apply cleanly to -9 and therefore needs testing we should wait
 for someone getting this assertion from -9 and then prepare the
 test and pullup.

 -- 
 J. Hannken-Illjes - hannken@netbsd.org

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.