NetBSD Problem Report #55397

From kardel@Kardel.name  Fri Jun 19 08:47:47 2020
Return-Path: <kardel@Kardel.name>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 3DFBB1A9217
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 19 Jun 2020 08:47:47 +0000 (UTC)
Message-Id: <20200619084743.E55E744B33@Andromeda.Kardel.name>
Date: Fri, 19 Jun 2020 10:47:43 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: Xen pvh instance: zpool create on xbd* devices panics
X-Send-Pr-Version: 3.95

>Number:         55397
>Category:       kern
>Synopsis:       Xen pvh instance: zpool create on xbd* devices panics
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jun 19 08:50:00 +0000 2020
>Closed-Date:    Sat Jun 20 12:16:36 +0000 2020
>Last-Modified:  Sat Jun 20 12:16:36 +0000 2020
>Originator:     Frank Kardel
>Release:        NetBSD 9.99.65
>Organization:

>Environment:


System: NetBSD abstest2 9.99.65 NetBSD 9.99.65 (GENERIC) #2: Tue Jun 9 01:58:38 CEST 2020 kardel@Andromeda:/src/NetBSD/cur/src/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
	Attempting to
		zpool create data mirror xbd1 xbd2
	on a xen pvh instance panics like this:
	  [ 451.7783806] panic: kernel diagnostic assertion "(req->req_bp->b_flags & B_PHYS) != 0" failed: file "/src/NetBSD/cur/src/sys/arch/xen/xen/xbd_xenbus.c", line 1373
	  [ 451.7783806] cpu0: Begin traceback...
	  [ 451.7783806] vpanic() at netbsd:vpanic+0x152
	  [ 451.7783806] __x86_indirect_thunk_rax() at netbsd:__x86_indirect_thunk_rax
	  [ 451.7783806] xbd_diskstart() at netbsd:xbd_diskstart+0x7c2
	  [ 451.7783806] dk_start() at netbsd:dk_start+0xef
	  [ 451.7783806] spec_strategy() at netbsd:spec_strategy+0x9f
	  [ 451.7783806] VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x3c
	  [ 451.7783806] zio_vdev_io_start() at zfs:zio_vdev_io_start+0x192
	  [ 451.7783806] zio_execute() at zfs:zio_execute+0xe3
	  [ 451.7783806] zio_nowait() at zfs:zio_nowait+0x5c
	  [ 451.7783806] vdev_mirror_io_start() at zfs:vdev_mirror_io_start+0x157
	  [ 451.7783806] zio_vdev_io_start() at zfs:zio_vdev_io_start+0x192
	  [ 451.7783806] zio_execute() at zfs:zio_execute+0xe3
	  [ 451.7783806] zio_nowait() at zfs:zio_nowait+0x5c
	  [ 451.7783806] vdev_mirror_io_start() at zfs:vdev_mirror_io_start+0x157
	  [ 451.7783806] zio_vdev_io_start() at zfs:zio_vdev_io_start+0x33f
	  [ 451.7783806] zio_execute() at zfs:zio_execute+0xe3
	  [ 451.7783806] task_executor() at solaris:task_executor+0x67
	  [ 451.7783806] threadpool_thread() at netbsd:threadpool_thread+0x19e
	  [ 451.7783806] cpu0: End traceback...
	  [ 451.7783806] rebooting..

>How-To-Repeat:
	perform "zpool create pool mirror xbd? xbd?" on a pvh xen instance with
	appropriate disks.
>Fix:
	?

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Fri, 19 Jun 2020 09:10:34 +0000
Responsible-Changed-Why:
Mine. The assertion is there to catch cases which trigger non-optimal
xbd I/O, to weed out those from kernel code. I'll check how is it
triggered by zfs.


From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, jdolecek@netbsd.org, kern-bug-people@netbsd.org,
 netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Cc: 
Subject: Re: kern/55397 (Xen pvh instance: zpool create on xbd* devices
 panics)
Date: Fri, 19 Jun 2020 13:40:26 +0200

 Good.

 I found another subsystem triggering the assertion.

 Try "fdisk raidX" to panic like this:

 [ 300.6631030] panic: kernel diagnostic assertion "(req->req_bp->b_flags 
 & B_PHYS) != 0" failed: file 
 "/src/NetBSD/cur/src/sys/arch/xen/xen/xbd_xenbus.c", line 1373
 [ 300.6631030] cpu2: Begin traceback...
 [ 300.6631030] vpanic() at netbsd:vpanic+0x152
 [ 300.6631030] __x86_indirect_thunk_rax() at netbsd:__x86_indirect_thunk_rax
 [ 300.6631030] xbd_diskstart() at netbsd:xbd_diskstart+0x7c2
 [ 300.6631030] dk_start() at netbsd:dk_start+0xef
 [ 300.6730966] rf_DispatchKernelIO() at netbsd:rf_DispatchKernelIO+0x1a8
 [ 300.6730966] rf_DiskIOEnqueue() at netbsd:rf_DiskIOEnqueue+0x103
 [ 300.6730966] FireNodeList() at netbsd:FireNodeList+0x67
 [ 300.6730966] rf_DispatchDAG() at netbsd:rf_DispatchDAG+0x12e
 [ 300.6730966] rf_State_ExecuteDAG() at netbsd:rf_State_ExecuteDAG+0xcb
 [ 300.6730966] rf_ContinueRaidAccess() at netbsd:rf_ContinueRaidAccess+0xb4
 [ 300.6730966] rf_DoAccess() at netbsd:rf_DoAccess+0x10c
 [ 300.6830965] raid_diskstart() at netbsd:raid_diskstart+0x100
 [ 300.6830965] dk_start() at netbsd:dk_start+0xef
 [ 300.6830965] rf_RaidIOThread() at netbsd:rf_RaidIOThread+0x142
 [ 300.6830965] cpu2: End traceback...
 [ 300.6830965] rebooting...

 Frank


 On 06/19/20 11:10, jdolecek@NetBSD.org wrote:
 > Synopsis: Xen pvh instance: zpool create on xbd* devices panics
 >
 > Responsible-Changed-From-To: kern-bug-people->jdolecek
 > Responsible-Changed-By: jdolecek@NetBSD.org
 > Responsible-Changed-When: Fri, 19 Jun 2020 09:10:34 +0000
 > Responsible-Changed-Why:
 > Mine. The assertion is there to catch cases which trigger non-optimal
 > xbd I/O, to weed out those from kernel code. I'll check how is it
 > triggered by zfs.
 >
 >
 >

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55397 CVS commit: src/sys/kern
Date: Fri, 19 Jun 2020 13:49:38 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Jun 19 13:49:38 UTC 2020

 Modified Files:
 	src/sys/kern: subr_pool.c

 Log Message:
 bump the limit on max item size for pool_init()/pool_cache_init() up
 to 1 << 24, so that the pools can be used for ZFS block allocations, which
 are up to SPA_MAXBLOCKSHIFT (1 << 24)

 part of PR kern/55397 by Frank Kardel


 To generate a diff of this commit:
 cvs rdiff -u -r1.272 -r1.273 src/sys/kern/subr_pool.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55397 CVS commit: src/sys/sys
Date: Fri, 19 Jun 2020 13:52:40 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Jun 19 13:52:40 UTC 2020

 Modified Files:
 	src/sys/sys: param.h

 Log Message:
 bump version - maximum item size for pool_init()/pool_cache_init() changed

 PR kern/55397


 To generate a diff of this commit:
 cvs rdiff -u -r1.670 -r1.671 src/sys/sys/param.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55397 CVS commit: src/external/cddl/osnet/dist/uts/common/fs/zfs
Date: Fri, 19 Jun 2020 14:13:23 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Jun 19 14:13:23 UTC 2020

 Modified Files:
 	src/external/cddl/osnet/dist/uts/common/fs/zfs: zio.c

 Log Message:
 use pool_cache for (meta)data buffers also on NetBSD

 this should generally slightly improve performance on MP systems, and
 specifically for xbd(4) storage avoids slow unaligned I/O buffer handling

 this change requires updated kernel, to allow up to SPA_MAXBLOCKSHIFT item
 size for pools

 fixes PR kern/55397 by Frank Kardel


 To generate a diff of this commit:
 cvs rdiff -u -r1.6 -r1.7 src/external/cddl/osnet/dist/uts/common/fs/zfs/zio.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: Frank Kardel <kardel@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>, Jaromir Dolecek <jdolecek@netbsd.org>, 
	kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55397 (Xen pvh instance: zpool create on xbd* devices panics)
Date: Fri, 19 Jun 2020 16:26:18 +0200

 Hi,

 I've fixed the problem for zfs, you'll need an updated kernel
 (src/sys/kern/subr_pool.c) and the zfs module sources
 (src/external/....). Can you confirm it works for you?

 I can repeat the raidframe path too, I'm working on a fix there.

 Jaromir

 Le ven. 19 juin 2020 =C3=A0 13:40, Frank Kardel <kardel@netbsd.org> a =C3=
 =A9crit :
 >
 > Good.
 >
 > I found another subsystem triggering the assertion.
 >
 > Try "fdisk raidX" to panic like this:
 >
 > [ 300.6631030] panic: kernel diagnostic assertion "(req->req_bp->b_flags
 > & B_PHYS) !=3D 0" failed: file
 > "/src/NetBSD/cur/src/sys/arch/xen/xen/xbd_xenbus.c", line 1373
 > [ 300.6631030] cpu2: Begin traceback...
 > [ 300.6631030] vpanic() at netbsd:vpanic+0x152
 > [ 300.6631030] __x86_indirect_thunk_rax() at netbsd:__x86_indirect_thunk_=
 rax
 > [ 300.6631030] xbd_diskstart() at netbsd:xbd_diskstart+0x7c2
 > [ 300.6631030] dk_start() at netbsd:dk_start+0xef
 > [ 300.6730966] rf_DispatchKernelIO() at netbsd:rf_DispatchKernelIO+0x1a8
 > [ 300.6730966] rf_DiskIOEnqueue() at netbsd:rf_DiskIOEnqueue+0x103
 > [ 300.6730966] FireNodeList() at netbsd:FireNodeList+0x67
 > [ 300.6730966] rf_DispatchDAG() at netbsd:rf_DispatchDAG+0x12e
 > [ 300.6730966] rf_State_ExecuteDAG() at netbsd:rf_State_ExecuteDAG+0xcb
 > [ 300.6730966] rf_ContinueRaidAccess() at netbsd:rf_ContinueRaidAccess+0x=
 b4
 > [ 300.6730966] rf_DoAccess() at netbsd:rf_DoAccess+0x10c
 > [ 300.6830965] raid_diskstart() at netbsd:raid_diskstart+0x100
 > [ 300.6830965] dk_start() at netbsd:dk_start+0xef
 > [ 300.6830965] rf_RaidIOThread() at netbsd:rf_RaidIOThread+0x142
 > [ 300.6830965] cpu2: End traceback...
 > [ 300.6830965] rebooting...
 >
 > Frank
 >
 >
 > On 06/19/20 11:10, jdolecek@NetBSD.org wrote:
 > > Synopsis: Xen pvh instance: zpool create on xbd* devices panics
 > >
 > > Responsible-Changed-From-To: kern-bug-people->jdolecek
 > > Responsible-Changed-By: jdolecek@NetBSD.org
 > > Responsible-Changed-When: Fri, 19 Jun 2020 09:10:34 +0000
 > > Responsible-Changed-Why:
 > > Mine. The assertion is there to catch cases which trigger non-optimal
 > > xbd I/O, to weed out those from kernel code. I'll check how is it
 > > triggered by zfs.
 > >
 > >
 > >
 >

From: Frank Kardel <kardel@netbsd.org>
To: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
 Jaromir Dolecek <jdolecek@netbsd.org>, kern-bug-people@netbsd.org,
 netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/55397 (Xen pvh instance: zpool create on xbd* devices
 panics)
Date: Fri, 19 Jun 2020 17:24:39 +0200

 Looks promising:

      zpool create: sucess

      zfs create (several): sucess

      initial writes didn't panic.

 Thanks for the quick fix!

 Frank

 On 06/19/20 16:26, Jaromír Doleček wrote:
 > Hi,
 >
 > I've fixed the problem for zfs, you'll need an updated kernel
 > (src/sys/kern/subr_pool.c) and the zfs module sources
 > (src/external/....). Can you confirm it works for you?
 >
 > I can repeat the raidframe path too, I'm working on a fix there.
 >
 > Jaromir
 >
 > Le ven. 19 juin 2020 à 13:40, Frank Kardel <kardel@netbsd.org> a écrit :
 >> Good.
 >>
 >> I found another subsystem triggering the assertion.
 >>
 >> Try "fdisk raidX" to panic like this:
 >>
 >> [ 300.6631030] panic: kernel diagnostic assertion "(req->req_bp->b_flags
 >> & B_PHYS) != 0" failed: file
 >> "/src/NetBSD/cur/src/sys/arch/xen/xen/xbd_xenbus.c", line 1373
 >> [ 300.6631030] cpu2: Begin traceback...
 >> [ 300.6631030] vpanic() at netbsd:vpanic+0x152
 >> [ 300.6631030] __x86_indirect_thunk_rax() at netbsd:__x86_indirect_thunk_rax
 >> [ 300.6631030] xbd_diskstart() at netbsd:xbd_diskstart+0x7c2
 >> [ 300.6631030] dk_start() at netbsd:dk_start+0xef
 >> [ 300.6730966] rf_DispatchKernelIO() at netbsd:rf_DispatchKernelIO+0x1a8
 >> [ 300.6730966] rf_DiskIOEnqueue() at netbsd:rf_DiskIOEnqueue+0x103
 >> [ 300.6730966] FireNodeList() at netbsd:FireNodeList+0x67
 >> [ 300.6730966] rf_DispatchDAG() at netbsd:rf_DispatchDAG+0x12e
 >> [ 300.6730966] rf_State_ExecuteDAG() at netbsd:rf_State_ExecuteDAG+0xcb
 >> [ 300.6730966] rf_ContinueRaidAccess() at netbsd:rf_ContinueRaidAccess+0xb4
 >> [ 300.6730966] rf_DoAccess() at netbsd:rf_DoAccess+0x10c
 >> [ 300.6830965] raid_diskstart() at netbsd:raid_diskstart+0x100
 >> [ 300.6830965] dk_start() at netbsd:dk_start+0xef
 >> [ 300.6830965] rf_RaidIOThread() at netbsd:rf_RaidIOThread+0x142
 >> [ 300.6830965] cpu2: End traceback...
 >> [ 300.6830965] rebooting...
 >>
 >> Frank
 >>
 >>
 >> On 06/19/20 11:10, jdolecek@NetBSD.org wrote:
 >>> Synopsis: Xen pvh instance: zpool create on xbd* devices panics
 >>>
 >>> Responsible-Changed-From-To: kern-bug-people->jdolecek
 >>> Responsible-Changed-By: jdolecek@NetBSD.org
 >>> Responsible-Changed-When: Fri, 19 Jun 2020 09:10:34 +0000
 >>> Responsible-Changed-Why:
 >>> Mine. The assertion is there to catch cases which trigger non-optimal
 >>> xbd I/O, to weed out those from kernel code. I'll check how is it
 >>> triggered by zfs.
 >>>
 >>>
 >>>

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55397 CVS commit: src/sys/dev/raidframe
Date: Fri, 19 Jun 2020 19:29:40 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Jun 19 19:29:39 UTC 2020

 Modified Files:
 	src/sys/dev/raidframe: rf_dag.h rf_dagfuncs.c rf_diskqueue.c
 	    rf_diskqueue.h rf_netbsd.h rf_netbsdkintf.c

 Log Message:
 pass down b_flags B_PHYS|B_RAW|B_MEDIA_FLAGS from bio subsystem
 to component I/O

 fixes the xbd(4) KASSERT() triggered by raidframe, noted in PR kern/55397
 by Frank Kardel


 To generate a diff of this commit:
 cvs rdiff -u -r1.20 -r1.21 src/sys/dev/raidframe/rf_dag.h
 cvs rdiff -u -r1.31 -r1.32 src/sys/dev/raidframe/rf_dagfuncs.c
 cvs rdiff -u -r1.56 -r1.57 src/sys/dev/raidframe/rf_diskqueue.c
 cvs rdiff -u -r1.25 -r1.26 src/sys/dev/raidframe/rf_diskqueue.h
 cvs rdiff -u -r1.34 -r1.35 src/sys/dev/raidframe/rf_netbsd.h
 cvs rdiff -u -r1.383 -r1.384 src/sys/dev/raidframe/rf_netbsdkintf.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Fri, 19 Jun 2020 19:34:22 +0000
State-Changed-Why:
Both zfs and raidframe should be fixed with up-to-date -current, can you
please confirm?


From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, jdolecek@netbsd.org, netbsd-bugs@netbsd.org,
 gnats-admin@netbsd.org
Cc: 
Subject: Re: kern/55397 (Xen pvh instance: zpool create on xbd* devices
 panics)
Date: Sat, 20 Jun 2020 13:06:00 +0200

 raidframe fdisk manipulations now work - good.

 zfs/zpool initialization works - good.

 But, zfs scrub runs into a "double fault" - see PR/55402

 The assertion caused issues are fixed - PR can be closed


 On 06/19/20 21:34, jdolecek@NetBSD.org wrote:
 > Synopsis: Xen pvh instance: zpool create on xbd* devices panics
 >
 > State-Changed-From-To: open->feedback
 > State-Changed-By: jdolecek@NetBSD.org
 > State-Changed-When: Fri, 19 Jun 2020 19:34:22 +0000
 > State-Changed-Why:
 > Both zfs and raidframe should be fixed with up-to-date -current, can you
 > please confirm?
 >
 >
 >

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 20 Jun 2020 12:16:36 +0000
State-Changed-Why:
Problem fixed. Thanks for report and testing.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.