NetBSD Problem Report #56328

From oster@gonzo.fween.ca  Sat Jul 24 22:54:33 2021
Return-Path: <oster@gonzo.fween.ca>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 203B01A921F
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 24 Jul 2021 22:54:33 +0000 (UTC)
Message-Id: <20210724225430.F14A41047FB@gonzo.fween.ca>
Date: Sat, 24 Jul 2021 16:54:30 -0600 (CST)
From: oster@netbsd.org
Reply-To: oster@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: XEN3_DOM0 bcount <= MAXPHYS panic...
X-Send-Pr-Version: 3.95

>Number:         56328
>Category:       port-xen
>Synopsis:       XEN3 DOM0 panic with bcount <= MAXPHYS in xbdback_xenbus.c
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jul 24 22:55:00 +0000 2021
>Closed-Date:    Thu Jul 29 06:42:20 +0000 2021
>Last-Modified:  Thu Jul 29 06:42:20 +0000 2021
>Originator:     Greg Oster
>Release:        NetBSD 9.99.87
>Organization:
>Environment:


System: NetBSD merlin 9.99.87 NetBSD 9.99.87 (XEN3_DOM0) #0: Fri Jul 23 22:39:05 CST 2021  oster@gonzo:/u1/builds/build350/src/obj/amd64/u1/builds/build350/src/sys/arch/amd64/compile/XEN3_DOM0 amd64

Architecture: x86_64
Machine: amd64
>Description:

When a Ubuntu 20.04.2 DOMU (with a 4.6.7 Linux kernel) is running on a 
NetBSD 9.99.87 DOM0, the following panic is encountered:

[50.8700852] panic: kernel diagnostic assertion "bcount <= MAXPHYS" failed: 
file "/u1/builds/build358/src/sys/arch/xen/xen/xbdback_xenbus.c", line 1320
[50.8700852] cpu0: Begin traceback...
[50.8700852] vpanic() at netbsd:vpanic+0x14a
[50.8700852] kern_assert at netbsd:kern_assert+0x4b
[50.8700852] xbdback_co_io_gotio() at netbsd:xbdback_co_io_gotio+0x202
[50.8700852] xbdback_thread() at netbsd:xbdback_thread+0x9b
[50.8700852] cpu0: End traceback...

when the domain is started.

>How-To-Repeat:

Upgrade a DOM0 kernel from NetBSD 9.2 to 9.99.87.  Watch the machine panic 
after starting DOMUs.  Hunt for a fix, and discover that bcount is 
MAXPHYS+PAGE_SIZE (i.e. 69632), and that there are 16 segments in the IO
that is failing.  Wonder how a DOMU could have issued an IO larger than
MAXPHYS, and then realize the Ubuntu DOMU might be doing that.  Test booting
without the Ubuntu DOMU, and it works fine.  Add Ubuntu DOMU back, and it fails. 

Ubuntu DOMU is:
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 4.6.7-040607-generic x86_64)

>Fix:
Please. :)
(I'm happy to help test any proposed fix, as it's easy to replicate the issue
on my Xen machine.)

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: port-xen-maintainer->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sun, 25 Jul 2021 06:18:53 +0000
Responsible-Changed-Why:
I'll check this. It's either Linux kernel not honoring max number of segments,
or an off-by-one on our side setting it.


From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-xen/56328 (XEN3 DOM0 panic with bcount <= MAXPHYS in xbdback_xenbus.c)
Date: Sun, 25 Jul 2021 07:07:09 -0000 (UTC)

 jdolecek@NetBSD.org writes:

 >I'll check this. It's either Linux kernel not honoring max number of segments,
 >or an off-by-one on our side setting it.

 I'm not sure if there is a size limit to a segment, so one segment
 could transfer up to 256 sectors (128kB, with optionally sectors larger
 than 512 bytes even more). Linux uses a segment per 4k page though.

 But there can also be more than BLKIF_MAX_SEGMENTS_PER_REQUEST=11 segments in
 an indirect page. With 4k page size, it fits 256 descriptors (probably 255
 since nr_segments is uint8_t).

 We still accumulate the whole request into a single buffer.

 The driver needs to learn to split requests into multiple I/O operations
 and it probably shouldn't panic when it gets a request it doesn't support
 but return an error to the frontend instead.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56328 CVS commit: src/sys/arch/xen/xen
Date: Wed, 28 Jul 2021 21:38:50 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Wed Jul 28 21:38:50 UTC 2021

 Modified Files:
 	src/sys/arch/xen/xen: xbdback_xenbus.c

 Log Message:
 fix intentional, but eventually faulty off-by-one for the maximum number
 of segments for I/O - this was supposed to allow MAXPHYS-size I/O even
 with page offset, but actually ended up letting through I/O up to
 MAXPHYS+PAGE_SIZE

 the KASSERT(bcount < MAXPHYS) is kept as-is, since at that place the number
 of segments should already be validated, so it's kernel bug if the size
 is still too big there

 fixes PR port-xen/56328 by Greg Oster


 To generate a diff of this commit:
 cvs rdiff -u -r1.97 -r1.98 src/sys/arch/xen/xen/xbdback_xenbus.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/56328 CVS commit: src/sys/arch/xen/xen
Date: Wed, 28 Jul 2021 15:57:33 -0600

 It gets further now... but now dies at:
 [  52.3001210] panic: kernel diagnostic assertion "bcount + 
 xbd_io->xio_start_offset < VBD_VBA_SIZE" failed: file 
 "/u1/builds/build350/src/sys/arch/xen/xen/xbdback_xenbus.c", line 1314
 [  52.3001210] cpu0: Begin traceback
 [  52.3001210] vpanic() at netbsd:vpanic+0x14a
 [  52.3001210] kern_assert() at netbsd:kern_assert+0x4b
 [  52.3001210] xbdback_co_io_gotio() at netbsd:xbdback_co_io_gotio+0x3cb
 [  52.3001210] xbdback_thread() at netbsd:xbdback_thread+0x9b
 [  52.3001210] cpu0: End traceback...

 Thanks!

 On 2021-07-28 3:40 p.m., Jaromir Dolecek wrote:
 > The following reply was made to PR port-xen/56328; it has been noted by GNATS.
 > 
 > From: "Jaromir Dolecek" <jdolecek@netbsd.org>
 > To: gnats-bugs@gnats.NetBSD.org
 > Cc:
 > Subject: PR/56328 CVS commit: src/sys/arch/xen/xen
 > Date: Wed, 28 Jul 2021 21:38:50 +0000
 > 
 >   Module Name:	src
 >   Committed By:	jdolecek
 >   Date:		Wed Jul 28 21:38:50 UTC 2021
 >   
 >   Modified Files:
 >   	src/sys/arch/xen/xen: xbdback_xenbus.c
 >   
 >   Log Message:
 >   fix intentional, but eventually faulty off-by-one for the maximum number
 >   of segments for I/O - this was supposed to allow MAXPHYS-size I/O even
 >   with page offset, but actually ended up letting through I/O up to
 >   MAXPHYS+PAGE_SIZE
 >   
 >   the KASSERT(bcount < MAXPHYS) is kept as-is, since at that place the number
 >   of segments should already be validated, so it's kernel bug if the size
 >   is still too big there
 >   
 >   fixes PR port-xen/56328 by Greg Oster
 >   
 >   
 >   To generate a diff of this commit:
 >   cvs rdiff -u -r1.97 -r1.98 src/sys/arch/xen/xen/xbdback_xenbus.c
 >   
 >   Please note that diffs are not public domain; they are subject to the
 >   copyright notices on the relevant files.
 >   
 > 

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 28 Jul 2021 22:14:25 +0000
State-Changed-Why:
Can you check if the committed change fixes the issue, please?


State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 28 Jul 2021 22:15:17 +0000
State-Changed-Why:
OK actually got the feedback. More work to fix this completely.


State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 28 Jul 2021 22:18:24 +0000
State-Changed-Why:
OK, KASSERT() was wrong. Fixed now, can you check?


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56328 CVS commit: src/sys/arch/xen/xen
Date: Wed, 28 Jul 2021 22:17:49 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Wed Jul 28 22:17:49 UTC 2021

 Modified Files:
 	src/sys/arch/xen/xen: xbdback_xenbus.c

 Log Message:
 fix off-by-one check in another KASSERT() for bcount

 still related to PR port-xen/56328


 To generate a diff of this commit:
 cvs rdiff -u -r1.98 -r1.99 src/sys/arch/xen/xen/xbdback_xenbus.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-xen/56328 (XEN3 DOM0 panic with bcount <= MAXPHYS in
 xbdback_xenbus.c)
Date: Wed, 28 Jul 2021 16:36:18 -0600

 Yes, all working great now!  Thanks for the quick fix!


 On 2021-07-28 4:18 p.m., jdolecek@NetBSD.org wrote:
 > Synopsis: XEN3 DOM0 panic with bcount <= MAXPHYS in xbdback_xenbus.c
 > 
 > State-Changed-From-To: analyzed->feedback
 > State-Changed-By: jdolecek@NetBSD.org
 > State-Changed-When: Wed, 28 Jul 2021 22:18:24 +0000
 > State-Changed-Why:
 > OK, KASSERT() was wrong. Fixed now, can you check?
 > 
 > 

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Thu, 29 Jul 2021 06:42:20 +0000
State-Changed-Why:
Confirmed fixed. Thanks for report.
Nothing to pullup, related changes are in -current only.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.