NetBSD Problem Report #56434

From gson@gson.org  Sat Oct  2 11:18:28 2021
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 87CCA1A921F
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  2 Oct 2021 11:18:28 +0000 (UTC)
Message-Id: <20211002111817.E6EF525417E@guava.gson.org>
Date: Sat,  2 Oct 2021 14:18:17 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: alpha testbed installs fail with cmdide errors
X-Send-Pr-Version: 3.95

>Number:         56434
>Category:       port-alpha
>Synopsis:       alpha testbed installs fail with cmdide errors
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    thorpej
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Oct 02 11:20:00 +0000 2021
>Closed-Date:    Sun Nov 26 13:17:16 +0000 2023
>Last-Modified:  Sun Nov 26 13:17:16 +0000 2023
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2021.09.25.08.54.31
>Organization:

>Environment:
System: NetBSD
Architecture: alpha
Machine: alpha
>Description:

The TNF alpha testbed is now consistently failing at the install stage,
with errors like this one:

  cmdide0:0:0: lost interrupt
  [  21.1241301]  type: ata tc_bcount: 8192 tc_skip: 0
  cmdide0:0:0: bus-master DMA error: missing interrupt, status=0x60
  [  21.1241301] wd0: excessive DMA errors - 4 in last 11 transfers
  [  21.1241301] wd0c: DMA error reading fsbn 128 of 128-143 (wd0 bn 128; cn 0 tn 2 sn 2), xfer 38, retry 0

My own testbed behaves the same way.  Logs are at:

  http://releng.netbsd.org/b5reports/alpha/commits-2021.09.html#2021.09.25.08.54.31
  https://www.gson.org/netbsd/bugs/build/alpha/commits-2021.09.html#2021.09.25.08.54.31

On both testbeds, the problem appears to have started with maya's
commit introducing the gpufw set, but I don't see any obvious
connection between the commit and the error.

>How-To-Repeat:

Install misc/py-anita and emulators/qemu from pkgsrc and run

  anita install http://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/latest/alpha/

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: port-alpha-maintainer->thorpej
Responsible-Changed-By: thorpej@NetBSD.org
Responsible-Changed-When: Sun, 24 Oct 2021 23:47:47 +0000
Responsible-Changed-Why:
I just booted a -current kernel in qemu and cannot reproduce the
problem. Is it still failing for you?


State-Changed-From-To: open->feedback
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Sun, 24 Oct 2021 23:47:47 +0000
State-Changed-Why:
I just booted a -current kernel in qemu and cannot reproduce the
problem. Is it still failing for you?


From: Andreas Gustafsson <gson@gson.org>
To: thorpej@NetBSD.org
Cc: gnats-bugs@netbsd.org
Subject: Re: port-alpha/56434 (alpha testbed installs fail with cmdide errors)
Date: Mon, 25 Oct 2021 09:29:32 +0300

 thorpej@NetBSD.org wrote:
 > I just booted a -current kernel in qemu and cannot reproduce the
 > problem.

 Booting the kernel does work - the error only occurs when sysinst
 reaches the point of displaying a list of available disks.

 > Is it still failing for you?

 Yes, on both the TNF testbed and my own:

   http://releng.netbsd.org/b5reports/alpha/2021/2021.10.24.11.58.23/install.log
   https://www.gson.org/netbsd/bugs/build/alpha/2021/2021.10.21.21.35.02/install.log

 Did you try the command in the "How-To-Repeat:" section of the PR?
 It also still fails for me as of today.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->open
State-Changed-By: gson@NetBSD.org
State-Changed-When: Fri, 05 Nov 2021 07:38:12 +0000
State-Changed-Why:
Feedback was provided.


From: Andreas Gustafsson <gson@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc: thorpej@netbsd.org
Subject: Re: port-alpha/56434 (alpha testbed installs fail with cmdide errors)
Date: Sun, 27 Feb 2022 16:35:31 +0200

 I looked into this a bit by tracing the disk I/O operations in qemu
 using the qemu -trace option, and by running qemu under gdb.

 The failing disk read (of 8192 bytes at sector 128) is hitting this
 error condition at qemu-6.2.0/hw/ide/core.c line 917:

     if (prep_size < n * 512) {
         /*
          * The PRDs are too short for this request. Error condition!
          * Reset the Active bit and don't raise the interrupt.
          */
         s->status = READY_STAT | SEEK_STAT;
         dma_buf_commit(s, 0);
         goto eot;
     }

 where prep_size is 8190 (!) and n * 512 is 8192.  The prep_size
 value comes from this call a few lines earlier:

     prep_size = s->bus->dma->ops->prepare_buf(s->bus->dma, s->io_buffer_size);

 I don't really know anything about IDE DMA, but it looks like this
 function is dealing with some kind of DMA segments, and their sizes
 look odd (literally, as in not divisble by two):

   (gdb) print prd
   $19 = {addr = 2209223277, size = 3475}
   [...]
   (gdb) print prd
   $20 = {addr = 2210054144, size = 2147488365}

 where the latter size is 4717 ORed with a flag bit in the MSB.

 The segments do sum to the right size (3475 + 4717 = 8192), but
 since their sizes are odd and rounded down to even in

             len = prd.size & 0xfffe;

 the resulting total size is 2 bytes short.

 Perhaps this will provide enough clues to someone who understands
 IDE DMA better than I do to figure out where the bug lies.
 -- 
 Andreas Gustafsson, gson@NetBSD.org

From: Andreas Gustafsson <gson@NetBSD.org>
To: Jason Thorpe <thorpej@me.com>, Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org,
    port-alpha@netbsd.org
Subject: Re: port-alpha/56434 (alpha testbed installs fail with cmdide errors)
Date: Mon, 20 Nov 2023 16:47:36 +0200

 I have adjusted the Subject: line in an attempt to get gnats to file
 this with the PR.

 Jason Thorpe wrote:
 > What this is a DMA segment that starts at an odd boundary
 > (0x83ae126d) and has an odd number of bytes (0xd93) to reach the end
 > of the page (0x83ae2000).  The remainder of the 8K buffer (in
 > virtual space) starts on a different physical page (0x83bac000).

 Makes sense.

 > Whether or not Qemu is correct in forcing the length to be even,

 Not sure if this is the right chip, but Table III in
 http://ftp.parisc-linux.org/docs/chips/PC87415.pdf suggests that
 qemu correctly emulates that chip at least.

 Martin Husemann wrote:
 > That part is pretty easy, I can do it if emulation in QEMU works good enough

 I successfully reproduced the issue today using the command from the
 How-To-Repeat section:

   anita install http://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/latest/alpha/

 -- 
 Andreas Gustafsson, gson@NetBSD.org

From: Martin Husemann <martin@duskware.de>
To: Andreas Gustafsson <gson@NetBSD.org>
Cc: Jason Thorpe <thorpej@me.com>, gnats-bugs@netbsd.org,
	port-alpha@netbsd.org
Subject: Re: port-alpha/56434 (alpha testbed installs fail with cmdide errors)
Date: Mon, 20 Nov 2023 15:57:07 +0100

 On Mon, Nov 20, 2023 at 04:47:36PM +0200, Andreas Gustafsson wrote:
 > I successfully reproduced the issue today using the command from the
 > How-To-Repeat section:
 > 
 >   anita install http://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/latest/alpha/

 Yes, but first we need a non crashing kernel, then we can fix the userland
 stuff that triggers it (which is by then supposedly getting a EINVAL back).

 Martin

State-Changed-From-To: open->analyzed
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Mon, 20 Nov 2023 15:18:12 +0000
State-Changed-Why:
We understand how this happens and the driver should now catch and
return an error when it does, but now we need a fix from sysinst to
stay inside the lines of physio.


From: "Jason R Thorpe" <thorpej@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 15:16:46 +0000

 Module Name:	src
 Committed By:	thorpej
 Date:		Mon Nov 20 15:16:46 UTC 2023

 Modified Files:
 	src/sys/dev/pci: pciide_common.c

 Log Message:
 pciide_dma_dmamap_setup(): If we end up with a DMA segment with an odd
 length, unload the map and return EIO.  Some controllers get really upset
 if a DMA segment has an odd length.  This can happen if a physio user
 performs a virtually-contiguous I/O that starts at an odd address and spans
 a page boundary where the resulting physical pages are discontiguous.

 Ultimately, it's up to the physio user to paint inside the lines, but this
 will prevent the disk controller from wandering off into the weeds, at least.

 PR port-alpha/56434


 To generate a diff of this commit:
 cvs rdiff -u -r1.67 -r1.68 src/sys/dev/pci/pciide_common.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: thorpej->martin
Responsible-Changed-By: martin@NetBSD.org
Responsible-Changed-When: Mon, 20 Nov 2023 15:38:01 +0000
Responsible-Changed-Why:
I'll deal with the userland side


From: Andreas Gustafsson <gson@gson.org>
To: thorpej@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: Re: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 17:38:07 +0200

 Jason R Thorpe wrote:
 >  pciide_dma_dmamap_setup(): If we end up with a DMA segment with an odd
 >  length, unload the map and return EIO.

 Shouldn't this also check for segments starting at an odd address?
 -- 
 Andreas Gustafsson, gson@gson.org

From: Jason Thorpe <thorpej@me.com>
To: Andreas Gustafsson <gson@gson.org>
Cc: Jason Thorpe <thorpej@netbsd.org>,
 gnats-bugs@netbsd.org
Subject: Re: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 07:40:59 -0800

 > On Nov 20, 2023, at 7:38 AM, Andreas Gustafsson <gson@gson.org> wrote:
 >=20
 > Jason R Thorpe wrote:
 >> pciide_dma_dmamap_setup(): If we end up with a DMA segment with an =
 odd
 >> length, unload the map and return EIO.
 >=20
 > Shouldn't this also check for segments starting at an odd address?

 There is apparently no constraint on the address, only the length.

 This kind of makes sense=E2=80=A6 the original =E2=80=9CAT attached=E2=80=9D=
  drives did 16-bit transfers, thus required an even byte length.

 -- thorpej

From: Jason Thorpe <thorpej@me.com>
To: Andreas Gustafsson <gson@gson.org>
Cc: Jason Thorpe <thorpej@netbsd.org>,
 gnats-bugs@netbsd.org
Subject: Re: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 08:05:36 -0800

 > On Nov 20, 2023, at 7:59 AM, Andreas Gustafsson <gson@gson.org> wrote:
 >=20
 > Jason Thorpe wrote:
 >> There is apparently no constraint on the address, only the length.
 >=20
 > http://ftp.parisc-linux.org/docs/chips/PC87415.pdf shows a 31-bit
 > address and a hardcoded LSB of 0 in Table III, and the first
 > paragraph on page 25 says "The Memory Region Physical Base Address
 > is aligned on a 2 byte boundary".
 >=20
 > =
 https://pdos.csail.mit.edu/6.828/2018/readings/hardware/IDE-BusMaster.pdf
 > also has a hardcoded LSB of 0 in Figure 1.

 Ok, well, Qemu doesn=E2=80=99t appear to enforce that, so=E2=80=A6 =
 (maybe I missed it?)

 -- thorpej

From: Andreas Gustafsson <gson@gson.org>
To: Jason Thorpe <thorpej@me.com>
Cc: Jason Thorpe <thorpej@netbsd.org>,
    martin@NetBSD.org,
    gnats-bugs@netbsd.org
Subject: Re: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 19:17:35 +0200

 Jason Thorpe wrote:
 > Ok, well, Qemu doesn't appear to enforce that, so... (maybe I missed it?)

 I don't think qemu enforces it, but we should still check for the
 benefit of real hardware.

 There is now new log output from b5, with your commit:

   http://releng.netbsd.org/b5reports/alpha/2023/2023.11.20.15.16.46/install.log

 Pay no attention to the fact that qemu exits on a signal rather than
 anita timing out; this is just because I didn't want to wait for the
 timeout and killed it manually after it had stopped making progress.

 There are fewer kernel messages than before (for example, the "lost
 interrupt" message is gone), but the kernel is still reporting a DMA
 error, and sysinst still appears to silently hang once the disk error
 has occurred.

 The disk error occurs after anita has responded to sysinst's question
 "On which disk do you want to install" with "a\n".
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: Jason Thorpe <thorpej@me.com>
Cc: Jason Thorpe <thorpej@netbsd.org>,
    gnats-bugs@netbsd.org
Subject: Re: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 17:59:46 +0200

 Jason Thorpe wrote:
 > There is apparently no constraint on the address, only the length.

 http://ftp.parisc-linux.org/docs/chips/PC87415.pdf shows a 31-bit
 address and a hardcoded LSB of 0 in Table III, and the first
 paragraph on page 25 says "The Memory Region Physical Base Address
 is aligned on a 2 byte boundary".

 https://pdos.csail.mit.edu/6.828/2018/readings/hardware/IDE-BusMaster.pdf
 also has a hardcoded LSB of 0 in Figure 1.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56434 CVS commit: src/usr.sbin/sysinst
Date: Mon, 20 Nov 2023 18:03:55 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Nov 20 18:03:55 UTC 2023

 Modified Files:
 	src/usr.sbin/sysinst: label.c util.c

 Log Message:
 Force alignment of disk buffers to at least 8 byte.
 Fixes PR 56434.


 To generate a diff of this commit:
 cvs rdiff -u -r1.49 -r1.50 src/usr.sbin/sysinst/label.c
 cvs rdiff -u -r1.73 -r1.74 src/usr.sbin/sysinst/util.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/56434 CVS commit: src/usr.sbin/sysinst
Date: Mon, 20 Nov 2023 19:12:02 +0100

 On Mon, Nov 20, 2023 at 06:05:01PM +0000, Martin Husemann wrote:
 >  Modified Files:
 >  	src/usr.sbin/sysinst: label.c util.c
 >  
 >  Log Message:
 >  Force alignment of disk buffers to at least 8 byte.
 >  Fixes PR 56434.

 ... but unfortunately not the whole QEMU ATF setup - installation works now,
 but booting the fresh installation dies:

 [..]
 Building databases: dev, utmp, utmpx, services.
 Starting syslogd.
 Mounting all file systems...
 Clearing temporary files.
 Checking quotas: done.
 Setting securelevel: kern.securelevel: 0 -> 1
 swapctl: setting dump device to /dev/wd0b
 Starting virecover.
 Checking for core dump...
 [  11.6556444] panic: kernel diagnostic assertion "timo != 0 || intr" failed: file "/usr/src/sys/kern/kern_synch.c", line 249 
 [  11.6556444] cpu0: Begin traceback...
 [  11.6556444] alpha trace requires known PC =eject=
 [  11.6556444] cpu0: End traceback...
 [  11.6556444] dumping to dev 4,1 offset 196607
 [  11.6556444] dump 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 succeeded
 [  11.6556444] rebooting...


 Martin

From: Andreas Gustafsson <gson@gson.org>
To: martin@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: Re: PR/56434 CVS commit: src/usr.sbin/sysinst
Date: Mon, 20 Nov 2023 21:44:32 +0200

 Martin Husemann wrote:
 >  >  Fixes PR 56434.
 >
 >  ... but unfortunately not the whole QEMU ATF setup - installation works now,
 >  but booting the fresh installation dies:

 On b5, it has successfully booted and started running the ATF tests.
 This is using qemu-6.0.0nb2.

 Please file a separate PR about the boot failure.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Jason Thorpe <thorpej@me.com>
To: gnats-bugs@netbsd.org
Cc: martin@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org,
 Andreas Gustafsson <gson@gson.org>
Subject: Re: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 13:19:16 -0800

 > On Nov 20, 2023, at 9:20 AM, Andreas Gustafsson <gson@gson.org> wrote:
 >=20
 > There are fewer kernel messages than before (for example, the "lost
 > interrupt" message is gone), but the kernel is still reporting a DMA
 > error, and sysinst still appears to silently hang once the disk error
 > has occurred.

 I expected it to report a DMA error, but I did not expect it to hang.  =
 Turns out, the =E2=80=9Cwd=E2=80=9D code just retries the request on a =
 DMA error.  That seems to be not a very smart idea, but maybe it=E2=80=99s=
  a good thing to do on some systems?  Anyway, I will add some additional =
 error codes so that =E2=80=9Cwd=E2=80=9D can correctly determine that =
 this error is fatal and a retry should not occur.

 -- thorpej

From: "Jason R Thorpe" <thorpej@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56434 CVS commit: src/sys/dev/pci
Date: Mon, 20 Nov 2023 21:59:38 +0000

 Module Name:	src
 Committed By:	thorpej
 Date:		Mon Nov 20 21:59:38 UTC 2023

 Modified Files:
 	src/sys/dev/pci: pciide_common.c

 Log Message:
 pciide_dma_dmamap_setup(): If we end up with a DMA segment with an odd
 length or odd starting address, unload the map and return EINVAL.  Some
 controllers get really upset if a DMA segment has an odd address or length.
 This can happen if a physio user performs a virtually-contiguous I/O that
 starts at an odd address and spans a page boundary where the resulting
 physical pages are discontiguous.  The EINVAL return will cause the upper
 layers in the ATA code to re-try the I/O using PIO, which should (will
 in all of my tests) succeed.

 PR port-alpha/56434


 To generate a diff of this commit:
 cvs rdiff -u -r1.69 -r1.70 src/sys/dev/pci/pciide_common.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: martin->thorpej
Responsible-Changed-By: thorpej@NetBSD.org
Responsible-Changed-When: Mon, 20 Nov 2023 22:01:43 +0000
Responsible-Changed-Why:
I made a change to the PCI IDE code that can be safely pulled into
netbsd-10.


State-Changed-From-To: analyzed->pending-pullups
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Mon, 20 Nov 2023 22:01:43 +0000
State-Changed-Why:
Safe for netbsd-10 change pending.


From: "Manuel Bouyer" <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56434 CVS commit: [netbsd-10] src/sys/dev/pci
Date: Sun, 26 Nov 2023 12:38:56 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Sun Nov 26 12:38:56 UTC 2023

 Modified Files:
 	src/sys/dev/pci [netbsd-10]: pciide_common.c

 Log Message:
 Pull up following revision(s) (requested by thorpej in ticket #470):
 	sys/dev/pci/pciide_common.c: revision 1.70
 pciide_dma_dmamap_setup(): If we end up with a DMA segment with an odd
 length or odd starting address, unload the map and return EINVAL.  Some
 controllers get really upset if a DMA segment has an odd address or length.
 This can happen if a physio user performs a virtually-contiguous I/O that
 starts at an odd address and spans a page boundary where the resulting
 physical pages are discontiguous.  The EINVAL return will cause the upper
 layers in the ATA code to re-try the I/O using PIO, which should (will
 in all of my tests) succeed.
 PR port-alpha/56434


 To generate a diff of this commit:
 cvs rdiff -u -r1.67 -r1.67.20.1 src/sys/dev/pci/pciide_common.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Manuel Bouyer" <bouyer@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56434 CVS commit: [netbsd-10] src/usr.sbin/sysinst
Date: Sun, 26 Nov 2023 12:40:51 +0000

 Module Name:	src
 Committed By:	bouyer
 Date:		Sun Nov 26 12:40:50 UTC 2023

 Modified Files:
 	src/usr.sbin/sysinst [netbsd-10]: label.c util.c

 Log Message:
 Pull up following revision(s) (requested by martin in ticket #471):
 	usr.sbin/sysinst/label.c: revision 1.50
 	usr.sbin/sysinst/util.c: revision 1.74
 Force alignment of disk buffers to at least 8 byte.
 Fixes PR 56434.


 To generate a diff of this commit:
 cvs rdiff -u -r1.46.2.1 -r1.46.2.2 src/usr.sbin/sysinst/label.c
 cvs rdiff -u -r1.71.2.1 -r1.71.2.2 src/usr.sbin/sysinst/util.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: thorpej@NetBSD.org
State-Changed-When: Sun, 26 Nov 2023 13:17:16 +0000
State-Changed-Why:
Pulled up to netbsd-10, fix already confirmed with test case and in ATF runs.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.