NetBSD Problem Report #52614

From gson@gson.org  Thu Oct 12 19:45:29 2017
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 633B57A16F
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 12 Oct 2017 19:45:29 +0000 (UTC)
Message-Id: <20171012194522.C5939989E68@guava.gson.org>
Date: Thu, 12 Oct 2017 22:45:22 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: qemu virtual CD-ROM report read errors since recent wdc changes
X-Send-Pr-Version: 3.95

>Number:         52614
>Category:       kern
>Synopsis:       qemu virtual CD-ROM reports read errors since recent wdc changes
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Oct 12 19:50:00 +0000 2017
>Closed-Date:    Wed Oct 24 06:56:05 +0000 2018
>Last-Modified:  Wed Oct 24 09:20:00 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2017.10.07.20.02.07
>Organization:

>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

The testbed on babylon5.netbsd.org has done more than 16,000
i386 installs in qemu over the last six years without printing
the string "cd0a: error reading" to the console even once.

This changed on source date 2017.10.07.20.02.07, which happens
to be immediately after a bunch of wdc related commits.  Since
then, the string with "cd0a: error reading" has been printed many
times:

  babylon5.netbsd.org$ zgrep -c 'cd0a: error reading' ./2017.10*/install.log.gz
  [44 lines showing zero matches omitted]
  ./2017.10.07.20.02.07/install.log.gz:5
  ./2017.10.07.20.32.20/install.log.gz:2
  ./2017.10.07.21.53.16/install.log.gz:1
  ./2017.10.08.00.45.25/install.log.gz:3
  ./2017.10.08.01.05.13/install.log.gz:8
  ./2017.10.08.03.39.50/install.log.gz:2
  ./2017.10.08.08.29.57/install.log.gz:5
  ./2017.10.08.09.10.11/install.log.gz:4
  ./2017.10.08.14.03.46/install.log.gz:0
  ./2017.10.08.15.00.40/install.log.gz:0
  ./2017.10.08.15.29.33/install.log.gz:2
  ./2017.10.08.18.46.10/install.log.gz:0
  ./2017.10.08.20.44.19/install.log.gz:1
  ./2017.10.08.21.18.14/install.log.gz:3
  ./2017.10.08.21.33.38/install.log.gz:1
  ./2017.10.09.05.24.26/install.log.gz:2
  ./2017.10.09.10.31.50/install.log.gz:0
  ./2017.10.09.12.07.03/install.log.gz:0
  ./2017.10.09.14.28.01/install.log.gz:6
  ./2017.10.09.17.49.28/install.log.gz:2
  ./2017.10.09.23.42.40/install.log.gz:2
  ./2017.10.10.03.11.01/install.log.gz:1
  ./2017.10.10.05.35.15/install.log.gz:3
  ./2017.10.10.09.29.14/install.log.gz:2
  ./2017.10.10.11.52.51/install.log.gz:3
  ./2017.10.10.13.47.27/install.log.gz:4
  ./2017.10.10.16.04.59/install.log.gz:2
  ./2017.10.10.16.44.24/install.log.gz:0
  ./2017.10.10.17.20.42/install.log.gz:1
  ./2017.10.10.19.31.57/install.log.gz:3
  ./2017.10.10.21.37.49/install.log.gz:1
  ./2017.10.11.00.17.03/install.log.gz:1
  ./2017.10.11.06.49.03/install.log.gz:9
  ./2017.10.11.08.29.17/install.log.gz:7
  ./2017.10.11.10.53.25/install.log.gz:1
  ./2017.10.11.12.27.49/install.log.gz:4
  ./2017.10.11.17.08.32/install.log.gz:0
  ./2017.10.12.09.53.55/install.log.gz:0

>How-To-Repeat:

Install NetBSD-current/i386 in qemu using a virtual CD-ROM.
Observe the console output.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Thu, 12 Oct 2017 19:53:50 +0000
Responsible-Changed-Why:
Over to committer.


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 22 Oct 2017 13:23:53 +0000
State-Changed-Why:
Does this still happen after the recent atapi fixes?


From: Andreas Gustafsson <gson@gson.org>
To: jdolecek@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent wdc changes)
Date: Sun, 22 Oct 2017 23:29:41 +0300

 On Sun, 22 Oct 2017 13:23:53 +0000 (UTC), jdolecek@NetBSD.org wrote:
 > Does this still happen after the recent atapi fixes?

 Yes - the install attempt from 2017.10.22.14.25.33 sources failed
 after multiple errors reading from cd0a:

   http://releng.netbsd.org/b5reports/i386/build/2017.10.22.14.25.33/install.log

 -- 
 Andreas Gustafsson, gson@gson.org

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org, Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent
 wdc changes)
Date: Mon, 23 Oct 2017 05:20:44 +0800 (+08)

 On Sun, 22 Oct 2017, jdolecek@NetBSD.org wrote:

 > Does this still happen after the recent atapi fixes?

 In addition to the read errors, I am regularly seeing reports of "driver 
 resource shortage" from qemu's cd0@piixide .  In some cases, it is able 
 to recover and continue, in some cases it seems fatal.  About 50% of the 
 time I have to repeat the qemu install procedure.

 This is with sources updated on 2017-10-22 at 1:11:56 UTC

 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

State-Changed-From-To: feedback->open
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 23 Oct 2017 07:34:07 +0000
State-Changed-Why:
Still happens, need to investigate.


From: Andreas Gustafsson <gson@gson.org>
To: jdolecek@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent wdc changes)
Date: Mon, 18 Dec 2017 20:25:11 +0200

 On October 23, jdolecek@NetBSD.org wrote:
 > Still happens, need to investigate.

 Any progress on this?  It has now been causing random installation
 failures on the testbed for almost two months.   Here's the log
 output from a recent one:

   http://releng.netbsd.org/b5reports/i386/build/2017.12.18.05.35.36/install.log

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org, jaromir.dolecek@gmail.com
Cc: 
Subject: Re: kern/52614: qemu virtual CD-ROM report read errors since recent wdc changes
Date: Fri, 22 Jun 2018 17:10:06 +0300

 In private email, Jaromir suggested I try a qemu configuration with
 ahcisata instead of wdc.

 I tried this by adding "-machine q35" to the qemu command line, which
 should cause qemu to emulate a more modern PC.  The hard disk and
 CD-ROM(s) were now detected as SATA devices:

 [   1.0205182] ahcisata0 at pci0 dev 31 function 2: vendor 8086 product 2922 (rev. 0x02)
 [   1.0205182] ahcisata0: interrupting at ioapic0 pin 16
 [   1.0205182] ahcisata0: AHCI revision 1.0, 6 ports, 32 slots, CAP 0xc0141f05<SAM,ISS=0x1=Gen1,SNCQ,S64A>
 [   1.0205182] atabus0 at ahcisata0 channel 0
 [   1.0205182] atabus1 at ahcisata0 channel 1
 [   1.0205182] atabus2 at ahcisata0 channel 2
 [   1.0205182] atabus3 at ahcisata0 channel 3
 [   1.0205182] atabus4 at ahcisata0 channel 4
 [   1.0205182] atabus5 at ahcisata0 channel 5
 (...)
 [   1.3749109] ahcisata0 port 0: device present, speed: 1.5Gb/s
 [   1.3749109] ahcisata0 port 1: device present, speed: 1.5Gb/s
 [   1.3749109] ahcisata0 port 2: device present, speed: 1.5Gb/s
 [   4.3760485] wd0 at atabus0 drive 0
 [   4.3810162] wd0: <QEMU HARDDISK>
 [   4.3810162] wd0: 1536 MB, 3120 cyl, 16 head, 63 sec, 512 bytes/sect x 3145728 sectors
 [   4.4214675] atapibus0 at atabus1: 1 targets
 [   4.4340258] cd0 at atapibus0 drive 0: <QEMU DVD-ROM, QM00003, 2.5+> cdrom removable
 [   4.4340258] atapibus1 at atabus2: 1 targets
 [   4.4463895] cd1 at atapibus1 drive 0: <QEMU DVD-ROM, QM00005, 2.5+> cdrom removable
 [   4.4711559] WARNING: 2 errors while detecting hardware; check system log.

 but when sysinst tried to mount the CD, it failed with the following errors:

 [  39.5120853] cd0(ahcisata0:1:0): request sense for a request sense ?
 [  39.5120853] cd0(ahcisata0:1:0): request sense failed with error 22
 [  39.5120853] cd0(ahcisata0:1:0): generic HBA error
 [  39.5120853] cd0: secperunit and ncylinders are zero
 [  39.5211147] cd0(ahcisata0:1:0): request sense for a request sense ?
 [  39.5211147] cd0(ahcisata0:1:0): request sense failed with error 22
 [  39.5211147] cd0(ahcisata0:1:0): generic HBA error
 [  39.5211147] WARNING: cd0: total sector size in disklabel (536870911) != the size of cd0 (0)
 [  39.5211147] WARNING: cd0: end of partition `a' exceeds the size of cd0 (0)
 [  39.5211147] WARNING: cd0: end of partition `d' exceeds the size of cd0 (0)
 [  39.5286895] cd0(ahcisata0:1:0): request sense for a request sense ?
 [  39.5286895] cd0(ahcisata0:1:0): request sense failed with error 22
 [  39.5286895] cd0(ahcisata0:1:0): generic HBA error
 [  39.5286895] cd1(ahcisata0:2:0): request sense for a request sense ?
 [  39.5286895] cd1(ahcisata0:2:0): request sense failed with error 22
 [  39.5286895] cd1(ahcisata0:2:0): generic HBA error
 [  39.5286895] cd1(ahcisata0:2:0): request sense for a request sense ?
 [  39.5286895] cd1(ahcisata0:2:0): request sense failed with error 22
 [  39.5286895] cd1(ahcisata0:2:0): generic HBA error
 [  39.5286895] cd1: secperunit and ncylinders are zero
 [  39.5286895] cd1(ahcisata0:2:0): request sense for a request sense ?
 [  39.5286895] cd1(ahcisata0:2:0): request sense failed with error 22
 [  39.5286895] cd1(ahcisata0:2:0): generic HBA error
 [  39.5286895] WARNING: cd1: total sector size in disklabel (536870911) != the size of cd1 (0)
 [  39.5286895] WARNING: cd1: end of partition `a' exceeds the size of cd1 (0)
 [  39.5494644] WARNING: cd1: end of partition `d' exceeds the size of cd1 (0)
 [  39.5494644] cd1(ahcisata0:2:0): request sense for a request sense ?
 [  39.5494644] cd1(ahcisata0:2:0): request sense failed with error 22
 [  39.5494644] cd1(ahcisata0:2:0): generic HBA error

 This is with qemu 2.12.0. The full console log from the install
 attempt, including qemu command line, is available at

   http://www.gson.org/netbsd/bugs/build/i386/2018/2018.06.22.10.17.04/install.log

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org, jaromir.dolecek@gmail.com
Cc: 
Subject: Re: kern/52614: qemu virtual CD-ROM report read errors since recent wdc changes
Date: Fri, 22 Jun 2018 20:01:19 +0300

 I repeated the "-machine q35" test with sources from the beginning of October,
 before the SATA-NCQ merge, and it failed the same way as with today's sources:

   http://www.gson.org/netbsd/bugs/build/i386/2017/2017.10.01.01.45.02/install.log

 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: jdolecek@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/52614: qemu virtual CD-ROM report read errors since recent wdc changes
Date: Tue, 21 Aug 2018 11:29:54 +0300

 Jaromir,

 This bug is still causing large numbers of random installation
 failures on the testbed.  For example, the i386 install has failed
 more than 200 times this year.

 I added a bunch of debug printfs to the kernel to try to figure out
 what's happening.  Here's a summary of what I have found so far.
 If there are other tests I can run to help debug this, please let me
 know.

 During a typical failed sysinst run, the "if (avail == 0)" branch
 in ata_get_xfer_ext() was entered more than 9000 times, always with
 flags == 0.  From reading the code, this condition results in
 ata_get_xfer_ext() returning NULL.

 53 of the NULL-returning ata_get_xfer_ext() calls were from
 wdc_atapi_scsipi_request(), causing sc_xfer->error to be
 set to XS_RESOURCE_SHORTAGE.

 One read from cd0a failed, with bp->b_error == 16 (EBUSY).

 I'm not sure how to interpret these results - are these frequent NULL
 returns from ata_get_xfer_ext() themselves the problem, or are they
 expected and the bug is the scsipi code not recovering from them?

 Here are the kernel log messages from the last few seconds leading up
 to the cd0a read error, with the debug printfs in place:

 [ 4066.9440596] ata_get_xfer_ext avail 0, flags 00000000
 [ 4066.9440596] ata_get_xfer_ext() returned NULL
 [ 4066.9440596] cd0(piixide0:0:1): adapter resource shortage
 [ 4067.1440971] ata_get_xfer_ext avail 0, flags 00000000
 [ 4068.9066925] ata_get_xfer_ext avail 0, flags 00000000
 [ 4068.9066925] ata_get_xfer_ext() returned NULL
 [ 4068.9066925] cd0(piixide0:0:1): adapter resource shortage
 [ 4070.8755630] ata_get_xfer_ext avail 0, flags 00000000
 [ 4070.8755630] ata_get_xfer_ext() returned NULL
 [ 4070.8755630] cd0(piixide0:0:1): adapter resource shortage
 [ 4071.0664209] ata_get_xfer_ext avail 0, flags 00000000
 [ 4071.0664209] ata_get_xfer_ext avail 0, flags 00000000
 [ 4071.0664209] ata_get_xfer_ext avail 0, flags 00000000
 [ 4071.0664209] ata_get_xfer_ext avail 0, flags 00000000
 [ 4072.8642938] ata_get_xfer_ext avail 0, flags 00000000
 [ 4072.8642938] ata_get_xfer_ext() returned NULL
 [ 4072.8642938] cd0(piixide0:0:1): adapter resource shortage
 [ 4072.8642938] cddone error=16
 [ 4072.8642938] cd0a: error (errno=16) reading fsbn 3565164 of 3565164-3565179 (cd0 bn 3565164; cn 35651 tn 0 sn 64)

 This was produced with the following patches applied.  The first
 patch is to stop sysinst from intercepting the console output,
 which will otherwise cause parts of the kernel messages to be lost.

 Index: src/usr.sbin/sysinst/run.c
 ===================================================================
 RCS file: /cvsroot/src/usr.sbin/sysinst/run.c,v
 retrieving revision 1.5
 diff -u -r1.5 run.c
 --- src/usr.sbin/sysinst/run.c	30 Dec 2014 10:10:22 -0000	1.5
 +++ src/usr.sbin/sysinst/run.c	15 Aug 2018 12:36:42 -0000
 @@ -387,7 +387,9 @@
  	char *cp, *ncp;
  	struct termios rtt, tt;
  	struct timeval tmo;
 +#if 0
  	static int do_tioccons = 2;
 +#endif

  	(void)tcgetattr(STDIN_FILENO, &tt);
  	if (openpty(&master, &slave, NULL, &tt, win) == -1) {
 @@ -401,6 +403,7 @@
  	ttysig_ignore = 1;
  	ioctl(master, TIOCPKT, &ttysig_ignore);

 +#if 0
  	/* Try to get console output into our pipe */
  	if (do_tioccons) {
  		if (ioctl(slave, TIOCCONS, &do_tioccons) == 0
 @@ -415,6 +418,7 @@
  				do_tioccons = 1;
  		}
  	}
 +#endif

  	if (logfp)
  		fflush(logfp);
 Index: src/sys/dev/ata/ata_subr.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ata/ata_subr.c,v
 retrieving revision 1.4
 diff -u -r1.4 ata_subr.c
 --- src/sys/dev/ata/ata_subr.c	20 Oct 2017 07:06:07 -0000	1.4
 +++ src/sys/dev/ata/ata_subr.c	15 Aug 2018 15:31:48 -0000
 @@ -288,6 +288,7 @@
  retry:
  	avail = ffs32(chq->queue_xfers_avail & mask);
  	if (avail == 0) {
 +                printf("ata_get_xfer_ext avail 0, flags %08x\n", flags);
  		/*
  		 * Catch code which tries to get another recovery xfer while
  		 * already holding one (wrong recursion).
 @@ -299,6 +300,7 @@
  		if (flags & C_WAIT) {
  			chq->queue_flags |= QF_NEED_XFER;
  			error = cv_wait_sig(&chq->queue_busy, &chp->ch_lock);
 +                        printf("ata_get_xfer_ext cv_wait_sig error=%d\n", error);
  			if (error == 0)
  				goto retry;
  		}
 Index: src/sys/dev/scsipi/atapi_wdc.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/scsipi/atapi_wdc.c,v
 retrieving revision 1.129
 diff -u -r1.129 atapi_wdc.c
 --- src/sys/dev/scsipi/atapi_wdc.c	17 Oct 2017 18:52:51 -0000	1.129
 +++ src/sys/dev/scsipi/atapi_wdc.c	15 Aug 2018 15:26:24 -0000
 @@ -389,6 +389,7 @@

  		xfer = ata_get_xfer_ext(atac->atac_channels[channel], false, 0);
  		if (xfer == NULL) {
 +                        printf("ata_get_xfer_ext() returned NULL\n");
  			sc_xfer->error = XS_RESOURCE_SHORTAGE;
  			scsipi_done(sc_xfer);
  			return;
 Index: src/sys/dev/scsipi/cd.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/scsipi/cd.c,v
 retrieving revision 1.341
 diff -u -r1.341 cd.c
 --- src/sys/dev/scsipi/cd.c	17 Jun 2017 22:35:50 -0000	1.341
 +++ src/sys/dev/scsipi/cd.c	14 Aug 2018 17:47:47 -0000
 @@ -591,6 +591,9 @@
  	if (obp->b_error)
  		obp->b_resid = obp->b_bcount;

 +        if (obp->b_error)
 +                printf("cd bounce error=%d\n", obp->b_error);
 +        
  	free(bounce, M_DEVBUF);
  	biodone(obp);
  }
 @@ -743,6 +746,7 @@
  	return;

  bad:
 +        printf("cdstrategy bad error=%d\n", error);
  	bp->b_error = error;
  	bp->b_resid = bp->b_bcount;
  	biodone(bp);
 @@ -915,6 +919,8 @@
  	struct buf *bp = xs->bp;

  	if (bp) {
 +                if (error)
 +                        printf("cddone error=%d\n", error);
  		bp->b_error = error;
  		bp->b_resid = xs->resid;
  		if (error) {
 Index: src/sys/kern/subr_disk.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/subr_disk.c,v
 retrieving revision 1.122
 diff -u -r1.122 subr_disk.c
 --- src/sys/kern/subr_disk.c	7 Mar 2018 21:13:24 -0000	1.122
 +++ src/sys/kern/subr_disk.c	14 Aug 2018 10:00:34 -0000
 @@ -139,7 +139,7 @@
  		pr = addlog;
  	} else
  		pr = printf;
 -	(*pr)("%s%d%c: %s %sing fsbn ", dname, unit, partname, what,
 +	(*pr)("%s%d%c: %s (errno=%d) %sing fsbn ", dname, unit, partname, what, bp->b_error,
  	    bp->b_flags & B_READ ? "read" : "writ");
  	sn = bp->b_blkno;
  	if (bp->b_bcount <= DEV_BSIZE)

 -- 
 Andreas Gustafsson, gson@gson.org

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: Andreas Gustafsson <gson@gson.org>
Cc: Jaromir Dolecek <jdolecek@netbsd.org>, "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/52614: qemu virtual CD-ROM report read errors since recent
 wdc changes
Date: Tue, 21 Aug 2018 11:21:26 +0200

 2018-08-21 10:29 GMT+02:00 Andreas Gustafsson <gson@gson.org>:
 > During a typical failed sysinst run, the "if (avail == 0)" branch
 > in ata_get_xfer_ext() was entered more than 9000 times, always with
 > flags == 0.  From reading the code, this condition results in
 > ata_get_xfer_ext() returning NULL.

 Thanks, this does help.

 It shows there is definitely codepath where the new code is wrong - if
 SCSIPI fails to get the xfer and returns with EAGAIN, nothing ever
 re-triggers the SCSI code to retry that transfer again and it
 eventually times out. I'll look at this in more detail, maybe I can
 fix this separately.

 The EAGAIN is more likely to happen on the legacy IDE interfaces,
 where the controller has just one preallocated xfer for all I/O (disk
 and ATAPI) now. I plan to rewrite the xfer handling code to actually
 avoid this and don't artificially limit the number of middle layer
 CCBs, e.g. switch to pool allocation or just have more of them. Stay
 tuned, I should have something soon.

 Jaromir

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52614 CVS commit: [jdolecek-ncqfixes] src/sys/dev
Date: Sat, 22 Sep 2018 09:23:00 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat Sep 22 09:23:00 UTC 2018

 Modified Files:
 	src/sys/dev/ata [jdolecek-ncqfixes]: TODO.ncq ata.c ata_subr.c atavar.h
 	    satapmp_subr.c wd.c wdvar.h
 	src/sys/dev/ic [jdolecek-ncqfixes]: ahcisata_core.c mvsata.c siisata.c
 	src/sys/dev/scsipi [jdolecek-ncqfixes]: atapi_wdc.c
 	src/sys/dev/usb [jdolecek-ncqfixes]: umass_isdata.c

 Log Message:
 separate ata_xfer slot allocation and the memory allocation, so that
 there can be more queued xfers than number of supported slots by controller,
 and use a pool instead of custom pre-allocation

 primarily to help PR kern/52614

 remove no longer needed custom wd(4) logic for flush cache

 switch also wd(4) trim/suspend/setcache/wdioctlstrategy to sleep waiting
 for the memory, they are all called from process context and this
 avoids spurious failures


 To generate a diff of this commit:
 cvs rdiff -u -r1.4.2.3 -r1.4.2.4 src/sys/dev/ata/TODO.ncq
 cvs rdiff -u -r1.141.6.5 -r1.141.6.6 src/sys/dev/ata/ata.c
 cvs rdiff -u -r1.6.2.4 -r1.6.2.5 src/sys/dev/ata/ata_subr.c
 cvs rdiff -u -r1.99.2.4 -r1.99.2.5 src/sys/dev/ata/atavar.h
 cvs rdiff -u -r1.14 -r1.14.2.1 src/sys/dev/ata/satapmp_subr.c
 cvs rdiff -u -r1.441.2.3 -r1.441.2.4 src/sys/dev/ata/wd.c
 cvs rdiff -u -r1.46.6.1 -r1.46.6.2 src/sys/dev/ata/wdvar.h
 cvs rdiff -u -r1.62.2.4 -r1.62.2.5 src/sys/dev/ic/ahcisata_core.c
 cvs rdiff -u -r1.41.2.3 -r1.41.2.4 src/sys/dev/ic/mvsata.c
 cvs rdiff -u -r1.35.6.4 -r1.35.6.5 src/sys/dev/ic/siisata.c
 cvs rdiff -u -r1.129.6.3 -r1.129.6.4 src/sys/dev/scsipi/atapi_wdc.c
 cvs rdiff -u -r1.36 -r1.36.6.1 src/sys/dev/usb/umass_isdata.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 22 Oct 2018 20:24:44 +0000
State-Changed-Why:
Can you test after the jdolecek-ncqfixes branch merge?


From: Andreas Gustafsson <gson@gson.org>
To: jdolecek@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent wdc changes)
Date: Wed, 24 Oct 2018 09:50:00 +0300

 jdolecek@NetBSD.org wrote:
 > Can you test after the jdolecek-ncqfixes branch merge?

 Looks like the bug is fixed.  The b5 i386 testbed has now done 13
 successful installs in a row since the jdolecek-ncqfixes merge, when
 the previous record since the sata-ncq merge was four in a row.  Also,
 the "cd0a: error reading" message is absent from the install logs.
 Thank you!
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 24 Oct 2018 06:56:05 +0000
State-Changed-Why:
Reported fixed. Thanks for report.


From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent wdc changes)
Date: Wed, 24 Oct 2018 15:15:34 +0700

     Date:        Wed, 24 Oct 2018 06:55:01 +0000 (UTC)
     From:        Andreas Gustafsson <gson@gson.org>
     Message-ID:  <20181024065501.083BA7A237@mollari.NetBSD.org>

   |  the previous record since the sata-ncq merge was four in a row.

 That's just this month, unless you're counting the read error messages.

 At the end of Aug there was a sequence of 12 in a row with no install
 fail, which continued for 8 more at the beginning of Sep, for a run of
 20 build/install/test with no build or install failures 

 Just before that run in Aug there was another longish sequence of
 successful installs (though this one broken by a lot of build failures
 as well - which are irrelevant for this purpose.)   I'm sure there have
 been others.

 So while things are certainly looking promising, I would not claim
 success quite yet (though this PR need not be reopened, or not
 unless another failure of the same kind occur).

   |  Also,
   |  the "cd0a: error reading" message is absent from the install logs.

 That, if anything, is a more promising sign I think.

 kre

From: Andreas Gustafsson <gson@gson.org>
To: Robert Elz <kre@munnari.OZ.AU>
Cc: jdolecek@NetBSD.org, gnats-bugs@NetBSD.org
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent wdc changes)
Date: Wed, 24 Oct 2018 11:35:35 +0300

 Robert Elz wrote:
 >    |  the previous record since the sata-ncq merge was four in a row.
 >  
 >  That's just this month, unless you're counting the read error messages.

 Mea culpa.  I was in fact counting installs with no "cd0a: error
 reading" message, not successful installs.  But in any case, going
 from at most 4 in a row of those to at least 13 does look promising.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
    Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/52614 (qemu virtual CD-ROM reports read errors since recent
 wdc changes)
Date: Wed, 24 Oct 2018 17:17:49 +0800 (+08)

 >   |  Also,
 >   |  the "cd0a: error reading" message is absent from the install logs.
 >
 > That, if anything, is a more promising sign I think.

 Yes, this is, i think, the most important item.  There were many times
 when "only a few" (or even only 1) read error occurred and the driver
 was able to retry and recover.

 Prior to this issue rising, I had never seen the cd0a: error message,
 so now that it has disappeared I would strongly feel that we've "fixed
 the bug".


 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.