NetBSD Problem Report #52126

From www@NetBSD.org  Fri Mar 31 15:30:43 2017
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 67E107A0F3
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 31 Mar 2017 15:30:43 +0000 (UTC)
Message-Id: <20170331153042.17C0D7A2AC@mollari.NetBSD.org>
Date: Fri, 31 Mar 2017 15:30:42 +0000 (UTC)
From: prlw1@cam.ac.uk
Reply-To: prlw1@cam.ac.uk
To: gnats-bugs@NetBSD.org
Subject: mvsata Marvell 88SX6081 panics on boot
X-Send-Pr-Version: www-1.0

>Number:         52126
>Category:       kern
>Synopsis:       mvsata Marvell 88SX6081 panics on boot
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Mar 31 15:35:00 +0000 2017
>Closed-Date:    
>Last-Modified:  Wed May 05 19:16:31 +0000 2021
>Originator:     Patrick Welche
>Release:        NetBSD-7.99.66/amd64
>Organization:
>Environment:
>Description:
First reported in

http://mail-index.netbsd.org/current-users/2017/02/20/msg031172.html

quoting:

mvsata0 at pci5 dev 1 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
mvsata0: interrupting at ioapic1 pin 1
mvsata0: GenII, 2hc, 4port/hc

mvsata0:0:0: unable to stop EDMA
panic: EDMA enabled


but it took me a while to realise that kgdb works much better if you
boot -1

(gdb) bt
#0  0xffffffff80215455 in breakpoint ()
#1  0xffffffff8082c0c8 in vpanic (
    fmt=fmt@entry=0xffffffff80edac77 "EDMA enabled", 
    ap=ap@entry=0xffffffff81645818) at ../../../../kern/subr_prf.c:340
#2  0xffffffff8082c185 in panic (
    fmt=fmt@entry=0xffffffff80edac77 "EDMA enabled")
    at ../../../../kern/subr_prf.c:258
#3  0xffffffff804b6d21 in mvsata_pmp_select (mvport=0xfffffe811d78d810, 
    pmpport=0) at ../../../../dev/ic/mvsata.c:241
#4  0xffffffff804b80e1 in mvsata_wdc_cmd_start (chp=0xfffffe811d78d810, 
    xfer=0xfffffe811dc60f60) at ../../../../dev/ic/mvsata.c:1586
#5  0xffffffff804b7100 in mvsata_exec_command (drvp=<optimized out>, 
    ata_c=0xffffffff81645978) at ../../../../dev/ic/mvsata.c:677
#6  0xffffffff8026e356 in ata_get_params (drvp=0xfffffe811dcb2e80, 
    flags=<optimized out>, prms=0xfffffe811dcb358c)
    at ../../../../dev/ata/ata.c:762
#7  0xffffffff8026a5a1 in wd_get_params (wd=0xfffffe811dcb3008, 
    flags=<optimized out>, params=0xfffffe811dcb358c)
    at ../../../../dev/ata/wd.c:1776
#8  0xffffffff8026ae08 in wdopen (dev=<optimized out>, flag=<optimized out>, 
    fmt=24576, l=<optimized out>) at ../../../../dev/ata/wd.c:990
#9  0xffffffff808194b4 in bdev_open (dev=3, flag=1073741825, devtype=24576, 
    l=0xffffffff812764e0 <lwp0>) at ../../../../kern/subr_devsw.c:716
#10 0xffffffff8089b03d in spec_open (v=0xffffffff81645b20)
    at ../../../../miscfs/specfs/spec_vnops.c:609
#11 0xffffffff8088f25c in VOP_OPEN (vp=0xfffffe811dc5fd20, 
    mode=mode@entry=1073741825, cred=cred@entry=0xffffffffffffffff)
    at ../../../../kern/vnode_if.c:238
#12 0xffffffff80668c71 in rf_find_raid_components ()
    at ../../../../dev/raidframe/rf_netbsdkintf.c:2874
#13 0xffffffff8066a1ec in rf_autoconfig (self=<optimized out>)
    at ../../../../dev/raidframe/rf_netbsdkintf.c:445
#14 0xffffffff808137d4 in config_finalize ()
    at ../../../../kern/subr_autoconf.c:2166
#15 0xffffffff80b7d9a1 in main () at ../../../../kern/init_main.c:613
(gdb) frame 4
#4  0xffffffff804b80e1 in mvsata_wdc_cmd_start (chp=0xfffffe811d78d810, 
    xfer=0xfffffe811dc60f60) at ../../../../dev/ic/mvsata.c:1586
1586            mvsata_pmp_select(mvport, drive);
(gdb) print *xfer
$1 = {c_flags = 16, c_chp = 0xfffffe811d78d810, c_drive = 0, 
  c_cmd = 0xffffffff81645978, c_databuf = 0xfffffe8125788c88, c_bcount = 512, 
  c_skip = 0, c_dscpoll = 0, c_lenoff = 0, c_xferchain = {tqe_next = 0x0, 
    tqe_prev = 0xfffffe811d78de80}, 
  c_start = 0xffffffff804b806b <mvsata_wdc_cmd_start>, 
  c_intr = 0xffffffff804b7d9f <mvsata_wdc_cmd_intr>, 
  c_kill_xfer = 0xffffffff804b795a <mvsata_wdc_cmd_kill_xfer>}
(gdb) print {struct ata_command}xfer->c_cmd
$3 = {r_lba = 0, r_count = 0, {r_features = 0, r_error = 0 '\000'}, {
    r_command = 236 '\354', r_status = 236 '\354'}, r_device = 0 '\000', 
  r_st_bmask = 64 '@', r_st_pmask = 8 '\b', flags = 9, timeout = 3000, 
  data = 0xfffffe8125788c88, bcount = 512, callback = 0x0, callback_arg = 0x0}

flags = 9 = AT_READ | AT_WAIT

r_command = 236 = 0xEC = WDCC_IDENTIFY

so mvsata just seems to be trying to identify the drives on boot. What is EDMA?
>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.orgu
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Fri, 31 Mar 2017 17:38:30 +0200

 On Fri, Mar 31, 2017 at 03:35:00PM +0000, prlw1@cam.ac.uk wrote:
 > mvsata0:0:0: unable to stop EDMA
 > panic: EDMA enabled

 I have seen that when the machine had trouble talking to the disk, a
 full reset usualy fixed that (but later I replaced the drive due to SMART
 failurs and haven't seen it again since then).

 Is this reprducable for you? Does it happen always on boot? Did the device/
 disk combo work before?

 Martin

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Fri, 31 Mar 2017 16:46:21 +0100

 On Fri, Mar 31, 2017 at 03:40:01PM +0000, Martin Husemann wrote:
 >  Is this reprducable for you? Does it happen always on boot? Did the device/
 >  disk combo work before?

 When I received that old server, I successfully booted it with the ubuntu
 that was installed. I have always had the netbsd netboot panic. It is true
 that given its age, the chance of one of the 8 disks on each mvsata board
 being bad is quite high.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Fri, 31 Mar 2017 17:17:05 +0100

 Just booted the already installed Ubuntu 14.04.1.

   for i in a b c d e f g h i j k l m n o p; do smartctl -a /dev/sd$i | grep overall-health; done

 gets me

   SMART overall-health self-assessment test result: PASSED

 for all 16 disks, even though there is plenty of "Pre-fail" and "Old_age"

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Mon, 3 Apr 2017 12:01:33 +0100

 Removed 15 of the 16 disks from the 2 mvsata cards, and get same panic.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Tue, 4 Apr 2017 13:54:23 +0100

 I took a leaf out of the ARMADAXP (evbarm) and added

   options MVSATA_WITHOUTDMA

 to my amd64 kernel configuration. The panic is now worked-around, I just
 need to create more wd devices(!)

 Anyway, this suggests a software rather than hardware bug.

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 25 Apr 2017 22:00:02 +0000
Responsible-Changed-Why:
Collecting mvsata(4) issues, soon to do changes on ncq branch.


From: Matt Brocklehurst <matt@mattbrocklehurst.co.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Subject: Re: kern/52126
Date: Fri, 18 Aug 2017 13:06:43 +0100

 --94eb2c0a956e6ce423055705f6e6
 Content-Type: text/plain; charset="UTF-8"

 I'm also having this problem, about to build a new kernel with the DMA
 disabled in the hope to get it running properly.

 I've got a Thecus 5200Pro NAS running NetBSD 7.1
  Controller is 88SX6081


 I've tried a mixture of different drives, they all give the "unable to stop
 EDMA" error on boot.

 --94eb2c0a956e6ce423055705f6e6
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div><div>I&#39;m also having this problem, about to build=
  a new kernel with the DMA disabled in the hope to get it running properly.=
 <br><br></div>I&#39;ve got a Thecus 5200Pro NAS running NetBSD 7.1 <br>=C2=
 =A0Controller is 88SX6081 <br><br><br></div>I&#39;ve tried a mixture of dif=
 ferent drives, they all give the &quot;unable to stop EDMA&quot; error on b=
 oot.<br></div>

 --94eb2c0a956e6ce423055705f6e6--

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: Subject: Re: kern/52126
Date: Sat, 19 Aug 2017 08:41:28 -0000 (UTC)

 matt@mattbrocklehurst.co.uk (Matt Brocklehurst) writes:

 > I've tried a mixture of different drives, they all give the "unable to stop
 > EDMA" error on boot.

 The corresponding code in Linux does almost the same. Differences are:

 - it not only waits for the IDLE flag, but also for the CACHEEMPTY flag
   before disabling DMA.
 - it waits up to 15ms (instead of 10ms) before disabling DMA
 - it disables DMA even when the idle timeout is reached.
 - it waits up to 100ms (instead of the remainder of 10ms) after disabling DMA
   before giving up.

 On the other hand, our timeouts are very imprecise as the driver tries to
 sleep for 1 millisecond, which is impossible with HZ=100. A sleep for
 mstohz(1) will sleep 0 ticks, i.e. it will only yield the CPU to other
 LWPs. It is likely that the required delays are not met.


 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org, Michael van Elst <mlelstv@serpens.de>
Cc: Jaromir Dolecek <jdolecek@netbsd.org>, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
	Patrick Welche <prlw1@cam.ac.uk>
Subject: Re: Subject: Re: kern/52126
Date: Sat, 19 Aug 2017 21:31:26 +0200

 --001a1148f876afc9050557204acf
 Content-Type: text/plain; charset="UTF-8"

 I've changed the sleep/delay code on jdolecek-ncq branch so that the sleep
 is always at least 1 tick. You might give it a try - if that is really the
 only issue, this might help. Increasing the timeouts however did not fix
 the issue for me on 60xx, at least not with the brief testing I did so far.

 I've experienced the EDMA timeout on 6048 just while booting or some light
 use, while it almost never happened on 70xxx. It usually happened when the
 chip was wedged for other reasons, like unrecovered NCQ error or some such.
 So I think that there might be something else we do wrong for 60xx, and the
 EDMA disable problem is just consequence.

 60xx was however not really focus of the NCQ branch, and I won't do
 anything further for it on the branch. My focus is now on completing my
 branch and merge, and some RL stuff. Then I may start looking on the 60xx
 issue, but that won't be before late September.

 Jaromir

 2017-08-19 10:45 GMT+02:00 Michael van Elst <mlelstv@serpens.de>:

 > The following reply was made to PR kern/52126; it has been noted by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: Subject: Re: kern/52126
 > Date: Sat, 19 Aug 2017 08:41:28 -0000 (UTC)
 >
 >  matt@mattbrocklehurst.co.uk (Matt Brocklehurst) writes:
 >
 >  > I've tried a mixture of different drives, they all give the "unable to
 > stop
 >  > EDMA" error on boot.
 >
 >  The corresponding code in Linux does almost the same. Differences are:
 >
 >  - it not only waits for the IDLE flag, but also for the CACHEEMPTY flag
 >    before disabling DMA.
 >  - it waits up to 15ms (instead of 10ms) before disabling DMA
 >  - it disables DMA even when the idle timeout is reached.
 >  - it waits up to 100ms (instead of the remainder of 10ms) after disabling
 > DMA
 >    before giving up.
 >
 >  On the other hand, our timeouts are very imprecise as the driver tries to
 >  sleep for 1 millisecond, which is impossible with HZ=100. A sleep for
 >  mstohz(1) will sleep 0 ticks, i.e. it will only yield the CPU to other
 >  LWPs. It is likely that the required delays are not met.
 >
 >
 >  --
 >  --
 >                                  Michael van Elst
 >  Internet: mlelstv@serpens.de
 >                                  "A potential Snark may lurk in every
 > tree."
 >
 >

 --001a1148f876afc9050557204acf
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr">I&#39;ve changed the sleep/delay code on jdolecek-ncq bran=
 ch so that the sleep is always at least 1 tick. You might give it a try - i=
 f that is really the only issue, this might help. Increasing the timeouts h=
 owever did not fix the issue for me on 60xx, at least not with the brief te=
 sting I did so far.<div><br></div><div>I&#39;ve experienced the EDMA timeou=
 t on 6048 just while booting or some light use, while it almost never happe=
 ned on 70xxx. It usually happened when the chip was wedged for other reason=
 s, like unrecovered NCQ error or some such. So I think that there might be =
 something else we do wrong for 60xx, and the EDMA disable problem is just c=
 onsequence.=C2=A0<br></div><div><br></div><div>60xx was however not really =
 focus of the NCQ branch, and I won&#39;t do anything further for it on the =
 branch. My focus is now on completing my branch and merge, and some RL stuf=
 f. Then I may start looking on the 60xx issue, but that won&#39;t be before=
  late September.</div><div><br></div><div>Jaromir<br></div></div><div class=
 =3D"gmail_extra"><br><div class=3D"gmail_quote">2017-08-19 10:45 GMT+02:00 =
 Michael van Elst <span dir=3D"ltr">&lt;<a href=3D"mailto:mlelstv@serpens.de=
 " target=3D"_blank">mlelstv@serpens.de</a>&gt;</span>:<br><blockquote class=
 =3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
 ing-left:1ex"><span class=3D"">The following reply was made to PR kern/5212=
 6; it has been noted by GNATS.<br>
 <br>
 </span>From: <a href=3D"mailto:mlelstv@serpens.de">mlelstv@serpens.de</a> (=
 Michael van Elst)<br>
 To: <a href=3D"mailto:gnats-bugs@netbsd.org">gnats-bugs@netbsd.org</a><br>
 Cc:<br>
 Subject: Re: Subject: Re: kern/52126<br>
 Date: Sat, 19 Aug 2017 08:41:28 -0000 (UTC)<br>
 <span class=3D""><br>
 =C2=A0<a href=3D"mailto:matt@mattbrocklehurst.co.uk">matt@mattbrocklehurst.=
 co.uk</a> (Matt Brocklehurst) writes:<br>
 <br>
 =C2=A0&gt; I&#39;ve tried a mixture of different drives, they all give the =
 &quot;unable to stop<br>
 =C2=A0&gt; EDMA&quot; error on boot.<br>
 <br>
 </span>=C2=A0The corresponding code in Linux does almost the same. Differen=
 ces are:<br>
 <br>
 =C2=A0- it not only waits for the IDLE flag, but also for the CACHEEMPTY fl=
 ag<br>
 =C2=A0 =C2=A0before disabling DMA.<br>
 =C2=A0- it waits up to 15ms (instead of 10ms) before disabling DMA<br>
 =C2=A0- it disables DMA even when the idle timeout is reached.<br>
 =C2=A0- it waits up to 100ms (instead of the remainder of 10ms) after disab=
 ling DMA<br>
 =C2=A0 =C2=A0before giving up.<br>
 <br>
 =C2=A0On the other hand, our timeouts are very imprecise as the driver trie=
 s to<br>
 =C2=A0sleep for 1 millisecond, which is impossible with HZ=3D100. A sleep f=
 or<br>
 =C2=A0mstohz(1) will sleep 0 ticks, i.e. it will only yield the CPU to othe=
 r<br>
 =C2=A0LWPs. It is likely that the required delays are not met.<br>
 <br>
 <br>
 =C2=A0--<br>
 =C2=A0--<br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Michael van Elst<br>
 =C2=A0Internet: <a href=3D"mailto:mlelstv@serpens.de">mlelstv@serpens.de</a=
 ><br>
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
 =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;A potential Snark may lu=
 rk in every tree.&quot;<br>
 <br>
 </blockquote></div><br></div>

 --001a1148f876afc9050557204acf--

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Fwd: Subject: Re: kern/52126
Date: Sun, 20 Aug 2017 00:29:29 +0200

 --001a11471ce06c41e2055722c70f
 Content-Type: text/plain; charset="UTF-8"

 (once more in plain text for record)

 I've changed the sleep/delay code on jdolecek-ncq branch so that the sleep
 is always at least 1 tick. You might give it a try - if that is really the
 only issue, this might help. Increasing the timeouts however did not fix
 the issue for me on 60xx, at least not with the brief testing I did so far.

 I've experienced the EDMA timeout on 6048 just while booting or some light
 use, while it almost never happened on 70xxx. It usually happened when the
 chip was wedged for other reasons, like unrecovered NCQ error or some such.
 So I think that there might be something else we do wrong for 60xx, and the
 EDMA disable problem is just consequence.

 60xx was however not really focus of the NCQ branch, and I won't do
 anything further for it on the branch. My focus is now on completing my
 branch and merge, and some RL stuff. Then I may start looking on the 60xx
 issue, but that won't be before late September.

 Jaromir

 --001a11471ce06c41e2055722c70f
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr">(once more in plain text for record)<br><br>I&#39;ve chang=
 ed the sleep/delay code on jdolecek-ncq branch so that the sleep is always =
 at least 1 tick. You might give it a try - if that is really the only issue=
 , this might help. Increasing the timeouts however did not fix the issue fo=
 r me on 60xx, at least not with the brief testing I did so far.<br><br>I&#3=
 9;ve experienced the EDMA timeout on 6048 just while booting or some light =
 use, while it almost never happened on 70xxx. It usually happened when the =
 chip was wedged for other reasons, like unrecovered NCQ error or some such.=
  So I think that there might be something else we do wrong for 60xx, and th=
 e EDMA disable problem is just consequence.<br><br>60xx was however not rea=
 lly focus of the NCQ branch, and I won&#39;t do anything further for it on =
 the branch. My focus is now on completing my branch and merge, and some RL =
 stuff. Then I may start looking on the 60xx issue, but that won&#39;t be be=
 fore late September.<br><br>Jaromir</div>

 --001a11471ce06c41e2055722c70f--

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Wed, 8 Nov 2017 13:08:20 +0000

 Should I be looking at jdolecek-ncq? I just checked today's -current, and:


 Standard GENERIC-8.99.5/amd64:
 mvsata1 at pci5 dev 2 function 0: vendor 11ab product 6081 (rev. 0x09)
 mvsata1: interrupting at ioapic1 pin 2
 mvsata1: GenII, 2hc, 4port/hc
 mvsata1 port 4: device present, speed: 3.0Gb/s
 mvsata1:4:0: lost interrupt
         type: ata tc_bcount: 16384 tc_skip: 0

 and then machine apparently hangs

 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 0       94 3   2       200   ffffe4011dd43ae0         mvsata1cnf biowait


 kernel with MVSATA_WITHOUTDMA etc.

 mvsata1 at pci5 dev 2 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
 mvsata1: interrupting at ioapic1 pin 2
 mvsata1: GenII, 2hc, 4port/hc
 mvsata1 port 4: device present, speed: 3.0Gb/s
 mvsata1:4:0: lost interrupt
         type: ata tc_bcount: 16384 tc_skip: 0
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 1
 mvsata1:4:0: lost interrupt
         type: ata tc_bcount: 16384 tc_skip: 0
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 2
 mvsata1:4:0: lost interrupt
         type: ata tc_bcount: 16384 tc_skip: 0
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 3
 mvsata1:4:0: lost interrupt
         type: ata tc_bcount: 16384 tc_skip: 0
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 4
 mvsata1:4:0: lost interrupt
         type: ata tc_bcount: 15872 tc_skip: 512
 wd0d: device timeout reading fsbn 1 of 0-31 (wd0 bn 1; cn 0 tn 0 sn 1), slot 0, retry 5
 fatal page fault in supervisor mode
 trap type 6 code 0x2 rip 0xffffffff8021c9e3 cs 0x8 rflags 0x10202 cr2 0xffff800020991000 ilevel 0x8 rsp 0xffffe4011d271be0
 curlwp 0xffffe4045ef0b840 pid 0.2 lowest kstack 0xffffe4011d26e2c0
 kernel: page fault trap, code=0
 Stopped in pid 0.2 (system) at  netbsd:bus_space_read_multi_2+0x23:     movw    %
 ax,0(%rcx)
 bus_space_read_multi_2() at netbsd:bus_space_read_multi_2+0x23
 wdc_ata_bio_intr() at netbsd:wdc_ata_bio_intr+0x2e7
 mvsata_intr() at netbsd:mvsata_intr+0x107
 mvsata_pci_intr() at netbsd:mvsata_pci_intr+0x12e
 intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
 Xintr_ioapic_level4() at netbsd:Xintr_ioapic_level4+0xf7
 --- interrupt ---
 x86_stihlt() at netbsd:x86_stihlt+0x6
 acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0xdb
 acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xb6
 idle_loop() at netbsd:idle_loop+0x19c


 There is no guarantee that the disks are OK, but I have tried several...

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 20 May 2019 10:45:39 +0100

 An update - in the meantime jdolecek-ncq was merged, and I'm sure I saw
 something about DMA...

 >  and then machine apparently hangs

 I still see the hang

 >  kernel with MVSATA_WITHOUTDMA etc.

 but don't need the MVSATA_WITHOUTDMA. No more panics!

 GENERIC 8.9941/amd64 kernel, with LOCKDEBUG, DIAGNOSTIC and KUBSAN,
 booted with -1 just in case (no difference in multiprocessor), and:

 cpu0: package 0, core 0, smt 0
 cpu1 at mainbus0 apid 1: multiprocessor boot disabled
 ...
 mvsata0 at pci5 dev 1 function 0: vendor 11ab product 6081 (rev. 0x09)
 mvsata0: interrupting at ioapic1 pin 1
 mvsata0: GenII, 2hc, 4port/hc
 atabus0 at mvsata0 channel 0
 atabus1 at mvsata0 channel 1
 ...
 mvsata0 port 0: device present, speed: 1.5Gb/s
 mvsata0 port 6: device present, speed: 1.5Gb/s
 ahcisata0 port 0: device present, speed: 1.5Gb/s
 ahcisata0 port 1: device present, speed: 1.5Gb/s
 wd0 at atabus0 drive 0
 wd0: <ST3200822AS>
 wd0: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
 mvsata0:0:0: lost interrupt
  type: ata tc_bcount: 16384 tc_skip: 0
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 0
 mvsata0:0:0: lost interrupt
  type: ata tc_bcount: 16384 tc_skip: 0
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 1
 ipmi0: ID 32.1 IPMI 2.0 Available
 mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 2
 mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 3
 mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 4
 mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
 wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0)
 wd0d: error reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0)

 hang

 break into ddb:

 --- interrupt ---
 x86_stihlt() at netbsd:x86_stihlt+0x6
 acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0x18d
 acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0x1b3
 idle_loop() at netbsd:idle_loop+0x19a

 Lost in idle_loop?

 Extracts from ps:

 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 0       65 3   0       204   ffffbddd0279f200         mvsata0cnf biowait
 0       21 3   0       204   ffffbddd0248b540            atabus0 atath
 0    >   2 7   0       201   ffffbddf4b803040             idle/0

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 20 May 2019 12:34:51 +0100

 On Mon, May 20, 2019 at 11:00:01AM +0000, Patrick Welche wrote:
 >  An update - in the meantime jdolecek-ncq was merged, and I'm sure I saw
 >  something about DMA...

 Checking: the bit about DMA was in fact the patch in PR kern/54205, which
 AFAICT hasn't been committed. I just tried it, and it didn't help with
 this (different) bug.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Thu, 17 Oct 2019 14:24:53 +0100

 Just tried a today's NetBSD-9.99.17/amd64 kernel with:

 include "arch/amd64/conf/GENERIC"
 options         DEBUG
 options         LOCKDEBUG
 options         MVSATA_DEBUG
 options         MVSATA_WITHOUTDMA

 (I think the withoutdma is not longer necessary) and on boot, get

 Mutex error: assert_sleepable,70: spin lock held

 lock address : 0xfffff91bd38a4020 type     :               spin
 initialized  : 0xffffffff80284dd9
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  1 last held:                  1
 current lwp  : 0xfffff91bd3934ac0 last held: 0xfffff91bd3934ac0
 last locked* : 0xffffffff809c910f unlocked : 0xffffffff809c9070
 owner field  : 0x0000000000010600 wait/spin:                0/1

 panic: LOCKDEBUG: Mutex error: assert_sleepable,70: spin lock held
 cpu1: Begin traceback...
 vpanic() at netbsd:vpanic+0x160
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 assert_sleepable() at netbsd:assert_sleepable+0x3d
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x18e
 kmem_intr_alloc() at netbsd:kmem_intr_alloc+0x5b
 kmem_intr_zalloc() at netbsd:kmem_intr_zalloc+kmem_zalloc() at netbsd:kmem_zalloc+0x4a
 atabus_alloc_drives() at netbsd:atabus_alloc_drives+0x70
 sata_interpret_sig() at netbsd:sata_interpret_sig+0x1e5
 mvsata_probe_drive() at netbsd:mvsata_probe_drive+0x234
 atabusconfig() at netbsd:atabusconfig+0x65
 atabus_thread() at netbsd:atabus_thread+0x7e
 cpu1: End traceback...
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip 0xffffffff8021ddcd cs 0x8 rflags 0x202 cr2 0 ilevel 0x8 rsp 0xffffd5811bbfcce0
 curlwp 0xfffff91bd3934ac0 pid 0.52 lowest kstack 0xffffd5811bbf92c0
 0x11

 db{1}> show locks
 [Locks tracked through LWPs]
 Locks held by an LWP (ipmi0):
 Lock 0 (initialized at ipmi_attach)
 lock address : 0xfffff91e95a333c0 type     :     sleep/adaptive
 initialized  : 0xffffffff80a92655
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  1 last held:                  0
 current lwp  : 0xfffff91bd3934ac0 last held: 0xfffff91bd34f7600
 last locked* : 0xffffffff80a934f5 unlocked : 0xffffffff80a9354c
 owner field  : 0xfffff91bd34f7600 wait/spin:                0/0

 Turnstile chain at 0xffffffff816a0200.
 => No active turnstile for this lock.


 [Locks tracked through CPUs]
 Locks held on CPU 1:
 Lock 0 (initialized at main)
 lock address : 0xffffffff8169e800 type     :               spin
 initialized  : 0xffffffff80e3be69
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  1 last held:                  1
 current lwp  : 0xfffff91bd3934ac0 last held: 0xfffff91bd3934ac0
 last locked* : 0xffffffff809c60ac unlocked : 0xffffffff809c90dd
 curcpu holds :                  1 wanted by: 000000000000000000

 Lock 1 (initialized at ata_channel_init)
 lock address : 0xfffff91bd38a4020 type     :               spin
 initialized  : 0xffffffff80284dd9
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  1 last held:                  1
 current lwp  : 0xfffff91bd3934ac0 last held: 0xfffff91bd3934ac0
 last locked* : 0xffffffff809c910f unlocked : 0xffffffff809c9070
 	owner field  : 0x0000000000010600 wait/spin:                ine cpu 0 
 using CPU 0
 db{1}> trace
 _kernel_lock() at netbsd:_kernel_lock+0x1d7
 sleepq_block() at netbsd:sleepq_block+0x93
 kpause() at netbsd:kpause+0x11f
 sata_reset_interface() at netbsd:sata_reset_interface+0xa8
 mvsata_probe_drive() at netbsd:mvsata_probe_drive+0x36
 atabusconfig() at netbsd:atabusconfig+0x65
 atabus_thread() at netbsd:atabus_thread+0x7e
 db{1}> machine cpu 1
 using CPU 1
 db{1}> trace
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x160
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 assert_sleepable() at netbsd:assert_sleepable+0x3d
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x18e
 kmem_intr_alloc() at netbsd:kmem_intr_alloc+0x5b
 kmem_intr_zalloc() at netbsd:kmem_intr_zalloc+0x11
 kmem_zalloc() at netbsd:kmem_zalloc+0x4a
 atabus_alloc_drives() at netbsd:atabus_alloc_drives+0x70
 sata_interpret_sig() at netbsd:sata_interpret_sig+0x1e5
 mvsata_probe_drive() at netbsd:mvsata_probe_drive+0x234
 atabusconfig() at netbsd:atabusconfig+0x65
 atabus_thread() at netbsd:atabus_thread+0x7e
 db{1}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 ...
 0    >  52 7   1       200   fffff91bd3934ac0           atabus12

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Tue, 22 Oct 2019 17:07:34 +0100

 The locking error has gone away - guessing this has to do with the ata
 insomniac fixes...

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 22 Oct 2019 16:36:49 +0000
State-Changed-Why:
Does this mean this PR can be closed?


From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Wed, 23 Oct 2019 11:33:49 +0100

 Just tried again with this morning's source and see the locking panic

 Kernel lock error: _kernel_lock,220: spinout

 lock address : 0xffffffff8169e800 type     :               spin
 initialized  : 0xffffffff80e3a8a9
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  3
 current cpu  :                  3 last held:                  1
 current lwp  : 0xffffd36185f81600 last held: 0xffffd3618633eac0
 last locked* : 0xffffffff809c58db unlocked : 0xffffffff8098aee2
 curcpu holds :                  0 wanted by: 0xffffd36185f81600

 panic: LOCKDEBUG: Kernel lock error: _kernel_lock,220: spinout
 cpu3: Begin traceback...
 vpanic() at netbsd:vpanic+0x178
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 _kernel_lock() at netbsd:_kernel_lock+0x14a
 sleepq_block() at netbsd:sleepq_block+0x1f8
 cv_timedwait() at netbsd:cv_timedwait+0x130
 ipmi_thread() at netbsd:ipmi_thread+0x881
 cpu3: End traceback...
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip 0xffffffff8021ddcd cs 0x8 rflags 0x202 cr2 0 ilevel 0 rsp 0xffffab013d2ccd20
 curlwp 0xffffd36185f81600 pid 0.39 lowest kstack 0xffffab013d2c92c0
 	Stopped in pid 0.39 (system) at netbsd:breakpoint+0x5: how locks
 [Locks tracked through LWPs]
 Locks held by an LWP (npfgc-0):
 Lock 0 (initialized at npf_config_init)
 lock address : 0xffffd36186396cf8 type     :     sleep/adaptive
 initialized  : 0xffffffff8079aa4b
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  3 last held:                  2
 current lwp  : 0xffffd36185f81600 last held: 0xffffd361868d29a0
 last locked* : 0xffffffff8079aae7 unlocked : 0xffffffff807aa8db
 owner field  : 0xffffd361868d29a0 wait/spin:                0/0

 Turnstile chain at 0xffffffff816a0070.
 => No active turnstile for this lock.

 Locks held by an LWP (nd6_timer):
 Lock 0 (initialized at soinit)
 lock address : 0xffffd364c8274080 type     :     sleep/adaptive
 initialized  : 0xffffffff80a31cad
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  1
 current cpu  :                  3 last held:                  2
 current lwp  : 0xffffd36185f81600 last held: 0xffffd36186725940
 last locked* : 0xffffffff80760675 unlocked : 0xffffffff80733115
 owner field  : 0xffffd36186725940 wait/spin:                1/0

 Turnstile chain at 0xffffffff8169ff80.
 => Turnstile at 0xffffd364c79e6b98 (wrq=0xffffd364c79e6bb8, rdq=0xffffd364c79e6b
 c8).
 => 0 waiting readers:
 => 1 waiting writers: 0xffffd36185ef95a0

 Locks held by an LWP (swapper):
 Lock 0 (initialized at disk_init)
 lock address : 0xffffd361866ff5b0 type     :     sleep/adaptive
 initialized  : 0xffffffff809e3de9
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  3 last held:                  0
 current lwp  : 0xffffd36185f81600 last held: 0xffffffff8165cbe0
 last locked* : 0xffffffff80a7f7b6 unlocked : 0xffffffff80a7e49c
 owner field  : 0xffffffff8165cbe0 wait/spin:                0/0

 Turnstile chain at 0xffffffff816a01e0.
 => No active turnstile for this lock.


 [Locks tracked through CPUs]
 Locks held on CPU 1:
 Lock 0 (initialized at main)
 lock address : 0xffffffff8169e800 type     :               spin
 initialized  : 0xffffffff80e3a8a9
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  3
 current cpu  :                  3 last held:                  1
 current lwp  : 0xffffd36185f81600 last held: 0xffffd3618633eac0
 last locked* : 0xffffffff809c58db unlocked : 0xffffffff8098aee2
 curcpu holds :                  0 wanted by: 0xffffd36185f81600

 Lock 1 (initialized at ata_channel_init)
 lock address : 0xffffd361862ef020 type     :               spin
 initialized  : 0xffffffff8028428c
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  3 last held:                  1
 current lwp  : 0xffffd36185f81600 last held: 0xffffd3618633eac0
 last locked* : 0xffffffff802829f1 unlocked : 0xffffffff80645f69
 owner field  : 0x0000000000010600 wait/spin:                0/1



 so presumably a race, and I was lucky last time.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Wed, 23 Oct 2019 11:45:10 +0100

 A bit more:

 db{3}> machine cpu 0
 using CPU 0
 db{3}> bt
 _kernel_lock() at netbsd:_kernel_lock+0x118
 sleepq_block() at netbsd:sleepq_block+0x1f8
 iic_smbus_intr_thread() at netbsd:iic_smbus_intr_thread+0x52
 db{3}> machine cpu 2
 using CPU 2
 db{3}> bt
 _kernel_lock() at netbsd:_kernel_lock+0xfd
 nd6_timer_work() at netbsd:nd6_timer_work+0x49
 workqueue_worker() at netbsd:workqueue_worker+0xe5
 db{3}> machine cpu 3
 using CPU 3
 db{3}> bt
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x178
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 _kernel_lock() at netbsd:_kernel_lock+0x14a
 sleepq_block() at netbsd:sleepq_block+0x1f8
 cv_timedwait() at netbsd:cv_timedwait+0x130
 ipmi_thread() at netbsd:ipmi_thread+0x881
 db{3}> machine cpu 1 
 using CPU 1
 db{3}> bt
 lapic_delay() at netbsd:lapic_delay+0x6e
 __wdcwait() at netbsd:__wdcwait+0x91
 mvsata_bio_start() at netbsd:mvsata_bio_start+0xa9c
 ata_xfer_start() at netbsd:ata_xfer_start+0x4f
 atabus_thread() at netbsd:atabus_thread+0x279

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Thu, 24 Oct 2019 13:47:47 +0100

 (Still get locking panic with wdc.c 1.294)

State-Changed-From-To: feedback->open
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Thu, 24 Oct 2019 13:18:23 +0000
State-Changed-Why:
Thanks for feedback. Needs review and fix of locking protocol for calls at least


From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Thu, 31 Oct 2019 15:22:54 +0000

 Tinkering with today's -current/amd64, a probably not relevant

 Index: sys/dev/ic/wdc.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/wdc.c,v
 retrieving revision 1.294
 diff -u -p -r1.294 wdc.c
 --- sys/dev/ic/wdc.c    23 Oct 2019 14:44:41 -0000      1.294
 +++ sys/dev/ic/wdc.c    31 Oct 2019 15:17:20 -0000
 @@ -1210,6 +1210,7 @@ __wdcwait(struct ata_channel *chp, int m
         *tfd = 0;

         timeout = timeout * 1000 / WDCDELAY; /* delay uses microseconds */
 +       timeout *= 100; /* XXXPW */

         for (;;) {
                 status =

 and a

 include "arch/amd64/conf/GENERIC"
 options         DEBUG
 options         LOCKDEBUG
 options         ATADEBUG
 options         ATADEBUG_MASK=0x10
 options         MVSATA_DEBUG
 options         WDCNDELAY_DEBUG

 kernel, booting in uniprocessor mode, i.e., boot netbsd -1, I also hit
 the locking problem:

 panic: LOCKDEBUG: Mutex error: mutex_vector_enter,477: locking against myself

 db{0}> bt
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x178
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 mutex_enter() at netbsd:mutex_enter+0x518
 ata_queue_get_active_xfer() at netbsd:ata_queue_get_active_xfer+0x16
 __wdcwait() at netbsd:__wdcwait+0x116
 wdcwait() at netbsd:wdcwait+0x99
 __wdccommand_intr() at netbsd:__wdccommand_intr+0xc4
 ata_xfer_start() at netbsd:ata_xfer_start+0x136
 atastart() at netbsd:atastart+0x20e
 wdc_exec_command() at netbsd:wdc_exec_command+0x9a
 ata_get_params() at netbsd:ata_get_params+0x11e
 wdc_drvprobe() at netbsd:wdc_drvprobe+0x3cb
 atabusconfig() at netbsd:atabusconfig+0x65
 atabus_thread() at netbsd:atabus_thread+0x81
 db{0}> show locks
 [Locks tracked through LWPs]
 Locks held by an LWP (ipmi0):
 Lock 0 (initialized at ipmi_attach)
 lock address : 0xfffffe226d5333c0 type     :     sleep/adaptive
 initialized  : 0xffffffff80a9285c
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  0 last held:                  0
 current lwp  : 0xfffffe1f2f053a40 last held: 0xfffffe226d53e540
 last locked* : 0xffffffff80a93498 unlocked : 000000000000000000
 owner field  : 0xfffffe226d53e540 wait/spin:                0/0

 Turnstile chain at 0xffffffff816a0480.
 => No active turnstile for this lock.


 [Locks tracked through CPUs]
 Locks held on CPU 0:
 Lock 0 (initialized at main)
 lock address : 0xffffffff8169ea80 type     :               spin
 initialized  : 0xffffffff80e3cca9
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  0
 current cpu  :                  0 last held:                  0
 current lwp  : 0xfffffe1f2f053a40 last held: 0xfffffe1f2f053a40
 last locked* : 0xffffffff80209781 unlocked : 0xffffffff809ca113
 curcpu holds :                  1 wanted by: 000000000000000000

 Lock 1 (initialized at ata_channel_init)
 lock address : 0xffffad801da5f1f0 type     :               spin
 initialized  : 0xffffffff8028424c
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  1
 current cpu  :                  0 last held:                  0
 current lwp  : 0xfffffe1f2f053a40 last held: 0xfffffe1f2f053a40
 last locked* : 0xffffffff8063b9d8 unlocked : 0xffffffff8028070c
 owner field  : 0x0000000000010600 wait/spin:                0/1

 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 0    >  40 7   0       200   fffffe1f2f053a40           atabus16
 0       21 3   0       200   fffffe226d53e540              ipmi0 ipmicmd

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 1 Nov 2019 17:59:50 +0000

 On Thu, Oct 31, 2019 at 03:25:01PM +0000, Patrick Welche wrote:
 >  include "arch/amd64/conf/GENERIC"
 >  options         DEBUG
 >  options         LOCKDEBUG
 >  options         ATADEBUG
 >  options         ATADEBUG_MASK=0x10
 >  options         MVSATA_DEBUG
 >  options         WDCNDELAY_DEBUG <--------------- !!!!!

 The locking issue is caused by WDCNDELAY_DEBUG.

 wdcwait() asserts that the channel is locked, and calls __wdcwait.
 Only ifdef WDCNDELAY_DEBUG does __wdcwait call ata_queue_get_active_xfer
 which tries to lock the already locked channel.

 https://nxr.netbsd.org/xref/src/sys/dev/ic/wdc.c#1243

 Removing WDCNDELAY_DEBUG removes the locking issue.

 Any idea where the status values in

   __wdcwait: timeout (time=1001), status d0 error 0 (mask 0x40 bits 0x40)
                                   ^^^^^^^^^

 are defined?

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 1 Nov 2019 18:44:03 -0000 (UTC)

 prlw1@cam.ac.uk (Patrick Welche) writes:

 > Any idea where the status values in
 >   __wdcwait: timeout (time=1001), status d0 error 0 (mask 0x40 bits 0x40)
 > 
 > are defined?

 That's the hex value of the status register. See sys/dev/ata/atareg.h.

 0xd0 = WDCS_BSY | WDCS_DRDY | WDCS_DSC;

 The timeout happens because the drive never cleared the busy flag,
 the code waits for (mask | WDCS_BSY) == bits.

 The names are ancient, in ATA speak WDCS_DSC is the DF (Device Fault) flag
 that says that a command couldn't be completed (in contrast to a command
 that is completed with or without an error). There are special modes
 where the DF flag and others get repurposed to signal other conditions,
 but that's not relevant here.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: "Patrick Welche" <prlw1@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52126 CVS commit: src/sys/dev/ic
Date: Mon, 4 Nov 2019 15:12:10 +0000

 Module Name:	src
 Committed By:	prlw1
 Date:		Mon Nov  4 15:12:10 UTC 2019

 Modified Files:
 	src/sys/dev/ic: wdc.c

 Log Message:
 Fix locking bug seen when using WDCNDELAY_DEBUG in PR kern/52126
 (Observed
   piixide0:0:0: warning: busy-wait took 200us
 with no panic)


 To generate a diff of this commit:
 cvs rdiff -u -r1.294 -r1.295 src/sys/dev/ic/wdc.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 4 Nov 2019 16:02:38 +0000

 Next locking issue:

 mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
 mvsata1:4: mvsata_bio_start: drive=0
 mvsata1:4: mvsata_bio_start: drive=0
 Kernel lock error: _kernel_lock,220: spinout

 lock address : 0xffffffff8169eb80 type     :               spin
 initialized  : 0xffffffff80e3dc09
 shared holds :                  0 exclusive:                  1
 shares wanted:                  0 exclusive:                  2
 current cpu  :                  0 last held:                  2
 current lwp  : 0xffffcaa088781600 last held: 0xffffcaa088b3eac0
 last locked* : 0xffffffff809c7c9b unlocked : 0xffffffff8098d2a2
 curcpu holds :                  0 wanted by: 0xffffcaa088781600

 panic: LOCKDEBUG: Kernel lock error: _kernel_lock,220: spinout
 cpu0: Begin traceback...
 vpanic() at netbsd:vpanic+0x178
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 _kernel_lock() at netbsd:_kernel_lock+0x14a
 sleepq_block() at netbsd:sleepq_block+0x1f8
 cv_timedwait() at netbsd:cv_timedwait+0x130
 ipmi_thread() at netbsd:ipmi_thread+0x881
 cpu0: End traceback...
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip 0xffffffff8021ddcd cs 0x8 rflags 0x202 cr2 0 ilevel 0 rsp 0xffffde813d2ccd20
 curlwp 0xffffcaa088781600 pid 0.39 lowest kstack 0xffffde813d2c92c0
 Stopped in pid 0.39 (system) at netbsd:breakpoint+0x5:  leave

 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 0    >  39 7   0       200   ffffcaa088781600              ipmi0
 0    >  52 7   2       e00   ffffcaa088b3eac0           

 db{0}> bt
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x178
 snprintf() at netbsd:snprintf
 lockdebug_more() at netbsd:lockdebug_more
 _kernel_lock() at netbsd:_kernel_lock+0x14a
 sleepq_block() at netbsd:sleepq_block+0x1f8
 cv_timedwait() at netbsd:cv_timedwait+0x130
 ipmi_thread() at netbsd:ipmi_thread+0x881
 db{0}> machine cpu 1
 using CPU 1
 db{0}> bt
 x86_stihlt() at netbsd:x86_stihlt+0x6
 acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0xd1
 acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xba
 idle_loop() at netbsd:idle_loop+0x14b
 cpu_hatch() at netbsd:cpu_hatch+0x17f
 db{0}> machine cpu 2 
 using CPU 2
 db{0}> bt
 lapic_delay() at netbsd:lapic_delay+0x6e
 __wdcwait() at netbsd:__wdcwait+0x99
 mvsata_bio_start() at netbsd:mvsata_bio_start+0xa9c
 ata_xfer_start() at netbsd:ata_xfer_start+0x4f
 atabus_thread() at netbsd:atabus_thread+0x279
 db{0}> machine cpu 3
 using CPU 3
 db{0}> bt
 ld_rbto_compare_key() at netbsd:ld_rbto_compare_key+0x4
 lockdebug_abort() at netbsd:lockdebug_abort+0x5a
 _kernel_lock() at netbsd:_kernel_lock+0x14a
 ip_slowtimo() at netbsd:ip_slowtimo+0x1a
 pfslowtimo() at netbsd:pfslowtimo+0x47
 callout_softclock() at netbsd:callout_softclock+0x121
 softint_dispatch() at netbsd:softint_dispatch+0xf2
 DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffde813d2b60f0
 Xsoftintr() at netbsd:Xsoftintr+0x4f
 --- interrupt ---
 0:

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 4 Nov 2019 16:21:04 +0000

 (no locking issue if boot netbsd -c, and disable mvsata)

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 20 Dec 2019 17:01:50 +0000

 Just for the record, the run up to the spinout is:

 ata_exec_xfer 0xfffff2bbff940038 channel 1 drive 0
 atastart from ata_exec_xfer, flags 0x800
 ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0atastart(chp=0xffffb0001edffa20): xfer 0xfffff2bbff940038 channel 1 drive 0
 atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
 atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
 mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
 ata_exec_xfer 0xfffff2bbff940038 channel 4 drive 0
 atastart from ata_exec_xfer, flags 0x0
 ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xfffff2bbff46e010): xfer 0xfffff2bbff940038 channel 4 drive 0
 mvsata1:4: mvsata_bio_start: drive=0
 ata_channel_freeze_locked(chp=0xfffff2bbff46e010) -> 1
 ata_channel_thaw_locked(chp=0xfffff2bbff46e010) -> 0
 mvsata1:4: mvsata_bio_start: drive=0
 Kernel lock error: _kernel_lock,222: spinout

From: Frank Kardel <kardel@kardel.name>
To: gnats-bugs@netbsd.org, jdolecek@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org, prlw1@cam.ac.uk
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 20 Dec 2019 23:08:33 +0100

 The _kernel_lock() code looks very suspicious.

 spinouts on kernel_lock only happen while the root is not mounted 
 (start_init_exec == 0, see init_main.c):

     215          spins = 0;
     216          do {
     217                  splx(s);
     218                  while (__SIMPLELOCK_LOCKED_P(kernel_lock)) {
     219                          if (SPINLOCK_SPINOUT(spins)) {
     220                                  extern int start_init_exec;
     221                                  if 
 (!start_init_exec)                      <===+ looks fishy to me
     222 _KERNEL_LOCK_ABORT("spinout");   <===+
     223                          }
     224                          SPINLOCK_BACKOFF_HOOK;
     225                          SPINLOCK_SPIN_HOOK;
     226                  }
     227                  s = splvm();
     228          } while (!__cpu_simple_lock_try(kernel_lock));
     229
     230          ci->ci_biglock_count = nlocks;

 I have seen spinout on my machine when booting via EFI and kernel 
 messages slowly crawling up the screen

 causing kernel lock contention while initializing the system. After the 
 flag is set, the spinout logic is ineffective.

 I am missing to see the reasoning here to panic at all during startup 
 and disable the panic after initialization.

 IMHO the panic should be removed as it can be randomly triggered and 
 prohibit boot. And the code serves no purpose

 after mounting the root-fs.

 Frank



 On 12/20/19 18:05, Patrick Welche wrote:
 > The following reply was made to PR kern/52126; it has been noted by GNATS.
 >
 > From: Patrick Welche <prlw1@cam.ac.uk>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
 > Date: Fri, 20 Dec 2019 17:01:50 +0000
 >
 >   Just for the record, the run up to the spinout is:
 >   
 >   ata_exec_xfer 0xfffff2bbff940038 channel 1 drive 0
 >   atastart from ata_exec_xfer, flags 0x800
 >   ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0atastart(chp=0xffffb0001edffa20): xfer 0xfffff2bbff940038 channel 1 drive 0
 >   atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
 >   atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
 >   mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
 >   ata_exec_xfer 0xfffff2bbff940038 channel 4 drive 0
 >   atastart from ata_exec_xfer, flags 0x0
 >   ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xfffff2bbff46e010): xfer 0xfffff2bbff940038 channel 4 drive 0
 >   mvsata1:4: mvsata_bio_start: drive=0
 >   ata_channel_freeze_locked(chp=0xfffff2bbff46e010) -> 1
 >   ata_channel_thaw_locked(chp=0xfffff2bbff46e010) -> 0
 >   mvsata1:4: mvsata_bio_start: drive=0
 >   Kernel lock error: _kernel_lock,222: spinout
 >   

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 20 Dec 2019 22:34:41 -0000 (UTC)

 kardel@kardel.name (Frank Kardel) writes:

 >The _kernel_lock() code looks very suspicious.

 >spinouts on kernel_lock only happen while the root is not mounted 
 >(start_init_exec == 0, see init_main.c):

 start_init_exec is not related to the root mount but is set before the
 first user code is executed.

 The spinout check is a crude deadlock detection. It's not reasonable
 to do that for user code that is waiting to enter the kernel.

 If you want to remove the panic you may want to replace it with a (rate
 limited) console message.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 07:56:19 +0100

 I was just referring to place where the flag is set, not that it is 
 related to mounting the root although attempting to exec init implies 
 the availability of the root.

 Crude indeed it is, es. It panics only during kernel startup when no 
 user code can be running,

 Maybe I am missing omething, but the prerequisite condition for panic is 
 "!start_init_exec". Maybe the test was intended to be inverted so that 
 the crude

 deadlock check is enabled while user level is running. Right now it only 
 checks during the initialization phase *before*

 user-level is running and in some circumstance even prohibits a 
 successful boot (slow raster console+debug+verbose boot).

 Frank

 On 12/20/19 23:40, Michael van Elst wrote:
 > The following reply was made to PR kern/52126; it has been noted by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
 > Date: Fri, 20 Dec 2019 22:34:41 -0000 (UTC)
 >
 >   kardel@kardel.name (Frank Kardel) writes:
 >   
 >   >The _kernel_lock() code looks very suspicious.
 >   
 >   >spinouts on kernel_lock only happen while the root is not mounted
 >   >(start_init_exec == 0, see init_main.c):
 >   
 >   start_init_exec is not related to the root mount but is set before the
 >   first user code is executed.
 >   
 >   The spinout check is a crude deadlock detection. It's not reasonable
 >   to do that for user code that is waiting to enter the kernel.
 >   
 >   If you want to remove the panic you may want to replace it with a (rate
 >   limited) console message.
 >   
 >   --
 >   --
 >                                   Michael van Elst
 >   Internet: mlelstv@serpens.de
 >                                   "A potential Snark may lurk in every tree."
 >   

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 07:47:12 -0000 (UTC)

 kardel@netbsd.org (Frank Kardel) writes:

 > Maybe I am missing omething, but the prerequisite condition for panic is 
 > "!start_init_exec". Maybe the test was intended to be inverted so that 
 > the crude
 > deadlock check is enabled while user level is running. Right now it only 
 > checks during the initialization phase *before*

 Excessive waiting for the kernel lock is always a problem, functional
 (i.e. delayed interrupt processing and then lost interrupts) or just
 performance. But after bootstrap a panic is probably too much, anything
 before can be handled by just fixing the code that holds the kernel
 lock for too long.

 > user-level is running and in some circumstance even prohibits a 
 > successful boot (slow raster console+debug+verbose boot).

 This just shows that slow raster consoles (kernel messages, not tty) have
 their limitations. Still a good thing to get this information because
 it points to a design problem (kernel lock held too long) usually somewhere
 else. E.g. sdmmc(4) now sleeps in some places instead of spinning in delay()
 because of this.

 There are places in our ata code (and probably mvsata too) that need
 similar attention.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 11:05:55 +0100

 On 12/21/19 08:50, Michael van Elst wrote:
 > The following reply was made to PR kern/52126; it has been noted by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
 > Date: Sat, 21 Dec 2019 07:47:12 -0000 (UTC)
 >
 >   kardel@netbsd.org (Frank Kardel) writes:
 >   
 >   > Maybe I am missing omething, but the prerequisite condition for panic is
 >   > "!start_init_exec". Maybe the test was intended to be inverted so that
 >   > the crude
 >   > deadlock check is enabled while user level is running. Right now it only
 >   > checks during the initialization phase *before*
 >   
 >   Excessive waiting for the kernel lock is always a problem, functional
 >   (i.e. delayed interrupt processing and then lost interrupts) or just
 >   performance.
 I am not questioning that.Ideally we can manage to make the kernel lock 
 disappear some time.
 >   But after bootstrap a panic is probably too much, anything
 >   before can be handled by just fixing the code that holds the kernel
 >   lock for too long.
 During bootstrap a panic is even prohibitive. The long holding may not 
 even be caused by wrong code but
 by debug checks and or kernel output to a slow console and on (larger) 
 MP systems even caused by starvation due to multiple
 threads doing output to slow consoles. Especially in debug mode we have 
 even more unpredictable timing (memory size/resource dependent).
 So we are looking here at performance bugs and delays related to 
 starvation and consistency check code.
 My boot panics where caused when chasing the EFI device initialization 
 failures (which are almost gone - need to re-check).
 The kernel panicked all over the place but never reached user mode 
 because of that. The workaround was/is to disable DEBUG and pciverbose 
 or the spinout code.

 As the spinout code sets an absolute retry barrier any printf/debug 
 analysis within a kernel_lock section is risky. So while trying to catch 
 'long locks' we may be tripping also over false positives due to slow 
 output/bad design/starvation. I have no idea what the signal/noise ratio 
 is here.
 >   
 >   > user-level is running and in some circumstance even prohibits a
 >   > successful boot (slow raster console+debug+verbose boot).
 >   
 >   This just shows that slow raster consoles (kernel messages, not tty) have
 >   their limitations. Still a good thing to get this information because
 >   it points to a design problem (kernel lock held too long)
 Starvation may also be a cause for that. We could add a lock counter to 
 reset the spin backoff to avoid triggering
 on high contention. But this looks still like a bandaid.
 >   usually somewhere
 >   else. E.g. sdmmc(4) now sleeps in some places instead of spinning in delay()
 >   because of this.
 There are certainly many places where kernel_lock is held needlessly 
 long. Which need to be addressed while
 working on removing the big lock.
 >   
 >   There are places in our ata code (and probably mvsata too) that need
 >   similar attention.
 Definitely. This bug looks like there is some unexpected interaction 
 going on with the device. Also ipmi printfs are often a victim here 
 going by the bug report. Would be nice to get a stack trace or at least 
 the location where the lock way taken of the kernel lock owner at this time.

 But I still wonder whether this panic is a good idea due to signal/noise 
 ratio especially as is is active only during boot.

 Frank

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: Patrick Welche <prlw1@cam.ac.uk>
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 23:06:11 +0100

 --0000000000003786cf059a3e0164
 Content-Type: text/plain; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 Le ven. 20 d=C3=A9c. 2019 =C3=A0 18:05, Patrick Welche <prlw1@cam.ac.uk> a =
 =C3=A9crit :

 >  mvsata1:4: mvsata_bio_start: drive=3D0
 >  Kernel lock error: _kernel_lock,222: spinout
 >

 Can you please modify mvsata_bio_start() to log the wait_flags? It seems
 the code path eventually triggers delay() calls, I want to confirm it's
 doing polled I/O there.

 Jaromir

 --0000000000003786cf059a3e0164
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
 <div dir=3D"ltr" class=3D"gmail_attr">Le=C2=A0ven. 20 d=C3=A9c. 2019 =C3=A0=
 =C2=A018:05, Patrick Welche &lt;<a href=3D"mailto:prlw1@cam.ac.uk">prlw1@ca=
 m.ac.uk</a>&gt; a =C3=A9crit=C2=A0:</div><blockquote class=3D"gmail_quote" =
 style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:s=
 olid;border-left-color:rgb(204,204,204);padding-left:1ex">
 =C2=A0mvsata1:4: mvsata_bio_start: drive=3D0<br>
 =C2=A0Kernel lock error: _kernel_lock,222: spinout<br></blockquote><div><br=
 ></div><div>Can you please modify mvsata_bio_start() to log the wait_flags?=
  It seems the code path eventually triggers delay() calls, I want to confir=
 m it&#39;s doing polled I/O there.</div><div><br></div><div>Jaromir=C2=A0</=
 div></div></div>

 --0000000000003786cf059a3e0164--

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/52126 CVS commit: src/sys/dev/ic
Date: Sun, 22 Dec 2019 20:54:00 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sun Dec 22 20:54:00 UTC 2019

 Modified Files:
 	src/sys/dev/ic: mvsata.c

 Log Message:
 remove check for idle status when disabling EDMA, and always toggle
 the disable regardless of current state - this particularly seems
 to fail during error recovery, and on my system this also fails
 during regular boot

 this matches both FreeBSD and Linux drivers - neither of those checks
 the idle status

 should help with PR kern/52419 and maybe also the lock spinout part
 of PR kern/52126


 To generate a diff of this commit:
 cvs rdiff -u -r1.49 -r1.50 src/sys/dev/ic/mvsata.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 22 Dec 2019 21:13:21 +0000
State-Changed-Why:
Does this still happen with rev. 1.50 of sys/dev/ic/mvsata.c?


From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 23 Dec 2019 12:06:21 +0000

 From Frank & Michael's comments it sounds as though the spinout is a
 secondary symptom of the "lost interrupt"s? This is with mvsata.c v1.51

 - example of early lost interrupt:

 [   3.6618488] ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xffff893d7986e010): xfer 0xffff893d79c2c038 channel 4 drive 0
 [   3.6618488] mvsata1:4: mvsata_bio_start: drive=0 wait_flags=0
 [   3.6618488] wdccommand mvsata1:4:0: command=0xc4 cylin=0 head=64 sector=0 count=32 features=0
 [   3.6618488] atastart(chp=0xffff893d7986e010): channel 4 queue_xfer is empty
 [  13.7050587] wdctimeout
 mvsata1:4:0: lost interrupt
 [  13.7050587]  type: ata tc_bcount: 16384 tc_skip: 0


 - the spinout with wait_flags (apparently not polling):

 [ 135.6760828] ata_exec_xfer 0xffff893d79c2c038 channel 1 drive 0
 [ 135.6760828] atastart from ata_exec_xfer, flags 0x800
 [ 135.6760828] ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0atastart(chp=0xffffa0801edffa20): xfer 0xffff893d79c2c038 channel 1 drive 0
 [ 135.6760828] atastart(chp=0xffffa0801edffa20): channel 1 queue_xfer is empty
 [ 135.7134179] atastart(chp=0xffffa0801edffa20): channel 1 queue_xfer is empty
 [ 135.7134179] mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
 [ 135.7134179] ata_exec_xfer 0xffff893d79c2c038 channel 4 drive 0
 [ 135.7134179] atastart from ata_exec_xfer, flags 0x0
 [ 135.7134179] ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xffff893d7986e010): xfer 0xffff893d79c2c038 channel 4 drive 0
 [ 135.7134179] mvsata1:4: mvsata_bio_start: drive=0 wait_flags=0
 [ 135.7134179] ata_channel_freeze_locked(chp=0xffff893d7986e010) -> 1
 [ 135.7633699] ata_channel_thaw_locked(chp=0xffff893d7986e010) -> 0
 [ 135.7634379] mvsata1:4: mvsata_bio_start: drive=0 wait_flags=0
 [ 137.5141317] Kernel lock error: _kernel_lock,222: spinout

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 23 Dec 2019 12:41:01 +0000

 Not all those ata commands are necessarily for wd0 on mvsata - edited recap:

 [   1.0429517] mvsata0 at pci5 dev 1 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
 [   1.0429517] mvsata0: interrupting at ioapic1 pin 1
 [   1.0429517] mvsata0: GenII, 2hc, 4port/hc
 [   1.0429517] atabus0 at mvsata0 channel 0
 [   1.0429517] atabus1 at mvsata0 channel 1
 [   1.0429517] atabus2 at mvsata0 channel 2
 [   1.0429517] atabus3 at mvsata0 channel 3
 [   1.0429517] atabus4 at mvsata0 channel 4
 [   1.0429517] atabus5 at mvsata0 channel 5
 [   1.0429517] atabus6 at mvsata0 channel 6
 [   1.0429517] atabus7 at mvsata0 channel 7
 [   1.0429517] mvsata1 at pci5 dev 2 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
 [   1.0429517] mvsata1: interrupting at ioapic1 pin 2
 [   1.0429517] mvsata1: GenII, 2hc, 4port/hc
 [   1.0429517] atabus8 at mvsata1 channel 0
 [   1.0429517] atabus9 at mvsata1 channel 1
 [   1.0429517] atabus10 at mvsata1 channel 2
 [   1.0429517] atabus11 at mvsata1 channel 3
 [   1.0429517] atabus12 at mvsata1 channel 4
 [   1.0429517] atabus13 at mvsata1 channel 5
 [   1.0429517] atabus14 at mvsata1 channel 6
 [   1.0429517] atabus15 at mvsata1 channel 7
 [  55.8441307] mvsata1: channel 4: drive 0 not ready, st=0xd0, err=0x00
 [  55.8551652] wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 4
 [can't find piixide part which is where wd1 and wd2 are attached as a working
  raidframe mirror]
 [ 132.4877096] atapibus0 at atabus16: 2 targets
 [ 132.6211624] cd0 at atapibus0 drive 0: <TSSTcorpCD/DVDW SH-S182D, , SB04> cdrom removable
 [ 132.8502886] wd1 at atabus18 drive 0
 [ 133.3554994] wd2 at atabus19 drive 0

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 4 Sep 2020 16:07:12 +0100

 Retrying with NetBSD-9.99.72/amd64, I now see no panics nor lost interrupts,
 but simple a hang on boot. Sending break down the serial line doesn't
 get me into ddb. Apart from GENERIC, I also tried:

   include "arch/amd64/conf/GENERIC"
   options         DEBUG
   options         LOCKDEBUG
   options         ATADEBUG
   options         ATADEBUG_MASK=0x58
   options         MVSATA_DEBUG
   options         WDCNDELAY_DEBUG=50
   #options        MVSATA_WITHOUTDMA

 highlights of which are:

 [   1.0405006] ppb4 at pci1 dev 0 function 3: Intel 63xxESB PCI Express to PCI-X Bridge (rev. 0x01)
 [   1.0405006] ppb4: PCI Express capability version 1 <PCI-E to PCI/PCI-X Bridge>
 [   1.0405006] pci5 at ppb4 bus 5
 [   1.0405006] mvsata0 at pci5 dev 1 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
 [   1.0405006] mvsata0: interrupting at ioapic1 pin 1
 [   1.0405006] mvsata0: GenII, 2hc, 4port/hc
 [   1.0405006] atabus0 at mvsata0 channel 0
 [   1.0405006] atabus1 at mvsata0 channel 1
 [   1.0405006] atabus2 at mvsata0 channel 2
 [   1.0405006] atabus3 at mvsata0 channel 3
 [   1.0405006] atabus4 at mvsata0 channel 4
 [   1.0405006] atabus5 at mvsata0 channel 5
 [   1.0405006] atabus6 at mvsata0 channel 6
 [   1.0405006] atabus7 at mvsata0 channel 7
 [   1.0405006] mvsata1 at pci5 dev 2 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
 [   1.0405006] mvsata1: interrupting at ioapic1 pin 2
 [   1.0405006] mvsata1: GenII, 2hc, 4port/hc
 [   1.0405006] atabus8 at mvsata1 channel 0
 [   1.0405006] atabus9 at mvsata1 channel 1
 [   1.0405006] atabus10 at mvsata1 channel 2
 [   1.0405006] atabus11 at mvsata1 channel 3
 [   1.0405006] atabus12 at mvsata1 channel 4
 [   1.0405006] atabus13 at mvsata1 channel 5
 [   1.0405006] atabus14 at mvsata1 channel 6
 [   1.0405006] atabus15 at mvsata1 channel 7
 ...
 [   1.0405006] piixide0 at pci0 dev 31 function 1: Intel 631xESB/632xESB IDE Controller (rev. 0x09)
 [   1.0405006] piixide0: primary channel interrupting at ioapic0 pin 14
 [   1.0405006] atabus16 at piixide0 channel 0
 [   1.0405006] piixide0: secondary channel interrupting at ioapic0 pin 15
 [   1.0405006] atabus17 at piixide0 channel 1
 [   1.0405006] ahcisata0 at pci0 dev 31 function 2: Intel 63xxESB AHCI Serial ATA Controller (rev. 0x09)
 [   1.0405006] ahcisata0: AHCI revision 1.10, 6 ports, 32 slots, CAP 0xc6227f05<PSC,SSC,SPM,ISS=0x2=Gen2,SAL,SALP,SNCQ,S64A>
 [   1.0405006] ahcisata0: interrupting at ioapic0 pin 20
 [   1.0405006] atabus18 at ahcisata0 channel 0
 [   1.0405006] atabus19 at ahcisata0 channel 1
 [   1.0405006] atabus20 at ahcisata0 channel 2
 [   1.0405006] atabus21 at ahcisata0 channel 3
 [   1.0405006] atabus22 at ahcisata0 channel 4
 [   1.0405006] atabus23 at ahcisata0 channel 5
 ...
 [   2.6746267] wdcprobe1:0: before reset, st0=0x50, st1=0x1
 [   2.6746267] wdcprobe1:0 drive 1 wd_cyl_lo: got 0x0 != 0x02
 [   2.6746267] wdcprobe1:0 drive 1 wd_cyl_lo: got 0x0 != 0x01
 [   2.6746267] wdcprobe1:0 drive 1 wd_sector: got 0x41 != 0x01
 [   2.6746267] wdcprobe1:0 drive 1 wd_sector: got 0x42 != 0x02
 [   2.6746267] wdcprobe1:0 drive 1 wd_cyl_lo(2): got 0x0 != 0x01
 ...
 [   2.7646198] wdcprobe1:1: before reset, st0=0x7f, st1=0x7f
 [   2.7646198] wdcprobe1:1 drive 0 wd_cyl_lo: got 0x7f != 0x02
 [   2.7646198] wdcprobe1:1 drive 0 wd_cyl_lo: got 0x7f != 0x01
 [   2.7646198] wdcprobe1:1 drive 0 wd_sector: got 0x7f != 0x01
 [   2.7646198] wdcprobe1:1 drive 0 wd_sector: got 0x7f != 0x02
 [   2.7646198] wdcprobe1:1 drive 0 wd_cyl_lo(2): got 0x7f != 0x01
 [   2.7646198] wdcprobe1:1 drive 1 wd_cyl_lo: got 0x7f != 0x02
 [   2.7646198] wdcprobe1:1 drive 1 wd_cyl_lo: got 0x7f != 0x01
 [   2.7646198] wdcprobe1:1 drive 1 wd_sector: got 0x7f != 0x01
 [   2.7646198] wdcprobe1:1 drive 1 wd_sector: got 0x7f != 0x02
 [   2.7646198] wdcprobe1:1 drive 1 wd_cyl_lo(2): got 0x7f != 0x01
 [   2.8259235] ata_queue_alloc_slot: channel 0 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 2 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 3 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 5 qavail 0xffffffff qact 0piixide0:0:0: after reset, sc=0x1 sn=0x1 cl=0x14 ch=0xeb
 [   2.8646197] piixide0:0:1: after reset, sc=0x0 sn=0x0 cl=0x0 ch=0x0
 [   2.8746188] piixide0:0: wdcwait_reset() end, st0=0x10 st1=0x0
 [   2.8846202] wdcprobe1:0: after reset, ret_value=0x1
 [   2.8846202] wdcprobe1:0:0: after reset, sc=0x1 sn=0x1 cl=0x14 ch=0xeb
 [   2.8946223] piixide0:0: wait DRDY st0 0x0 st1 0x0
 [   2.9046198] ata_get_params
 [   2.9046198] wdc_exec_command piixide0:0:0
 [   2.9046198] ata_exec_xfer 0xffffd69369140038 channel 0 drive 0
 [   2.9146204] atastart from ata_exec_xfer, flags 0x0
 [   2.9146204] ata_queue_alloc_slot: channel 0 qavail 0x1 qact 0atastart(chp=0xffffdb801f8421e0): xfer 0xffffd69369140038 channel 0 drive 0
 [   2.9346199] __wdccommand_start piixidead=160 sector=0 count=0 features=0
 [   2.9446209] __wdcwait: timeout (time=11), status 90 error 1 (mask 0x8 bits 0x8)
 [   2.9546198] __wdccommand_done piixide0:0:0 flags 0x59
 [   2.9546198] atastart(chp=0xffffdb801f8421e0): channel 0 queue_xfer is empty
 [6182] ahcisata0 port 0: device present, speed: 1.5Gb/s
 [   3.9846181] wd0 at atabus12 drive 0
 [   3.9846181] ata_get_params
 [   3.9846181] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
 [   3.9946226] atastart from ata_exec_xfer, flags 0x0
 [   3.99462_xfer is empty
 [   4.0246200] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.0377748] wd0: <ST3500630AS>
 [   4.0377748] wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
 [   4.0446210] ata_get_params
 [   4hannel 4 drive 0
 [   4.0746204] wdccommand mvsata1:4:0: command=0xec cylin=0 head=0 sector=0 count=0 features=0
 [   4.0746204] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.0846209] atastart(chp=0xffffd6902bc37008): channel 4 queue channel 4 drive 0
 [   4.1146206] ata_channel_freeze_locked(chp=0xffffd6902bc37008) -> 1
 [   4.1246206] ata_channel_thaw_locked(chp=0xffffd6902bc37008) -> 0
 [   4.1246206] wdccommandshort mvsata1:4:0 command 0x10
 [   4.1246206] __wdcwait: did busy-wait, ts=0
 [   4.1636341] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.1705662] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.1775598] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
 [   4.1775598] atastart frotor=0 count=3 features=0
 [   4.1881314] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.2162904] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.2162904] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
 [   4. sector=64 count=1 features=0
 [   4.2349051] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.2631482] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [   4.2701319] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
 [and=0xc4 cylin=0 head=64 sector=0 count=32 features=0
 [   4.2759362] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
 [  25.8046209] ipmi0: ID 32.1 IPMI 2.0 Available
 hang and no ddb

State-Changed-From-To: feedback->open
State-Changed-By: prlw1@NetBSD.org
State-Changed-When: Sun, 20 Sep 2020 12:57:41 +0000
State-Changed-Why:
feedback given


Responsible-Changed-From-To: jdolecek->kern-bug-people
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Wed, 05 May 2021 19:16:31 +0000
Responsible-Changed-Why:
I won't work on mvsata(4) anymore.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.