NetBSD Problem Report #52126
From www@NetBSD.org Fri Mar 31 15:30:43 2017
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 67E107A0F3
for <gnats-bugs@gnats.NetBSD.org>; Fri, 31 Mar 2017 15:30:43 +0000 (UTC)
Message-Id: <20170331153042.17C0D7A2AC@mollari.NetBSD.org>
Date: Fri, 31 Mar 2017 15:30:42 +0000 (UTC)
From: prlw1@cam.ac.uk
Reply-To: prlw1@cam.ac.uk
To: gnats-bugs@NetBSD.org
Subject: mvsata Marvell 88SX6081 panics on boot
X-Send-Pr-Version: www-1.0
>Number: 52126
>Category: kern
>Synopsis: mvsata Marvell 88SX6081 panics on boot
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Mar 31 15:35:00 +0000 2017
>Closed-Date:
>Last-Modified: Wed May 05 19:16:31 +0000 2021
>Originator: Patrick Welche
>Release: NetBSD-7.99.66/amd64
>Organization:
>Environment:
>Description:
First reported in
http://mail-index.netbsd.org/current-users/2017/02/20/msg031172.html
quoting:
mvsata0 at pci5 dev 1 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
mvsata0: interrupting at ioapic1 pin 1
mvsata0: GenII, 2hc, 4port/hc
mvsata0:0:0: unable to stop EDMA
panic: EDMA enabled
but it took me a while to realise that kgdb works much better if you
boot -1
(gdb) bt
#0 0xffffffff80215455 in breakpoint ()
#1 0xffffffff8082c0c8 in vpanic (
fmt=fmt@entry=0xffffffff80edac77 "EDMA enabled",
ap=ap@entry=0xffffffff81645818) at ../../../../kern/subr_prf.c:340
#2 0xffffffff8082c185 in panic (
fmt=fmt@entry=0xffffffff80edac77 "EDMA enabled")
at ../../../../kern/subr_prf.c:258
#3 0xffffffff804b6d21 in mvsata_pmp_select (mvport=0xfffffe811d78d810,
pmpport=0) at ../../../../dev/ic/mvsata.c:241
#4 0xffffffff804b80e1 in mvsata_wdc_cmd_start (chp=0xfffffe811d78d810,
xfer=0xfffffe811dc60f60) at ../../../../dev/ic/mvsata.c:1586
#5 0xffffffff804b7100 in mvsata_exec_command (drvp=<optimized out>,
ata_c=0xffffffff81645978) at ../../../../dev/ic/mvsata.c:677
#6 0xffffffff8026e356 in ata_get_params (drvp=0xfffffe811dcb2e80,
flags=<optimized out>, prms=0xfffffe811dcb358c)
at ../../../../dev/ata/ata.c:762
#7 0xffffffff8026a5a1 in wd_get_params (wd=0xfffffe811dcb3008,
flags=<optimized out>, params=0xfffffe811dcb358c)
at ../../../../dev/ata/wd.c:1776
#8 0xffffffff8026ae08 in wdopen (dev=<optimized out>, flag=<optimized out>,
fmt=24576, l=<optimized out>) at ../../../../dev/ata/wd.c:990
#9 0xffffffff808194b4 in bdev_open (dev=3, flag=1073741825, devtype=24576,
l=0xffffffff812764e0 <lwp0>) at ../../../../kern/subr_devsw.c:716
#10 0xffffffff8089b03d in spec_open (v=0xffffffff81645b20)
at ../../../../miscfs/specfs/spec_vnops.c:609
#11 0xffffffff8088f25c in VOP_OPEN (vp=0xfffffe811dc5fd20,
mode=mode@entry=1073741825, cred=cred@entry=0xffffffffffffffff)
at ../../../../kern/vnode_if.c:238
#12 0xffffffff80668c71 in rf_find_raid_components ()
at ../../../../dev/raidframe/rf_netbsdkintf.c:2874
#13 0xffffffff8066a1ec in rf_autoconfig (self=<optimized out>)
at ../../../../dev/raidframe/rf_netbsdkintf.c:445
#14 0xffffffff808137d4 in config_finalize ()
at ../../../../kern/subr_autoconf.c:2166
#15 0xffffffff80b7d9a1 in main () at ../../../../kern/init_main.c:613
(gdb) frame 4
#4 0xffffffff804b80e1 in mvsata_wdc_cmd_start (chp=0xfffffe811d78d810,
xfer=0xfffffe811dc60f60) at ../../../../dev/ic/mvsata.c:1586
1586 mvsata_pmp_select(mvport, drive);
(gdb) print *xfer
$1 = {c_flags = 16, c_chp = 0xfffffe811d78d810, c_drive = 0,
c_cmd = 0xffffffff81645978, c_databuf = 0xfffffe8125788c88, c_bcount = 512,
c_skip = 0, c_dscpoll = 0, c_lenoff = 0, c_xferchain = {tqe_next = 0x0,
tqe_prev = 0xfffffe811d78de80},
c_start = 0xffffffff804b806b <mvsata_wdc_cmd_start>,
c_intr = 0xffffffff804b7d9f <mvsata_wdc_cmd_intr>,
c_kill_xfer = 0xffffffff804b795a <mvsata_wdc_cmd_kill_xfer>}
(gdb) print {struct ata_command}xfer->c_cmd
$3 = {r_lba = 0, r_count = 0, {r_features = 0, r_error = 0 '\000'}, {
r_command = 236 '\354', r_status = 236 '\354'}, r_device = 0 '\000',
r_st_bmask = 64 '@', r_st_pmask = 8 '\b', flags = 9, timeout = 3000,
data = 0xfffffe8125788c88, bcount = 512, callback = 0x0, callback_arg = 0x0}
flags = 9 = AT_READ | AT_WAIT
r_command = 236 = 0xEC = WDCC_IDENTIFY
so mvsata just seems to be trying to identify the drives on boot. What is EDMA?
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.orgu
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Fri, 31 Mar 2017 17:38:30 +0200
On Fri, Mar 31, 2017 at 03:35:00PM +0000, prlw1@cam.ac.uk wrote:
> mvsata0:0:0: unable to stop EDMA
> panic: EDMA enabled
I have seen that when the machine had trouble talking to the disk, a
full reset usualy fixed that (but later I replaced the drive due to SMART
failurs and haven't seen it again since then).
Is this reprducable for you? Does it happen always on boot? Did the device/
disk combo work before?
Martin
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Fri, 31 Mar 2017 16:46:21 +0100
On Fri, Mar 31, 2017 at 03:40:01PM +0000, Martin Husemann wrote:
> Is this reprducable for you? Does it happen always on boot? Did the device/
> disk combo work before?
When I received that old server, I successfully booted it with the ubuntu
that was installed. I have always had the netbsd netboot panic. It is true
that given its age, the chance of one of the 8 disks on each mvsata board
being bad is quite high.
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Fri, 31 Mar 2017 17:17:05 +0100
Just booted the already installed Ubuntu 14.04.1.
for i in a b c d e f g h i j k l m n o p; do smartctl -a /dev/sd$i | grep overall-health; done
gets me
SMART overall-health self-assessment test result: PASSED
for all 16 disks, even though there is plenty of "Pre-fail" and "Old_age"
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Mon, 3 Apr 2017 12:01:33 +0100
Removed 15 of the 16 disks from the 2 mvsata cards, and get same panic.
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126: mvsata Marvell 88SX6081 panics on boot
Date: Tue, 4 Apr 2017 13:54:23 +0100
I took a leaf out of the ARMADAXP (evbarm) and added
options MVSATA_WITHOUTDMA
to my amd64 kernel configuration. The panic is now worked-around, I just
need to create more wd devices(!)
Anyway, this suggests a software rather than hardware bug.
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 25 Apr 2017 22:00:02 +0000
Responsible-Changed-Why:
Collecting mvsata(4) issues, soon to do changes on ncq branch.
From: Matt Brocklehurst <matt@mattbrocklehurst.co.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Subject: Re: kern/52126
Date: Fri, 18 Aug 2017 13:06:43 +0100
--94eb2c0a956e6ce423055705f6e6
Content-Type: text/plain; charset="UTF-8"
I'm also having this problem, about to build a new kernel with the DMA
disabled in the hope to get it running properly.
I've got a Thecus 5200Pro NAS running NetBSD 7.1
Controller is 88SX6081
I've tried a mixture of different drives, they all give the "unable to stop
EDMA" error on boot.
--94eb2c0a956e6ce423055705f6e6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div><div>I'm also having this problem, about to build=
a new kernel with the DMA disabled in the hope to get it running properly.=
<br><br></div>I've got a Thecus 5200Pro NAS running NetBSD 7.1 <br>=C2=
=A0Controller is 88SX6081 <br><br><br></div>I've tried a mixture of dif=
ferent drives, they all give the "unable to stop EDMA" error on b=
oot.<br></div>
--94eb2c0a956e6ce423055705f6e6--
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: Subject: Re: kern/52126
Date: Sat, 19 Aug 2017 08:41:28 -0000 (UTC)
matt@mattbrocklehurst.co.uk (Matt Brocklehurst) writes:
> I've tried a mixture of different drives, they all give the "unable to stop
> EDMA" error on boot.
The corresponding code in Linux does almost the same. Differences are:
- it not only waits for the IDLE flag, but also for the CACHEEMPTY flag
before disabling DMA.
- it waits up to 15ms (instead of 10ms) before disabling DMA
- it disables DMA even when the idle timeout is reached.
- it waits up to 100ms (instead of the remainder of 10ms) after disabling DMA
before giving up.
On the other hand, our timeouts are very imprecise as the driver tries to
sleep for 1 millisecond, which is impossible with HZ=100. A sleep for
mstohz(1) will sleep 0 ticks, i.e. it will only yield the CPU to other
LWPs. It is likely that the required delays are not met.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org, Michael van Elst <mlelstv@serpens.de>
Cc: Jaromir Dolecek <jdolecek@netbsd.org>, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Patrick Welche <prlw1@cam.ac.uk>
Subject: Re: Subject: Re: kern/52126
Date: Sat, 19 Aug 2017 21:31:26 +0200
--001a1148f876afc9050557204acf
Content-Type: text/plain; charset="UTF-8"
I've changed the sleep/delay code on jdolecek-ncq branch so that the sleep
is always at least 1 tick. You might give it a try - if that is really the
only issue, this might help. Increasing the timeouts however did not fix
the issue for me on 60xx, at least not with the brief testing I did so far.
I've experienced the EDMA timeout on 6048 just while booting or some light
use, while it almost never happened on 70xxx. It usually happened when the
chip was wedged for other reasons, like unrecovered NCQ error or some such.
So I think that there might be something else we do wrong for 60xx, and the
EDMA disable problem is just consequence.
60xx was however not really focus of the NCQ branch, and I won't do
anything further for it on the branch. My focus is now on completing my
branch and merge, and some RL stuff. Then I may start looking on the 60xx
issue, but that won't be before late September.
Jaromir
2017-08-19 10:45 GMT+02:00 Michael van Elst <mlelstv@serpens.de>:
> The following reply was made to PR kern/52126; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: Subject: Re: kern/52126
> Date: Sat, 19 Aug 2017 08:41:28 -0000 (UTC)
>
> matt@mattbrocklehurst.co.uk (Matt Brocklehurst) writes:
>
> > I've tried a mixture of different drives, they all give the "unable to
> stop
> > EDMA" error on boot.
>
> The corresponding code in Linux does almost the same. Differences are:
>
> - it not only waits for the IDLE flag, but also for the CACHEEMPTY flag
> before disabling DMA.
> - it waits up to 15ms (instead of 10ms) before disabling DMA
> - it disables DMA even when the idle timeout is reached.
> - it waits up to 100ms (instead of the remainder of 10ms) after disabling
> DMA
> before giving up.
>
> On the other hand, our timeouts are very imprecise as the driver tries to
> sleep for 1 millisecond, which is impossible with HZ=100. A sleep for
> mstohz(1) will sleep 0 ticks, i.e. it will only yield the CPU to other
> LWPs. It is likely that the required delays are not met.
>
>
> --
> --
> Michael van Elst
> Internet: mlelstv@serpens.de
> "A potential Snark may lurk in every
> tree."
>
>
--001a1148f876afc9050557204acf
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">I've changed the sleep/delay code on jdolecek-ncq bran=
ch so that the sleep is always at least 1 tick. You might give it a try - i=
f that is really the only issue, this might help. Increasing the timeouts h=
owever did not fix the issue for me on 60xx, at least not with the brief te=
sting I did so far.<div><br></div><div>I've experienced the EDMA timeou=
t on 6048 just while booting or some light use, while it almost never happe=
ned on 70xxx. It usually happened when the chip was wedged for other reason=
s, like unrecovered NCQ error or some such. So I think that there might be =
something else we do wrong for 60xx, and the EDMA disable problem is just c=
onsequence.=C2=A0<br></div><div><br></div><div>60xx was however not really =
focus of the NCQ branch, and I won't do anything further for it on the =
branch. My focus is now on completing my branch and merge, and some RL stuf=
f. Then I may start looking on the 60xx issue, but that won't be before=
late September.</div><div><br></div><div>Jaromir<br></div></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-08-19 10:45 GMT+02:00 =
Michael van Elst <span dir=3D"ltr"><<a href=3D"mailto:mlelstv@serpens.de=
" target=3D"_blank">mlelstv@serpens.de</a>></span>:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><span class=3D"">The following reply was made to PR kern/5212=
6; it has been noted by GNATS.<br>
<br>
</span>From: <a href=3D"mailto:mlelstv@serpens.de">mlelstv@serpens.de</a> (=
Michael van Elst)<br>
To: <a href=3D"mailto:gnats-bugs@netbsd.org">gnats-bugs@netbsd.org</a><br>
Cc:<br>
Subject: Re: Subject: Re: kern/52126<br>
Date: Sat, 19 Aug 2017 08:41:28 -0000 (UTC)<br>
<span class=3D""><br>
=C2=A0<a href=3D"mailto:matt@mattbrocklehurst.co.uk">matt@mattbrocklehurst.=
co.uk</a> (Matt Brocklehurst) writes:<br>
<br>
=C2=A0> I've tried a mixture of different drives, they all give the =
"unable to stop<br>
=C2=A0> EDMA" error on boot.<br>
<br>
</span>=C2=A0The corresponding code in Linux does almost the same. Differen=
ces are:<br>
<br>
=C2=A0- it not only waits for the IDLE flag, but also for the CACHEEMPTY fl=
ag<br>
=C2=A0 =C2=A0before disabling DMA.<br>
=C2=A0- it waits up to 15ms (instead of 10ms) before disabling DMA<br>
=C2=A0- it disables DMA even when the idle timeout is reached.<br>
=C2=A0- it waits up to 100ms (instead of the remainder of 10ms) after disab=
ling DMA<br>
=C2=A0 =C2=A0before giving up.<br>
<br>
=C2=A0On the other hand, our timeouts are very imprecise as the driver trie=
s to<br>
=C2=A0sleep for 1 millisecond, which is impossible with HZ=3D100. A sleep f=
or<br>
=C2=A0mstohz(1) will sleep 0 ticks, i.e. it will only yield the CPU to othe=
r<br>
=C2=A0LWPs. It is likely that the required delays are not met.<br>
<br>
<br>
=C2=A0--<br>
=C2=A0--<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Michael van Elst<br>
=C2=A0Internet: <a href=3D"mailto:mlelstv@serpens.de">mlelstv@serpens.de</a=
><br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"A potential Snark may lu=
rk in every tree."<br>
<br>
</blockquote></div><br></div>
--001a1148f876afc9050557204acf--
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Fwd: Subject: Re: kern/52126
Date: Sun, 20 Aug 2017 00:29:29 +0200
--001a11471ce06c41e2055722c70f
Content-Type: text/plain; charset="UTF-8"
(once more in plain text for record)
I've changed the sleep/delay code on jdolecek-ncq branch so that the sleep
is always at least 1 tick. You might give it a try - if that is really the
only issue, this might help. Increasing the timeouts however did not fix
the issue for me on 60xx, at least not with the brief testing I did so far.
I've experienced the EDMA timeout on 6048 just while booting or some light
use, while it almost never happened on 70xxx. It usually happened when the
chip was wedged for other reasons, like unrecovered NCQ error or some such.
So I think that there might be something else we do wrong for 60xx, and the
EDMA disable problem is just consequence.
60xx was however not really focus of the NCQ branch, and I won't do
anything further for it on the branch. My focus is now on completing my
branch and merge, and some RL stuff. Then I may start looking on the 60xx
issue, but that won't be before late September.
Jaromir
--001a11471ce06c41e2055722c70f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">(once more in plain text for record)<br><br>I've chang=
ed the sleep/delay code on jdolecek-ncq branch so that the sleep is always =
at least 1 tick. You might give it a try - if that is really the only issue=
, this might help. Increasing the timeouts however did not fix the issue fo=
r me on 60xx, at least not with the brief testing I did so far.<br><br>I=
9;ve experienced the EDMA timeout on 6048 just while booting or some light =
use, while it almost never happened on 70xxx. It usually happened when the =
chip was wedged for other reasons, like unrecovered NCQ error or some such.=
So I think that there might be something else we do wrong for 60xx, and th=
e EDMA disable problem is just consequence.<br><br>60xx was however not rea=
lly focus of the NCQ branch, and I won't do anything further for it on =
the branch. My focus is now on completing my branch and merge, and some RL =
stuff. Then I may start looking on the 60xx issue, but that won't be be=
fore late September.<br><br>Jaromir</div>
--001a11471ce06c41e2055722c70f--
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Wed, 8 Nov 2017 13:08:20 +0000
Should I be looking at jdolecek-ncq? I just checked today's -current, and:
Standard GENERIC-8.99.5/amd64:
mvsata1 at pci5 dev 2 function 0: vendor 11ab product 6081 (rev. 0x09)
mvsata1: interrupting at ioapic1 pin 2
mvsata1: GenII, 2hc, 4port/hc
mvsata1 port 4: device present, speed: 3.0Gb/s
mvsata1:4:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
and then machine apparently hangs
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
0 94 3 2 200 ffffe4011dd43ae0 mvsata1cnf biowait
kernel with MVSATA_WITHOUTDMA etc.
mvsata1 at pci5 dev 2 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
mvsata1: interrupting at ioapic1 pin 2
mvsata1: GenII, 2hc, 4port/hc
mvsata1 port 4: device present, speed: 3.0Gb/s
mvsata1:4:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 1
mvsata1:4:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 2
mvsata1:4:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 3
mvsata1:4:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), slot 0, retry 4
mvsata1:4:0: lost interrupt
type: ata tc_bcount: 15872 tc_skip: 512
wd0d: device timeout reading fsbn 1 of 0-31 (wd0 bn 1; cn 0 tn 0 sn 1), slot 0, retry 5
fatal page fault in supervisor mode
trap type 6 code 0x2 rip 0xffffffff8021c9e3 cs 0x8 rflags 0x10202 cr2 0xffff800020991000 ilevel 0x8 rsp 0xffffe4011d271be0
curlwp 0xffffe4045ef0b840 pid 0.2 lowest kstack 0xffffe4011d26e2c0
kernel: page fault trap, code=0
Stopped in pid 0.2 (system) at netbsd:bus_space_read_multi_2+0x23: movw %
ax,0(%rcx)
bus_space_read_multi_2() at netbsd:bus_space_read_multi_2+0x23
wdc_ata_bio_intr() at netbsd:wdc_ata_bio_intr+0x2e7
mvsata_intr() at netbsd:mvsata_intr+0x107
mvsata_pci_intr() at netbsd:mvsata_pci_intr+0x12e
intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
Xintr_ioapic_level4() at netbsd:Xintr_ioapic_level4+0xf7
--- interrupt ---
x86_stihlt() at netbsd:x86_stihlt+0x6
acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0xdb
acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xb6
idle_loop() at netbsd:idle_loop+0x19c
There is no guarantee that the disks are OK, but I have tried several...
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 20 May 2019 10:45:39 +0100
An update - in the meantime jdolecek-ncq was merged, and I'm sure I saw
something about DMA...
> and then machine apparently hangs
I still see the hang
> kernel with MVSATA_WITHOUTDMA etc.
but don't need the MVSATA_WITHOUTDMA. No more panics!
GENERIC 8.9941/amd64 kernel, with LOCKDEBUG, DIAGNOSTIC and KUBSAN,
booted with -1 just in case (no difference in multiprocessor), and:
cpu0: package 0, core 0, smt 0
cpu1 at mainbus0 apid 1: multiprocessor boot disabled
...
mvsata0 at pci5 dev 1 function 0: vendor 11ab product 6081 (rev. 0x09)
mvsata0: interrupting at ioapic1 pin 1
mvsata0: GenII, 2hc, 4port/hc
atabus0 at mvsata0 channel 0
atabus1 at mvsata0 channel 1
...
mvsata0 port 0: device present, speed: 1.5Gb/s
mvsata0 port 6: device present, speed: 1.5Gb/s
ahcisata0 port 0: device present, speed: 1.5Gb/s
ahcisata0 port 1: device present, speed: 1.5Gb/s
wd0 at atabus0 drive 0
wd0: <ST3200822AS>
wd0: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
mvsata0:0:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 0
mvsata0:0:0: lost interrupt
type: ata tc_bcount: 16384 tc_skip: 0
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 1
ipmi0: ID 32.1 IPMI 2.0 Available
mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 2
mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 3
mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 4
mvsata0: channel 0: drive 0 not ready, st=0xd0, err=0x00
wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0)
wd0d: error reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0)
hang
break into ddb:
--- interrupt ---
x86_stihlt() at netbsd:x86_stihlt+0x6
acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0x18d
acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0x1b3
idle_loop() at netbsd:idle_loop+0x19a
Lost in idle_loop?
Extracts from ps:
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
0 65 3 0 204 ffffbddd0279f200 mvsata0cnf biowait
0 21 3 0 204 ffffbddd0248b540 atabus0 atath
0 > 2 7 0 201 ffffbddf4b803040 idle/0
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 20 May 2019 12:34:51 +0100
On Mon, May 20, 2019 at 11:00:01AM +0000, Patrick Welche wrote:
> An update - in the meantime jdolecek-ncq was merged, and I'm sure I saw
> something about DMA...
Checking: the bit about DMA was in fact the patch in PR kern/54205, which
AFAICT hasn't been committed. I just tried it, and it didn't help with
this (different) bug.
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Thu, 17 Oct 2019 14:24:53 +0100
Just tried a today's NetBSD-9.99.17/amd64 kernel with:
include "arch/amd64/conf/GENERIC"
options DEBUG
options LOCKDEBUG
options MVSATA_DEBUG
options MVSATA_WITHOUTDMA
(I think the withoutdma is not longer necessary) and on boot, get
Mutex error: assert_sleepable,70: spin lock held
lock address : 0xfffff91bd38a4020 type : spin
initialized : 0xffffffff80284dd9
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 1 last held: 1
current lwp : 0xfffff91bd3934ac0 last held: 0xfffff91bd3934ac0
last locked* : 0xffffffff809c910f unlocked : 0xffffffff809c9070
owner field : 0x0000000000010600 wait/spin: 0/1
panic: LOCKDEBUG: Mutex error: assert_sleepable,70: spin lock held
cpu1: Begin traceback...
vpanic() at netbsd:vpanic+0x160
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
assert_sleepable() at netbsd:assert_sleepable+0x3d
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x18e
kmem_intr_alloc() at netbsd:kmem_intr_alloc+0x5b
kmem_intr_zalloc() at netbsd:kmem_intr_zalloc+kmem_zalloc() at netbsd:kmem_zalloc+0x4a
atabus_alloc_drives() at netbsd:atabus_alloc_drives+0x70
sata_interpret_sig() at netbsd:sata_interpret_sig+0x1e5
mvsata_probe_drive() at netbsd:mvsata_probe_drive+0x234
atabusconfig() at netbsd:atabusconfig+0x65
atabus_thread() at netbsd:atabus_thread+0x7e
cpu1: End traceback...
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff8021ddcd cs 0x8 rflags 0x202 cr2 0 ilevel 0x8 rsp 0xffffd5811bbfcce0
curlwp 0xfffff91bd3934ac0 pid 0.52 lowest kstack 0xffffd5811bbf92c0
0x11
db{1}> show locks
[Locks tracked through LWPs]
Locks held by an LWP (ipmi0):
Lock 0 (initialized at ipmi_attach)
lock address : 0xfffff91e95a333c0 type : sleep/adaptive
initialized : 0xffffffff80a92655
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 1 last held: 0
current lwp : 0xfffff91bd3934ac0 last held: 0xfffff91bd34f7600
last locked* : 0xffffffff80a934f5 unlocked : 0xffffffff80a9354c
owner field : 0xfffff91bd34f7600 wait/spin: 0/0
Turnstile chain at 0xffffffff816a0200.
=> No active turnstile for this lock.
[Locks tracked through CPUs]
Locks held on CPU 1:
Lock 0 (initialized at main)
lock address : 0xffffffff8169e800 type : spin
initialized : 0xffffffff80e3be69
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 1 last held: 1
current lwp : 0xfffff91bd3934ac0 last held: 0xfffff91bd3934ac0
last locked* : 0xffffffff809c60ac unlocked : 0xffffffff809c90dd
curcpu holds : 1 wanted by: 000000000000000000
Lock 1 (initialized at ata_channel_init)
lock address : 0xfffff91bd38a4020 type : spin
initialized : 0xffffffff80284dd9
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 1 last held: 1
current lwp : 0xfffff91bd3934ac0 last held: 0xfffff91bd3934ac0
last locked* : 0xffffffff809c910f unlocked : 0xffffffff809c9070
owner field : 0x0000000000010600 wait/spin: ine cpu 0
using CPU 0
db{1}> trace
_kernel_lock() at netbsd:_kernel_lock+0x1d7
sleepq_block() at netbsd:sleepq_block+0x93
kpause() at netbsd:kpause+0x11f
sata_reset_interface() at netbsd:sata_reset_interface+0xa8
mvsata_probe_drive() at netbsd:mvsata_probe_drive+0x36
atabusconfig() at netbsd:atabusconfig+0x65
atabus_thread() at netbsd:atabus_thread+0x7e
db{1}> machine cpu 1
using CPU 1
db{1}> trace
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x160
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
assert_sleepable() at netbsd:assert_sleepable+0x3d
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x18e
kmem_intr_alloc() at netbsd:kmem_intr_alloc+0x5b
kmem_intr_zalloc() at netbsd:kmem_intr_zalloc+0x11
kmem_zalloc() at netbsd:kmem_zalloc+0x4a
atabus_alloc_drives() at netbsd:atabus_alloc_drives+0x70
sata_interpret_sig() at netbsd:sata_interpret_sig+0x1e5
mvsata_probe_drive() at netbsd:mvsata_probe_drive+0x234
atabusconfig() at netbsd:atabusconfig+0x65
atabus_thread() at netbsd:atabus_thread+0x7e
db{1}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
...
0 > 52 7 1 200 fffff91bd3934ac0 atabus12
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Tue, 22 Oct 2019 17:07:34 +0100
The locking error has gone away - guessing this has to do with the ata
insomniac fixes...
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 22 Oct 2019 16:36:49 +0000
State-Changed-Why:
Does this mean this PR can be closed?
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Wed, 23 Oct 2019 11:33:49 +0100
Just tried again with this morning's source and see the locking panic
Kernel lock error: _kernel_lock,220: spinout
lock address : 0xffffffff8169e800 type : spin
initialized : 0xffffffff80e3a8a9
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 3
current cpu : 3 last held: 1
current lwp : 0xffffd36185f81600 last held: 0xffffd3618633eac0
last locked* : 0xffffffff809c58db unlocked : 0xffffffff8098aee2
curcpu holds : 0 wanted by: 0xffffd36185f81600
panic: LOCKDEBUG: Kernel lock error: _kernel_lock,220: spinout
cpu3: Begin traceback...
vpanic() at netbsd:vpanic+0x178
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
_kernel_lock() at netbsd:_kernel_lock+0x14a
sleepq_block() at netbsd:sleepq_block+0x1f8
cv_timedwait() at netbsd:cv_timedwait+0x130
ipmi_thread() at netbsd:ipmi_thread+0x881
cpu3: End traceback...
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff8021ddcd cs 0x8 rflags 0x202 cr2 0 ilevel 0 rsp 0xffffab013d2ccd20
curlwp 0xffffd36185f81600 pid 0.39 lowest kstack 0xffffab013d2c92c0
Stopped in pid 0.39 (system) at netbsd:breakpoint+0x5: how locks
[Locks tracked through LWPs]
Locks held by an LWP (npfgc-0):
Lock 0 (initialized at npf_config_init)
lock address : 0xffffd36186396cf8 type : sleep/adaptive
initialized : 0xffffffff8079aa4b
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 3 last held: 2
current lwp : 0xffffd36185f81600 last held: 0xffffd361868d29a0
last locked* : 0xffffffff8079aae7 unlocked : 0xffffffff807aa8db
owner field : 0xffffd361868d29a0 wait/spin: 0/0
Turnstile chain at 0xffffffff816a0070.
=> No active turnstile for this lock.
Locks held by an LWP (nd6_timer):
Lock 0 (initialized at soinit)
lock address : 0xffffd364c8274080 type : sleep/adaptive
initialized : 0xffffffff80a31cad
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 1
current cpu : 3 last held: 2
current lwp : 0xffffd36185f81600 last held: 0xffffd36186725940
last locked* : 0xffffffff80760675 unlocked : 0xffffffff80733115
owner field : 0xffffd36186725940 wait/spin: 1/0
Turnstile chain at 0xffffffff8169ff80.
=> Turnstile at 0xffffd364c79e6b98 (wrq=0xffffd364c79e6bb8, rdq=0xffffd364c79e6b
c8).
=> 0 waiting readers:
=> 1 waiting writers: 0xffffd36185ef95a0
Locks held by an LWP (swapper):
Lock 0 (initialized at disk_init)
lock address : 0xffffd361866ff5b0 type : sleep/adaptive
initialized : 0xffffffff809e3de9
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 3 last held: 0
current lwp : 0xffffd36185f81600 last held: 0xffffffff8165cbe0
last locked* : 0xffffffff80a7f7b6 unlocked : 0xffffffff80a7e49c
owner field : 0xffffffff8165cbe0 wait/spin: 0/0
Turnstile chain at 0xffffffff816a01e0.
=> No active turnstile for this lock.
[Locks tracked through CPUs]
Locks held on CPU 1:
Lock 0 (initialized at main)
lock address : 0xffffffff8169e800 type : spin
initialized : 0xffffffff80e3a8a9
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 3
current cpu : 3 last held: 1
current lwp : 0xffffd36185f81600 last held: 0xffffd3618633eac0
last locked* : 0xffffffff809c58db unlocked : 0xffffffff8098aee2
curcpu holds : 0 wanted by: 0xffffd36185f81600
Lock 1 (initialized at ata_channel_init)
lock address : 0xffffd361862ef020 type : spin
initialized : 0xffffffff8028428c
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 3 last held: 1
current lwp : 0xffffd36185f81600 last held: 0xffffd3618633eac0
last locked* : 0xffffffff802829f1 unlocked : 0xffffffff80645f69
owner field : 0x0000000000010600 wait/spin: 0/1
so presumably a race, and I was lucky last time.
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Wed, 23 Oct 2019 11:45:10 +0100
A bit more:
db{3}> machine cpu 0
using CPU 0
db{3}> bt
_kernel_lock() at netbsd:_kernel_lock+0x118
sleepq_block() at netbsd:sleepq_block+0x1f8
iic_smbus_intr_thread() at netbsd:iic_smbus_intr_thread+0x52
db{3}> machine cpu 2
using CPU 2
db{3}> bt
_kernel_lock() at netbsd:_kernel_lock+0xfd
nd6_timer_work() at netbsd:nd6_timer_work+0x49
workqueue_worker() at netbsd:workqueue_worker+0xe5
db{3}> machine cpu 3
using CPU 3
db{3}> bt
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x178
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
_kernel_lock() at netbsd:_kernel_lock+0x14a
sleepq_block() at netbsd:sleepq_block+0x1f8
cv_timedwait() at netbsd:cv_timedwait+0x130
ipmi_thread() at netbsd:ipmi_thread+0x881
db{3}> machine cpu 1
using CPU 1
db{3}> bt
lapic_delay() at netbsd:lapic_delay+0x6e
__wdcwait() at netbsd:__wdcwait+0x91
mvsata_bio_start() at netbsd:mvsata_bio_start+0xa9c
ata_xfer_start() at netbsd:ata_xfer_start+0x4f
atabus_thread() at netbsd:atabus_thread+0x279
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Thu, 24 Oct 2019 13:47:47 +0100
(Still get locking panic with wdc.c 1.294)
State-Changed-From-To: feedback->open
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Thu, 24 Oct 2019 13:18:23 +0000
State-Changed-Why:
Thanks for feedback. Needs review and fix of locking protocol for calls at least
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Thu, 31 Oct 2019 15:22:54 +0000
Tinkering with today's -current/amd64, a probably not relevant
Index: sys/dev/ic/wdc.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/wdc.c,v
retrieving revision 1.294
diff -u -p -r1.294 wdc.c
--- sys/dev/ic/wdc.c 23 Oct 2019 14:44:41 -0000 1.294
+++ sys/dev/ic/wdc.c 31 Oct 2019 15:17:20 -0000
@@ -1210,6 +1210,7 @@ __wdcwait(struct ata_channel *chp, int m
*tfd = 0;
timeout = timeout * 1000 / WDCDELAY; /* delay uses microseconds */
+ timeout *= 100; /* XXXPW */
for (;;) {
status =
and a
include "arch/amd64/conf/GENERIC"
options DEBUG
options LOCKDEBUG
options ATADEBUG
options ATADEBUG_MASK=0x10
options MVSATA_DEBUG
options WDCNDELAY_DEBUG
kernel, booting in uniprocessor mode, i.e., boot netbsd -1, I also hit
the locking problem:
panic: LOCKDEBUG: Mutex error: mutex_vector_enter,477: locking against myself
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x178
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
mutex_enter() at netbsd:mutex_enter+0x518
ata_queue_get_active_xfer() at netbsd:ata_queue_get_active_xfer+0x16
__wdcwait() at netbsd:__wdcwait+0x116
wdcwait() at netbsd:wdcwait+0x99
__wdccommand_intr() at netbsd:__wdccommand_intr+0xc4
ata_xfer_start() at netbsd:ata_xfer_start+0x136
atastart() at netbsd:atastart+0x20e
wdc_exec_command() at netbsd:wdc_exec_command+0x9a
ata_get_params() at netbsd:ata_get_params+0x11e
wdc_drvprobe() at netbsd:wdc_drvprobe+0x3cb
atabusconfig() at netbsd:atabusconfig+0x65
atabus_thread() at netbsd:atabus_thread+0x81
db{0}> show locks
[Locks tracked through LWPs]
Locks held by an LWP (ipmi0):
Lock 0 (initialized at ipmi_attach)
lock address : 0xfffffe226d5333c0 type : sleep/adaptive
initialized : 0xffffffff80a9285c
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 0
current lwp : 0xfffffe1f2f053a40 last held: 0xfffffe226d53e540
last locked* : 0xffffffff80a93498 unlocked : 000000000000000000
owner field : 0xfffffe226d53e540 wait/spin: 0/0
Turnstile chain at 0xffffffff816a0480.
=> No active turnstile for this lock.
[Locks tracked through CPUs]
Locks held on CPU 0:
Lock 0 (initialized at main)
lock address : 0xffffffff8169ea80 type : spin
initialized : 0xffffffff80e3cca9
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 0
current lwp : 0xfffffe1f2f053a40 last held: 0xfffffe1f2f053a40
last locked* : 0xffffffff80209781 unlocked : 0xffffffff809ca113
curcpu holds : 1 wanted by: 000000000000000000
Lock 1 (initialized at ata_channel_init)
lock address : 0xffffad801da5f1f0 type : spin
initialized : 0xffffffff8028424c
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 1
current cpu : 0 last held: 0
current lwp : 0xfffffe1f2f053a40 last held: 0xfffffe1f2f053a40
last locked* : 0xffffffff8063b9d8 unlocked : 0xffffffff8028070c
owner field : 0x0000000000010600 wait/spin: 0/1
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
0 > 40 7 0 200 fffffe1f2f053a40 atabus16
0 21 3 0 200 fffffe226d53e540 ipmi0 ipmicmd
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 1 Nov 2019 17:59:50 +0000
On Thu, Oct 31, 2019 at 03:25:01PM +0000, Patrick Welche wrote:
> include "arch/amd64/conf/GENERIC"
> options DEBUG
> options LOCKDEBUG
> options ATADEBUG
> options ATADEBUG_MASK=0x10
> options MVSATA_DEBUG
> options WDCNDELAY_DEBUG <--------------- !!!!!
The locking issue is caused by WDCNDELAY_DEBUG.
wdcwait() asserts that the channel is locked, and calls __wdcwait.
Only ifdef WDCNDELAY_DEBUG does __wdcwait call ata_queue_get_active_xfer
which tries to lock the already locked channel.
https://nxr.netbsd.org/xref/src/sys/dev/ic/wdc.c#1243
Removing WDCNDELAY_DEBUG removes the locking issue.
Any idea where the status values in
__wdcwait: timeout (time=1001), status d0 error 0 (mask 0x40 bits 0x40)
^^^^^^^^^
are defined?
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 1 Nov 2019 18:44:03 -0000 (UTC)
prlw1@cam.ac.uk (Patrick Welche) writes:
> Any idea where the status values in
> __wdcwait: timeout (time=1001), status d0 error 0 (mask 0x40 bits 0x40)
>
> are defined?
That's the hex value of the status register. See sys/dev/ata/atareg.h.
0xd0 = WDCS_BSY | WDCS_DRDY | WDCS_DSC;
The timeout happens because the drive never cleared the busy flag,
the code waits for (mask | WDCS_BSY) == bits.
The names are ancient, in ATA speak WDCS_DSC is the DF (Device Fault) flag
that says that a command couldn't be completed (in contrast to a command
that is completed with or without an error). There are special modes
where the DF flag and others get repurposed to signal other conditions,
but that's not relevant here.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: "Patrick Welche" <prlw1@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52126 CVS commit: src/sys/dev/ic
Date: Mon, 4 Nov 2019 15:12:10 +0000
Module Name: src
Committed By: prlw1
Date: Mon Nov 4 15:12:10 UTC 2019
Modified Files:
src/sys/dev/ic: wdc.c
Log Message:
Fix locking bug seen when using WDCNDELAY_DEBUG in PR kern/52126
(Observed
piixide0:0:0: warning: busy-wait took 200us
with no panic)
To generate a diff of this commit:
cvs rdiff -u -r1.294 -r1.295 src/sys/dev/ic/wdc.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 4 Nov 2019 16:02:38 +0000
Next locking issue:
mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
mvsata1:4: mvsata_bio_start: drive=0
mvsata1:4: mvsata_bio_start: drive=0
Kernel lock error: _kernel_lock,220: spinout
lock address : 0xffffffff8169eb80 type : spin
initialized : 0xffffffff80e3dc09
shared holds : 0 exclusive: 1
shares wanted: 0 exclusive: 2
current cpu : 0 last held: 2
current lwp : 0xffffcaa088781600 last held: 0xffffcaa088b3eac0
last locked* : 0xffffffff809c7c9b unlocked : 0xffffffff8098d2a2
curcpu holds : 0 wanted by: 0xffffcaa088781600
panic: LOCKDEBUG: Kernel lock error: _kernel_lock,220: spinout
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x178
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
_kernel_lock() at netbsd:_kernel_lock+0x14a
sleepq_block() at netbsd:sleepq_block+0x1f8
cv_timedwait() at netbsd:cv_timedwait+0x130
ipmi_thread() at netbsd:ipmi_thread+0x881
cpu0: End traceback...
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff8021ddcd cs 0x8 rflags 0x202 cr2 0 ilevel 0 rsp 0xffffde813d2ccd20
curlwp 0xffffcaa088781600 pid 0.39 lowest kstack 0xffffde813d2c92c0
Stopped in pid 0.39 (system) at netbsd:breakpoint+0x5: leave
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
0 > 39 7 0 200 ffffcaa088781600 ipmi0
0 > 52 7 2 e00 ffffcaa088b3eac0
db{0}> bt
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x178
snprintf() at netbsd:snprintf
lockdebug_more() at netbsd:lockdebug_more
_kernel_lock() at netbsd:_kernel_lock+0x14a
sleepq_block() at netbsd:sleepq_block+0x1f8
cv_timedwait() at netbsd:cv_timedwait+0x130
ipmi_thread() at netbsd:ipmi_thread+0x881
db{0}> machine cpu 1
using CPU 1
db{0}> bt
x86_stihlt() at netbsd:x86_stihlt+0x6
acpicpu_cstate_idle_enter() at netbsd:acpicpu_cstate_idle_enter+0xd1
acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xba
idle_loop() at netbsd:idle_loop+0x14b
cpu_hatch() at netbsd:cpu_hatch+0x17f
db{0}> machine cpu 2
using CPU 2
db{0}> bt
lapic_delay() at netbsd:lapic_delay+0x6e
__wdcwait() at netbsd:__wdcwait+0x99
mvsata_bio_start() at netbsd:mvsata_bio_start+0xa9c
ata_xfer_start() at netbsd:ata_xfer_start+0x4f
atabus_thread() at netbsd:atabus_thread+0x279
db{0}> machine cpu 3
using CPU 3
db{0}> bt
ld_rbto_compare_key() at netbsd:ld_rbto_compare_key+0x4
lockdebug_abort() at netbsd:lockdebug_abort+0x5a
_kernel_lock() at netbsd:_kernel_lock+0x14a
ip_slowtimo() at netbsd:ip_slowtimo+0x1a
pfslowtimo() at netbsd:pfslowtimo+0x47
callout_softclock() at netbsd:callout_softclock+0x121
softint_dispatch() at netbsd:softint_dispatch+0xf2
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xffffde813d2b60f0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 4 Nov 2019 16:21:04 +0000
(no locking issue if boot netbsd -c, and disable mvsata)
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 20 Dec 2019 17:01:50 +0000
Just for the record, the run up to the spinout is:
ata_exec_xfer 0xfffff2bbff940038 channel 1 drive 0
atastart from ata_exec_xfer, flags 0x800
ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0atastart(chp=0xffffb0001edffa20): xfer 0xfffff2bbff940038 channel 1 drive 0
atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
ata_exec_xfer 0xfffff2bbff940038 channel 4 drive 0
atastart from ata_exec_xfer, flags 0x0
ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xfffff2bbff46e010): xfer 0xfffff2bbff940038 channel 4 drive 0
mvsata1:4: mvsata_bio_start: drive=0
ata_channel_freeze_locked(chp=0xfffff2bbff46e010) -> 1
ata_channel_thaw_locked(chp=0xfffff2bbff46e010) -> 0
mvsata1:4: mvsata_bio_start: drive=0
Kernel lock error: _kernel_lock,222: spinout
From: Frank Kardel <kardel@kardel.name>
To: gnats-bugs@netbsd.org, jdolecek@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, prlw1@cam.ac.uk
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 20 Dec 2019 23:08:33 +0100
The _kernel_lock() code looks very suspicious.
spinouts on kernel_lock only happen while the root is not mounted
(start_init_exec == 0, see init_main.c):
215 spins = 0;
216 do {
217 splx(s);
218 while (__SIMPLELOCK_LOCKED_P(kernel_lock)) {
219 if (SPINLOCK_SPINOUT(spins)) {
220 extern int start_init_exec;
221 if
(!start_init_exec) <===+ looks fishy to me
222 _KERNEL_LOCK_ABORT("spinout"); <===+
223 }
224 SPINLOCK_BACKOFF_HOOK;
225 SPINLOCK_SPIN_HOOK;
226 }
227 s = splvm();
228 } while (!__cpu_simple_lock_try(kernel_lock));
229
230 ci->ci_biglock_count = nlocks;
I have seen spinout on my machine when booting via EFI and kernel
messages slowly crawling up the screen
causing kernel lock contention while initializing the system. After the
flag is set, the spinout logic is ineffective.
I am missing to see the reasoning here to panic at all during startup
and disable the panic after initialization.
IMHO the panic should be removed as it can be randomly triggered and
prohibit boot. And the code serves no purpose
after mounting the root-fs.
Frank
On 12/20/19 18:05, Patrick Welche wrote:
> The following reply was made to PR kern/52126; it has been noted by GNATS.
>
> From: Patrick Welche <prlw1@cam.ac.uk>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
> Date: Fri, 20 Dec 2019 17:01:50 +0000
>
> Just for the record, the run up to the spinout is:
>
> ata_exec_xfer 0xfffff2bbff940038 channel 1 drive 0
> atastart from ata_exec_xfer, flags 0x800
> ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0atastart(chp=0xffffb0001edffa20): xfer 0xfffff2bbff940038 channel 1 drive 0
> atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
> atastart(chp=0xffffb0001edffa20): channel 1 queue_xfer is empty
> mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
> ata_exec_xfer 0xfffff2bbff940038 channel 4 drive 0
> atastart from ata_exec_xfer, flags 0x0
> ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xfffff2bbff46e010): xfer 0xfffff2bbff940038 channel 4 drive 0
> mvsata1:4: mvsata_bio_start: drive=0
> ata_channel_freeze_locked(chp=0xfffff2bbff46e010) -> 1
> ata_channel_thaw_locked(chp=0xfffff2bbff46e010) -> 0
> mvsata1:4: mvsata_bio_start: drive=0
> Kernel lock error: _kernel_lock,222: spinout
>
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 20 Dec 2019 22:34:41 -0000 (UTC)
kardel@kardel.name (Frank Kardel) writes:
>The _kernel_lock() code looks very suspicious.
>spinouts on kernel_lock only happen while the root is not mounted
>(start_init_exec == 0, see init_main.c):
start_init_exec is not related to the root mount but is set before the
first user code is executed.
The spinout check is a crude deadlock detection. It's not reasonable
to do that for user code that is waiting to enter the kernel.
If you want to remove the panic you may want to replace it with a (rate
limited) console message.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 07:56:19 +0100
I was just referring to place where the flag is set, not that it is
related to mounting the root although attempting to exec init implies
the availability of the root.
Crude indeed it is, es. It panics only during kernel startup when no
user code can be running,
Maybe I am missing omething, but the prerequisite condition for panic is
"!start_init_exec". Maybe the test was intended to be inverted so that
the crude
deadlock check is enabled while user level is running. Right now it only
checks during the initialization phase *before*
user-level is running and in some circumstance even prohibits a
successful boot (slow raster console+debug+verbose boot).
Frank
On 12/20/19 23:40, Michael van Elst wrote:
> The following reply was made to PR kern/52126; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
> Date: Fri, 20 Dec 2019 22:34:41 -0000 (UTC)
>
> kardel@kardel.name (Frank Kardel) writes:
>
> >The _kernel_lock() code looks very suspicious.
>
> >spinouts on kernel_lock only happen while the root is not mounted
> >(start_init_exec == 0, see init_main.c):
>
> start_init_exec is not related to the root mount but is set before the
> first user code is executed.
>
> The spinout check is a crude deadlock detection. It's not reasonable
> to do that for user code that is waiting to enter the kernel.
>
> If you want to remove the panic you may want to replace it with a (rate
> limited) console message.
>
> --
> --
> Michael van Elst
> Internet: mlelstv@serpens.de
> "A potential Snark may lurk in every tree."
>
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 07:47:12 -0000 (UTC)
kardel@netbsd.org (Frank Kardel) writes:
> Maybe I am missing omething, but the prerequisite condition for panic is
> "!start_init_exec". Maybe the test was intended to be inverted so that
> the crude
> deadlock check is enabled while user level is running. Right now it only
> checks during the initialization phase *before*
Excessive waiting for the kernel lock is always a problem, functional
(i.e. delayed interrupt processing and then lost interrupts) or just
performance. But after bootstrap a panic is probably too much, anything
before can be handled by just fixing the code that holds the kernel
lock for too long.
> user-level is running and in some circumstance even prohibits a
> successful boot (slow raster console+debug+verbose boot).
This just shows that slow raster consoles (kernel messages, not tty) have
their limitations. Still a good thing to get this information because
it points to a design problem (kernel lock held too long) usually somewhere
else. E.g. sdmmc(4) now sleeps in some places instead of spinning in delay()
because of this.
There are places in our ata code (and probably mvsata too) that need
similar attention.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 11:05:55 +0100
On 12/21/19 08:50, Michael van Elst wrote:
> The following reply was made to PR kern/52126; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
> Date: Sat, 21 Dec 2019 07:47:12 -0000 (UTC)
>
> kardel@netbsd.org (Frank Kardel) writes:
>
> > Maybe I am missing omething, but the prerequisite condition for panic is
> > "!start_init_exec". Maybe the test was intended to be inverted so that
> > the crude
> > deadlock check is enabled while user level is running. Right now it only
> > checks during the initialization phase *before*
>
> Excessive waiting for the kernel lock is always a problem, functional
> (i.e. delayed interrupt processing and then lost interrupts) or just
> performance.
I am not questioning that.Ideally we can manage to make the kernel lock
disappear some time.
> But after bootstrap a panic is probably too much, anything
> before can be handled by just fixing the code that holds the kernel
> lock for too long.
During bootstrap a panic is even prohibitive. The long holding may not
even be caused by wrong code but
by debug checks and or kernel output to a slow console and on (larger)
MP systems even caused by starvation due to multiple
threads doing output to slow consoles. Especially in debug mode we have
even more unpredictable timing (memory size/resource dependent).
So we are looking here at performance bugs and delays related to
starvation and consistency check code.
My boot panics where caused when chasing the EFI device initialization
failures (which are almost gone - need to re-check).
The kernel panicked all over the place but never reached user mode
because of that. The workaround was/is to disable DEBUG and pciverbose
or the spinout code.
As the spinout code sets an absolute retry barrier any printf/debug
analysis within a kernel_lock section is risky. So while trying to catch
'long locks' we may be tripping also over false positives due to slow
output/bad design/starvation. I have no idea what the signal/noise ratio
is here.
>
> > user-level is running and in some circumstance even prohibits a
> > successful boot (slow raster console+debug+verbose boot).
>
> This just shows that slow raster consoles (kernel messages, not tty) have
> their limitations. Still a good thing to get this information because
> it points to a design problem (kernel lock held too long)
Starvation may also be a cause for that. We could add a lock counter to
reset the spin backoff to avoid triggering
on high contention. But this looks still like a bandaid.
> usually somewhere
> else. E.g. sdmmc(4) now sleeps in some places instead of spinning in delay()
> because of this.
There are certainly many places where kernel_lock is held needlessly
long. Which need to be addressed while
working on removing the big lock.
>
> There are places in our ata code (and probably mvsata too) that need
> similar attention.
Definitely. This bug looks like there is some unexpected interaction
going on with the device. Also ipmi printfs are often a victim here
going by the bug report. Would be nice to get a stack trace or at least
the location where the lock way taken of the kernel lock owner at this time.
But I still wonder whether this panic is a good idea due to signal/noise
ratio especially as is is active only during boot.
Frank
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: Patrick Welche <prlw1@cam.ac.uk>
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 23:06:11 +0100
--0000000000003786cf059a3e0164
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Le ven. 20 d=C3=A9c. 2019 =C3=A0 18:05, Patrick Welche <prlw1@cam.ac.uk> a =
=C3=A9crit :
> mvsata1:4: mvsata_bio_start: drive=3D0
> Kernel lock error: _kernel_lock,222: spinout
>
Can you please modify mvsata_bio_start() to log the wait_flags? It seems
the code path eventually triggers delay() calls, I want to confirm it's
doing polled I/O there.
Jaromir
--0000000000003786cf059a3e0164
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">Le=C2=A0ven. 20 d=C3=A9c. 2019 =C3=A0=
=C2=A018:05, Patrick Welche <<a href=3D"mailto:prlw1@cam.ac.uk">prlw1@ca=
m.ac.uk</a>> a =C3=A9crit=C2=A0:</div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:s=
olid;border-left-color:rgb(204,204,204);padding-left:1ex">
=C2=A0mvsata1:4: mvsata_bio_start: drive=3D0<br>
=C2=A0Kernel lock error: _kernel_lock,222: spinout<br></blockquote><div><br=
></div><div>Can you please modify mvsata_bio_start() to log the wait_flags?=
It seems the code path eventually triggers delay() calls, I want to confir=
m it's doing polled I/O there.</div><div><br></div><div>Jaromir=C2=A0</=
div></div></div>
--0000000000003786cf059a3e0164--
From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/52126 CVS commit: src/sys/dev/ic
Date: Sun, 22 Dec 2019 20:54:00 +0000
Module Name: src
Committed By: jdolecek
Date: Sun Dec 22 20:54:00 UTC 2019
Modified Files:
src/sys/dev/ic: mvsata.c
Log Message:
remove check for idle status when disabling EDMA, and always toggle
the disable regardless of current state - this particularly seems
to fail during error recovery, and on my system this also fails
during regular boot
this matches both FreeBSD and Linux drivers - neither of those checks
the idle status
should help with PR kern/52419 and maybe also the lock spinout part
of PR kern/52126
To generate a diff of this commit:
cvs rdiff -u -r1.49 -r1.50 src/sys/dev/ic/mvsata.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 22 Dec 2019 21:13:21 +0000
State-Changed-Why:
Does this still happen with rev. 1.50 of sys/dev/ic/mvsata.c?
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 23 Dec 2019 12:06:21 +0000
From Frank & Michael's comments it sounds as though the spinout is a
secondary symptom of the "lost interrupt"s? This is with mvsata.c v1.51
- example of early lost interrupt:
[ 3.6618488] ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xffff893d7986e010): xfer 0xffff893d79c2c038 channel 4 drive 0
[ 3.6618488] mvsata1:4: mvsata_bio_start: drive=0 wait_flags=0
[ 3.6618488] wdccommand mvsata1:4:0: command=0xc4 cylin=0 head=64 sector=0 count=32 features=0
[ 3.6618488] atastart(chp=0xffff893d7986e010): channel 4 queue_xfer is empty
[ 13.7050587] wdctimeout
mvsata1:4:0: lost interrupt
[ 13.7050587] type: ata tc_bcount: 16384 tc_skip: 0
- the spinout with wait_flags (apparently not polling):
[ 135.6760828] ata_exec_xfer 0xffff893d79c2c038 channel 1 drive 0
[ 135.6760828] atastart from ata_exec_xfer, flags 0x800
[ 135.6760828] ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0atastart(chp=0xffffa0801edffa20): xfer 0xffff893d79c2c038 channel 1 drive 0
[ 135.6760828] atastart(chp=0xffffa0801edffa20): channel 1 queue_xfer is empty
[ 135.7134179] atastart(chp=0xffffa0801edffa20): channel 1 queue_xfer is empty
[ 135.7134179] mvsata1:4: mvsata_bio: drive=0, blkno=0, bcount=512
[ 135.7134179] ata_exec_xfer 0xffff893d79c2c038 channel 4 drive 0
[ 135.7134179] atastart from ata_exec_xfer, flags 0x0
[ 135.7134179] ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0atastart(chp=0xffff893d7986e010): xfer 0xffff893d79c2c038 channel 4 drive 0
[ 135.7134179] mvsata1:4: mvsata_bio_start: drive=0 wait_flags=0
[ 135.7134179] ata_channel_freeze_locked(chp=0xffff893d7986e010) -> 1
[ 135.7633699] ata_channel_thaw_locked(chp=0xffff893d7986e010) -> 0
[ 135.7634379] mvsata1:4: mvsata_bio_start: drive=0 wait_flags=0
[ 137.5141317] Kernel lock error: _kernel_lock,222: spinout
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Mon, 23 Dec 2019 12:41:01 +0000
Not all those ata commands are necessarily for wd0 on mvsata - edited recap:
[ 1.0429517] mvsata0 at pci5 dev 1 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
[ 1.0429517] mvsata0: interrupting at ioapic1 pin 1
[ 1.0429517] mvsata0: GenII, 2hc, 4port/hc
[ 1.0429517] atabus0 at mvsata0 channel 0
[ 1.0429517] atabus1 at mvsata0 channel 1
[ 1.0429517] atabus2 at mvsata0 channel 2
[ 1.0429517] atabus3 at mvsata0 channel 3
[ 1.0429517] atabus4 at mvsata0 channel 4
[ 1.0429517] atabus5 at mvsata0 channel 5
[ 1.0429517] atabus6 at mvsata0 channel 6
[ 1.0429517] atabus7 at mvsata0 channel 7
[ 1.0429517] mvsata1 at pci5 dev 2 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
[ 1.0429517] mvsata1: interrupting at ioapic1 pin 2
[ 1.0429517] mvsata1: GenII, 2hc, 4port/hc
[ 1.0429517] atabus8 at mvsata1 channel 0
[ 1.0429517] atabus9 at mvsata1 channel 1
[ 1.0429517] atabus10 at mvsata1 channel 2
[ 1.0429517] atabus11 at mvsata1 channel 3
[ 1.0429517] atabus12 at mvsata1 channel 4
[ 1.0429517] atabus13 at mvsata1 channel 5
[ 1.0429517] atabus14 at mvsata1 channel 6
[ 1.0429517] atabus15 at mvsata1 channel 7
[ 55.8441307] mvsata1: channel 4: drive 0 not ready, st=0xd0, err=0x00
[ 55.8551652] wd0d: device timeout reading fsbn 0 of 0-31 (wd0 bn 0; cn 0 tn 0 sn 0), xfer 38, retry 4
[can't find piixide part which is where wd1 and wd2 are attached as a working
raidframe mirror]
[ 132.4877096] atapibus0 at atabus16: 2 targets
[ 132.6211624] cd0 at atapibus0 drive 0: <TSSTcorpCD/DVDW SH-S182D, , SB04> cdrom removable
[ 132.8502886] wd1 at atabus18 drive 0
[ 133.3554994] wd2 at atabus19 drive 0
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Fri, 4 Sep 2020 16:07:12 +0100
Retrying with NetBSD-9.99.72/amd64, I now see no panics nor lost interrupts,
but simple a hang on boot. Sending break down the serial line doesn't
get me into ddb. Apart from GENERIC, I also tried:
include "arch/amd64/conf/GENERIC"
options DEBUG
options LOCKDEBUG
options ATADEBUG
options ATADEBUG_MASK=0x58
options MVSATA_DEBUG
options WDCNDELAY_DEBUG=50
#options MVSATA_WITHOUTDMA
highlights of which are:
[ 1.0405006] ppb4 at pci1 dev 0 function 3: Intel 63xxESB PCI Express to PCI-X Bridge (rev. 0x01)
[ 1.0405006] ppb4: PCI Express capability version 1 <PCI-E to PCI/PCI-X Bridge>
[ 1.0405006] pci5 at ppb4 bus 5
[ 1.0405006] mvsata0 at pci5 dev 1 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
[ 1.0405006] mvsata0: interrupting at ioapic1 pin 1
[ 1.0405006] mvsata0: GenII, 2hc, 4port/hc
[ 1.0405006] atabus0 at mvsata0 channel 0
[ 1.0405006] atabus1 at mvsata0 channel 1
[ 1.0405006] atabus2 at mvsata0 channel 2
[ 1.0405006] atabus3 at mvsata0 channel 3
[ 1.0405006] atabus4 at mvsata0 channel 4
[ 1.0405006] atabus5 at mvsata0 channel 5
[ 1.0405006] atabus6 at mvsata0 channel 6
[ 1.0405006] atabus7 at mvsata0 channel 7
[ 1.0405006] mvsata1 at pci5 dev 2 function 0: Marvell 88SX6081 SATA II (rev. 0x09)
[ 1.0405006] mvsata1: interrupting at ioapic1 pin 2
[ 1.0405006] mvsata1: GenII, 2hc, 4port/hc
[ 1.0405006] atabus8 at mvsata1 channel 0
[ 1.0405006] atabus9 at mvsata1 channel 1
[ 1.0405006] atabus10 at mvsata1 channel 2
[ 1.0405006] atabus11 at mvsata1 channel 3
[ 1.0405006] atabus12 at mvsata1 channel 4
[ 1.0405006] atabus13 at mvsata1 channel 5
[ 1.0405006] atabus14 at mvsata1 channel 6
[ 1.0405006] atabus15 at mvsata1 channel 7
...
[ 1.0405006] piixide0 at pci0 dev 31 function 1: Intel 631xESB/632xESB IDE Controller (rev. 0x09)
[ 1.0405006] piixide0: primary channel interrupting at ioapic0 pin 14
[ 1.0405006] atabus16 at piixide0 channel 0
[ 1.0405006] piixide0: secondary channel interrupting at ioapic0 pin 15
[ 1.0405006] atabus17 at piixide0 channel 1
[ 1.0405006] ahcisata0 at pci0 dev 31 function 2: Intel 63xxESB AHCI Serial ATA Controller (rev. 0x09)
[ 1.0405006] ahcisata0: AHCI revision 1.10, 6 ports, 32 slots, CAP 0xc6227f05<PSC,SSC,SPM,ISS=0x2=Gen2,SAL,SALP,SNCQ,S64A>
[ 1.0405006] ahcisata0: interrupting at ioapic0 pin 20
[ 1.0405006] atabus18 at ahcisata0 channel 0
[ 1.0405006] atabus19 at ahcisata0 channel 1
[ 1.0405006] atabus20 at ahcisata0 channel 2
[ 1.0405006] atabus21 at ahcisata0 channel 3
[ 1.0405006] atabus22 at ahcisata0 channel 4
[ 1.0405006] atabus23 at ahcisata0 channel 5
...
[ 2.6746267] wdcprobe1:0: before reset, st0=0x50, st1=0x1
[ 2.6746267] wdcprobe1:0 drive 1 wd_cyl_lo: got 0x0 != 0x02
[ 2.6746267] wdcprobe1:0 drive 1 wd_cyl_lo: got 0x0 != 0x01
[ 2.6746267] wdcprobe1:0 drive 1 wd_sector: got 0x41 != 0x01
[ 2.6746267] wdcprobe1:0 drive 1 wd_sector: got 0x42 != 0x02
[ 2.6746267] wdcprobe1:0 drive 1 wd_cyl_lo(2): got 0x0 != 0x01
...
[ 2.7646198] wdcprobe1:1: before reset, st0=0x7f, st1=0x7f
[ 2.7646198] wdcprobe1:1 drive 0 wd_cyl_lo: got 0x7f != 0x02
[ 2.7646198] wdcprobe1:1 drive 0 wd_cyl_lo: got 0x7f != 0x01
[ 2.7646198] wdcprobe1:1 drive 0 wd_sector: got 0x7f != 0x01
[ 2.7646198] wdcprobe1:1 drive 0 wd_sector: got 0x7f != 0x02
[ 2.7646198] wdcprobe1:1 drive 0 wd_cyl_lo(2): got 0x7f != 0x01
[ 2.7646198] wdcprobe1:1 drive 1 wd_cyl_lo: got 0x7f != 0x02
[ 2.7646198] wdcprobe1:1 drive 1 wd_cyl_lo: got 0x7f != 0x01
[ 2.7646198] wdcprobe1:1 drive 1 wd_sector: got 0x7f != 0x01
[ 2.7646198] wdcprobe1:1 drive 1 wd_sector: got 0x7f != 0x02
[ 2.7646198] wdcprobe1:1 drive 1 wd_cyl_lo(2): got 0x7f != 0x01
[ 2.8259235] ata_queue_alloc_slot: channel 0 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 1 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 2 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 3 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 4 qavail 0xffffffff qact 0ata_queue_alloc_slot: channel 5 qavail 0xffffffff qact 0piixide0:0:0: after reset, sc=0x1 sn=0x1 cl=0x14 ch=0xeb
[ 2.8646197] piixide0:0:1: after reset, sc=0x0 sn=0x0 cl=0x0 ch=0x0
[ 2.8746188] piixide0:0: wdcwait_reset() end, st0=0x10 st1=0x0
[ 2.8846202] wdcprobe1:0: after reset, ret_value=0x1
[ 2.8846202] wdcprobe1:0:0: after reset, sc=0x1 sn=0x1 cl=0x14 ch=0xeb
[ 2.8946223] piixide0:0: wait DRDY st0 0x0 st1 0x0
[ 2.9046198] ata_get_params
[ 2.9046198] wdc_exec_command piixide0:0:0
[ 2.9046198] ata_exec_xfer 0xffffd69369140038 channel 0 drive 0
[ 2.9146204] atastart from ata_exec_xfer, flags 0x0
[ 2.9146204] ata_queue_alloc_slot: channel 0 qavail 0x1 qact 0atastart(chp=0xffffdb801f8421e0): xfer 0xffffd69369140038 channel 0 drive 0
[ 2.9346199] __wdccommand_start piixidead=160 sector=0 count=0 features=0
[ 2.9446209] __wdcwait: timeout (time=11), status 90 error 1 (mask 0x8 bits 0x8)
[ 2.9546198] __wdccommand_done piixide0:0:0 flags 0x59
[ 2.9546198] atastart(chp=0xffffdb801f8421e0): channel 0 queue_xfer is empty
[6182] ahcisata0 port 0: device present, speed: 1.5Gb/s
[ 3.9846181] wd0 at atabus12 drive 0
[ 3.9846181] ata_get_params
[ 3.9846181] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
[ 3.9946226] atastart from ata_exec_xfer, flags 0x0
[ 3.99462_xfer is empty
[ 4.0246200] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.0377748] wd0: <ST3500630AS>
[ 4.0377748] wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
[ 4.0446210] ata_get_params
[ 4hannel 4 drive 0
[ 4.0746204] wdccommand mvsata1:4:0: command=0xec cylin=0 head=0 sector=0 count=0 features=0
[ 4.0746204] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.0846209] atastart(chp=0xffffd6902bc37008): channel 4 queue channel 4 drive 0
[ 4.1146206] ata_channel_freeze_locked(chp=0xffffd6902bc37008) -> 1
[ 4.1246206] ata_channel_thaw_locked(chp=0xffffd6902bc37008) -> 0
[ 4.1246206] wdccommandshort mvsata1:4:0 command 0x10
[ 4.1246206] __wdcwait: did busy-wait, ts=0
[ 4.1636341] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.1705662] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.1775598] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
[ 4.1775598] atastart frotor=0 count=3 features=0
[ 4.1881314] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.2162904] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.2162904] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
[ 4. sector=64 count=1 features=0
[ 4.2349051] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.2631482] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 4.2701319] ata_exec_xfer 0xffffd69369140038 channel 4 drive 0
[and=0xc4 cylin=0 head=64 sector=0 count=32 features=0
[ 4.2759362] atastart(chp=0xffffd6902bc37008): channel 4 queue_xfer is empty
[ 25.8046209] ipmi0: ID 32.1 IPMI 2.0 Available
hang and no ddb
State-Changed-From-To: feedback->open
State-Changed-By: prlw1@NetBSD.org
State-Changed-When: Sun, 20 Sep 2020 12:57:41 +0000
State-Changed-Why:
feedback given
Responsible-Changed-From-To: jdolecek->kern-bug-people
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Wed, 05 May 2021 19:16:31 +0000
Responsible-Changed-Why:
I won't work on mvsata(4) anymore.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.