NetBSD Problem Report #58776
From Emmanuel.Dreyfus@espci.fr Fri Oct 25 08:09:20 2024
Return-Path: <Emmanuel.Dreyfus@espci.fr>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
client-signature RSA-PSS (2048 bits) client-digest SHA256)
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 9620E1A9238
for <gnats-bugs@gnats.NetBSD.org>; Fri, 25 Oct 2024 08:09:20 +0000 (UTC)
Message-Id: <202410250809.49P89Et1022382@volanges.net.espci.fr>
Date: Fri, 25 Oct 2024 10:09:14 +0200 (CEST)
From: manu@netbsd.org
Reply-To: manu@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: RAIDframe panic on I/O error during reconstruction
X-Send-Pr-Version: 3.95
>Number: 58776
>Category: kern
>Synopsis: RAIDframe panic on I/O error during reconstruction
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: oster
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Oct 25 08:10:00 +0000 2024
>Last-Modified: Wed Oct 30 20:15:00 +0000 2024
>Originator: Emmanuel Dreyfus
>Release: NetBSD 10.0
>Organization:
NetBSD
>Environment:
NetBSD-10.0/i386
>Description:
On NetBSD-10, a RAID 1 with two disks get un unrecoverable I/O error
during reconstruction, which sometime happens. But then it panics,
which is worse that just failing a RAID reconstruction.
panic: trap
cpu0: Begin traceback...
vpanic(c0d6708c,df0fcdd0,df0fce8c,c012fd98,c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206) at netbsd:vpanic+0x196
panic(c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206,fffffff0,0,0,c42716a0) at netbsd:panic+0x18
trap() at netbsd:trap+0xd51
--- trap (number 6) ---
mutex_vector_enter(c6f02c7c,1,1,c6dd6f44,c6dd6f44,c670316c,c6dd7308,df0fcf78,c078dc03,c670316c) at netbsd:mutex_vector_enter+0x91
rf_CauseReconEvent(c670316c,1,c6dd6f44,9,c6703244,df0fcf9c,c077e310,c6dd6f44,5,5) at netbsd:rf_CauseReconEvent+0x58
ReconWriteDoneProc(c6dd6f44,5,5,c6703248,c670316c,c077e271,c6c19c00,0,c0102011,c670316c) at netbsd:ReconWriteDoneProc+0x79
rf_RaidIOThread(c670316c,43e6000,43fb000,0,c01005a8,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x9f
>How-To-Repeat:
raidctl -R /dev/dk3 raid1
But it is probably very specific to the hardware state.
>Fix:
None known yet
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->oster
Responsible-Changed-By: oster@NetBSD.org
Responsible-Changed-When: Fri, 25 Oct 2024 23:15:00 +0000
Responsible-Changed-Why:
RAIDframe PRs are likely for me.
From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Sun, 27 Oct 2024 09:58:05 -0600
On 2024-10-25 02.10, manu@netbsd.org wrote:
>> Number: 58776
>> Category: kern
>> Synopsis: RAIDframe panic on I/O error during reconstruction
>> Confidential: no
>> Severity: serious
>> Priority: low
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Fri Oct 25 08:10:00 +0000 2024
>> Originator: Emmanuel Dreyfus
>> Release: NetBSD 10.0
>> Organization:
> NetBSD
>> Environment:
> NetBSD-10.0/i386
>> Description:
> On NetBSD-10, a RAID 1 with two disks get un unrecoverable I/O error
> during reconstruction, which sometime happens. But then it panics,
> which is worse that just failing a RAID reconstruction.
>
> panic: trap
> cpu0: Begin traceback...
> vpanic(c0d6708c,df0fcdd0,df0fce8c,c012fd98,c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206) at netbsd:vpanic+0x196
> panic(c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206,fffffff0,0,0,c42716a0) at netbsd:panic+0x18
> trap() at netbsd:trap+0xd51
> --- trap (number 6) ---
>
> mutex_vector_enter(c6f02c7c,1,1,c6dd6f44,c6dd6f44,c670316c,c6dd7308,df0fcf78,c078dc03,c670316c) at netbsd:mutex_vector_enter+0x91
> rf_CauseReconEvent(c670316c,1,c6dd6f44,9,c6703244,df0fcf9c,c077e310,c6dd6f44,5,5) at netbsd:rf_CauseReconEvent+0x58
> ReconWriteDoneProc(c6dd6f44,5,5,c6703248,c670316c,c077e271,c6c19c00,0,c0102011,c670316c) at netbsd:ReconWriteDoneProc+0x79
> rf_RaidIOThread(c670316c,43e6000,43fb000,0,c01005a8,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x9f
>
>> How-To-Repeat:
> raidctl -R /dev/dk3 raid1
> But it is probably very specific to the hardware state.
Do you know if the I/O error was on /dev/dk3 or on the other component
of raid1? Did you catch anything else printed from the kernel? What
does the RAID config look like for this setup? (In particular, is
/dev/dk3 component 0 or component 1?)
Just trying to figure out if the I/O error has happened for an I/O
before this Recon Event, or for this Recon Event.
Thanks.
Later...
Greg Oster
From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Sun, 27 Oct 2024 16:09:24 +0000
On Sun, Oct 27, 2024 at 04:00:03PM +0000, Greg Oster wrote:
> Do you know if the I/O error was on /dev/dk3 or on the other component
> of raid1? Did you catch anything else printed from the kernel? What
> does the RAID config look like for this setup? (In particular, is
> /dev/dk3 component 0 or component 1?)
The I/O error happened on dk3. At some time I tried to swap the
disks, and the I/O error followed the SATA port, hence the disk
getting the failure has alwas been wd3 and the RAID component dk3.
Before the swap, I got the panic because of a write error on dk3.
After the swap, it was a read error on dk3.
$ raidctl -s raid1
Components:
/dev/dk3: optimal
/dev/dk2: failed
No spares.
Component label for /dev/dk3:
Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
Version: 2, Serial Number: 20240729, Mod Counter: 25860
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 19532873568
RAID Level: 1
Autoconfig: Yes
Root partition: No
Last configured as: raid1
/dev/dk2 status is: failed. Skipping label.
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
I initiated another raid1 reconstruction, I will provide you the
exact kernel output.
Note that I plan to move the offending disk to another unused
SATA port in 14 hours in order to fix the setup. I hope I will
not be able to reproduce the problem anymore.
--
Emmanuel Dreyfus
manu@netbsd.org
From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Mon, 28 Oct 2024 13:28:27 +0000
On Sun, Oct 27, 2024 at 04:00:03PM +0000, Greg Oster wrote:
> Did you catch anything else printed from the kernel?
This time it was wd2 write error
[ 144555.7191324] raid1: initiating in-place reconstruction on column 1
[ 191191.9781226] wd2d: device timeout writing fsbn 15953755106 of 15953755106-15953755137 (wd2 bn 15953755106; cn 15827138 tn 0 sn 2), xfer d68, retry 0
[ 191191.9914893] wd2d: device timeout writing fsbn 15953755138 of 15953755138-15953755169 (wd2 bn 15953755138; cn 15827138 tn 0 sn 34), xfer ef8, retry 0
[ 191192.0050090] wd2d: device timeout writing fsbn 15953755170 of 15953755170-15953755201 (wd2 bn 15953755170; cn 15827138 tn 1 sn 3), xfer e94, retry 0
[ 191192.0184433] wd2d: device timeout writing fsbn 15953755202 of 15953755202-15953755233 (wd2 bn 15953755202; cn 15827138 tn 1 sn 35), xfer b74, retry 0
[ 191192.7281226] wd2: soft error (corrected) xfer d68
[ 191192.7396135] wd2: soft error (corrected) xfer b74
[ 191192.7396135] wd2: soft error (corrected) xfer e94
[ 191192.7493198] wd2: soft error (corrected) xfer ef8
[ 191241.6781226] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 0
[ 191241.6916686] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 0
[ 191241.7052746] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 0
[ 191241.7188813] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 0
[ 191252.2181226] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 1
[ 191252.2316606] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 1
[ 191252.2452678] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 1
[ 191252.2588739] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 1
[ 191262.7581228] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 2
[ 191262.7716625] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 2
[ 191262.7852692] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 2
[ 191262.7988751] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 2
[ 191273.2981226] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 3
[ 191273.3116644] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 3
[ 191273.3252700] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 3
[ 191273.3388764] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 3
[ 191680.6601668] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 4
[ 191690.6751717] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 4
[ 191700.6901767] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 4
[ 191710.7051816] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 4
[ 191991.0752951] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44)
[ 191991.0871980] wd2d: error writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44)
[ 191991.0970825] raid1: Recon write failed (status 5(0x5))!
[ 191991.0970825] raid1: reconstruction failed.
[ 192001.1003051] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
[ 192001.1122002] wd2d: error writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
[ 192001.1220854] raid1: Recon write failed (status 5(0x5))!
[ 192001.1220854] raid1: 566502314 recon event waits, 11 recon delays
[ 192001.1303189] raid1: 2821808363 max exec ticks
[ 192011.1253150] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
[ 192011.1372114] wd2d: error writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
[ 192011.1470958] raid1: Recon write failed (status 5(0x5))!
[ 192011.1470958] uvm_fault(0xc420f140, 0xfffff000, 1) -> 0xe
[ 192011.1592364] fatal page fault in supervisor mode
[ 192011.1592364] trap type 6 code 0 eip 0xc0900a08 cs 0x8 eflags 0x210206 cr2 0xfffffff0 ilevel 0 esp 0
[ 192011.1731966] curlwp 0xc6dde3c0 pid 0 lid 234 lowest kstack 0xdf0f72c0
[ 192011.1797854] panic: trap
[ 192011.1797854] cpu0: Begin traceback...
[ 192011.1862893] vpanic(c0d6708c,df0f8dd0,df0f8e8c,c012fd98,c0d6708c,df0f8e98,df0f8e98,ea,df0f72c0,210206) at netbsd:vpanic+0x196
[ 192011.1957062] panic(c0d6708c,df0f8e98,df0f8e98,ea,df0f72c0,210206,fffffff0,0,0,c42716a0) at netbsd:panic+0x18
[ 192011.2055306] trap() at netbsd:trap+0xd51
[ 192011.2055306] --- trap (number 6) ---
[ 192011.2173051] mutex_vector_enter(c75b507c,1,1,c6dd7f44,c6dd7f44,c670916c,c6dd8308,df0f8f78,c078dc03,c670916c) at netbsd:mutex_vector_enter+0x91
[ 192011.2259154] rf_CauseReconEvent(c670916c,1,c6dd7f44,9,c6709244,df0f8f9c,c077e310,c6dd7f44,5,5) at netbsd:rf_CauseReconEvent+0x58
[ 192011.2354273] ReconWriteDoneProc(c6dd7f44,5,5,c6709248,c670916c,c077e271,c6dde3c0,0,c0102011,c670916c) at netbsd:ReconWriteDoneProc+0x79
[ 192011.2557962] rf_RaidIOThread(c670916c,43e6000,43fb000,0,c01005a8,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x9f
[ 192011.2661329] cpu0: End traceback...
--
Emmanuel Dreyfus
manu@netbsd.org
From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Tue, 29 Oct 2024 00:40:17 +0000
On Sun, Oct 27, 2024 at 04:10:01PM +0000, Emmanuel Dreyfus via gnats wrote:
> Note that I plan to move the offending disk to another unused
> SATA port in 14 hours in order to fix the setup. I hope I will
> not be able to reproduce the problem anymore.
It did not help. Te setup is still available for investigation.
--
Emmanuel Dreyfus
manu@netbsd.org
From: oster@netbsd.org
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Wed, 30 Oct 2024 19:28:52 +0000
October 28, 2024 at 7:30 AM, "Emmanuel Dreyfus via gnats" <gnats-admin@ne=
tbsd.org> wrote:
>=20=20
>=20 This time it was wd2 write error=20
[snip]
>=20 [ 191991.0970825] raid1: Recon write failed (status 5(0x5))!
> [ 191991.0970825] raid1: reconstruction failed.
> [ 192001.1003051] wd2d: device timeout writing fsbn 15966377986 of 159=
66377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
> [ 192001.1122002] wd2d: error writing fsbn 15966377986 of 15966377986-=
15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
> [ 192001.1220854] raid1: Recon write failed (status 5(0x5))!
> [ 192001.1220854] raid1: 566502314 recon event waits, 11 recon delays
> [ 192001.1303189] raid1: 2821808363 max exec ticks
> [ 192011.1253150] wd2d: device timeout writing fsbn 15966378018 of 159=
66378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
> [ 192011.1372114] wd2d: error writing fsbn 15966378018 of 15966378018-=
15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
> [ 192011.1470958] raid1: Recon write failed (status 5(0x5))!
Thanks for this.
So this "Recon write failed" is showing up *after* raid1 thinks that=20
the=20reconstruction has failed and is already done... The code in=20
rf_reconstruct.c:ProcessReconEvent()=20in the RF_REVENT_WRITE_FAILED=20
case=20may not be sufficient. More likely, it's the code in=20
rf_reconstruct.c:rf_ContinueReconstructFailedDisk()=20in the
'if (recon_error) { /* we've encountered an error in reconstructing. */` =
case,=20
where=20perhaps we're not waiting for enough writes to complete?=20=20
(I=20need to look at how the IOs are scheduled again to figure out=20
if/how=20those 'extra writes' are getting generated, and then exactly=20
how=20to account for them..)
In any event, this error path is not as well tested as it could/should be=
.
Later...
Greg Oster
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Wed, 30 Oct 2024 20:13:49 -0000 (UTC)
gnats-admin@NetBSD.org ("oster@netbsd.org via gnats") writes:
> So this "Recon write failed" is showing up *after* raid1 thinks that=20
> the=20reconstruction has failed and is already done... The code in=20
> rf_reconstruct.c:ProcessReconEvent()=20in the RF_REVENT_WRITE_FAILED=20
> case=20may not be sufficient. More likely, it's the code in=20
> rf_reconstruct.c:rf_ContinueReconstructFailedDisk()=20in the
> 'if (recon_error) { /* we've encountered an error in reconstructing. */` =
> case,=20
> where=20perhaps we're not waiting for enough writes to complete?=20=20
> (I=20need to look at how the IOs are scheduled again to figure out=20
> if/how=20those 'extra writes' are getting generated, and then exactly=20
> how=20to account for them..)
It's a use-after free of the rctrl->eq_mutex. No idea how that happens.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.49 2026/05/14 01:52:41 riastradh Exp $
$NetBSD: gnats_config.sh,v 1.10 2026/05/13 22:00:09 riastradh Exp $
Copyright © 1994-2026
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.