NetBSD Problem Report #58776

From Emmanuel.Dreyfus@espci.fr  Fri Oct 25 08:09:20 2024
Return-Path: <Emmanuel.Dreyfus@espci.fr>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 9620E1A9238
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 25 Oct 2024 08:09:20 +0000 (UTC)
Message-Id: <202410250809.49P89Et1022382@volanges.net.espci.fr>
Date: Fri, 25 Oct 2024 10:09:14 +0200 (CEST)
From: manu@netbsd.org
Reply-To: manu@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: RAIDframe panic on I/O error during reconstruction
X-Send-Pr-Version: 3.95

>Number:         58776
>Category:       kern
>Synopsis:       RAIDframe panic on I/O error during reconstruction
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    oster
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Oct 25 08:10:00 +0000 2024
>Last-Modified:  Wed Oct 30 20:15:00 +0000 2024
>Originator:     Emmanuel Dreyfus
>Release:        NetBSD 10.0
>Organization:
NetBSD
>Environment:
NetBSD-10.0/i386
>Description:
On NetBSD-10, a RAID 1 with two disks get un unrecoverable I/O error
during reconstruction, which sometime happens. But then it panics,
which is worse that just failing a RAID reconstruction.

panic: trap     
cpu0: Begin traceback...
vpanic(c0d6708c,df0fcdd0,df0fce8c,c012fd98,c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206) at netbsd:vpanic+0x196        
panic(c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206,fffffff0,0,0,c42716a0) at netbsd:panic+0x18
trap() at netbsd:trap+0xd51
--- trap (number 6) ---

mutex_vector_enter(c6f02c7c,1,1,c6dd6f44,c6dd6f44,c670316c,c6dd7308,df0fcf78,c078dc03,c670316c) at netbsd:mutex_vector_enter+0x91
rf_CauseReconEvent(c670316c,1,c6dd6f44,9,c6703244,df0fcf9c,c077e310,c6dd6f44,5,5) at netbsd:rf_CauseReconEvent+0x58
ReconWriteDoneProc(c6dd6f44,5,5,c6703248,c670316c,c077e271,c6c19c00,0,c0102011,c670316c) at netbsd:ReconWriteDoneProc+0x79
rf_RaidIOThread(c670316c,43e6000,43fb000,0,c01005a8,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x9f

>How-To-Repeat:
	raidctl -R /dev/dk3 raid1
	But it is probably very specific to the hardware state.
>Fix:
	None known yet

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->oster
Responsible-Changed-By: oster@NetBSD.org
Responsible-Changed-When: Fri, 25 Oct 2024 23:15:00 +0000
Responsible-Changed-Why:
RAIDframe PRs are likely for me.


From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Sun, 27 Oct 2024 09:58:05 -0600

 On 2024-10-25 02.10, manu@netbsd.org wrote:
 >> Number:         58776
 >> Category:       kern
 >> Synopsis:       RAIDframe panic on I/O error during reconstruction
 >> Confidential:   no
 >> Severity:       serious
 >> Priority:       low
 >> Responsible:    kern-bug-people
 >> State:          open
 >> Class:          sw-bug
 >> Submitter-Id:   net
 >> Arrival-Date:   Fri Oct 25 08:10:00 +0000 2024
 >> Originator:     Emmanuel Dreyfus
 >> Release:        NetBSD 10.0
 >> Organization:
 > NetBSD
 >> Environment:
 > NetBSD-10.0/i386
 >> Description:
 > On NetBSD-10, a RAID 1 with two disks get un unrecoverable I/O error
 > during reconstruction, which sometime happens. But then it panics,
 > which is worse that just failing a RAID reconstruction.
 > 
 > panic: trap
 > cpu0: Begin traceback...
 > vpanic(c0d6708c,df0fcdd0,df0fce8c,c012fd98,c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206) at netbsd:vpanic+0x196
 > panic(c0d6708c,df0fce98,df0fce98,ea,df0fb2c0,210206,fffffff0,0,0,c42716a0) at netbsd:panic+0x18
 > trap() at netbsd:trap+0xd51
 > --- trap (number 6) ---
 >                                                                
 > mutex_vector_enter(c6f02c7c,1,1,c6dd6f44,c6dd6f44,c670316c,c6dd7308,df0fcf78,c078dc03,c670316c) at netbsd:mutex_vector_enter+0x91
 > rf_CauseReconEvent(c670316c,1,c6dd6f44,9,c6703244,df0fcf9c,c077e310,c6dd6f44,5,5) at netbsd:rf_CauseReconEvent+0x58
 > ReconWriteDoneProc(c6dd6f44,5,5,c6703248,c670316c,c077e271,c6c19c00,0,c0102011,c670316c) at netbsd:ReconWriteDoneProc+0x79
 > rf_RaidIOThread(c670316c,43e6000,43fb000,0,c01005a8,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x9f
 > 
 >> How-To-Repeat:
 > 	raidctl -R /dev/dk3 raid1
 > 	But it is probably very specific to the hardware state.

 Do you know if the I/O error was on /dev/dk3 or on the other component 
 of raid1?  Did you catch anything else printed from the kernel?  What 
 does the RAID config look like for this setup?  (In particular, is 
 /dev/dk3 component 0 or component 1?)

 Just trying to figure out if the I/O error has happened for an I/O 
 before this Recon Event, or for this Recon Event.

 Thanks.

 Later...

 Greg Oster


From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Sun, 27 Oct 2024 16:09:24 +0000

 On Sun, Oct 27, 2024 at 04:00:03PM +0000, Greg Oster wrote:
 >  Do you know if the I/O error was on /dev/dk3 or on the other component 
 >  of raid1?  Did you catch anything else printed from the kernel?  What 
 >  does the RAID config look like for this setup?  (In particular, is 
 >  /dev/dk3 component 0 or component 1?)

 The I/O error happened on dk3. At some time I tried to swap the
 disks, and the I/O error followed the SATA port, hence the disk
 getting the failure has alwas been wd3 and the RAID component dk3. 

 Before the swap, I got the panic because of a write error on dk3. 
 After the swap, it was a read error on dk3.


 $ raidctl -s raid1
 Components:
             /dev/dk3: optimal
             /dev/dk2: failed
 No spares.
 Component label for /dev/dk3:
    Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
    Version: 2, Serial Number: 20240729, Mod Counter: 25860
    Clean: No, Status: 0
    sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
    Queue size: 100, blocksize: 512, numBlocks: 19532873568
    RAID Level: 1
    Autoconfig: Yes
    Root partition: No
    Last configured as: raid1
 /dev/dk2 status is: failed.  Skipping label.
 Parity status: DIRTY
 Reconstruction is 100% complete.
 Parity Re-write is 100% complete.
 Copyback is 100% complete.

 I initiated another raid1 reconstruction, I will provide you the
 exact kernel output. 

 Note that I plan to move the offending disk to another unused 
 SATA port in 14 hours in order to fix the setup. I hope I will
 not be able to reproduce the problem anymore.


 -- 
 Emmanuel Dreyfus
 manu@netbsd.org

From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Mon, 28 Oct 2024 13:28:27 +0000

 On Sun, Oct 27, 2024 at 04:00:03PM +0000, Greg Oster wrote:
 >  Did you catch anything else printed from the kernel?  

 This time it was wd2 write error 

 [ 144555.7191324] raid1: initiating in-place reconstruction on column 1
 [ 191191.9781226] wd2d: device timeout writing fsbn 15953755106 of 15953755106-15953755137 (wd2 bn 15953755106; cn 15827138 tn 0 sn 2), xfer d68, retry 0
 [ 191191.9914893] wd2d: device timeout writing fsbn 15953755138 of 15953755138-15953755169 (wd2 bn 15953755138; cn 15827138 tn 0 sn 34), xfer ef8, retry 0
 [ 191192.0050090] wd2d: device timeout writing fsbn 15953755170 of 15953755170-15953755201 (wd2 bn 15953755170; cn 15827138 tn 1 sn 3), xfer e94, retry 0
 [ 191192.0184433] wd2d: device timeout writing fsbn 15953755202 of 15953755202-15953755233 (wd2 bn 15953755202; cn 15827138 tn 1 sn 35), xfer b74, retry 0
 [ 191192.7281226] wd2: soft error (corrected) xfer d68
 [ 191192.7396135] wd2: soft error (corrected) xfer b74
 [ 191192.7396135] wd2: soft error (corrected) xfer e94
 [ 191192.7493198] wd2: soft error (corrected) xfer ef8
 [ 191241.6781226] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 0
 [ 191241.6916686] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 0
 [ 191241.7052746] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 0
 [ 191241.7188813] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 0
 [ 191252.2181226] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 1
 [ 191252.2316606] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 1
 [ 191252.2452678] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 1
 [ 191252.2588739] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 1
 [ 191262.7581228] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 2
 [ 191262.7716625] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 2
 [ 191262.7852692] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 2
 [ 191262.7988751] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 2
 [ 191273.2981226] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 3
 [ 191273.3116644] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 3
 [ 191273.3252700] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 3
 [ 191273.3388764] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 3
 [ 191680.6601668] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44), xfer f5c, retry 4
 [ 191690.6751717] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13), xfer bd8, retry 4
 [ 191700.6901767] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45), xfer e94, retry 4
 [ 191710.7051816] wd2d: device timeout writing fsbn 15966378050 of 15966378050-15966378081 (wd2 bn 15966378050; cn 15839660 tn 12 sn 14), xfer d04, retry 4
 [ 191991.0752951] wd2d: device timeout writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44)
 [ 191991.0871980] wd2d: error writing fsbn 15966377954 of 15966377954-15966377985 (wd2 bn 15966377954; cn 15839660 tn 10 sn 44)
 [ 191991.0970825] raid1: Recon write failed (status 5(0x5))!
 [ 191991.0970825] raid1: reconstruction failed.
 [ 192001.1003051] wd2d: device timeout writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
 [ 192001.1122002] wd2d: error writing fsbn 15966377986 of 15966377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
 [ 192001.1220854] raid1: Recon write failed (status 5(0x5))!
 [ 192001.1220854] raid1: 566502314 recon event waits, 11 recon delays
 [ 192001.1303189] raid1: 2821808363 max exec ticks
 [ 192011.1253150] wd2d: device timeout writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
 [ 192011.1372114] wd2d: error writing fsbn 15966378018 of 15966378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
 [ 192011.1470958] raid1: Recon write failed (status 5(0x5))!
 [ 192011.1470958] uvm_fault(0xc420f140, 0xfffff000, 1) -> 0xe
 [ 192011.1592364] fatal page fault in supervisor mode
 [ 192011.1592364] trap type 6 code 0 eip 0xc0900a08 cs 0x8 eflags 0x210206 cr2 0xfffffff0 ilevel 0 esp 0
 [ 192011.1731966] curlwp 0xc6dde3c0 pid 0 lid 234 lowest kstack 0xdf0f72c0
 [ 192011.1797854] panic: trap
 [ 192011.1797854] cpu0: Begin traceback...
 [ 192011.1862893] vpanic(c0d6708c,df0f8dd0,df0f8e8c,c012fd98,c0d6708c,df0f8e98,df0f8e98,ea,df0f72c0,210206) at netbsd:vpanic+0x196
 [ 192011.1957062] panic(c0d6708c,df0f8e98,df0f8e98,ea,df0f72c0,210206,fffffff0,0,0,c42716a0) at netbsd:panic+0x18
 [ 192011.2055306] trap() at netbsd:trap+0xd51
 [ 192011.2055306] --- trap (number 6) ---
 [ 192011.2173051] mutex_vector_enter(c75b507c,1,1,c6dd7f44,c6dd7f44,c670916c,c6dd8308,df0f8f78,c078dc03,c670916c) at netbsd:mutex_vector_enter+0x91
 [ 192011.2259154] rf_CauseReconEvent(c670916c,1,c6dd7f44,9,c6709244,df0f8f9c,c077e310,c6dd7f44,5,5) at netbsd:rf_CauseReconEvent+0x58
 [ 192011.2354273] ReconWriteDoneProc(c6dd7f44,5,5,c6709248,c670916c,c077e271,c6dde3c0,0,c0102011,c670916c) at netbsd:ReconWriteDoneProc+0x79
 [ 192011.2557962] rf_RaidIOThread(c670916c,43e6000,43fb000,0,c01005a8,0,0,0,0,0) at netbsd:rf_RaidIOThread+0x9f
 [ 192011.2661329] cpu0: End traceback...

 -- 
 Emmanuel Dreyfus
 manu@netbsd.org

From: Emmanuel Dreyfus <manu@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Tue, 29 Oct 2024 00:40:17 +0000

 On Sun, Oct 27, 2024 at 04:10:01PM +0000, Emmanuel Dreyfus via gnats wrote:
 >  Note that I plan to move the offending disk to another unused 
 >  SATA port in 14 hours in order to fix the setup. I hope I will
 >  not be able to reproduce the problem anymore.

 It did not help. Te setup is still available for investigation.

 -- 
 Emmanuel Dreyfus
 manu@netbsd.org

From: oster@netbsd.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Wed, 30 Oct 2024 19:28:52 +0000

 October 28, 2024 at 7:30 AM, "Emmanuel Dreyfus via gnats" <gnats-admin@ne=
 tbsd.org> wrote:

 >=20=20
 >=20 This time it was wd2 write error=20
 [snip]
 >=20 [ 191991.0970825] raid1: Recon write failed (status 5(0x5))!
 >  [ 191991.0970825] raid1: reconstruction failed.
 >  [ 192001.1003051] wd2d: device timeout writing fsbn 15966377986 of 159=
 66377986-15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
 >  [ 192001.1122002] wd2d: error writing fsbn 15966377986 of 15966377986-=
 15966378017 (wd2 bn 15966377986; cn 15839660 tn 11 sn 13)
 >  [ 192001.1220854] raid1: Recon write failed (status 5(0x5))!
 >  [ 192001.1220854] raid1: 566502314 recon event waits, 11 recon delays
 >  [ 192001.1303189] raid1: 2821808363 max exec ticks
 >  [ 192011.1253150] wd2d: device timeout writing fsbn 15966378018 of 159=
 66378018-15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
 >  [ 192011.1372114] wd2d: error writing fsbn 15966378018 of 15966378018-=
 15966378049 (wd2 bn 15966378018; cn 15839660 tn 11 sn 45)
 >  [ 192011.1470958] raid1: Recon write failed (status 5(0x5))!

 Thanks for this.

 So this "Recon write failed" is showing up *after* raid1 thinks that=20
 the=20reconstruction has failed and is already done...  The code in=20
 rf_reconstruct.c:ProcessReconEvent()=20in the RF_REVENT_WRITE_FAILED=20
 case=20may not be sufficient.  More likely, it's the code in=20
 rf_reconstruct.c:rf_ContinueReconstructFailedDisk()=20in the
 'if (recon_error) { /* we've encountered an error in reconstructing. */` =
 case,=20
 where=20perhaps we're not waiting for enough writes to complete?=20=20
 (I=20need to look at how the IOs are scheduled again to figure out=20
 if/how=20those 'extra writes' are getting generated, and then exactly=20
 how=20to account for them..)

 In any event, this error path is not as well tested as it could/should be=
 .

 Later...

 Greg Oster

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58776: RAIDframe panic on I/O error during reconstruction
Date: Wed, 30 Oct 2024 20:13:49 -0000 (UTC)

 gnats-admin@NetBSD.org ("oster@netbsd.org via gnats") writes:

 > So this "Recon write failed" is showing up *after* raid1 thinks that=20
 > the=20reconstruction has failed and is already done...  The code in=20
 > rf_reconstruct.c:ProcessReconEvent()=20in the RF_REVENT_WRITE_FAILED=20
 > case=20may not be sufficient.  More likely, it's the code in=20
 > rf_reconstruct.c:rf_ContinueReconstructFailedDisk()=20in the
 > 'if (recon_error) { /* we've encountered an error in reconstructing. */` =
 > case,=20
 > where=20perhaps we're not waiting for enough writes to complete?=20=20
 > (I=20need to look at how the IOs are scheduled again to figure out=20
 > if/how=20those 'extra writes' are getting generated, and then exactly=20
 > how=20to account for them..)


 It's a use-after free of the rctrl->eq_mutex. No idea how that happens.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.49 2026/05/14 01:52:41 riastradh Exp $
$NetBSD: gnats_config.sh,v 1.10 2026/05/13 22:00:09 riastradh Exp $
Copyright © 1994-2026 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.