NetBSD Problem Report #43986
From woods@once.weird.com Tue Oct 19 19:41:20 2010
Return-Path: <woods@once.weird.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id 0A33C63B11D
for <gnats-bugs@gnats.NetBSD.org>; Tue, 19 Oct 2010 19:41:20 +0000 (UTC)
Message-Id: <m1P8H4t-001ZAkC@once.weird.com>
Date: Tue, 19 Oct 2010 14:38:27 -0400 (EDT)
From: "Greg A. Woods" <woods@planix.com>
Sender: "Greg A. Woods" <woods@once.weird.com>
Reply-To: "Greg A. Woods" <woods@planix.com>
To: gnats-bugs@gnats.NetBSD.org
Subject: ataraid(4) doesn't seem to handle weird array configs very gracefully
X-Send-Pr-Version: 3.95
>Number: 43986
>Category: kern
>Synopsis: ataraid(4) doesn't seem to handle weird array configs very gracefully
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: analyzed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Oct 19 19:45:00 +0000 2010
>Closed-Date:
>Last-Modified: Tue Mar 26 02:16:44 +0000 2024
>Originator: Greg A. Woods
>Release: NetBSD 4.0_STABLE 2010/10/15
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD historically 4.0_STABLE NetBSD 4.0_STABLE (GENERIC.MP) #1: Fri Oct 15 13:14:43 EDT 2010 woods@once:/rest/build/woods/once/netbsd-4-i386-i386-ppro-obj/rest/work/woods/m-NetBSD-4/sys/arch/i386/compile/GENERIC.MP i386
Architecture: i386
Machine: i386
>Description:
ataraid(4) doesn't seem to handle weird array configs very gracefully:
wd1 at atabus2 drive 0: <WDC WD2000JD-00HBB0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
rnd: wd1 attached as an entropy source (collecting)
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(piixide1:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
MAGIC == 0x00
wd2 at atabus3 drive 0: <WDC WD2000JD-00HBB0>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
rnd: wd2 attached as an entropy source (collecting)
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(piixide1:1:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
ataraid0: found 1 RAID volume
ld0 at ataraid0 vendtype 1 unit 0: Adaptec ATA RAID-1 array
uvm_fault(0xc0ab2c40, 0, 1) -> 0xe
kernel: supervisor trap page fault, code=0
Stopped in pid 0.1 (swapper) at netbsd:strncmp+0x23: movzbl 0(%ebx),%eax
db{0}> trace
strncmp(c095397a,1c,6,282,c0a9564c) at netbsd:strncmp+0x23
devsw_name2blk(1c,0,0,0,0) at netbsd:devsw_name2blk+0x7e
ld_ataraid_attach(c3d9c940,c3dede00,c3da0600,c37d4000,c37f4000) at netbsd:ld_ata
raid_attach+0x1a8
config_attach_loc(c3d9c940,c0a0bb40,c0b96b54,c3da0600,c053c470) at netbsd:config
_attach_loc+0x3b0
ataraid_attach(0,c3d9c940,0,c04547fd,c095304c) at netbsd:ataraid_attach+0x87
config_attach_pseudo(c0a20e28,c0a20e00,c09c4c6c,241,7) at netbsd:config_attach_p
seudo+0x213
ata_raid_finalize(0,c0a1cfe0,0,20,20) at netbsd:ata_raid_finalize+0x4d
config_finalize(c0ab4114,20,c0967934,0,0) at netbsd:config_finalize+0x29
main(fbff,c01002d2,0,0,0) at netbsd:main+0x297
db{0}>
>How-To-Repeat:
1. set up a RAID-1 mirror in the hardware RAID controller config
2. disconnect one drive
3. delete the array
4. connect the drive
5. boot
i.e. I think the problem was the disconnected drive was
identified as being an array member and so the array
re-appeared, but the second drive was in JBOD mode because it
had been the only one connected when the array was initially
deleted.
>Fix:
>Release-Note:
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: "Greg A. Woods" <woods@planix.com>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/43986: ataraid(4) doesn't seem to handle weird array configs very gracefully
Date: Tue, 26 Mar 2024 02:12:00 +0000
> uvm_fault(0xc0ab2c40, 0, 1) -> 0xe
> kernel: supervisor trap page fault, code=3D0
> Stopped in pid 0.1 (swapper) at netbsd:strncmp+0x23: movzbl 0(%ebx),%=
eax
> db{0}> trace
> strncmp(c095397a,1c,6,282,c0a9564c) at netbsd:strncmp+0x23
This is an attempt to access the page zero. Not exactly a null
pointer dererence, but a near-null pointer dereference -- the second
argument is 0x1c.
> devsw_name2blk(1c,0,0,0,0) at netbsd:devsw_name2blk+0x7e
It probably happened here:
535 if (strncmp(conv->d_name, name, len) !=3D 0)
https://nxr.netbsd.org/xref/src/sys/kern/subr_devsw.c?r=3D1.28#535
> ld_ataraid_attach(c3d9c940,c3dede00,c3da0600,c37d4000,c37f4000) at netbsd=
:ld_ataraid_attach+0x1a8
The devsw_name2blk call is probaby this one:
78 bmajor =3D devsw_name2blk(device_xname(adi->adi_dev), NULL, 0);
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_subr.c?r=3D1.2#78
0x1c is offsetof(struct device, dv_xname) on i386 as of around 2010:
https://nxr.netbsd.org/xref/src/sys/sys/device.h?r=3D1.137#141
So adi->adi_dev is probably null here.
ata_raid_disk_vnode_find assumes that adi->adi_dev is nonnull. This
is probably a bug on its own: the caller handles a null return, so
either if the input is null, the output should be null, or the caller
should avoid calling ata_raid_disk_vnode_find with a null input.
225 adi =3D &aai->aai_disks[i];
226 vp =3D ata_raid_disk_vnode_find(adi);
227 if (vp =3D=3D NULL) {
228 /*
229 * XXX This is bogus. We should just mark the
230 * XXX component as FAILED, and write-back new
231 * XXX config blocks.
232 */
233 break;
234 }
https://nxr.netbsd.org/xref/src/sys/dev/ata/ld_ataraid.c?r=3D1.37#225
How did adi->adi_dev come to be null?
Well, the struct ataraid_disk_info *adi structure is one of the
aai->aai_ndisks entries in a struct ataraid_array_info *aai structure
which is created by ataraid_get_array_info:
270 aai =3D malloc(sizeof(*aai), M_DEVBUF, M_WAITOK | M_ZERO);
...
289 TAILQ_INSERT_TAIL(&ataraid_array_info_list, aai, aai_list);
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid.c?r=3D1.34#258
This is called by the various ata_raid_*.c drivers. From the dmesg
you shared, it looks like this was an Adaptec adapter, so aai_ndisks
is filled in from the first component here:
156 aai->aai_ndisks =3D be16toh(info->configs[0].total_disks);
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#156
And aai->aai_disks[drive].adi_dev is filled in for each component
found here:
182 adi =3D &aai->aai_disks[drive];
183 adi->adi_dev =3D sc->sc_dev;
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#182
If any components are missing, there will be some number `drive' for
which aai->aai_disks[drive].adi_dev remains null, and the logic will
crash where you saw this. That seems to be consistent with your
explanation of the state of the disks.
None of this code has changed much since this PR was filed, so I
suspect the bug is still there. I'm inclined to say that
ld_ataraid_attach should just check adi->adi_dev =3D=3D NULL and handle it
like when ata_raid_disk_vnode_find returns null -- in both cases, it
needs to deal with a missing component.
--- ld_ataraid.c
+++ ld_ataraid.c
@@ -226,8 +226,8 @@
*/
for (i =3D 0; i < aai->aai_ndisks; i++) {
adi =3D &aai->aai_disks[i];
- vp =3D ata_raid_disk_vnode_find(adi);
- if (vp =3D=3D NULL) {
+ if (adi =3D=3D NULL ||
+ (vp =3D ata_raid_disk_vnode_find(adi)) =3D=3D NULL) {
/*
* XXX This is bogus. We should just mark the
* XXX component as FAILED, and write-back new
Now, the failure branch logic here may be wrong -- it just gives up
instead of trying to deal with a missing component -- but that's a
separate issue which, with any luck, should lead to a more graceful
failure than crash, even if the more graceful failure isn't ideal.
State-Changed-From-To: open->analyzed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Tue, 26 Mar 2024 02:16:44 +0000
State-Changed-Why:
problem analyzed, band-aid proposed
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.