NetBSD Problem Report #43986

From woods@once.weird.com  Tue Oct 19 19:41:20 2010
Return-Path: <woods@once.weird.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 0A33C63B11D
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 19 Oct 2010 19:41:20 +0000 (UTC)
Message-Id: <m1P8H4t-001ZAkC@once.weird.com>
Date: Tue, 19 Oct 2010 14:38:27 -0400 (EDT)
From: "Greg A. Woods" <woods@planix.com>
Sender: "Greg A. Woods" <woods@once.weird.com>
Reply-To: "Greg A. Woods" <woods@planix.com>
To: gnats-bugs@gnats.NetBSD.org
Subject: ataraid(4) doesn't seem to handle weird array configs very gracefully
X-Send-Pr-Version: 3.95

>Number:         43986
>Category:       kern
>Synopsis:       ataraid(4) doesn't seem to handle weird array configs very gracefully
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 19 19:45:00 +0000 2010
>Closed-Date:    
>Last-Modified:  Tue Mar 26 02:16:44 +0000 2024
>Originator:     Greg A. Woods
>Release:        NetBSD 4.0_STABLE 2010/10/15
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD historically 4.0_STABLE NetBSD 4.0_STABLE (GENERIC.MP) #1: Fri Oct 15 13:14:43 EDT 2010  woods@once:/rest/build/woods/once/netbsd-4-i386-i386-ppro-obj/rest/work/woods/m-NetBSD-4/sys/arch/i386/compile/GENERIC.MP i386
Architecture: i386
Machine: i386
>Description:

	ataraid(4) doesn't seem to handle weird array configs very gracefully:

wd1 at atabus2 drive 0: <WDC WD2000JD-00HBB0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
rnd: wd1 attached as an entropy source (collecting)
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(piixide1:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
MAGIC == 0x00
wd2 at atabus3 drive 0: <WDC WD2000JD-00HBB0>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
rnd: wd2 attached as an entropy source (collecting)
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(piixide1:1:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
ataraid0: found 1 RAID volume
ld0 at ataraid0 vendtype 1 unit 0: Adaptec ATA RAID-1 array
uvm_fault(0xc0ab2c40, 0, 1) -> 0xe
kernel: supervisor trap page fault, code=0
Stopped in pid 0.1 (swapper) at netbsd:strncmp+0x23:    movzbl  0(%ebx),%eax
db{0}> trace
strncmp(c095397a,1c,6,282,c0a9564c) at netbsd:strncmp+0x23
devsw_name2blk(1c,0,0,0,0) at netbsd:devsw_name2blk+0x7e
ld_ataraid_attach(c3d9c940,c3dede00,c3da0600,c37d4000,c37f4000) at netbsd:ld_ata
raid_attach+0x1a8
config_attach_loc(c3d9c940,c0a0bb40,c0b96b54,c3da0600,c053c470) at netbsd:config
_attach_loc+0x3b0
ataraid_attach(0,c3d9c940,0,c04547fd,c095304c) at netbsd:ataraid_attach+0x87
config_attach_pseudo(c0a20e28,c0a20e00,c09c4c6c,241,7) at netbsd:config_attach_p
seudo+0x213
ata_raid_finalize(0,c0a1cfe0,0,20,20) at netbsd:ata_raid_finalize+0x4d
config_finalize(c0ab4114,20,c0967934,0,0) at netbsd:config_finalize+0x29
main(fbff,c01002d2,0,0,0) at netbsd:main+0x297
db{0}> 


>How-To-Repeat:

	1. set up a RAID-1 mirror in the hardware RAID controller config
	2. disconnect one drive
	3. delete the array
	4. connect the drive
	5. boot

	i.e. I think the problem was the disconnected drive was
	identified as being an array member and so the array
	re-appeared, but the second drive was in JBOD mode because it
	had been the only one connected when the array was initially
	deleted.

>Fix:


>Release-Note:

>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: "Greg A. Woods" <woods@planix.com>
Cc: gnats-bugs@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/43986: ataraid(4) doesn't seem to handle weird array configs very gracefully
Date: Tue, 26 Mar 2024 02:12:00 +0000

 > uvm_fault(0xc0ab2c40, 0, 1) -> 0xe
 > kernel: supervisor trap page fault, code=3D0
 > Stopped in pid 0.1 (swapper) at netbsd:strncmp+0x23:    movzbl  0(%ebx),%=
 eax
 > db{0}> trace
 > strncmp(c095397a,1c,6,282,c0a9564c) at netbsd:strncmp+0x23

 This is an attempt to access the page zero.  Not exactly a null
 pointer dererence, but a near-null pointer dereference -- the second
 argument is 0x1c.

 > devsw_name2blk(1c,0,0,0,0) at netbsd:devsw_name2blk+0x7e

 It probably happened here:

     535 		if (strncmp(conv->d_name, name, len) !=3D 0)

 https://nxr.netbsd.org/xref/src/sys/kern/subr_devsw.c?r=3D1.28#535

 > ld_ataraid_attach(c3d9c940,c3dede00,c3da0600,c37d4000,c37f4000) at netbsd=
 :ld_ataraid_attach+0x1a8

 The devsw_name2blk call is probaby this one:

      78 	bmajor =3D devsw_name2blk(device_xname(adi->adi_dev), NULL, 0);

 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_subr.c?r=3D1.2#78

 0x1c is offsetof(struct device, dv_xname) on i386 as of around 2010:

 https://nxr.netbsd.org/xref/src/sys/sys/device.h?r=3D1.137#141

 So adi->adi_dev is probably null here.

 ata_raid_disk_vnode_find assumes that adi->adi_dev is nonnull.  This
 is probably a bug on its own: the caller handles a null return, so
 either if the input is null, the output should be null, or the caller
 should avoid calling ata_raid_disk_vnode_find with a null input.

     225 		adi =3D &aai->aai_disks[i];
     226 		vp =3D ata_raid_disk_vnode_find(adi);
     227 		if (vp =3D=3D NULL) {
     228 			/*
     229 			 * XXX This is bogus.  We should just mark the
     230 			 * XXX component as FAILED, and write-back new
     231 			 * XXX config blocks.
     232 			 */
     233 			break;
     234 		}

 https://nxr.netbsd.org/xref/src/sys/dev/ata/ld_ataraid.c?r=3D1.37#225

 How did adi->adi_dev come to be null?

 Well, the struct ataraid_disk_info *adi structure is one of the
 aai->aai_ndisks entries in a struct ataraid_array_info *aai structure
 which is created by ataraid_get_array_info:

     270 	aai =3D malloc(sizeof(*aai), M_DEVBUF, M_WAITOK | M_ZERO);
 ...
     289 	TAILQ_INSERT_TAIL(&ataraid_array_info_list, aai, aai_list);

 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid.c?r=3D1.34#258

 This is called by the various ata_raid_*.c drivers.  From the dmesg
 you shared, it looks like this was an Adaptec adapter, so aai_ndisks
 is filled in from the first component here:

     156 		aai->aai_ndisks =3D be16toh(info->configs[0].total_disks);

 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#156

 And aai->aai_disks[drive].adi_dev is filled in for each component
 found here:

     182 	adi =3D &aai->aai_disks[drive];
     183 	adi->adi_dev =3D sc->sc_dev;

 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#182

 If any components are missing, there will be some number `drive' for
 which aai->aai_disks[drive].adi_dev remains null, and the logic will
 crash where you saw this.  That seems to be consistent with your
 explanation of the state of the disks.

 None of this code has changed much since this PR was filed, so I
 suspect the bug is still there.  I'm inclined to say that
 ld_ataraid_attach should just check adi->adi_dev =3D=3D NULL and handle it
 like when ata_raid_disk_vnode_find returns null -- in both cases, it
 needs to deal with a missing component.

 --- ld_ataraid.c
 +++ ld_ataraid.c
 @@ -226,8 +226,8 @@
  	 */
  	for (i =3D 0; i < aai->aai_ndisks; i++) {
  		adi =3D &aai->aai_disks[i];
 -		vp =3D ata_raid_disk_vnode_find(adi);
 -		if (vp =3D=3D NULL) {
 +		if (adi =3D=3D NULL ||
 +		    (vp =3D ata_raid_disk_vnode_find(adi)) =3D=3D NULL) {
  			/*
  			 * XXX This is bogus.  We should just mark the
  			 * XXX component as FAILED, and write-back new

 Now, the failure branch logic here may be wrong -- it just gives up
 instead of trying to deal with a missing component -- but that's a
 separate issue which, with any luck, should lead to a more graceful
 failure than crash, even if the more graceful failure isn't ideal.

State-Changed-From-To: open->analyzed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Tue, 26 Mar 2024 02:16:44 +0000
State-Changed-Why:
problem analyzed, band-aid proposed


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.