NetBSD Problem Report #34461
From kiers@raidtest2.xs4all.nl Mon Sep 4 00:31:02 2006
Return-Path: <kiers@raidtest2.xs4all.nl>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 14AC663B89F
for <gnats-bugs@gnats.NetBSD.org>; Mon, 4 Sep 2006 00:31:02 +0000 (UTC)
Message-Id: <20060903232449.6F4597FF2A@raidtest2.xs4all.nl>
Date: Mon, 4 Sep 2006 01:24:49 +0200 (CEST)
From: kiersb@xs4all.net
Reply-To: kiersb@xs4all.net
To: gnats-bugs@NetBSD.org
Subject: multiple problems; ioapic related?
X-Send-Pr-Version: 3.95
>Number: 34461
>Category: kern
>Synopsis: multiple problems; lfs-related
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Sep 04 00:35:00 +0000 2006
>Closed-Date:
>Last-Modified: Mon Aug 11 15:40:01 +0000 2014
>Originator: kiersb@xs4all.net
>Release: NetBSD 4.99.1
>Organization:
XS4All
>Environment:
System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0: Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/i386/compile/GENERIC.MPACPI i386
Architecture: i386
Machine: i386
>Description:
While rsyncing 1.3 TB from an ufs file system to an lfs file system on same box, after around 200 MB:
db{0}> bt
Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
--- interrupt ---
Bad frame pointer: 0xc3077318
0x30:
then, after rebooting and issueing save rsync command, after 392 GB:
sd5(isp0:0:21:0): unable to allocate scsipi_xfer
and nothing special in kernel messages; after this sd5 message, the computer is *very* slow,
root@raidtest2:~# uptime
12:58AM up 9:11, 2 users, load averages: 8.66, 8.59, 8.52
took around two minutes. Even typing 'uptime' on the serial console echoes around 1 char in two seconds.
dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
Not production system, so I can test stuff.
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: Rui Paulo <rpaulo@fnop.net>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Mon, 4 Sep 2006 13:16:53 +0100
On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:
>> Number: 34461
>> Category: kern
>> Synopsis: multiple problems; ioapic related?
>> Confidential: no
>> Severity: critical
>> Priority: high
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Mon Sep 04 00:35:00 +0000 2006
>> Originator: kiersb@xs4all.net
>> Release: NetBSD 4.99.1
>> Organization:
> XS4All
>> Environment:
>
>
> System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:
> Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/
> i386/compile/GENERIC.MPACPI i386
> Architecture: i386
> Machine: i386
>> Description:
>
> While rsyncing 1.3 TB from an ufs file system to an lfs file system
> on same box, after around 200 MB:
>
> db{0}> bt
> Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
> --- interrupt ---
> Bad frame pointer: 0xc3077318
> 0x30:
>
> then, after rebooting and issueing save rsync command, after 392 GB:
>
> sd5(isp0:0:21:0): unable to allocate scsipi_xfer
>
> and nothing special in kernel messages; after this sd5 message, the
> computer is *very* slow,
>
> root@raidtest2:~# uptime
> 12:58AM up 9:11, 2 users, load averages: 8.66, 8.59, 8.52
> took around two minutes. Even typing 'uptime' on the serial console
> echoes around 1 char in two seconds.
Can you monitor disk I/O with iostat at the same time you are doing a
copy ?
>
> dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
>
> Not production system, so I can test stuff.
>> How-To-Repeat:
>
>> Fix:
>
>
>> Unformatted:
>
>
-- Rui Paulo
From: Bert Kiers <kiersb@xs4all.net>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org, kiersb@xs4all.net
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Mon, 4 Sep 2006 16:05:17 +0200
On Mon, Sep 04, 2006 at 12:20:02PM +0000, Rui Paulo wrote:
> The following reply was made to PR kern/34461; it has been noted by GNATS.
>
> From: Rui Paulo <rpaulo@fnop.net>
> To: gnats-bugs@NetBSD.org
> Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
> netbsd-bugs@netbsd.org
> Subject: Re: kern/34461: multiple problems; ioapic related?
> Date: Mon, 4 Sep 2006 13:16:53 +0100
>
> On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:
>
> >> Number: 34461
> >> Category: kern
> >> Synopsis: multiple problems; ioapic related?
> >> Confidential: no
> >> Severity: critical
> >> Priority: high
> >> Responsible: kern-bug-people
> >> State: open
> >> Class: sw-bug
> >> Submitter-Id: net
> >> Arrival-Date: Mon Sep 04 00:35:00 +0000 2006
> >> Originator: kiersb@xs4all.net
> >> Release: NetBSD 4.99.1
> >> Organization:
> > XS4All
> >> Environment:
> >
> >
> > System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:
> > Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/
> > i386/compile/GENERIC.MPACPI i386
> > Architecture: i386
> > Machine: i386
> >> Description:
> >
> > While rsyncing 1.3 TB from an ufs file system to an lfs file system
> > on same box, after around 200 MB:
> >
> > db{0}> bt
> > Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
> > --- interrupt ---
> > Bad frame pointer: 0xc3077318
> > 0x30:
> >
> > then, after rebooting and issueing save rsync command, after 392 GB:
> >
> > sd5(isp0:0:21:0): unable to allocate scsipi_xfer
> >
> > and nothing special in kernel messages; after this sd5 message, the
> > computer is *very* slow,
> >
> > root@raidtest2:~# uptime
> > 12:58AM up 9:11, 2 users, load averages: 8.66, 8.59, 8.52
> > took around two minutes. Even typing 'uptime' on the serial console
> > echoes around 1 char in two seconds.
>
> Can you monitor disk I/O with iostat at the same time you are doing a
> copy ?
before starting rsync, parity reconstruction is working on raid1:
(37% complete)
device read KB/t r/s time MB/s write KB/t w/s time MB/s
fd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
ld0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
md0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
cd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd1 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd3 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd4 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd5 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd6 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd7 16.00 674 0.61 10.54 0.00 0 0.61 0.00
sd8 16.00 675 0.61 10.55 0.00 0 0.61 0.00
sd9 16.00 674 0.56 10.54 0.00 0 0.56 0.00
sd10 16.00 674 0.59 10.54 0.00 0 0.59 0.00
sd11 16.00 673 0.56 10.52 0.00 0 0.56 0.00
sd12 16.00 674 0.55 10.54 0.00 0 0.55 0.00
sd13 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd14 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd15 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid1 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
during root@raidtest2:/disk0/pub# rsync --size-only -r --progress /disk1/pub/* .:
(i/o to disks sd0..sd5 is very bursty)
device read KB/t r/s time MB/s write KB/t w/s time MB/s
fd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
ld0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
md0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
cd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd0 10.69 339 0.85 3.53 12.25 479 0.85 5.73
sd1 12.37 259 0.75 3.13 13.78 422 0.75 5.68
sd2 10.80 340 0.88 3.58 12.27 476 0.88 5.71
sd3 13.20 249 0.71 3.20 14.31 407 0.71 5.69
sd4 10.69 374 0.88 3.91 11.94 491 0.88 5.72
sd5 13.20 241 0.63 3.10 14.35 408 0.63 5.72
sd6 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd7 16.00 593 0.61 9.27 0.00 0 0.61 0.00
sd8 16.00 593 0.58 9.27 0.00 0 0.58 0.00
sd9 16.00 593 0.57 9.27 0.00 0 0.57 0.00
sd10 16.00 593 0.61 9.27 0.00 0 0.61 0.00
sd11 16.00 591 0.61 9.24 0.00 0 0.61 0.00
sd12 16.00 592 0.59 9.25 0.00 0 0.59 0.00
sd13 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd14 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd15 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid0 0.00 0 1.01 0.00 50.19 489 1.01 23.97
raid1 64.00 4 0.01 0.25 0.00 0 0.01 0.00
raid2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
now, after 20 GB done, system is very unresponsive, but nothing in messages or
dmesg:
(iostat -w 1, rsync and systat vmstat: no output, ssh to box time-out, but
serial console is 'only' very very slow)
device read KB/t r/s time MB/s write KB/t w/s time MB/s
fd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
ld0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
md0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
cd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd0 11.37 277 0.74 3.08 12.57 374 0.74 4.59
sd1 11.88 257 0.65 2.99 13.07 362 0.65 4.63
sd2 11.80 258 0.67 2.98 13.03 365 0.67 4.65
sd3 11.24 271 0.81 2.98 12.58 377 0.81 4.63
sd4 11.82 247 0.69 2.85 13.14 362 0.69 4.65
sd5 11.74 256 0.70 2.94 12.98 362 0.70 4.59
sd6 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd7 16.00 530 0.50 8.28 0.00 0 0.50 0.00
sd8 16.00 530 0.57 8.28 0.00 0 0.57 0.00
sd9 16.00 530 0.57 8.28 0.00 0 0.57 0.00
sd10 16.00 530 0.56 8.28 0.00 0 0.56 0.00
sd11 16.00 530 0.51 8.28 0.00 0 0.51 0.00
sd12 16.00 530 0.51 8.28 0.00 0 0.51 0.00
sd13 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd14 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd15 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid0 0.00 0 1.00 0.00 36.31 535 1.00 18.96
raid1 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
systat vmstat at this moment:
4 users Load 9.34 9.10 8.04 Mon Sep 4 15:39:01
Proc:r d s w Csw Trp Sys Int Sof Flt PAGING SWAPPING
2 10 6435 23 117 13250 119 11 in out in out
ops
20.5% Sy 0.0% Us 0.0% Ni 3.3% In 76.2% Id pages
| | | | | | | | | | |
==========%% forks
fkppw
memory totals (in kB) 23237 Interrupts fksvm
real virtual free 100 cpu0 softclock pwait
Active 1227868 1227868 1620 13 cpu0 softnet relck
All 2052948 2052948 134672 cpu0 softserial rlkok
100 cpu0 timer noram
Namei Sys-cache Proc-cache FPU flush IPI ndcpy
Calls hits % hits % FPU synch IPI fltcp
8 8 100 2146 TLB shootdown I zfod
cpu1 softnet cow
Disks: seeks xfers bytes %busy 100 cpu1 timer 64 fmin
fd0 FPU flush IPI 85 ftarg
ld0 FPU synch IPI 150021 itarg
md0 2512 TLB shootdown I 1038 wired
cd0 cpu2 softnet pdfre
sd0 685 8390K 76.9 100 cpu2 timer pdscn
sd1 669 8330K 72.6 FPU flush IPI
sd2 671 8349K 73.4 FPU synch IPI
sd3 679 8339K 73.8 2403 TLB shootdown I
sd4 669 8296K 71.6 cpu3 softnet
sd5 669 8301K 72.2 100 cpu3 timer
sd6 FPU flush IPI
sd7 556 8896K 57.6 FPU synch IPI
sd8 556 8893K 58.0 2412 TLB shootdown I
sd9 556 8893K 58.2 ioapic0 pin 4
sd10 556 8890K 58.6 ioapic0 pin 6
sd11 556 8893K 59.8 13238 ioapic2 pin 0
sd12 556 8893K 61.4 13 ioapic2 pin 6
sd13 ioapic1 pin 6
sd14 ioapic0 pin 15
sd15
raid0 575 20M 100.0
raid1
raid2
root@raidtest2:~# date
Mon Sep 4 15:58:02 CEST 2006
root@raidtest2:~#
took 1 minute, and see that vmstat is frozen for 20 minutes already.
root@raidtest2:~# netstat -anfinet|grep EST
tcp 0 0 194.109.0.22.22 194.109.0.97.63281 ESTABLISHED
tcp 0 0 194.109.0.22.22 194.109.0.97.63327 ESTABLISHED
tcp 0 80 194.109.0.22.22 194.109.0.97.63328 ESTABLISHED
on the console took 30 seconds, just to see the connections are still there and now
(16:00) suddenly the system starts going again; rsync, iostat and systat just continue
>
> >
> > dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
> >
> > Not production system, so I can test stuff.
> >> How-To-Repeat:
> >
> >> Fix:
> >
> >
> >> Unformatted:
> >
> >
>
>
>
> -- Rui Paulo
>
>
>
--
Bert Kiers
XS4All UNIX systeembeheerder, lockpicker & techno anarchist
From: Bert Kiers <kiers@original.xs4all.nl>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org, kiersb@xs4all.net
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Wed, 13 Sep 2006 13:41:07 +0200
Well, I reformatted the destination disk with UFS and everything is fine now.
So, it is a LFS problem.
--
Bert Kiers, !MCSE && 0xFF, frique d'ordinateur
State-Changed-From-To: open->closed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Wed, 30 Apr 2008 14:33:59 +0000
State-Changed-Why:
appears not to be an ioapic problem
State-Changed-From-To: closed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 30 Apr 2008 18:26:35 +0000
State-Changed-Why:
Just because it's an lfs problem doesn't mean the PR should be closed...
I've updated the synopsis accordingly.
From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Wed, 30 Apr 2008 19:36:01 +0100
On Wed, Apr 30, 2008 at 06:26:37PM +0000, dholland@NetBSD.org wrote:
> Synopsis: multiple problems; lfs-related
>
> State-Changed-From-To: closed->open
> State-Changed-By: dholland@NetBSD.org
> State-Changed-When: Wed, 30 Apr 2008 18:26:35 +0000
> State-Changed-Why:
> Just because it's an lfs problem doesn't mean the PR should be closed...
> I've updated the synopsis accordingly.
1. Re-opening the PR without even a note was rude.
2. There is scant diagnostic information in this PR, other than
evidence of KVM starvation and a statement indicating that LFS is
broken and not suitable for use on production systems. Any dog on
the street could tell you that.
Andrew
From: David Holland <dholland-bugs@netbsd.org>
To: Andrew Doran <ad@netbsd.org>, gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, kiersb@xs4all.net
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Wed, 30 Apr 2008 21:53:47 +0000
On Wed, Apr 30, 2008 at 06:40:02PM +0000, Andrew Doran wrote:
>> Just because it's an lfs problem doesn't mean the PR should be closed...
>> I've updated the synopsis accordingly.
>
> 1. Re-opening the PR without even a note was rude.
Well, it *is* a note, but I take your point. It is unnecessarily
abrupt. I apologize.
> 2. There is scant diagnostic information in this PR, other than
> evidence of KVM starvation and a statement indicating that LFS is
> broken and not suitable for use on production systems. Any dog on
> the street could tell you that.
Well, yes and no. It contains a method for breaking LFS. When/if
someone sits down to tackle the various outstanding problems in LFS,
this is something that they'll want to try. So the PR ought to be left
open so that person (which stands a good chance of being me) will be
able to find it.
If at that point it doesn't apply any more or turns out to depend on
setup details that aren't available, we can always close it then.
As much as it's nice to reduce the total PR count, it doesn't do us
any good in the long run to be too aggressive about it.
--
David A. Holland
dholland@netbsd.org
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 12 Jul 2014 17:20:30 +0000
State-Changed-Why:
If there's anyone still on the other end of this... is there any reason to
believe this wasn't a driver or hardware issue? The fact that the problem
starts with a driver-level error makes me think it probably isn't LFS.
Reformatting with FFS and no longer seeing the issue doesn't really prove
anything as FFS has much different I/O patterns, and in particular when
rsyncing LFS will be ramming a lot more data down the disk system's throat.
(Or at least, a lot more at once.) If there was a load- or timing-dependent
problem at the disk level it's quite possible that FFS wouldn't trigger it,
especially if you weren't using softupdates.
From: Bert Kiers <kiersb@xs4all.net>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, netbsd-bugs@NetBSD.org,
gnats-admin@NetBSD.org, dholland@NetBSD.org
Cc:
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Sun, 13 Jul 2014 14:55:39 +0200
On 7/12/14 7:20 PM, dholland@NetBSD.org wrote:
> Synopsis: multiple problems; lfs-related
>
> State-Changed-From-To: open->feedback
> State-Changed-By: dholland@NetBSD.org
> State-Changed-When: Sat, 12 Jul 2014 17:20:30 +0000
> State-Changed-Why:
> If there's anyone still on the other end of this... is there any reason to
> believe this wasn't a driver or hardware issue? The fact that the problem
> starts with a driver-level error makes me think it probably isn't LFS.
>
> Reformatting with FFS and no longer seeing the issue doesn't really prove
> anything as FFS has much different I/O patterns, and in particular when
> rsyncing LFS will be ramming a lot more data down the disk system's throat.
> (Or at least, a lot more at once.) If there was a load- or timing-dependent
> problem at the disk level it's quite possible that FFS wouldn't trigger it,
> especially if you weren't using softupdates.
System is scrapped. If somebody is really interested I could retry with
newer computer, NetBSD-current and same box with disks.
--
Bert Kiers
XS4ALL UNIX systeembeheerder, suspected terrorist
1984 was not meant as a manual
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Sun, 13 Jul 2014 18:42:21 +0000
On Sun, Jul 13, 2014 at 02:35:00PM +0000, Bert Kiers wrote:
> On 7/12/14 7:20 PM, dholland@NetBSD.org wrote:
>> If there's anyone still on the other end of this... is there any
>> reason to believe this wasn't a driver or hardware issue? The fact
>> that the problem starts with a driver-level error makes me think
>> it probably isn't LFS.
>>
>> Reformatting with FFS and no longer seeing the issue doesn't
>> really prove anything as FFS has much different I/O patterns, and
>> in particular when rsyncing LFS will be ramming a lot more data
>> down the disk system's throat. (Or at least, a lot more at once.)
>> If there was a load- or timing-dependent problem at the disk level
>> it's quite possible that FFS wouldn't trigger it, especially if
>> you weren't using softupdates.
>
> System is scrapped. If somebody is really interested I could retry with
> newer computer, NetBSD-current and same box with disks.
Looking at it some more, I have the following conjecture:
- Back in 2009 when WAPBL was new, it sometimes under load exhibited
a dysfunctional operating state where it would be madly writing the
same blocks over and over again and making very little real progress.
- This turned out to be not WAPBL-specific but also possible (just
much harder to get into) with regular FFS.
- It was caused by bad dynamic behavior logic in the syncer that was
triggered by the disks getting behind on the pending I/O.
- It got fixed; some of the fix was FS-independent, but it isn't
clear to me (without digging a lot deeper) how much.
I think it's possible that you were seeing an LFS version of this same
behavior. With the size of the RAID you had/have, it's quite plausible
that flooding it with writes, as this problem resulted in, would
render the system as slow as described.
If so, it might now be fixed... or it might not. It would probably be
interesting to find out, but see below.
The allocation failure message that started it is scsipi-level; it
means that the allocation pool for xs structures ran out. In the
current code (this doesn't seem to have changed) this causes a
half-second delay. It is quite likely that a sudden half-second delay
while running at peak throughput would be enough to trigger the
dysfunctional state described above, if it existed in LFS at the time.
(Given that the message appeared only once, the half-second delay
itself can't be the performance problem.)
The problem could be something else entirely, though; e.g. something
in the way LFS prepares segments, or something silly in lfs_putpages.
Or this could be the same as PR 35187. Or a raidframe issue. Also, the
allocation failure might conceivably be a red herring and not actually
related at all; or the trigger (or even the problem) might be some
other allocation failure that doesn't print anything.
Trying to replicate this on a modern machine (with much more RAM and
faster disks) might be much harder... or much easier. Even if my
conjecture's correct, it's hard to guess. It will, probably, be harder
to get the triggering allocation failure; you might have to insert
fault injection code for that. If my conjecture's correct, without the
half-second delay the problem might well not appear. There's some
chance (especially if my conjecture's wrong) that it'll turn out to be
easy to reproduce, but this doesn't seem too likely.
So I would say: unless you're interested in working on LFS, trying to
replicate it probably isn't worthwhile; it will take a fair amount of
effort and isn't that likely to produce conclusive results.
If you *are* interested in working on LFS, by all means go ahead
though :-)
(There's another unrelated bug: it seems that hitting that half-second
delay causes mishandling of the iostat counters. However, that
shouldn't matter much.)
--
David A. Holland
dholland@netbsd.org
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 13 Jul 2014 21:23:29 +0000
State-Changed-Why:
Feedback received, plus I came up with a theory.
From: Bert Kiers <kiers@original.xs4all.nl>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org, dholland@NetBSD.org
Cc:
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Mon, 11 Aug 2014 17:37:17 +0200
On Sun, Jul 13, 2014 at 02:55:39PM +0200, Bert Kiers wrote:
> On 7/12/14 7:20 PM, dholland@NetBSD.org wrote:
> > Synopsis: multiple problems; lfs-related
> >
> > State-Changed-From-To: open->feedback
> > State-Changed-By: dholland@NetBSD.org
> > State-Changed-When: Sat, 12 Jul 2014 17:20:30 +0000
> > State-Changed-Why:
> > If there's anyone still on the other end of this... is there any reason to
> > believe this wasn't a driver or hardware issue? The fact that the problem
> > starts with a driver-level error makes me think it probably isn't LFS.
> >
> > Reformatting with FFS and no longer seeing the issue doesn't really prove
> > anything as FFS has much different I/O patterns, and in particular when
> > rsyncing LFS will be ramming a lot more data down the disk system's throat.
> > (Or at least, a lot more at once.) If there was a load- or timing-dependent
> > problem at the disk level it's quite possible that FFS wouldn't trigger it,
> > especially if you weren't using softupdates.
>
> System is scrapped. If somebody is really interested I could retry with
> newer computer, NetBSD-current and same box with disks.
I cannot reproduce this problem (same box of disks, new NetBSD, new
other hardware).
Grtnx,
--
B*E*R*T
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.