NetBSD Problem Report #34461

From kiers@raidtest2.xs4all.nl  Mon Sep  4 00:31:02 2006
Return-Path: <kiers@raidtest2.xs4all.nl>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 14AC663B89F
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  4 Sep 2006 00:31:02 +0000 (UTC)
Message-Id: <20060903232449.6F4597FF2A@raidtest2.xs4all.nl>
Date: Mon,  4 Sep 2006 01:24:49 +0200 (CEST)
From: kiersb@xs4all.net
Reply-To: kiersb@xs4all.net
To: gnats-bugs@NetBSD.org
Subject: multiple problems; ioapic related?
X-Send-Pr-Version: 3.95

>Number:         34461
>Category:       kern
>Synopsis:       multiple problems; lfs-related
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 04 00:35:00 +0000 2006
>Closed-Date:    
>Last-Modified:  Mon Aug 11 15:40:01 +0000 2014
>Originator:     kiersb@xs4all.net
>Release:        NetBSD 4.99.1
>Organization:
	XS4All
>Environment:


System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0: Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/i386/compile/GENERIC.MPACPI i386
Architecture: i386
Machine: i386
>Description:

While rsyncing 1.3 TB from an ufs file system to an lfs file system on same box, after around 200 MB:

db{0}> bt
Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
--- interrupt ---
Bad frame pointer: 0xc3077318
0x30:

then, after rebooting and issueing save rsync command, after 392 GB:

sd5(isp0:0:21:0): unable to allocate scsipi_xfer

and nothing special in kernel messages; after this sd5 message, the computer is *very* slow, 

root@raidtest2:~# uptime
12:58AM  up  9:11, 2 users, load averages: 8.66, 8.59, 8.52                                                                                          
took around two minutes. Even typing 'uptime' on the serial console echoes around 1 char in two seconds.

dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt

Not production system, so I can test stuff. 
>How-To-Repeat:

>Fix:


>Release-Note:

>Audit-Trail:
From: Rui Paulo <rpaulo@fnop.net>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Mon, 4 Sep 2006 13:16:53 +0100

 On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:

 >> Number:         34461
 >> Category:       kern
 >> Synopsis:       multiple problems; ioapic related?
 >> Confidential:   no
 >> Severity:       critical
 >> Priority:       high
 >> Responsible:    kern-bug-people
 >> State:          open
 >> Class:          sw-bug
 >> Submitter-Id:   net
 >> Arrival-Date:   Mon Sep 04 00:35:00 +0000 2006
 >> Originator:     kiersb@xs4all.net
 >> Release:        NetBSD 4.99.1
 >> Organization:
 > 	XS4All
 >> Environment:
 > 	
 > 	
 > System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:  
 > Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/ 
 > i386/compile/GENERIC.MPACPI i386
 > Architecture: i386
 > Machine: i386
 >> Description:
 > 	
 > While rsyncing 1.3 TB from an ufs file system to an lfs file system  
 > on same box, after around 200 MB:
 >
 > db{0}> bt
 > Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
 > --- interrupt ---
 > Bad frame pointer: 0xc3077318
 > 0x30:
 >
 > then, after rebooting and issueing save rsync command, after 392 GB:
 >
 > sd5(isp0:0:21:0): unable to allocate scsipi_xfer
 >
 > and nothing special in kernel messages; after this sd5 message, the  
 > computer is *very* slow,
 >
 > root@raidtest2:~# uptime
 > 12:58AM  up  9:11, 2 users, load averages: 8.66, 8.59, 8.52
 > took around two minutes. Even typing 'uptime' on the serial console  
 > echoes around 1 char in two seconds.

 Can you monitor disk I/O with iostat at the same time you are doing a  
 copy ?

 >
 > dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
 >
 > Not production system, so I can test stuff.
 >> How-To-Repeat:
 > 	
 >> Fix:
 > 	
 >
 >> Unformatted:
 >  	
 >  	



 	-- Rui Paulo


From: Bert Kiers <kiersb@xs4all.net>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org, kiersb@xs4all.net
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Mon, 4 Sep 2006 16:05:17 +0200

 On Mon, Sep 04, 2006 at 12:20:02PM +0000, Rui Paulo wrote:
 > The following reply was made to PR kern/34461; it has been noted by GNATS.
 > 
 > From: Rui Paulo <rpaulo@fnop.net>
 > To: gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 > 	netbsd-bugs@netbsd.org
 > Subject: Re: kern/34461: multiple problems; ioapic related?
 > Date: Mon, 4 Sep 2006 13:16:53 +0100
 > 
 >  On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:
 >  
 >  >> Number:         34461
 >  >> Category:       kern
 >  >> Synopsis:       multiple problems; ioapic related?
 >  >> Confidential:   no
 >  >> Severity:       critical
 >  >> Priority:       high
 >  >> Responsible:    kern-bug-people
 >  >> State:          open
 >  >> Class:          sw-bug
 >  >> Submitter-Id:   net
 >  >> Arrival-Date:   Mon Sep 04 00:35:00 +0000 2006
 >  >> Originator:     kiersb@xs4all.net
 >  >> Release:        NetBSD 4.99.1
 >  >> Organization:
 >  > 	XS4All
 >  >> Environment:
 >  > 	
 >  > 	
 >  > System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:  
 >  > Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/ 
 >  > i386/compile/GENERIC.MPACPI i386
 >  > Architecture: i386
 >  > Machine: i386
 >  >> Description:
 >  > 	
 >  > While rsyncing 1.3 TB from an ufs file system to an lfs file system  
 >  > on same box, after around 200 MB:
 >  >
 >  > db{0}> bt
 >  > Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
 >  > --- interrupt ---
 >  > Bad frame pointer: 0xc3077318
 >  > 0x30:
 >  >
 >  > then, after rebooting and issueing save rsync command, after 392 GB:
 >  >
 >  > sd5(isp0:0:21:0): unable to allocate scsipi_xfer
 >  >
 >  > and nothing special in kernel messages; after this sd5 message, the  
 >  > computer is *very* slow,
 >  >
 >  > root@raidtest2:~# uptime
 >  > 12:58AM  up  9:11, 2 users, load averages: 8.66, 8.59, 8.52
 >  > took around two minutes. Even typing 'uptime' on the serial console  
 >  > echoes around 1 char in two seconds.
 >  
 >  Can you monitor disk I/O with iostat at the same time you are doing a  
 >  copy ?


 before starting rsync, parity reconstruction is working on raid1:
 (37% complete)

 device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
 fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd1          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd2          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd3          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd4          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd5          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd7         16.00    674   0.61    10.54       0.00      0   0.61     0.00
 sd8         16.00    675   0.61    10.55       0.00      0   0.61     0.00
 sd9         16.00    674   0.56    10.54       0.00      0   0.56     0.00
 sd10        16.00    674   0.59    10.54       0.00      0   0.59     0.00
 sd11        16.00    673   0.56    10.52       0.00      0   0.56     0.00
 sd12        16.00    674   0.55    10.54       0.00      0   0.55     0.00
 sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid0        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid1        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00

 during root@raidtest2:/disk0/pub# rsync --size-only -r --progress /disk1/pub/* .:
 (i/o to disks sd0..sd5 is very bursty)

 device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
 fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd0         10.69    339   0.85     3.53      12.25    479   0.85     5.73
 sd1         12.37    259   0.75     3.13      13.78    422   0.75     5.68
 sd2         10.80    340   0.88     3.58      12.27    476   0.88     5.71
 sd3         13.20    249   0.71     3.20      14.31    407   0.71     5.69
 sd4         10.69    374   0.88     3.91      11.94    491   0.88     5.72
 sd5         13.20    241   0.63     3.10      14.35    408   0.63     5.72
 sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd7         16.00    593   0.61     9.27       0.00      0   0.61     0.00
 sd8         16.00    593   0.58     9.27       0.00      0   0.58     0.00
 sd9         16.00    593   0.57     9.27       0.00      0   0.57     0.00
 sd10        16.00    593   0.61     9.27       0.00      0   0.61     0.00
 sd11        16.00    591   0.61     9.24       0.00      0   0.61     0.00
 sd12        16.00    592   0.59     9.25       0.00      0   0.59     0.00
 sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid0        0.00      0   1.01     0.00      50.19    489   1.01    23.97
 raid1       64.00      4   0.01     0.25       0.00      0   0.01     0.00
 raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00


 now, after 20 GB done, system is very unresponsive, but nothing in messages or
 dmesg:
 (iostat -w 1, rsync and systat vmstat: no output, ssh to box time-out, but
 serial console is 'only' very very slow)

 device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
 fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd0         11.37    277   0.74     3.08      12.57    374   0.74     4.59
 sd1         11.88    257   0.65     2.99      13.07    362   0.65     4.63
 sd2         11.80    258   0.67     2.98      13.03    365   0.67     4.65
 sd3         11.24    271   0.81     2.98      12.58    377   0.81     4.63
 sd4         11.82    247   0.69     2.85      13.14    362   0.69     4.65
 sd5         11.74    256   0.70     2.94      12.98    362   0.70     4.59
 sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd7         16.00    530   0.50     8.28       0.00      0   0.50     0.00
 sd8         16.00    530   0.57     8.28       0.00      0   0.57     0.00
 sd9         16.00    530   0.57     8.28       0.00      0   0.57     0.00
 sd10        16.00    530   0.56     8.28       0.00      0   0.56     0.00
 sd11        16.00    530   0.51     8.28       0.00      0   0.51     0.00
 sd12        16.00    530   0.51     8.28       0.00      0   0.51     0.00
 sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid0        0.00      0   1.00     0.00      36.31    535   1.00    18.96
 raid1        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00

 systat vmstat at this moment:

     4 users    Load  9.34  9.10  8.04                  Mon Sep  4 15:39:01

 Proc:r  d  s  w     Csw    Trp    Sys   Int   Sof    Flt      PAGING   SWAPPING
         2 10       6435     23    117 13250   119     11      in  out   in  out
                                                         ops
   20.5% Sy   0.0% Us   0.0% Ni   3.3% In  76.2% Id    pages
 |    |    |    |    |    |    |    |    |    |    |
 ==========%%                                                              forks
                                                                           fkppw
            memory totals (in kB)           23237 Interrupts               fksvm
           real  virtual     free             100 cpu0 softclock           pwait
 Active 1227868  1227868     1620              13 cpu0 softnet             relck
 All    2052948  2052948   134672                 cpu0 softserial          rlkok
                                              100 cpu0 timer               noram
 Namei         Sys-cache     Proc-cache           FPU flush IPI            ndcpy
     Calls     hits    %     hits     %           FPU synch IPI            fltcp
         8        8  100                     2146 TLB shootdown I          zfod
                                                  cpu1 softnet             cow
 Disks: seeks xfers bytes %busy               100 cpu1 timer            64 fmin
    fd0                                           FPU flush IPI         85 ftarg
    ld0                                           FPU synch IPI     150021 itarg
    md0                                      2512 TLB shootdown I     1038 wired
    cd0                                           cpu2 softnet             pdfre
    sd0         685 8390K  76.9               100 cpu2 timer               pdscn
    sd1         669 8330K  72.6                   FPU flush IPI
    sd2         671 8349K  73.4                   FPU synch IPI
    sd3         679 8339K  73.8              2403 TLB shootdown I
    sd4         669 8296K  71.6                   cpu3 softnet
    sd5         669 8301K  72.2               100 cpu3 timer
    sd6                                           FPU flush IPI
    sd7         556 8896K  57.6                   FPU synch IPI
    sd8         556 8893K  58.0              2412 TLB shootdown I
    sd9         556 8893K  58.2                   ioapic0 pin 4
   sd10         556 8890K  58.6                   ioapic0 pin 6
   sd11         556 8893K  59.8             13238 ioapic2 pin 0
   sd12         556 8893K  61.4                13 ioapic2 pin 6
   sd13                                           ioapic1 pin 6
   sd14                                           ioapic0 pin 15
   sd15
  raid0         575   20M 100.0
  raid1
  raid2

 root@raidtest2:~# date
 Mon Sep  4 15:58:02 CEST 2006
 root@raidtest2:~# 

 took 1 minute, and see that vmstat is frozen for 20 minutes already.

 root@raidtest2:~# netstat -anfinet|grep EST
 tcp        0      0  194.109.0.22.22        194.109.0.97.63281     ESTABLISHED
 tcp        0      0  194.109.0.22.22        194.109.0.97.63327     ESTABLISHED
 tcp        0     80  194.109.0.22.22        194.109.0.97.63328     ESTABLISHED

 on the console took 30 seconds, just to see the connections are still there and now
 (16:00) suddenly the system starts going again; rsync, iostat and systat just continue

 >  
 >  >
 >  > dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
 >  >
 >  > Not production system, so I can test stuff.
 >  >> How-To-Repeat:
 >  > 	
 >  >> Fix:
 >  > 	
 >  >
 >  >> Unformatted:
 >  >  	
 >  >  	
 >  
 >  
 >  
 >  	-- Rui Paulo
 >  
 >  
 > 

 -- 
 Bert Kiers
 XS4All UNIX systeembeheerder, lockpicker & techno anarchist

From: Bert Kiers <kiers@original.xs4all.nl>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org, kiersb@xs4all.net
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Wed, 13 Sep 2006 13:41:07 +0200

 Well, I reformatted the destination disk with UFS and everything is fine now.
 So, it is a LFS problem.

 -- 
 Bert Kiers, !MCSE && 0xFF, frique d'ordinateur

State-Changed-From-To: open->closed
State-Changed-By: ad@NetBSD.org
State-Changed-When: Wed, 30 Apr 2008 14:33:59 +0000
State-Changed-Why:
appears not to be an ioapic problem


State-Changed-From-To: closed->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 30 Apr 2008 18:26:35 +0000
State-Changed-Why:
Just because it's an lfs problem doesn't mean the PR should be closed...
I've updated the synopsis accordingly.


From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Wed, 30 Apr 2008 19:36:01 +0100

 On Wed, Apr 30, 2008 at 06:26:37PM +0000, dholland@NetBSD.org wrote:

 > Synopsis: multiple problems; lfs-related
 > 
 > State-Changed-From-To: closed->open
 > State-Changed-By: dholland@NetBSD.org
 > State-Changed-When: Wed, 30 Apr 2008 18:26:35 +0000
 > State-Changed-Why:
 > Just because it's an lfs problem doesn't mean the PR should be closed...
 > I've updated the synopsis accordingly.

 1. Re-opening the PR without even a note was rude.

 2. There is scant diagnostic information in this PR, other than
    evidence of KVM starvation and a statement indicating that LFS is
    broken and not suitable for use on production systems. Any dog on
    the street could tell you that.

 Andrew

From: David Holland <dholland-bugs@netbsd.org>
To: Andrew Doran <ad@netbsd.org>, gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, kiersb@xs4all.net
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Wed, 30 Apr 2008 21:53:47 +0000

 On Wed, Apr 30, 2008 at 06:40:02PM +0000, Andrew Doran wrote:
  >> Just because it's an lfs problem doesn't mean the PR should be closed...
  >> I've updated the synopsis accordingly.
  >  
  >  1. Re-opening the PR without even a note was rude.

 Well, it *is* a note, but I take your point. It is unnecessarily
 abrupt. I apologize.

  >  2. There is scant diagnostic information in this PR, other than
  >     evidence of KVM starvation and a statement indicating that LFS is
  >     broken and not suitable for use on production systems. Any dog on
  >     the street could tell you that.

 Well, yes and no. It contains a method for breaking LFS. When/if
 someone sits down to tackle the various outstanding problems in LFS,
 this is something that they'll want to try. So the PR ought to be left
 open so that person (which stands a good chance of being me) will be
 able to find it.

 If at that point it doesn't apply any more or turns out to depend on
 setup details that aren't available, we can always close it then.

 As much as it's nice to reduce the total PR count, it doesn't do us
 any good in the long run to be too aggressive about it.

 -- 
 David A. Holland
 dholland@netbsd.org

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 12 Jul 2014 17:20:30 +0000
State-Changed-Why:
If there's anyone still on the other end of this... is there any reason to
believe this wasn't a driver or hardware issue? The fact that the problem
starts with a driver-level error makes me think it probably isn't LFS.

Reformatting with FFS and no longer seeing the issue doesn't really prove
anything as FFS has much different I/O patterns, and in particular when
rsyncing LFS will be ramming a lot more data down the disk system's throat.
(Or at least, a lot more at once.) If there was a load- or timing-dependent
problem at the disk level it's quite possible that FFS wouldn't trigger it,
especially if you weren't using softupdates.


From: Bert Kiers <kiersb@xs4all.net>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, netbsd-bugs@NetBSD.org,
        gnats-admin@NetBSD.org, dholland@NetBSD.org
Cc: 
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Sun, 13 Jul 2014 14:55:39 +0200

 On 7/12/14 7:20 PM, dholland@NetBSD.org wrote:
 > Synopsis: multiple problems; lfs-related
 > 
 > State-Changed-From-To: open->feedback
 > State-Changed-By: dholland@NetBSD.org
 > State-Changed-When: Sat, 12 Jul 2014 17:20:30 +0000
 > State-Changed-Why:
 > If there's anyone still on the other end of this... is there any reason to
 > believe this wasn't a driver or hardware issue? The fact that the problem
 > starts with a driver-level error makes me think it probably isn't LFS.
 > 
 > Reformatting with FFS and no longer seeing the issue doesn't really prove
 > anything as FFS has much different I/O patterns, and in particular when
 > rsyncing LFS will be ramming a lot more data down the disk system's throat.
 > (Or at least, a lot more at once.) If there was a load- or timing-dependent
 > problem at the disk level it's quite possible that FFS wouldn't trigger it,
 > especially if you weren't using softupdates.

 System is scrapped. If somebody is really interested I could retry with
 newer computer, NetBSD-current and same box with disks.


 -- 
 Bert Kiers
 XS4ALL UNIX systeembeheerder, suspected terrorist
 1984 was not meant as a manual

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Sun, 13 Jul 2014 18:42:21 +0000

 On Sun, Jul 13, 2014 at 02:35:00PM +0000, Bert Kiers wrote:
  >  On 7/12/14 7:20 PM, dholland@NetBSD.org wrote:
  >> If there's anyone still on the other end of this... is there any
  >> reason to believe this wasn't a driver or hardware issue? The fact
  >> that the problem starts with a driver-level error makes me think
  >> it probably isn't LFS.
  >> 
  >> Reformatting with FFS and no longer seeing the issue doesn't
  >> really prove anything as FFS has much different I/O patterns, and
  >> in particular when rsyncing LFS will be ramming a lot more data
  >> down the disk system's throat.  (Or at least, a lot more at once.)
  >> If there was a load- or timing-dependent problem at the disk level
  >> it's quite possible that FFS wouldn't trigger it, especially if
  >> you weren't using softupdates.
  >  
  >  System is scrapped. If somebody is really interested I could retry with
  >  newer computer, NetBSD-current and same box with disks.

 Looking at it some more, I have the following conjecture:

  - Back in 2009 when WAPBL was new, it sometimes under load exhibited
 a dysfunctional operating state where it would be madly writing the
 same blocks over and over again and making very little real progress.

  - This turned out to be not WAPBL-specific but also possible (just
 much harder to get into) with regular FFS.

  - It was caused by bad dynamic behavior logic in the syncer that was
 triggered by the disks getting behind on the pending I/O.

  - It got fixed; some of the fix was FS-independent, but it isn't
 clear to me (without digging a lot deeper) how much.

 I think it's possible that you were seeing an LFS version of this same
 behavior. With the size of the RAID you had/have, it's quite plausible
 that flooding it with writes, as this problem resulted in, would
 render the system as slow as described.

 If so, it might now be fixed... or it might not. It would probably be
 interesting to find out, but see below.

 The allocation failure message that started it is scsipi-level; it
 means that the allocation pool for xs structures ran out. In the
 current code (this doesn't seem to have changed) this causes a
 half-second delay. It is quite likely that a sudden half-second delay
 while running at peak throughput would be enough to trigger the
 dysfunctional state described above, if it existed in LFS at the time.
 (Given that the message appeared only once, the half-second delay
 itself can't be the performance problem.)

 The problem could be something else entirely, though; e.g. something
 in the way LFS prepares segments, or something silly in lfs_putpages.
 Or this could be the same as PR 35187. Or a raidframe issue. Also, the
 allocation failure might conceivably be a red herring and not actually
 related at all; or the trigger (or even the problem) might be some
 other allocation failure that doesn't print anything.

 Trying to replicate this on a modern machine (with much more RAM and
 faster disks) might be much harder... or much easier. Even if my
 conjecture's correct, it's hard to guess. It will, probably, be harder
 to get the triggering allocation failure; you might have to insert
 fault injection code for that. If my conjecture's correct, without the
 half-second delay the problem might well not appear. There's some
 chance (especially if my conjecture's wrong) that it'll turn out to be
 easy to reproduce, but this doesn't seem too likely.

 So I would say: unless you're interested in working on LFS, trying to
 replicate it probably isn't worthwhile; it will take a fair amount of
 effort and isn't that likely to produce conclusive results.

 If you *are* interested in working on LFS, by all means go ahead
 though :-)


 (There's another unrelated bug: it seems that hitting that half-second
 delay causes mishandling of the iostat counters. However, that
 shouldn't matter much.)

 -- 
 David A. Holland
 dholland@netbsd.org

State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 13 Jul 2014 21:23:29 +0000
State-Changed-Why:
Feedback received, plus I came up with a theory.


From: Bert Kiers <kiers@original.xs4all.nl>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org, dholland@NetBSD.org
Cc: 
Subject: Re: kern/34461 (multiple problems; lfs-related)
Date: Mon, 11 Aug 2014 17:37:17 +0200

 On Sun, Jul 13, 2014 at 02:55:39PM +0200, Bert Kiers wrote:
 > On 7/12/14 7:20 PM, dholland@NetBSD.org wrote:
 > > Synopsis: multiple problems; lfs-related
 > > 
 > > State-Changed-From-To: open->feedback
 > > State-Changed-By: dholland@NetBSD.org
 > > State-Changed-When: Sat, 12 Jul 2014 17:20:30 +0000
 > > State-Changed-Why:
 > > If there's anyone still on the other end of this... is there any reason to
 > > believe this wasn't a driver or hardware issue? The fact that the problem
 > > starts with a driver-level error makes me think it probably isn't LFS.
 > > 
 > > Reformatting with FFS and no longer seeing the issue doesn't really prove
 > > anything as FFS has much different I/O patterns, and in particular when
 > > rsyncing LFS will be ramming a lot more data down the disk system's throat.
 > > (Or at least, a lot more at once.) If there was a load- or timing-dependent
 > > problem at the disk level it's quite possible that FFS wouldn't trigger it,
 > > especially if you weren't using softupdates.
 > 
 > System is scrapped. If somebody is really interested I could retry with
 > newer computer, NetBSD-current and same box with disks.

 I cannot reproduce this problem (same box of disks, new NetBSD, new
 other hardware).

 Grtnx,
 -- 
 B*E*R*T

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.