NetBSD Problem Report #39242

From Wolfgang.Stukenbrock@nagler-company.com  Mon Jul 28 21:08:48 2008
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 515D063B91E
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 28 Jul 2008 21:08:48 +0000 (UTC)
Message-Id: <20080728180121.6784479B8D@s040.nagler-company.com>
Date: Mon, 28 Jul 2008 20:01:21 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
X-Send-Pr-Version: 3.95

>Number:         39242
>Category:       kern
>Synopsis:       NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jul 28 21:10:00 +0000 2008
>Last-Modified:  Thu Jul 31 11:00:02 +0000 2008
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0_STABLE
>Organization:
Dr. Nagler & Company GmbH

>Environment:


System: NetBSD s040 4.0_STABLE NetBSD 4.0_STABLE (NSW-S040) #7: Fri Jul 25 16:02:03 CEST 2008 root@s040:/usr/src/sys/arch/amd64/compile/NSW-S040 amd64
Architecture: x86_64
Machine: amd64
>Description:
	After 4 GB main memory is used by the system, allocating another physical page fails,
	the pagedeamon is kicked, but there is about 3,9 GB free memory - according to the statistic - and
	th pagedeamon will do nothing.
	The system comes to sudden stop at this point and nothing works anymore.
>How-To-Repeat:
	Setup a machine with e.g. 8 GB RAM.
	Then either start some large processes (e.g. "dd if=.. of=... bs=1024000k") until the need more than
	4 GB memory or unpack large achives into the filesystem so that the filesystem cache will eat up
	4 GB memory.
	Run vmstat or top in parallel an you will see that there is something around 3,9 GB free memory, but
	it is not allocated for unknown reasons.
	The pagedaemon will show up 100% activity in top (on all CPU's in the system after a while) - if top
	still gets updates, until everything freezes.
	I've found no way to get the system out of this state without pressing the reset-button.
>Fix:
	not known to me up to now.
	I've tried VMHIST, but that does not realy help ...
	Some additional printouts in the pagedeamon routines show up, that the pagedeamon is dooing nothing,
	because there is still enought free physical memory.
	The problem seems to be related to the 4 GB boundary - other systems with identical HW but only 4 GB RAM
	doesn't show this behaviour.

>Audit-Trail:
From: Simon Burge <simonb@NetBSD.org>
To: Wolfgang.Stukenbrock@nagler-company.com, gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory 
Date: Tue, 29 Jul 2008 14:19:17 +1000

 Wolfgang.Stukenbrock@nagler-company.com wrote:

 > >Synopsis:       NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory

 What does "vmstat -s | grep colors" show on this machine?  If it's a
 reasonably recent Intel CPU with 6MB or 12MB of L2 cache, this will
 probably not be a power of two and would explain what you're seeing.

 In -current this was fixed by both fixing the cache detection stuff
 and rev 1.32 of sys/arch/x86/x86/cpu.c.

 Cheers,
 Simon.

From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
Date: Tue, 29 Jul 2008 09:07:09 +0200

 Hi,

 yes it is a E3110 CPU with 6MB cache.
 What files I need to catch from the current and integrate the changes 
 into my 4.0-version of netbsd?
 You talk about the "cache detection stuff" AND the file named below.

 I think it would be a real great idea to bring this fix into the 
 releases as soon as possible.
 This CPU is "very" cheep compared to the other one's (at least in 
 germany) and therefore is very attractive for new systems.

 W. Stukenbrock

 Simon Burge wrote:

 > The following reply was made to PR kern/39242; it has been noted by GNATS.
 > 
 > From: Simon Burge <simonb@NetBSD.org>
 > To: Wolfgang.Stukenbrock@nagler-company.com, gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 >     netbsd-bugs@netbsd.org
 > Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory 
 > Date: Tue, 29 Jul 2008 14:19:17 +1000
 > 
 >  Wolfgang.Stukenbrock@nagler-company.com wrote:
 >  
 >  > >Synopsis:       NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
 >  
 >  What does "vmstat -s | grep colors" show on this machine?  If it's a
 >  reasonably recent Intel CPU with 6MB or 12MB of L2 cache, this will
 >  probably not be a power of two and would explain what you're seeing.
 >  
 >  In -current this was fixed by both fixing the cache detection stuff
 >  and rev 1.32 of sys/arch/x86/x86/cpu.c.
 >  
 >  Cheers,
 >  Simon.
 >  
 > 


From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
Date: Tue, 29 Jul 2008 09:14:27 +0200

 Hi again,

 sorry I've failed to insert the requested output ov vmstat ...

 here it is:


 s040# vmstat -s | grep colors

         96 page colors

 s040#

 Simon Burge wrote:

 > The following reply was made to PR kern/39242; it has been noted by GNATS.
 > 
 > From: Simon Burge <simonb@NetBSD.org>
 > To: Wolfgang.Stukenbrock@nagler-company.com, gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 >     netbsd-bugs@netbsd.org
 > Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory 
 > Date: Tue, 29 Jul 2008 14:19:17 +1000
 > 
 >  Wolfgang.Stukenbrock@nagler-company.com wrote:
 >  
 >  > >Synopsis:       NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
 >  
 >  What does "vmstat -s | grep colors" show on this machine?  If it's a
 >  reasonably recent Intel CPU with 6MB or 12MB of L2 cache, this will
 >  probably not be a power of two and would explain what you're seeing.
 >  
 >  In -current this was fixed by both fixing the cache detection stuff
 >  and rev 1.32 of sys/arch/x86/x86/cpu.c.
 >  
 >  Cheers,
 >  Simon.
 >  
 > 


From: Simon Burge <simonb@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, Wolfgang.Stukenbrock@nagler-company.com
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory 
Date: Tue, 29 Jul 2008 23:21:47 +1000

 Wolfgang Stukenbrock wrote:

 > The following reply was made to PR kern/39242; it has been noted by GNATS.
 > 
 > From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
 > Date: Tue, 29 Jul 2008 09:07:09 +0200
 > 
 >  Hi,
 >  
 >  yes it is a E3110 CPU with 6MB cache.
 >  What files I need to catch from the current and integrate the changes 
 >  into my 4.0-version of netbsd?
 >  You talk about the "cache detection stuff" AND the file named below.
 >  
 >  I think it would be a real great idea to bring this fix into the 
 >  releases as soon as possible.
 >  This CPU is "very" cheep compared to the other one's (at least in 
 >  germany) and therefore is very attractive for new systems.

 Does rev 1.32 of sys/arch/x86/x86/cpu.c apply cleanly to netbsd-4 ?  I
 don't recall which bits of which files moved around with x86 recently.
 If so, that at least guarantees that a bogus number doesn't get passed
 deeper into UVM, and that'll be enough for netbsd-4.

 Note also the problem was to do with "half memory used", not "4G of
 memory used".

 Cheers,
 Simon.

From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: Simon Burge <simonb@NetBSD.org>, kern-bug-people@NetBSD.org
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
Date: Wed, 30 Jul 2008 20:13:32 +0200

 Hi again,

 I've seen responce to my last mail directly to Simon, but I continue 
 testing my system.

 here the patch I've added to /usr7src/sys/amd64/amd64/cpu.c:

 s012# rcsdiff -c -r1.1 c*
 ===================================================================
 RCS file: RCS/cpu.c,v
 retrieving revision 1.1
 diff -c -r1.1 cpu.c
 *** cpu.c       2008/07/29 07:28:03     1.1
 --- cpu.c       2008/07/29 08:05:21
 ***************
 *** 217,222 ****
 --- 217,239 ----
                          tcolors /= cai->cai_associativity;
                  }
                  ncolors = max(ncolors, tcolors);
 +               /*
 +                * If the desired number of colors is not a power of
 +                * two, it won't be good.  Find the greatest power of
 +                * two which is an even divisor of the number of colors,
 +                * to preserve even coloring of pages.
 +                */
 +               if (ncolors & (ncolors - 1) ) {
 +                       int try, picked = 1;
 +                       for (try = 1; try < ncolors; try *= 2) {
 +                               if (ncolors % try == 0) picked = try;
 +                       }
 +                       if (picked == 1) {
 +                               panic("desired number of cache colors %d 
 is "
 +                               " > 1, but not even!", ncolors);
 +                       }
 +                       ncolors = picked;
 +               }
          }

          /*


 Just some minutes ago, I've got two new kernel crashes.

 1. the kernel process [scsibus0] starts looping and sleeps sometimes in 
 pglalloc. After reinstalling the system with the patch above, there was 
 no DDB in the kernel, so I could not get any other information.

 2. now I've DDB in the kernel and tried to reproduce the problem. But it 
 crashes prior reaching this state in pagedaemon ...
 Some output from the console below:


 uvm_fault(0xffffffff80628800, 0x0, 1) -> e
 kernel: page fault trap, code=0
 Stopped in pid 26.1 (pagedaemon) at     netbsd:uvm_rb_insert+0x37: 
 movq    0
 x40(%rax),%rax
 db{0}> trace
 uvm_rb_insert() at netbsd:uvm_rb_insert+0x37
 uvm_map_enter() at netbsd:uvm_map_enter+0x290
 uvm_map() at netbsd:uvm_map+0xfe
 uvm_pagermapin() at netbsd:uvm_pagermapin+0x92
 uvm_swap_io() at netbsd:uvm_swap_io+0x3c
 swapcluster_flush() at netbsd:swapcluster_flush+0x55
 uvm_pageout() at netbsd:uvm_pageout+0x42b
 db{0}> show uvmexp
 Current UVM status:
    pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
    2039030 VM pages: 1333964 active, 651446 inactive, 2248 wired, 5 free
    pages  1674503 anon, 311641 file, 1580 exec
    freemin=64, free-target=85, wired-max=679676
    faults=3420439, traps=3975116, intrs=8916183, ctxswitch=14697719
    softint=710475, syscalls=12629494, swapins=177, swapouts=210
    fault counts:
      noram=5, noanon=0, pgwait=0, pgrele=0
      ok relocks(total)=1068(1069), anget(retrys)=73451(286), amapcopy=31017
      neighbor anon/obj pg=65282/474220, gets(lock/unlock)=116248/783
      cases: anon=51370, anoncow=21184, obj=96126, prcopy=20121, przero=63798
    daemon and swap counts:
      woke=32081, revs=4228, scans=1204885, obscans=1161315, anscans=2987
      busy=0, freed=1164012, reactivate=128, deactivate=1857899
      pageouts=75209, pending=1088875, nswget=78493
      nswapdev=1, swpgavail=6291455
      swpages=6291455, swpginuse=1164021, swpgonly=1085489, paging=66
 db{0}>


 This patch seems to enable the system to work with the 6MB cache of the 
 E3110 CPU, but the kernel is not realy stable at all.
 Any idea? What should I try next?

 By the way "vmstat -s | grep colo" reports 32 colors now.
 The system was busy again when the crash happens. raidframe sync on one 
 SATA-raid and one SCSI-raid, transfered something arund 12 GB int /tmp 
 (tmpfs) so nearly 5 GB of the 24 GB swap was used.
 The kernel crashes at the moment where I've tried to copy on oth the 
 archives from /tmp to a filesystem on the raid just syncing.

 OK that may be a lot of work for the system, and it may get slow, but it 
 may not crash!
 I've failed to get a core image this time - sorry.
 continue does not work and the system freezes in sync from DDB ...

 Bx the way: I've saved the 8 GB core-image from the crash below, but 
 that one has still 96 page color active - the patch was missing. I think 
 it makes no sence to look at it at all.
 If nobody came to me in the next few day with a request for it, I will 
 remove it.


 W. Stukenbrock

 Wolfgang Stukenbrock wrote:

 > Hi,
 > 
 > I've took the diff's from x86/x86/cpu.c rev 1.32 and merged them into 
 > amd64/amd64/cpu.c - there is no x86/x86/cpu.c in 4.0 ...
 > 
 > It looks like it will solve the problem.
 > Thanks
 > 
 > The system will no longer freeze after using something around 4 GB 
 > memory (of the 8 GB installed ...).
 > 
 > But I've recognized, that under (very) heavy load the system will panic 
 > with "out of memory" in the pagedaemon.
 > I have a 8GB core file here - anybody interested in analysing ???. I 
 > think it will not pass through any mailing system ... (the bzip2 
 > compresseed version is still larger 2 GB (compression still running ...)
 > 
 > In DDB there was exactly on page stated to be free in "show uvmexp" - 
 > the output follows:
 > 
 > db{1}> show uvmexp
 > Current UVM status:
 >   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
 >   2039036 VM pages: 1119946 active, 547095 inactive, 4031 wired, 1 free
 >   pages  1328086 anon, 341555 file, 1764 exec
 >   freemin=64, free-target=85, wired-max=679678
 >   faults=13246446, traps=18516116, intrs=17316633, ctxswitch=64031864
 >   softint=6934242, syscalls=65761334, swapins=431, swapouts=1372
 >   fault counts:
 >     noram=2008, noanon=0, pgwait=0, pgrele=0
 >     ok relocks(total)=2409(2418), anget(retrys)=12931894(1119), 
 > amapcopy=365963
 >     neighbor anon/obj pg=702333/4863720, gets(lock/unlock)=1135392/1299
 >     cases: anon=6302100, anoncow=256103, obj=947341, prcopy=188007, 
 > przero=24372
 > 47
 >   daemon and swap counts:
 >     woke=9819, revs=6729, scans=2176414, obscans=1733276, anscans=143716
 >     busy=0, freed=1876669, reactivate=14780, deactivate=3760043
 >     pageouts=9330, pending=134669, nswget=1111
 >     nswapdev=1, swpgavail=6291455
 >     swpages=6291455, swpginuse=143978, swpgonly=142546, paging=323
 > db{1}> trace
 > cpu_Debugger() at netbsd:cpu_Debugger+0x5
 > panic() at netbsd:panic+0x1f5
 > pmap_growkernel() at netbsd:pmap_growkernel+0x446
 > uvm_map_prepare() at netbsd:uvm_map_prepare+0x371
 > uvm_map() at netbsd:uvm_map+0xae
 > uvm_km_alloc() at netbsd:uvm_km_alloc+0x73
 > vmem_xalloc() at netbsd:vmem_xalloc+0x130
 > vmem_alloc() at netbsd:vmem_alloc+0x86
 > amap_alloc() at netbsd:amap_alloc+0xdb
 > uvm_map_enter() at netbsd:uvm_map_enter+0x24b
 > uvm_map() at netbsd:uvm_map+0xfe
 > sys_obreak() at netbsd:sys_obreak+0x106
 > syscall_plain() at netbsd:syscall_plain+0x1fc
 > uvm_fault(0xffff8000a666f440, 0x6907000, 1) -> e
 > kernel: page fault trap, code=0
 > Faulted in DDB; continuing...
 > db{1}> continue
 > syncing disks... 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 giving up
 > 
 > dumping to dev 18,1 offset 33560487
 > dump 8189 8188 8187 8186 8185 8184 8183 8182 8181 8180 8179 8178 8177 .....
 > 
 > I've started more than 1000 processes to see what will happen if "some" 
 > memory is needed for processes. (cat ... | dd obs=... | dd obs=... | ... 
 >  >/dev/null)
 > The systems reduces the amount of memory used by the file-cache to 
 > something about 1,2 GB of the 8 GB main memory - as expected.
 > At the time of the crash there was something around 800 MB of 24 GB swap 
 > space used.
 > 
 > I know that this is not related to the previous problem. Does it make 
 > sence to create another bug report for that? I'm not shure about it.
 > 
 > W. Stukenbrock
 > 
 > Simon Burge wrote:
 > 
 >> Wolfgang Stukenbrock wrote:
 >>
 >>
 >>> The following reply was made to PR kern/39242; it has been noted by 
 >>> GNATS.
 >>>
 >>> From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
 >>> To: gnats-bugs@NetBSD.org
 >>> Cc: Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang 
 >>> on machines with more than 4 GB memory
 >>> Date: Tue, 29 Jul 2008 09:07:09 +0200
 >>>
 >>> Hi,
 >>>
 >>> yes it is a E3110 CPU with 6MB cache.
 >>> What files I need to catch from the current and integrate the changes 
 >>> into my 4.0-version of netbsd?
 >>> You talk about the "cache detection stuff" AND the file named below.
 >>>
 >>> I think it would be a real great idea to bring this fix into the 
 >>> releases as soon as possible.
 >>> This CPU is "very" cheep compared to the other one's (at least in 
 >>> germany) and therefore is very attractive for new systems.
 >>>
 >>
 >> Does rev 1.32 of sys/arch/x86/x86/cpu.c apply cleanly to netbsd-4 ?  I
 >> don't recall which bits of which files moved around with x86 recently.
 >> If so, that at least guarantees that a bogus number doesn't get passed
 >> deeper into UVM, and that'll be enough for netbsd-4.
 >>
 >> Note also the problem was to do with "half memory used", not "4G of
 >> memory used".
 >>
 >> Cheers,
 >> Simon.
 >>
 > 
 > 


From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
Date: Thu, 31 Jul 2008 10:24:28 +0200

 Hi,

 I think I've located the problem with the looping scsibus0 kernel process!

 I've added some print in the uvm_plistalloc_simple() routine that will 
 start if  the routine is gooing to sleep for more memory.
 The problem happend after unpacking a tar file in the filesystem on the 
 SCSI-disks and then call sync to bring the cache to the disk.
 I've got the following output:


 plistalloc - waiting orig num 1 - num 1 low 0x1000000 high 0x100000000 - 
 free 245 pd_res 1 kres 5
 plistalloc - loop fl 0 psi 3 ps-fl 1 num 1
 plistalloc - loop fl 0 psi 2 ps-fl 1 num 1
 plistalloc - loop fl 0 psi 1 ps-fl 0 num 1
 plistalloc - loop fl 0 psi 0 ps-fl 0 num 1
 plistalloc - loop fl 1 psi 3 ps-fl 1 num 1
 plistalloc - loop fl 1 psi 2 ps-fl 1 num 1
 plistalloc - loop fl 1 psi 1 ps-fl 0 num 1
 plistalloc - loop fl 1 psi 0 ps-fl 0 num 1
 plistalloc - waiting orig num 1 - num 1 low 0x1000000 high 0x100000000 - 
 free 245 pd_res 1 kres 5
 plistalloc - loop fl 0 psi 3 ps-fl 1 num 1
 plistalloc - loop fl 0 psi 2 ps-fl 1 num 1
 plistalloc - loop fl 0 psi 1 ps-fl 0 num 1
 plistalloc - loop fl 0 psi 0 ps-fl 0 num 1
 plistalloc - loop fl 1 psi 3 ps-fl 1 num 1
 plistalloc - loop fl 1 psi 2 ps-fl 1 num 1
 plistalloc - loop fl 1 psi 1 ps-fl 0 num 1
 plistalloc - loop fl 1 psi 0 ps-fl 0 num 1
 plistalloc - waiting orig num 1 - num 1 low 0x1000000 high 0x100000000 - 
 free 245 pd_res 1 kres 5
 plistalloc - loop fl 0 psi 3 ps-fl 1 num 1
 plistalloc - loop fl 0 psi 2 ps-fl 1 num 1
 plistalloc - loop fl 0 psi 1 ps-fl 0 num 1
 plistalloc - loop fl 0 psi 0 ps-fl 0 num 1
 plistalloc - loop fl 1 psi 3 ps-fl 1 num 1
 plistalloc - loop fl 1 psi 2 ps-fl 1 num 1
 plistalloc - loop fl 1 psi 1 ps-fl 0 num 1
 plistalloc - loop fl 1 psi 0 ps-fl 0 num 1
 ...

 endless gooing on ....

 The controller seems to request one additional page in the range fom 
 0x1000000 to 0x100000000. This means an address below 4GB.
 The page daemon is kicked, but does nothing, because pagedaemon does not 
 know anything about the range in which the required memory must reside.
 This looks like a conceptual problem in 4.0 to me.

 some additional information from DDB - after I get the system into the 
 debugger ...

 Stopped in pid 28.1 (aiodoned) at       netbsd:cpu_Debugger+0x5: 
 leave
 db{0}> trace
 cpu_Debugger() at netbsd:cpu_Debugger+0x5
 comintr() at netbsd:comintr+0x6e0
 Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0xd4
 --- interrupt ---
 _kernel_lock() at netbsd:_kernel_lock+0xad
 intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x16
 Xintr_ioapic_level10() at netbsd:Xintr_ioapic_level10+0xd8
 --- interrupt ---
 Xspllower() at netbsd:Xspllower+0xe
 DDB lost frame for netbsd:Xsoftclock+0x1a, trying 0xffff800056ff4db8
 Xsoftclock() at netbsd:Xsoftclock+0x1a
 --- interrupt ---
 0x56ff4e30:
 db{0}> show uvmexp
 Current UVM status:
    pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
    2039030 VM pages: 1276109 active, 623406 inactive, 2284 wired, 372 free
    pages  1470432 anon, 429766 file, 1604 exec
    freemin=64, free-target=85, wired-max=679676
    faults=6949833, traps=7652305, intrs=16596032, ctxswitch=30047566
    softint=6211694, syscalls=17975420, swapins=287, swapouts=306
    fault counts:
      noram=112, noanon=0, pgwait=11, pgrele=0
      ok relocks(total)=1590(1591), anget(retrys)=1256742(505), 
 amapcopy=452013
      neighbor anon/obj pg=900772/6909020, gets(lock/unlock)=1606322/1086
      cases: anon=966859, anoncow=288741, obj=1336907, prcopy=269411, 
 przero=86586
 0
    daemon and swap counts:
      woke=67411, revs=5048, scans=1391672, obscans=1386891, anscans=2492
      busy=0, freed=1389383, reactivate=103, deactivate=2015339
      pageouts=89708, pending=1299675, nswget=99734
      nswapdev=1, swpgavail=6291455
      swpages=6291455, swpginuse=1389021, swpgonly=1289440, paging=0
 db{0}>

 I failed to get a stack-listing of the scsibus0 process

 ps output:
 9                0        0          0 2 0x20200    1         scsibus0

 but "trace/t 9" hangs up and I'm not able to get back into DDB


 For better understanding - the printf() statments I've inserted into 
 uvm_plistalloc_simple():

 XXX - start of modified routine ...

 static int
 uvm_pglistalloc_simple(int num, paddr_t low, paddr_t high,
      struct pglist *rlist, int waitok)
 {
          int fl, psi, s, error;
          struct vm_physseg *ps;
 int o_num = num;
 int xx = 0;

          /* Default to "lose". */
          error = ENOMEM;

 again:
          /*
           * Block all memory allocation and lock the free list.
           */
          s = uvm_lock_fpageq();

          /* Are there even any free pages? */
          if (uvmexp.free <= (uvmexp.reserve_pagedaemon + 
 uvmexp.reserve_kernel))
                  goto out;

          for (fl = 0; fl < VM_NFREELIST; fl++) {
 #if (VM_PHYSSEG_STRAT == VM_PSTRAT_BIGFIRST)
                  for (psi = vm_nphysseg - 1 ; psi >= 0 ; psi--)
 #else
                  for (psi = 0 ; psi < vm_nphysseg ; psi++)
 #endif
                  {
                          ps = &vm_physmem[psi];
 if (xx != 0) printf("plistalloc - loop fl %d psi %d ps-fl %d num %d\n", 
 fl, psi, ps->free_list, num);

                          if (ps->free_list != fl)
                                  continue;

                          num -= uvm_pglistalloc_s_ps(ps, num, low, high, 
 rlist);
                          if (num == 0) {
                                  error = 0;
                                  goto out;
                          }
                  }

          }

 out:
          /*
           * check to see if we need to generate some free pages waking
           * the pagedaemon.
           */

          uvm_kick_pdaemon();
          uvm_unlock_fpageq(s);
          if (error) {
                  if (waitok) {
                          /* XXX perhaps some time limitation? */
 #ifdef DEBUG
                          printf("pglistalloc waiting\n");
 #endif
 printf("plistalloc - waiting orig num %d - num %d low 0x%lx high 0x%lx - 
 free %d pd_res %d kres %d\n",
     o_num, num, low, high, uvmexp.free, uvmexp.reserve_pagedaemon, 
 uvmexp.reserve_kernel); xx=1;
                          uvm_wait("pglalloc");
                          goto again;
                  } else
                          uvm_pglistfree(rlist);
          }
 #ifdef PGALLOC_VERBOSE
          if (!error)
                  printf("pgalloc: %lx..%lx\n",
                         VM_PAGE_TO_PHYS(TAILQ_FIRST(rlist)),
                         VM_PAGE_TO_PHYS(TAILQ_LAST(rlist, pglist)));
 #endif
          return (error);
 }

 XXX - end of modified routine ...




 I thing the we need something like a list of ranges where memory is need 
 and the pagedaemon should free some memory in that range.

 At the moment, any system with more than 4 GB RAM may get into this 
 problem - the SCSI-controller is an Adaptec 29160A-R

 I will remove 4 GB of memory from the system in order to get it stable, 
 but that cannot be the final sollution.


 The problem switches back from "CPU-cache-problem" to my initial subject 
 for some supported controlers.

 best regards

 W. Stukenbrock

 Wolfgang Stukenbrock wrote:

 > The following reply was made to PR kern/39242; it has been noted by GNATS.
 > 
 > From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
 > To: gnats-bugs@NetBSD.org
 > Cc: Simon Burge <simonb@NetBSD.org>, kern-bug-people@NetBSD.org
 > Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
 > Date: Wed, 30 Jul 2008 20:13:32 +0200
 > 
 >  Hi again,
 >  
 >  I've seen responce to my last mail directly to Simon, but I continue 
 >  testing my system.
 >  
 >  here the patch I've added to /usr7src/sys/amd64/amd64/cpu.c:
 >  
 >  s012# rcsdiff -c -r1.1 c*
 >  ===================================================================
 >  RCS file: RCS/cpu.c,v
 >  retrieving revision 1.1
 >  diff -c -r1.1 cpu.c
 >  *** cpu.c       2008/07/29 07:28:03     1.1
 >  --- cpu.c       2008/07/29 08:05:21
 >  ***************
 >  *** 217,222 ****
 >  --- 217,239 ----
 >                           tcolors /= cai->cai_associativity;
 >                   }
 >                   ncolors = max(ncolors, tcolors);
 >  +               /*
 >  +                * If the desired number of colors is not a power of
 >  +                * two, it won't be good.  Find the greatest power of
 >  +                * two which is an even divisor of the number of colors,
 >  +                * to preserve even coloring of pages.
 >  +                */
 >  +               if (ncolors & (ncolors - 1) ) {
 >  +                       int try, picked = 1;
 >  +                       for (try = 1; try < ncolors; try *= 2) {
 >  +                               if (ncolors % try == 0) picked = try;
 >  +                       }
 >  +                       if (picked == 1) {
 >  +                               panic("desired number of cache colors %d 
 >  is "
 >  +                               " > 1, but not even!", ncolors);
 >  +                       }
 >  +                       ncolors = picked;
 >  +               }
 >           }
 >  
 >           /*
 >  
 >  
 >  Just some minutes ago, I've got two new kernel crashes.
 >  
 >  1. the kernel process [scsibus0] starts looping and sleeps sometimes in 
 >  pglalloc. After reinstalling the system with the patch above, there was 
 >  no DDB in the kernel, so I could not get any other information.
 >  
 >  2. now I've DDB in the kernel and tried to reproduce the problem. But it 
 >  crashes prior reaching this state in pagedaemon ...
 >  Some output from the console below:
 >  
 >  
 >  uvm_fault(0xffffffff80628800, 0x0, 1) -> e
 >  kernel: page fault trap, code=0
 >  Stopped in pid 26.1 (pagedaemon) at     netbsd:uvm_rb_insert+0x37: 
 >  movq    0
 >  x40(%rax),%rax
 >  db{0}> trace
 >  uvm_rb_insert() at netbsd:uvm_rb_insert+0x37
 >  uvm_map_enter() at netbsd:uvm_map_enter+0x290
 >  uvm_map() at netbsd:uvm_map+0xfe
 >  uvm_pagermapin() at netbsd:uvm_pagermapin+0x92
 >  uvm_swap_io() at netbsd:uvm_swap_io+0x3c
 >  swapcluster_flush() at netbsd:swapcluster_flush+0x55
 >  uvm_pageout() at netbsd:uvm_pageout+0x42b
 >  db{0}> show uvmexp
 >  Current UVM status:
 >     pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
 >     2039030 VM pages: 1333964 active, 651446 inactive, 2248 wired, 5 free
 >     pages  1674503 anon, 311641 file, 1580 exec
 >     freemin=64, free-target=85, wired-max=679676
 >     faults=3420439, traps=3975116, intrs=8916183, ctxswitch=14697719
 >     softint=710475, syscalls=12629494, swapins=177, swapouts=210
 >     fault counts:
 >       noram=5, noanon=0, pgwait=0, pgrele=0
 >       ok relocks(total)=1068(1069), anget(retrys)=73451(286), amapcopy=31017
 >       neighbor anon/obj pg=65282/474220, gets(lock/unlock)=116248/783
 >       cases: anon=51370, anoncow=21184, obj=96126, prcopy=20121, przero=63798
 >     daemon and swap counts:
 >       woke=32081, revs=4228, scans=1204885, obscans=1161315, anscans=2987
 >       busy=0, freed=1164012, reactivate=128, deactivate=1857899
 >       pageouts=75209, pending=1088875, nswget=78493
 >       nswapdev=1, swpgavail=6291455
 >       swpages=6291455, swpginuse=1164021, swpgonly=1085489, paging=66
 >  db{0}>
 >  
 >  
 >  This patch seems to enable the system to work with the 6MB cache of the 
 >  E3110 CPU, but the kernel is not realy stable at all.
 >  Any idea? What should I try next?
 >  
 >  By the way "vmstat -s | grep colo" reports 32 colors now.
 >  The system was busy again when the crash happens. raidframe sync on one 
 >  SATA-raid and one SCSI-raid, transfered something arund 12 GB int /tmp 
 >  (tmpfs) so nearly 5 GB of the 24 GB swap was used.
 >  The kernel crashes at the moment where I've tried to copy on oth the 
 >  archives from /tmp to a filesystem on the raid just syncing.
 >  
 >  OK that may be a lot of work for the system, and it may get slow, but it 
 >  may not crash!
 >  I've failed to get a core image this time - sorry.
 >  continue does not work and the system freezes in sync from DDB ...
 >  
 >  Bx the way: I've saved the 8 GB core-image from the crash below, but 
 >  that one has still 96 page color active - the patch was missing. I think 
 >  it makes no sence to look at it at all.
 >  If nobody came to me in the next few day with a request for it, I will 
 >  remove it.
 >  
 >  
 >  W. Stukenbrock
 >  
 >  Wolfgang Stukenbrock wrote:
 >  
 >  > Hi,
 >  > 
 >  > I've took the diff's from x86/x86/cpu.c rev 1.32 and merged them into 
 >  > amd64/amd64/cpu.c - there is no x86/x86/cpu.c in 4.0 ...
 >  > 
 >  > It looks like it will solve the problem.
 >  > Thanks
 >  > 
 >  > The system will no longer freeze after using something around 4 GB 
 >  > memory (of the 8 GB installed ...).
 >  > 
 >  > But I've recognized, that under (very) heavy load the system will panic 
 >  > with "out of memory" in the pagedaemon.
 >  > I have a 8GB core file here - anybody interested in analysing ???. I 
 >  > think it will not pass through any mailing system ... (the bzip2 
 >  > compresseed version is still larger 2 GB (compression still running ...)
 >  > 
 >  > In DDB there was exactly on page stated to be free in "show uvmexp" - 
 >  > the output follows:
 >  > 
 >  > db{1}> show uvmexp
 >  > Current UVM status:
 >  >   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
 >  >   2039036 VM pages: 1119946 active, 547095 inactive, 4031 wired, 1 free
 >  >   pages  1328086 anon, 341555 file, 1764 exec
 >  >   freemin=64, free-target=85, wired-max=679678
 >  >   faults=13246446, traps=18516116, intrs=17316633, ctxswitch=64031864
 >  >   softint=6934242, syscalls=65761334, swapins=431, swapouts=1372
 >  >   fault counts:
 >  >     noram=2008, noanon=0, pgwait=0, pgrele=0
 >  >     ok relocks(total)=2409(2418), anget(retrys)=12931894(1119), 
 >  > amapcopy=365963
 >  >     neighbor anon/obj pg=702333/4863720, gets(lock/unlock)=1135392/1299
 >  >     cases: anon=6302100, anoncow=256103, obj=947341, prcopy=188007, 
 >  > przero=24372
 >  > 47
 >  >   daemon and swap counts:
 >  >     woke=9819, revs=6729, scans=2176414, obscans=1733276, anscans=143716
 >  >     busy=0, freed=1876669, reactivate=14780, deactivate=3760043
 >  >     pageouts=9330, pending=134669, nswget=1111
 >  >     nswapdev=1, swpgavail=6291455
 >  >     swpages=6291455, swpginuse=143978, swpgonly=142546, paging=323
 >  > db{1}> trace
 >  > cpu_Debugger() at netbsd:cpu_Debugger+0x5
 >  > panic() at netbsd:panic+0x1f5
 >  > pmap_growkernel() at netbsd:pmap_growkernel+0x446
 >  > uvm_map_prepare() at netbsd:uvm_map_prepare+0x371
 >  > uvm_map() at netbsd:uvm_map+0xae
 >  > uvm_km_alloc() at netbsd:uvm_km_alloc+0x73
 >  > vmem_xalloc() at netbsd:vmem_xalloc+0x130
 >  > vmem_alloc() at netbsd:vmem_alloc+0x86
 >  > amap_alloc() at netbsd:amap_alloc+0xdb
 >  > uvm_map_enter() at netbsd:uvm_map_enter+0x24b
 >  > uvm_map() at netbsd:uvm_map+0xfe
 >  > sys_obreak() at netbsd:sys_obreak+0x106
 >  > syscall_plain() at netbsd:syscall_plain+0x1fc
 >  > uvm_fault(0xffff8000a666f440, 0x6907000, 1) -> e
 >  > kernel: page fault trap, code=0
 >  > Faulted in DDB; continuing...
 >  > db{1}> continue
 >  > syncing disks... 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 giving up
 >  > 
 >  > dumping to dev 18,1 offset 33560487
 >  > dump 8189 8188 8187 8186 8185 8184 8183 8182 8181 8180 8179 8178 8177 .....
 >  > 
 >  > I've started more than 1000 processes to see what will happen if "some" 
 >  > memory is needed for processes. (cat ... | dd obs=... | dd obs=... | ... 
 >  >  >/dev/null)
 >  > The systems reduces the amount of memory used by the file-cache to 
 >  > something about 1,2 GB of the 8 GB main memory - as expected.
 >  > At the time of the crash there was something around 800 MB of 24 GB swap 
 >  > space used.
 >  > 
 >  > I know that this is not related to the previous problem. Does it make 
 >  > sence to create another bug report for that? I'm not shure about it.
 >  > 
 >  > W. Stukenbrock
 >  > 
 >  > Simon Burge wrote:
 >  > 
 >  >> Wolfgang Stukenbrock wrote:
 >  >>
 >  >>
 >  >>> The following reply was made to PR kern/39242; it has been noted by 
 >  >>> GNATS.
 >  >>>
 >  >>> From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
 >  >>> To: gnats-bugs@NetBSD.org
 >  >>> Cc: Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang 
 >  >>> on machines with more than 4 GB memory
 >  >>> Date: Tue, 29 Jul 2008 09:07:09 +0200
 >  >>>
 >  >>> Hi,
 >  >>>
 >  >>> yes it is a E3110 CPU with 6MB cache.
 >  >>> What files I need to catch from the current and integrate the changes 
 >  >>> into my 4.0-version of netbsd?
 >  >>> You talk about the "cache detection stuff" AND the file named below.
 >  >>>
 >  >>> I think it would be a real great idea to bring this fix into the 
 >  >>> releases as soon as possible.
 >  >>> This CPU is "very" cheep compared to the other one's (at least in 
 >  >>> germany) and therefore is very attractive for new systems.
 >  >>>
 >  >>
 >  >> Does rev 1.32 of sys/arch/x86/x86/cpu.c apply cleanly to netbsd-4 ?  I
 >  >> don't recall which bits of which files moved around with x86 recently.
 >  >> If so, that at least guarantees that a bogus number doesn't get passed
 >  >> deeper into UVM, and that'll be enough for netbsd-4.
 >  >>
 >  >> Note also the problem was to do with "half memory used", not "4G of
 >  >> memory used".
 >  >>
 >  >> Cheers,
 >  >> Simon.
 >  >>
 >  > 
 >  > 
 >  
 >  
 > 


From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
Date: Thu, 31 Jul 2008 12:56:33 +0200

 Hi - once again ...

 I've removed 4 GB of memory now and the SCSI-controller seems to work 
 now, but ....

 kernel: protection fault trap, code=0
 Stopped in pid 28.1 (aiodoned) at       netbsd:uvm_tree_RB_REMOVE+0x50: 
 movq    %
 r14,0x10(%r15)
 db{0}> trace
 uvm_tree_RB_REMOVE() at netbsd:uvm_tree_RB_REMOVE+0x50
 uvm_rb_remove() at netbsd:uvm_rb_remove+0x1c
 uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x179
 uvm_pagermapout() at netbsd:uvm_pagermapout+0x110
 uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xc4
 uvm_aiodone_daemon() at netbsd:uvm_aiodone_daemon+0xd2
 db{0}>

 At the time of the crash top reports:

 load averages:  1.03,  0.47,  0.19 
              up 0 days,  0:38   12:27:08
 67 processes:  1 runnable, 64 sleeping, 2 on processor
 CPU0 states:  0.0% user,  0.0% nice, 11.9% system,  2.5% interrupt, 
 85.6% idle
 CPU1 states:  0.0% user,  0.0% nice,  3.0% system,  0.0% interrupt, 
 97.0% idle
 Memory: 2565M Act, 1253M Inact, 8924K Wired, 6296K Exec, 1316M File, 
 228K Free
 Swap: 24G Total, 9435M Used, 15G Free

    PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
 10722 root      -5    0   868K 1792K biowai/0   0:13 78.93%  7.52% tar
     23 root      -6    0     0K   28M RUN/1      0:10  5.37%  5.37% 
 [raidio0]
     26 root     -18    0     0K   28M pgdaem/0   0:04  1.71%  1.71% 
 [pagedaemon]
     28 root     -18    0     0K   28M aiodon/0   0:04  1.17%  1.17% 
 [aiodoned]
     21 root      -6    0     0K   28M raidio/1   0:05  1.07%  1.07% 
 [raidio1]
   9957 root      28    0   120K  820K CPU/1      0:00  0.00%  0.98% cp
     22 root      -6    0     0K   28M rfwcon/0   0:01  0.68%  0.68% [raid0]
     20 root      -6    0     0K   28M rfwcon/0   0:00  0.05%  0.05% [raid1]
      9 root      -6    0     0K   28M sccomp/1   0:02  0.00%  0.00% 
 [scsibus0]
    198 ncadmin   28    0   572K 1588K CPU/0      0:00  0.00%  0.00% top
     27 root      18    0     0K   28M syncer/0   0:00  0.00%  0.00% 
 [ioflush]
    816 root      18    0   988K 4188K pause/0    0:00  0.00%  0.00% ntpd
   9303 wgstuken  18    0   260K 1168K pause/1    0:00  0.00%  0.00% <csh>
 10024 root      18    0   236K 1144K pause/1    0:00  0.00%  0.00% <csh>
   1515 root      18    0   240K 1088K pause/1    0:00  0.00%  0.00% <csh>
     18 root      14    0     0K   28M crypto/0   0:00  0.00%  0.00% 
 [cryptoret]
     13 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb7]
     12 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb6]
     11 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb5]
     10 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb4]
    571 root      10    0     0K   28M nfsidl/0   0:00  0.00%  0.00% [nfsio]
    570 root      10    0     0K   28M nfsidl/0   0:00  0.00%  0.00% [nfsio]
    545 root      10    0     0K   28M nfsidl/0   0:00  0.00%  0.00% [nfsio]
    499 root      10    0     0K   28M nfsidl/0   0:00  0.00%  0.00% [nfsio]
      3 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb0]
      4 root      10    0     0K   28M usbtsk/0   0:00  0.00%  0.00% 
 [usbtask-hc]
      5 root      10    0     0K   28M usbtsk/0   0:00  0.00%  0.00% 
 [usbtask-dr]
      6 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb1]
      7 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb2]
      8 root      10    0     0K   28M usbevt/0   0:00  0.00%  0.00% [usb3]
 10303 root      10    0   484K 2768K wait/0     0:00  0.00%  0.00% <login>
    195 root      10    0   620K 1944K wait/1     0:00  0.00%  0.00% <login>
   1566 root      10    0   476K 1936K wait/0     0:00  0.00%  0.00% <login>
    196 ncadmin   10    0   284K 1048K wait/0     0:00  0.00%  0.00% <sh>
   1443 root      10    0   268K  880K nanosl/1   0:00  0.00%  0.00% <cron>
      1 root      10    0   100K  812K wait/0     0:00  0.00%  0.00% <init>
    557 root       2    0  1060K 3988K select/0   0:00  0.00%  0.00% <amd>
 10740 root       2    0   492K 1536K poll/0     0:00  0.00%  0.00% <rlogind>
    872 root       2    0   344K 1456K select/0   0:00  0.00%  0.00% <sshd>
   1385 postfix    2    0   628K 1216K kqread/0   0:00  0.00%  0.00% <qmgr>


 There seems to be a bigger problem in uvm as expected by me before!


 db{0}> show uvmexp
 Current UVM status:
    pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
    1018418 VM pages: 656891 active, 320807 inactive, 2230 wired, 5 free
    pages  634067 anon, 344367 file, 1574 exec
    freemin=64, free-target=85, wired-max=339472
    faults=3609406, traps=3722187, intrs=7567575, ctxswitch=9890365
    softint=380503, syscalls=14268860, swapins=245, swapouts=270
    fault counts:
      noram=24, noanon=0, pgwait=24, pgrele=0
      ok relocks(total)=1148(1150), anget(retrys)=114762(394), amapcopy=55274
      neighbor anon/obj pg=115362/837538, gets(lock/unlock)=198035/756
      cases: anon=77487, anoncow=37248, obj=164994, prcopy=33039, 
 przero=98761
    daemon and swap counts:
      woke=15991, revs=10979, scans=2430998, obscans=2420929, anscans=2524
      busy=0, freed=2423039, reactivate=730, deactivate=2791838
      pageouts=156825, pending=2266288, nswget=297605
      nswapdev=1, swpgavail=6291455
      swpages=6291455, swpginuse=2422999, swpgonly=2125395, paging=78
 db{0}>  ps
   PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND 
     WAIT
   9957          1515     9957          0 2  0x4002    1               cp 
 uvn_fp1
   10722        10024    10722          0 2  0x4002    1              tar
   10024         9303    10024          0 2  0x4002    1              csh 
    pause
   9303         10303     9303       1002 2  0x4002    1              csh 
    pause
   10303        10740    10303          0 2  0x4103    1            login 
     wait
   10740         1262     1262          0 2  0x4100    1          rlogind 
     poll
   198            196      198        500 2  0x4002    1              top 
     poll
   196            195      196        500 2  0x4002    1               sh 
     wait
   195            194      195          0 2  0x4103    1            login 
     wait
   194           1262     1262          0 2  0x4100    1          rlogind 
     poll
   1515          1566     1515          0 2  0x4002    1              csh 
    pause
   1566             1     1566          0 2  0x4102    1            login 
     wait
   1443             1     1443          0 2       0    1             cron 
 nanosle
   1262             1     1262          0 2       0    1            inetd 
   kqread
   1385          1263     1263         12 2  0x4108    1             qmgr 
   kqread
   950           1263     1263         12 2  0x4108    1           pickup 
   kqread
   1263             1     1263          0 2  0x4108    1           master 
   kqread
   872              1      872          0 2       0    1             sshd 
   select
   816              1      816          0 2       0    1             ntpd 
    pause
   757              1      757          0 2       0    1              lpd 
     poll
   631              1      631          0 2       0    1        rpc.lockd 
   select
   672              1      672          0 2 0xa0008    1        rpc.statd 
   select
   682            614      614          0 2       0    1             nfsd 
     nfsd
   677            614      614          0 2       0    1             nfsd 
     nfsd
   637            614      614          0 2       0    1             nfsd 
     nfsd
   671            614      614          0 2       0    1             nfsd 
     nfsd
   614              1      614          0 2       0    1             nfsd 
     poll
   615              1      615          0 2       0    1           mountd 
   select
   571              0        0          0 2 0x20200    1            nfsio 
   nfsidl
   570              0        0          0 2 0x20200    1            nfsio 
   nfsidl
   545              0        0          0 2 0x20200    1            nfsio 
   nfsidl
   499              0        0          0 2 0x20200    1            nfsio 
   nfsidl
   557              1      557          0 2       0    1              amd 
   select
   505              1      505          0 2       0    1           ypbind 
   select
   497              1      497          0 2       0    1          rpcbind 
     poll
   574              1      574          0 2       0    1          syslogd 
   kqread
   248              1      248          0 2       0    1           routed 
   select
   108              0        0          0 2 0x20200    1          physiod 
 physiod
  >28               0        0          0 2 0x20200    1         aiodoned
   27               0        0          0 2 0x20200    1          ioflush 
   syncer
   26               0        0          0 2 0x20200    1       pagedaemon 
 pgdaemo
   25               0        0          0 2 0x20200    1          raidio2 
 raidiow
   24               0        0          0 2 0x20200    1            raid2 
 rfwcond
   23               0        0          0 2 0x20200    1          raidio0
   22               0        0          0 2 0x20200    1            raid0
   21               0        0          0 2 0x20200    1          raidio1
   20               0        0          0 2 0x20200    1            raid1 
 rfwcond
   19               0        0          0 2 0x20200    1        atapibus0 
   sccomp
   18               0        0          0 2 0x20200    1        cryptoret 
 crypto_
   17               0        0          0 2 0x20200    1          atabus3 
    atath
   16               0        0          0 2 0x20200    1          atabus2 
    atath
   15               0        0          0 2 0x20200    1          atabus1 
    atath
   14               0        0          0 2 0x20200    1          atabus0 
    atath
   13               0        0          0 2 0x20200    1             usb7 
   usbevt
   12               0        0          0 2 0x20200    1             usb6 
   usbevt
   11               0        0          0 2 0x20200    1             usb5 
   usbevt
   10               0        0          0 2 0x20200    1             usb4 
   usbevt
   9                0        0          0 2 0x20200    1         scsibus0 
   sccomp
   8                0        0          0 2 0x20200    1             usb3 
   usbevt
   7                0        0          0 2 0x20200    1             usb2 
   usbevt
   6                0        0          0 2 0x20200    1             usb1 
   usbevt
   5                0        0          0 2 0x20200    1       usbtask-dr 
   usbtsk
   4                0        0          0 2 0x20200    1       usbtask-hc 
   usbtsk
   3                0        0          0 2 0x20200    1             usb0 
   usbevt
   2                0        0          0 2 0x20200    1           sysmon 
 smtaskq
   1                0        1          0 2  0x4001    1             init 
     wait
   0               -1        0          0 2 0x20200    1          swapper 
 schedul
 db{0}>

 At the time of the crash "tar" is extracting an archive into one 
 filesystem and "cp" is copying a large file into another filesystem - 
 both on the same raid-device (raid1 on SATA) - but that has caused no 
 problem in the past.
 The source of both commands is in /tmp (tmpfs) - ca. 11 GB resides in /tmp.

 Is tmpfs known to be stable? Or may the problem be there?
 Any Idea how to debug further on?

 I will see if I'm able to replace tmpfs by a real filesystem and try 
 again ...




 By the way: I've recognized another problem in uvm_plistalloc_simple() 
 by thinking about the strategie used there.
 If some processes tries to allocate more more memory in sum as the 
 system has at all, the strategie used there may deadlock the whole system.
 If e.g. 30 % of memory is given to each of tree processes that requests 
 each 50% of the whole memory, it will be impossible for any of them to 
 complete because the required memory is locked by the other processes 
 and cannot be stolen there (there is no way to manipulate the local 
 variable "num" from outside ...). It would be possible to satisfy the 
 request of the first on and after it has freed the memory again the next 
 and so on. Therefore a way is needed to "steal" steal the memory from 
 waiting processes again.
 This would be a very rare case and strange situation, but it is not 
 detected by the system at the moment. I'm not shure if there is an 
 inexpensive and easy way to detect such a situation, but this "possible" 
 problem should be at least documented in a comment in the source file.

 best regards

 W. Stukenbrock

 PS. I've removed the privious mail content - it is in the gnats system 
 and can be reviewed there.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.