NetBSD Problem Report #46136
From hf@bounce.nt.e-technik.tu-darmstadt.de Sat Mar 3 15:05:16 2012
Return-Path: <hf@bounce.nt.e-technik.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id C3C9A63DFA0
for <gnats-bugs@gnats.NetBSD.org>; Sat, 3 Mar 2012 15:05:16 +0000 (UTC)
Message-Id: <201203031404.q23E4ih9026963@bounce.nt.e-technik.tu-darmstadt.de>
Date: Sat, 3 Mar 2012 15:04:44 +0100 (CET)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@gnats.NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: processes get stuck in D under high I/O load
X-Send-Pr-Version: 3.95
>Number: 46136
>Category: kern
>Synopsis: processes get stuck in D under high I/O load - vmem/kmem
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 03 15:10:00 +0000 2012
>Closed-Date:
>Last-Modified: Mon Oct 05 09:49:02 +0000 2015
>Originator: Hauke Fath
>Release: NetBSD 6.0_BETA
>Organization:
TU Darmstadt
>Environment:
System: NetBSD venediger 6.0_BETA NetBSD 6.0_BETA (VENEDIGER) #0: Thu Mar 1 18:10:56 CET 2012 hf@Hochstuhl:/var/obj/netbsd-builds/6/i386/sys/arch/i386/compile/VENEDIGER i386
Architecture: i386
Machine: i386
>Description:
We run an i386 machine equipped with a Super Micro X7SBE (4
core Xeon) and a SCSI MegaRAID 320-4X as file server - mainly
NFS.
When we switched the RAID controller from a 320-2 to said
320-4S under netbsd-5, the nfsd developed a tendency to get
stuck in 'D' state every other day, making a reboot necessary.
After upgrading to netbsd-6, and tuning buffer and pool sizes,
the nfsd problem is somewhat mitigated, although there is
still a string-and-ducttape script in place, which checks if
nfsd is stuck in 'D' for an extended period of time, and
reboots the machine.
Unfortunately, the jobs started from /etc/daily get stuck,
too, and wedge the machine such that even a 'reboot 0x04' from
the debugger will not, and a hard reset is needed.
From the debugger 'ps' output:
[...]
About to run shutdown hooks...
Stopping cron.
Waiting for PIDS: 826.
Stopping inetd.
Waiting for PIDS: 302.
Saved entropy to disk.
Turning off accounting.
Removing block-type swap devices
swapctl: removing /dev/ld0b as swap device
Sat Mar 3 10:50:53 CET 2012
Done running shutdown hooks.
Mar 3 10:50:59 venediger syslogd[184]: Exiting on signal 15
syncing disks... 3 done
[-- break #0(1) sent -- `\z' -- Sat Mar 3 10:53:16 2012]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c0183c64 cs 8 eflags 200286 cr2 bb688b04 ilevel 8
Stopped in pid 0.7 (system) at netbsd:breakpoint+0x4: popl %ebp
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
9408 1 3 0 9020000 c5ded800 amd tstile
16127 1 3 3 9020000 c5dedd40 amd tstile
545 1 3 1 9020000 c80a5000 amd tstile
17941 1 3 2 0 cc47bd40 reboot tstile
29808 1 3 3 9020000 cd494d20 find vmem
28944 1 3 0 9020000 c8a86560 find vmem
1 1 3 3 8020080 c5d78aa0 init wait
0 78 3 3 200 c538e020 nfsio nfsiod
0 77 3 2 200 c538e2c0 nfsio nfsiod
0 76 3 1 200 c538e560 nfsio nfsiod
0 75 3 2 200 c5ded560 nfsio nfsiod
0 74 5 3 200 c5e34000 (zombie)
0 73 3 3 200 c5ded020 physiod physiod
0 72 3 3 200 c5dc5d20 aiodoned aiodoned
0 71 3 2 200 c5d782c0 ioflush vmem
0 70 3 1 200 c5d78020 pgdaemon xclocv
0 67 3 3 200 c5d3b800 cryptoret crypto_w
0 66 3 3 200 c5d78560 atapibus0 sccomp
0 64 3 2 200 c5d25540 usb4 usbevt
0 63 3 0 200 c5d3b2c0 usb7 usbevt
0 62 3 3 200 c5d3b560 usb6 usbevt
0 61 3 1 200 c5d3baa0 usb5 usbevt
0 60 3 3 200 c5d78800 usb3 usbevt
0 59 3 3 200 c5d252a0 unpgc unpgc
0 58 3 0 200 c5d3bd40 usb0 usbevt
0 57 3 0 200 c5d25000 usb2 usbevt
0 56 3 2 200 c5d78d40 usbtask-dr usbtsk
0 55 3 3 200 c5d3c000 usbtask-hc usbtsk
0 54 3 3 200 c5d3c2a0 usb1 usbevt
0 53 3 0 200 c5d3c540 vmem_rehash vmem_rehash
0 52 3 0 200 c5d3c7e0 coretemp3 coretemp3
0 51 3 3 200 c5d3ca80 coretemp2 coretemp2
0 50 3 1 200 c5d3cd20 coretemp1 coretemp1
0 49 3 2 200 c5d3b020 coretemp0 coretemp0
0 40 3 2 200 c5d257e0 atabus3 atath
0 39 3 0 200 c5d25a80 atabus2 atath
0 38 3 3 200 c5d25d20 iic0 iicintr
0 37 3 2 200 c5b29020 atabus1 atath
0 36 3 0 200 c5b292c0 atabus0 atath
0 35 3 0 200 c5b29560 apm0 apmev
0 34 3 3 200 c5b29800 xcall/3 xcall
0 33 1 3 200 c5b29aa0 softser/3
0 32 1 3 200 c5b29d40 softclk/3
0 31 1 3 200 c5b1e000 softbio/3
0 30 1 3 200 c5b1e2a0 softnet/3
0 > 29 7 3 201 c5b1e540 idle/3
0 28 3 2 200 c5b1e7e0 xcall/2 xcall
0 27 1 2 200 c5b1ea80 softser/2
0 26 1 2 200 c5b1ed20 softclk/2
0 25 1 2 200 c5b1a020 softbio/2
0 24 1 2 200 c5b1a2c0 softnet/2
0 > 23 7 2 201 c5b1a560 idle/2
0 22 3 1 200 c5b1a800 xcall/1 xcall
0 21 1 1 200 c5b1aaa0 softser/1
0 20 1 1 200 c5b1ad40 softclk/1
0 19 1 1 200 c4ffb000 softbio/1
0 18 1 1 200 c4ffb2a0 softnet/1
0 > 17 7 1 201 c4ffb540 idle/1
0 16 3 0 200 c4ffb7e0 sysmon smtaskq
0 15 3 0 200 c4ffba80 pmfsuspend pmfsuspend
0 14 3 0 200 c4ffbd20 pmfevent pmfevent
0 13 3 3 200 c4ff5020 sopendfree sopendfr
0 12 3 0 200 c4ff52c0 nfssilly nfssilly
0 11 3 0 200 c4ff5560 cachegc cachegc
0 10 3 3 200 c4ff5800 vrele vrele
0 9 3 2 200 c4ff5aa0 vdrain vdrain
0 8 3 0 200 c4ff5d40 modunload mod_unld
0 > 7 7 0 200 c4fed000 xcall/0
0 6 1 0 200 c4fed2a0 softser/0
0 5 1 0 200 c4fed540 softclk/0
0 4 1 0 200 c4fed7e0 softbio/0
0 3 1 0 200 c4feda80 softnet/0
0 2 1 0 201 c4fedd20 idle/0
0 1 3 3 200 c0652400 swapper uvm
db{0}> rev boot 0x04
[-- break #0(1) sent -- `\z' -- Sat Mar 3 10:56:58 2012]
[-- break #0(1) sent -- `\z' -- Sat Mar 3 10:57:02 2012]
[machine completely stuck]
Note the reboot(8) in 'tstile', and the find(1) processes (the
original culprits) in 'vmem'.
>How-To-Repeat:
Run netbsd-6 on a busy, scsi raid based nfs fileserver.
>Fix:
None I can see.
The machine is easy to upset, so I can quickly provide any
details someone knowledgable might be interested in, including
ddb dances.
(Re-sent because of botched sender mail address)
>Release-Note:
>Audit-Trail:
From: Lars Heidieker <lars@heidieker.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Sun, 04 Mar 2012 19:24:13 +0100
Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
to 1.73 ?
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Sun, 4 Mar 2012 21:01:11 +0100
At 18:25 Uhr +0000 04.03.2012, Lars Heidieker wrote:
Thanks for looking at this.
> Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
> to 1.73 ?
Kernel built, installed, re-booted, activated /etc/daily cron job. I'll
know tomorrow morning.
hauke
--
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-3281
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Mon, 5 Mar 2012 07:09:24 +0100
At 18:25 Uhr +0000 04.03.2012, Lars Heidieker wrote:
> Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
> to 1.73 ?
Yes.
<snip>
root 2225 7901 8306 392f58 0 D ? 0:06.40 find / ( ! -fstype
local -o -fstype rdonly -o -fstype
root 7787 345 345 392a08 0 I ? 0:00.00 cron: running job
root 7901 8306 8306 392f58 0 I ? 0:00.00 /bin/sh /etc/daily
root 7995 8306 8306 392f58 0 I ? 0:00.00 tee /var/log/daily.out
smmsp 8133 8306 8306 392f58 0 I ? 0:00.00 sendmail -t
root 8306 7787 8306 392f58 0 Is ? 0:00.00 /bin/sh -c /bin/sh
/etc/daily 2>&1 | tee /var/log/dail
</snip>
I'll wait tll i am @workplace before trying to reboot the machine, since I
expect the console to hang, but so far, the behaviour is unchanged.
hauke
--
The ASCII Ribbon Campaign Hauke Fath
() No HTML/RTF in email Institut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
Respect for open standards Ruf +49-6151-16-3281
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Fri, 04 Jan 2013 03:53:36 +0000
State-Changed-Why:
Any luck trying the suggested changes?
From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
dholland@NetBSD.org
Cc:
Subject: Re: kern/46136 (processes get stuck in D under high I/O load - vmem/kmem)
Date: Tue, 20 Jan 2015 14:40:04 +0100
On 01/04/13 04:53, dholland@NetBSD.org wrote:
> Any luck trying the suggested changes?
Lars fixed the vmem starvation, but the processes hanging in "uvn_fp2
wait" supposedly nfs related issue remained.
We eventually replaced the machine in question with different hardware
and OS (Illumos), and repurposed it for light duty with its original,
more benign SCSI MegaRAID 320-2 controller.
The 320-4X with four spare drives is still in the machine, so I could
try out any suggestions.
hauke
State-Changed-From-To: feedback->open
State-Changed-By: hauke@NetBSD.org
State-Changed-When: Mon, 05 Oct 2015 09:49:02 +0000
State-Changed-Why:
I provided feedback a while ago.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.