NetBSD Problem Report #46136

From hf@bounce.nt.e-technik.tu-darmstadt.de  Sat Mar  3 15:05:16 2012
Return-Path: <hf@bounce.nt.e-technik.tu-darmstadt.de>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id C3C9A63DFA0
	for <gnats-bugs@gnats.NetBSD.org>; Sat,  3 Mar 2012 15:05:16 +0000 (UTC)
Message-Id: <201203031404.q23E4ih9026963@bounce.nt.e-technik.tu-darmstadt.de>
Date: Sat, 3 Mar 2012 15:04:44 +0100 (CET)
From: Hauke Fath <hf@spg.tu-darmstadt.de>
Reply-To: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@gnats.NetBSD.org
Cc: Hauke Fath <hf@spg.tu-darmstadt.de>
Subject: processes get stuck in D under high I/O load
X-Send-Pr-Version: 3.95

>Number:         46136
>Category:       kern
>Synopsis:       processes get stuck in D under high I/O load - vmem/kmem
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 03 15:10:00 +0000 2012
>Closed-Date:    
>Last-Modified:  Mon Oct 05 09:49:02 +0000 2015
>Originator:     Hauke Fath
>Release:        NetBSD 6.0_BETA
>Organization:
TU Darmstadt
>Environment:
System: NetBSD venediger 6.0_BETA NetBSD 6.0_BETA (VENEDIGER) #0: Thu Mar 1 18:10:56 CET 2012 hf@Hochstuhl:/var/obj/netbsd-builds/6/i386/sys/arch/i386/compile/VENEDIGER i386
Architecture: i386
Machine: i386
>Description:

	We run an i386 machine equipped with a Super Micro X7SBE (4
	core Xeon) and a SCSI MegaRAID 320-4X as file server - mainly
	NFS.

	When we switched the RAID controller from a 320-2 to said
	320-4S under netbsd-5, the nfsd developed a tendency to get
	stuck in 'D' state every other day, making a reboot necessary.

	After upgrading to netbsd-6, and tuning buffer and pool sizes,
	the nfsd problem is somewhat mitigated, although there is
	still a string-and-ducttape script in place, which checks if
	nfsd is stuck in 'D' for an extended period of time, and
	reboots the machine.

	Unfortunately, the jobs started from /etc/daily get stuck,
	too, and wedge the machine such that even a 'reboot 0x04' from
	the debugger will not, and a hard reset is needed.

	From the debugger 'ps' output:

[...]
About to run shutdown hooks...
Stopping cron.
Waiting for PIDS: 826.
Stopping inetd.
Waiting for PIDS: 302.
Saved entropy to disk.
Turning off accounting.
Removing block-type swap devices
swapctl: removing /dev/ld0b as swap device
Sat Mar  3 10:50:53 CET 2012

Done running shutdown hooks.
Mar  3 10:50:59 venediger syslogd[184]: Exiting on signal 15
syncing disks... 3 done
[-- break #0(1) sent -- `\z' -- Sat Mar  3 10:53:16 2012]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c0183c64 cs 8 eflags 200286 cr2 bb688b04 ilevel 8
Stopped in pid 0.7 (system) at  netbsd:breakpoint+0x4:  popl    %ebp
db{0}> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
9408     1 3   0   9020000           c5ded800                amd tstile
16127    1 3   3   9020000           c5dedd40                amd tstile
545      1 3   1   9020000           c80a5000                amd tstile
17941    1 3   2         0           cc47bd40             reboot tstile
29808    1 3   3   9020000           cd494d20               find vmem
28944    1 3   0   9020000           c8a86560               find vmem
1        1 3   3   8020080           c5d78aa0               init wait
0       78 3   3       200           c538e020              nfsio nfsiod
0       77 3   2       200           c538e2c0              nfsio nfsiod
0       76 3   1       200           c538e560              nfsio nfsiod
0       75 3   2       200           c5ded560              nfsio nfsiod
0       74 5   3       200           c5e34000           (zombie)
0       73 3   3       200           c5ded020            physiod physiod
0       72 3   3       200           c5dc5d20           aiodoned aiodoned
0       71 3   2       200           c5d782c0            ioflush vmem
0       70 3   1       200           c5d78020           pgdaemon xclocv
0       67 3   3       200           c5d3b800          cryptoret crypto_w
0       66 3   3       200           c5d78560          atapibus0 sccomp
0       64 3   2       200           c5d25540               usb4 usbevt
0       63 3   0       200           c5d3b2c0               usb7 usbevt
0       62 3   3       200           c5d3b560               usb6 usbevt
0       61 3   1       200           c5d3baa0               usb5 usbevt
0       60 3   3       200           c5d78800               usb3 usbevt
0       59 3   3       200           c5d252a0              unpgc unpgc
0       58 3   0       200           c5d3bd40               usb0 usbevt
0       57 3   0       200           c5d25000               usb2 usbevt
0       56 3   2       200           c5d78d40         usbtask-dr usbtsk
0       55 3   3       200           c5d3c000         usbtask-hc usbtsk
0       54 3   3       200           c5d3c2a0               usb1 usbevt
0       53 3   0       200           c5d3c540        vmem_rehash vmem_rehash
0       52 3   0       200           c5d3c7e0          coretemp3 coretemp3
0       51 3   3       200           c5d3ca80          coretemp2 coretemp2
0       50 3   1       200           c5d3cd20          coretemp1 coretemp1
0       49 3   2       200           c5d3b020          coretemp0 coretemp0
0       40 3   2       200           c5d257e0            atabus3 atath
0       39 3   0       200           c5d25a80            atabus2 atath
0       38 3   3       200           c5d25d20               iic0 iicintr
0       37 3   2       200           c5b29020            atabus1 atath
0       36 3   0       200           c5b292c0            atabus0 atath
0       35 3   0       200           c5b29560               apm0 apmev
0       34 3   3       200           c5b29800            xcall/3 xcall
0       33 1   3       200           c5b29aa0          softser/3
0       32 1   3       200           c5b29d40          softclk/3
0       31 1   3       200           c5b1e000          softbio/3
0       30 1   3       200           c5b1e2a0          softnet/3
0    >  29 7   3       201           c5b1e540             idle/3
0       28 3   2       200           c5b1e7e0            xcall/2 xcall
0       27 1   2       200           c5b1ea80          softser/2
0       26 1   2       200           c5b1ed20          softclk/2
0       25 1   2       200           c5b1a020          softbio/2
0       24 1   2       200           c5b1a2c0          softnet/2
0    >  23 7   2       201           c5b1a560             idle/2
0       22 3   1       200           c5b1a800            xcall/1 xcall
0       21 1   1       200           c5b1aaa0          softser/1
0       20 1   1       200           c5b1ad40          softclk/1
0       19 1   1       200           c4ffb000          softbio/1
0       18 1   1       200           c4ffb2a0          softnet/1
0    >  17 7   1       201           c4ffb540             idle/1
0       16 3   0       200           c4ffb7e0             sysmon smtaskq
0       15 3   0       200           c4ffba80         pmfsuspend pmfsuspend
0       14 3   0       200           c4ffbd20           pmfevent pmfevent
0       13 3   3       200           c4ff5020         sopendfree sopendfr
0       12 3   0       200           c4ff52c0           nfssilly nfssilly
0       11 3   0       200           c4ff5560            cachegc cachegc
0       10 3   3       200           c4ff5800              vrele vrele
0        9 3   2       200           c4ff5aa0             vdrain vdrain
0        8 3   0       200           c4ff5d40          modunload mod_unld
0    >   7 7   0       200           c4fed000            xcall/0
0        6 1   0       200           c4fed2a0          softser/0
0        5 1   0       200           c4fed540          softclk/0
0        4 1   0       200           c4fed7e0          softbio/0
0        3 1   0       200           c4feda80          softnet/0
0        2 1   0       201           c4fedd20             idle/0
0        1 3   3       200           c0652400            swapper uvm
db{0}> rev boot 0x04
[-- break #0(1) sent -- `\z' -- Sat Mar  3 10:56:58 2012]
[-- break #0(1) sent -- `\z' -- Sat Mar  3 10:57:02 2012]

	[machine completely stuck]


	Note the reboot(8) in 'tstile', and the find(1) processes (the
	original culprits) in 'vmem'.

>How-To-Repeat:

	Run netbsd-6 on a busy, scsi raid based  nfs fileserver.

>Fix:

	None I can see. 

	The machine is easy to upset, so I can quickly provide any
	details someone knowledgable might be interested in, including
	ddb dances.

	(Re-sent because of botched sender mail address)

>Release-Note:

>Audit-Trail:
From: Lars Heidieker <lars@heidieker.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Sun, 04 Mar 2012 19:24:13 +0100

 Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
 to 1.73 ?

From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Sun, 4 Mar 2012 21:01:11 +0100

 At 18:25 Uhr +0000 04.03.2012, Lars Heidieker wrote:

 Thanks for looking at this.

 > Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
 > to 1.73 ?

 Kernel built, installed, re-booted, activated /etc/daily cron job. I'll
 know tomorrow morning.

 	hauke

 -- 
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut für Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-3281

From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Mon, 5 Mar 2012 07:09:24 +0100

 At 18:25 Uhr +0000 04.03.2012, Lars Heidieker wrote:
 > Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
 > to 1.73 ?

 Yes.

 <snip>
 root   2225  7901  8306 392f58    0 D    ?       0:06.40 find / ( ! -fstype
 local -o -fstype rdonly -o -fstype
 root   7787   345   345 392a08    0 I    ?       0:00.00 cron: running job
 root   7901  8306  8306 392f58    0 I    ?       0:00.00 /bin/sh /etc/daily
 root   7995  8306  8306 392f58    0 I    ?       0:00.00 tee /var/log/daily.out
 smmsp  8133  8306  8306 392f58    0 I    ?       0:00.00 sendmail -t
 root   8306  7787  8306 392f58    0 Is   ?       0:00.00 /bin/sh -c /bin/sh
 /etc/daily 2>&1 | tee /var/log/dail
 </snip>

 I'll wait tll i am @workplace before trying to reboot the machine, since I
 expect the console to hang, but so far, the behaviour is unchanged.

 	hauke

 -- 
      The ASCII Ribbon Campaign                    Hauke Fath
 ()     No HTML/RTF in email            Institut für Nachrichtentechnik
 /\     No Word docs in email                     TU Darmstadt
      Respect for open standards              Ruf +49-6151-16-3281

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Fri, 04 Jan 2013 03:53:36 +0000
State-Changed-Why:
Any luck trying the suggested changes?


From: Hauke Fath <hf@spg.tu-darmstadt.de>
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        dholland@NetBSD.org
Cc: 
Subject: Re: kern/46136 (processes get stuck in D under high I/O load - vmem/kmem)
Date: Tue, 20 Jan 2015 14:40:04 +0100

 On 01/04/13 04:53, dholland@NetBSD.org wrote:
 > Any luck trying the suggested changes?

 Lars fixed the vmem starvation, but the processes hanging in "uvn_fp2 
 wait" supposedly nfs related issue remained.

 We eventually replaced the machine in question with different hardware 
 and OS (Illumos), and repurposed it for light duty with its original, 
 more benign SCSI MegaRAID 320-2 controller.

 The 320-4X with four spare drives is still in the machine, so I could 
 try out any suggestions.

 hauke

State-Changed-From-To: feedback->open
State-Changed-By: hauke@NetBSD.org
State-Changed-When: Mon, 05 Oct 2015 09:49:02 +0000
State-Changed-Why:
I provided feedback a while ago.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.