NetBSD Problem Report #50313
From he@smistad.uninett.no Thu Oct 8 10:02:04 2015
Return-Path: <he@smistad.uninett.no>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id AA1B0A654F
for <gnats-bugs@gnats.NetBSD.org>; Thu, 8 Oct 2015 10:02:04 +0000 (UTC)
Message-Id: <20151008100200.DCE373D0B8@smistad.uninett.no>
Date: Thu, 8 Oct 2015 12:02:00 +0200 (CEST)
From: he@uninett.no
Reply-To: he@uninett.no
To: gnats-bugs@gnats.NetBSD.org
Subject: processes get stuck doing exec
X-Send-Pr-Version: 3.95
>Number: 50313
>Category: kern
>Synopsis: processes get stuck doing exec
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Oct 08 10:05:00 +0000 2015
>Last-Modified: Sat Oct 10 23:25:00 +0000 2015
>Originator: Havard Eidnes
>Release: NetBSD 6.1_STABLE
>Organization:
UNINETT AS
>Environment:
System: NetBSD smistad.uninett.no 6.1_STABLE NetBSD 6.1_STABLE (MAANEN) #1: Wed Oct 29 11:27:25 CET 2014 he@smistad.uninett.no:/usr/obj/sys/arch/i386/compile/MAANEN i386
Architecture: i386
Machine: i386
>Description:
All too often processes get stuck on this machine. According
to ps the processes are stuck in "R" state. Running crash on
the live system shows this call graph
syscall -> sys_execve -> execve1 -> execve_loadvm -> pool_get ->
pool_grow -> exec_pool_alloc -> uvm_km_alloc -> uvm_map ->
uvm_map_prepare -> cv_timedwait -> sleepq_block
as shown in this session with "crash":
smistad# crash
Crash version 6.1_STABLE, image version 6.1_STABLE.
Output from a running system is unreliable.
crash>
crash> t/t 0t2052
trace: pid 2052 lid 1 at 0xddb69938
sleepq_block(64,0,c0c54ee0,c0cda590,0,c4677c00,c4676680,c0d1dc00,c0d1dc10,c0d1dc14) at sleepq_block+0xa3
cv_timedwait(c0d1dc14,c0d1dc10,64,ddb699bc,c0d1da20,ffffffff,ffffffff,0,801727,0) at cv_timedwait+0x126
uvm_map_prepare(c0d1dc00,c0000000,40000,c0d1da20,ffffffff,ffffffff,0,801727,ddb699e8,0) at uvm_map_prepare+0x160
uvm_map(c0d1dc00,ddb69a58,40000,c0d1da20,ffffffff,ffffffff,0,801727,800002,c0d0aa94) at uvm_map+0x7f
uvm_km_alloc(c0d1dc00,40000,0,800002,ddb69ac0,c07daf4a,c0d0aa20,1,0,0) at uvm_km_alloc+0xe6
exec_pool_alloc(c0d0aa20,1,0,0,0,c055a02e,0,0,0,c0cda6e0) at exec_pool_alloc+0x2b
pool_grow(c0d0aa94,0,c054,c76e9180,ce7b5c09,4,c0d0aa98,c0d0aa94,1,0) at pool_grow+0x2a
pool_get(c0d0aa20,1,c8d99948,cea7cf80,c08d0731,0,ddb69b80,ddb69bec,ddb69be8,c7731c14) at pool_get+0x79
execve_loadvm(bb9123cc,c054f830,ddb69b68,bb912404,c4eedc00,c5327c00,cf1d7008,404,404,9) at execve_loadvm+0x1b0
execve1(cf0dfd40,bb912404,bb9123bc,bb9123cc,c054f830,ddb69d3c,c080312a,cf0dfd40,ddb69d00,ddb69d28) at execve1+0x32
sys_execve(cf0dfd40,ddb69d00,ddb69d28,c0803389,c7834cc0,3b,bb90e000,bb90e020,ddb69d00,c7731c14) at sys_execve+0x30
syscall(ddb69d48,bbb900b3,ab,bfbf001f,bbb9001f,bb912404,bb912404,bfbfeb20,bb9123bc,0) at syscall+0xaa
crash> t/t 0t498
trace: pid 498 lid 1 at 0xdd1e5938
sleepq_block(64,0,c0c54ee0,c0cda590,0,c4677d40,c4676680,c0d1dc00,c0d1dc10,c0d1dc14) at sleepq_block+0xa3
cv_timedwait(c0d1dc14,c0d1dc10,64,dd1e59bc,c0d1da20,ffffffff,ffffffff,0,801727,0) at cv_timedwait+0x126
uvm_map_prepare(c0d1dc00,c0000000,40000,c0d1da20,ffffffff,ffffffff,0,801727,dd1e59e8,0) at uvm_map_prepare+0x160
uvm_map(c0d1dc00,dd1e5a58,40000,c0d1da20,ffffffff,ffffffff,0,801727,800002,c0d0aa94) at uvm_map+0x7f
uvm_km_alloc(c0d1dc00,40000,0,800002,dd1e5ac0,c07daf4a,c0d0aa20,1,0,0) at uvm_km_alloc+0xe6
exec_pool_alloc(c0d0aa20,1,0,0,0,c055a02e,0,0,0,c0cda6e0) at exec_pool_alloc+0x2b
pool_grow(c0d0aa94,0,c054,c76e9b40,ce7b5009,5,c0d0aa98,c0d0aa94,1,0) at pool_grow+0x2a
pool_get(c0d0aa20,1,ce9a3d28,cac41240,c08d0ae3,0,dd1e5b80,dd1e5bec,dd1e5be8,c4c721d4) at pool_get+0x79
execve_loadvm(bb907c04,c054f830,dd1e5b68,bb907ae4,c4ef7c00,c4ef7000,cf87a008,404,404,9) at execve_loadvm+0x1b0
execve1(cf8d9aa0,bb907ae4,bb907b84,bb907c04,c054f830,dd1e5d3c,c080312a,cf8d9aa0,dd1e5d00,dd1e5d28) at execve1+0x32
sys_execve(cf8d9aa0,dd1e5d00,dd1e5d28,c0803389,c52ed364,3b,bba9b000,bba9b320,dd1e5d00,c4c721d4) at sys_execve+0x30
syscall(dd1e5d48,bbb900b3,ab,bfbf001f,bbb9001f,bb907ae4,bb907ae4,bfbfd6d8,bb907b84,7d7b7cff) at syscall+0xaa
crash> t/t 0t22404
trace: pid 22404 lid 1 at 0xdd0ee938
sleepq_block(64,0,c0c54ee0,c0cda590,0,c4677d40,c4676680,c0d1dc00,c0d1dc10,c0d1dc14) at sleepq_block+0xa3
cv_timedwait(c0d1dc14,c0d1dc10,64,dd0ee9bc,c0d1da20,ffffffff,ffffffff,0,801727,0) at cv_timedwait+0x126
uvm_map_prepare(c0d1dc00,c0000000,40000,c0d1da20,ffffffff,ffffffff,0,801727,dd0ee9e8,0) at uvm_map_prepare+0x160
uvm_map(c0d1dc00,dd0eea58,40000,c0d1da20,ffffffff,ffffffff,0,801727,800002,c0d0aa94) at uvm_map+0x7f
uvm_km_alloc(c0d1dc00,40000,0,800002,dd0eeac0,c07daf4a,c0d0aa20,1,0,0) at uvm_km_alloc+0xe6
exec_pool_alloc(c0d0aa20,1,0,0,0,c055a02e,0,0,0,c0cda6e0) at exec_pool_alloc+0x2b
pool_grow(c0d0aa94,0,c054,c76e9b40,c8511c05,3,c0d0aa98,c0d0aa94,1,0) at pool_grow+0x2a
pool_get(c0d0aa20,1,ce9a3c08,c8091e80,c08d0ae3,0,dd0eeb80,dd0eebec,dd0eebe8,cf7fac30) at pool_get+0x79
execve_loadvm(bb907b04,c054f830,dd0eeb68,bb907ae4,c4ebec00,c5327400,cf8c7808,404,404,9) at execve_loadvm+0x1b0
execve1(cf8d9800,bb907ae4,bb907afc,bb907b04,c054f830,dd0eed3c,c080312a,cf8d9800,dd0eed00,dd0eed28) at execve1+0x32
sys_execve(cf8d9800,dd0eed00,dd0eed28,c0803389,cf88ba44,3b,bba9b000,bba9b320,dd0eed00,cf7fac30) at sys_execve+0x30
syscall(dd0eed48,bbb900b3,ab,bfbf001f,bbb9001f,bb907ae4,bb907ae4,bfbfd6d8,bb907afc,6d6d6bff) at syscall+0xaa
crash>
execve_loadvm() contains only one pool_get, this one:
/* allocate an argument buffer */
data->ed_argp = pool_get(&exec_pool, PR_WAITOK);
and "vmstat -m" says:
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
...
execargs 262144 40498 0 40498 9 7 2 4 0 16 2
...
I had at that time started a pkgsrc compile, and one of
the stuck processes was a child inside that tree, but
aborting the build changed very little in the above
stats:
execargs 262144 40559 0 40559 9 7 2 4 0 16 2
At this point experience tells me that I need to reboot
to regain full use of the system. I wish I didn't have
to.
exec_pool_alloc in its turn runs uvm_km_alloc(kernel_map, ...)
but I don't know if I have a way to inspect the state of
kernel_map, i.e. if there's any free space left (probably not?)
>How-To-Repeat:
Sorry, I don't have an exact recipe for reproducing the problem.
>Fix:
I wish I knew.
>Audit-Trail:
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/50313: processes get stuck doing exec
Date: Sat, 10 Oct 2015 22:59:27 +0000
On Thu, Oct 08, 2015 at 10:05:00AM +0000, he@uninett.no wrote:
> execve_loadvm() contains only one pool_get, this one:
>
> /* allocate an argument buffer */
> data->ed_argp = pool_get(&exec_pool, PR_WAITOK);
>
> and "vmstat -m" says:
>
> Memory resource pool statistics
> Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
> ...
> execargs 262144 40498 0 40498 9 7 2 4 0 16 2
> ...
Out of curiosity: how much RAM is in the system? Because exec args are
big, unless there's a throttle (I don't think we have one) it's fairly
easy for transient demand to take up a lot of space, which can fill up
the kernel map or plain run out of memory.
--
David A. Holland
dholland@netbsd.org
From: Havard Eidnes <he@uninett.no>
To: gnats-bugs@NetBSD.org, dholland-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/50313: processes get stuck doing exec
Date: Sun, 11 Oct 2015 01:20:39 +0200 (CEST)
> > Memory resource pool statistics
> > Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat M=
inpg Maxpg Idle
> > ...
> > execargs 262144 40498 0 40498 9 7 2 4 =
0 16 2
> > ...
>
> Out of curiosity: how much RAM is in the system?
Physically 4G, but since this is i386, my kernel sees
total memory =3D 3316 MB
avail memory =3D 3248 MB
> Because exec args are big, unless there's a throttle (I don't
> think we have one) it's fairly easy for transient demand to
> take up a lot of space, which can fill up the kernel map or
> plain run out of memory.
Do we have a meter for the kernel map, inspectable from user-land
in some way?
When this problem strikes, it's usually going downhill.
Presently my host has a load average of 10, mostly caused by:
PID TTY STAT TIME COMMAND
...
1311 ? D 0:00.00 cron: running job =
2828 ? Rs 0:00.00 (local)
3101 ? RVs 0:00.00 (cron)
5471 ? D 0:00.00 cron: running job =
7074 ? D 0:00.00 cron: running job =
23096 ? RVs 0:00.00 (cron)
23277 ? RVs 0:00.00 (cron)
23347 ? RVs 0:00.00 (cron)
25722 ? RVs 0:00.00 (cron)
26808 ? D 0:00.00 cron: running job =
29057 ? R 0:00.00 (sh)
and some mail delivery appears to get stuck:
Oct 11 01:17:03 smistad postfix/qmgr[14168]: D81A13D0B8: skipped, still=
being delivered
Some of these are stuck doing uvm_km_alloc(), some of them stuck
in fork:
crash> t/t 0t5471
trace: pid 5471 lid 1 at 0xdd07ebe4
sleepq_block(0,0,c0be3a51,c0cda590,8,c4677c00,c4675280,c9002034,0,cf7fa=
dfc) at sleepq_block+0xa3
cv_wait(cf7fadfc,c4614f00,ddbb3000,80000000,0,0,c0803314,c8c2ad40,dd07e=
c88,0) at cv_wait+0x126
fork1(c4c90560,3,14,0,0,c0803314,0,dd07ed28,0,dd07ed3c) at fork1+0x6a0
sys___vfork14(c4c90560,dd07ed00,dd07ed28,c0803389,c77357ac,11a,bbab9000=
,bbab9d30,c77357ac,cf7fade4) at sys___vfork14+0x50
syscall(dd07ed48,bbbb00b3,ab,bfbf001f,bbba001f,1c20,0,bfbfed98,0,bb91e1=
53) at syscall+0xaa
crash> =
Regards,
- H=E5vard
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.