NetBSD Problem Report #48586
From kivinen@fireball.kivinen.iki.fi Mon Feb 10 16:02:29 2014
Return-Path: <kivinen@fireball.kivinen.iki.fi>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 74D51A6494
for <gnats-bugs@gnats.NetBSD.org>; Mon, 10 Feb 2014 16:02:29 +0000 (UTC)
Message-Id: <201402101602.s1AG2Mb5016766@fireball.kivinen.iki.fi>
Date: Mon, 10 Feb 2014 18:02:22 +0200 (EET)
From: kivinen@iki.fi
Reply-To: kivinen@iki.fi
To: gnats-bugs@gnats.NetBSD.org
Subject: Kern complains proc table full even when it is is not
X-Send-Pr-Version: 3.95
>Number: 48586
>Category: kern
>Synopsis: The kernel complais that proc table is full even when it is not
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Feb 10 16:05:00 +0000 2014
>Last-Modified: Thu May 08 12:45:01 +0000 2014
>Originator: Tero Kivinen
>Release: NetBSD 6.1_STABLE
>Organization:
>Environment:
System: NetBSD 8 12:56:43 EET 2014 root@haste.i.kivinen.iki.fi:/usr/obj/sys/arch/amd64/compile/HASTE amd64
Architecture: x86_64
Machine: amd64
The HASTE kernel is GENERIC with larger MAXDSIZE
include "arch/amd64/conf/GENERIC"
options MAXDSIZ=34359738368
The problem occurred with GENERIC kernel too, so that change should
not cause it.
>Description:
I am creating garmin maps using the java tools (osmosis,
splitter, mkgmap). The java tools use sun-jre7-7.0.45 from
pkgsrc. After 15 or so hours of continously running scripts
the scripts start to complain:
...
Map: -110.0..-150.0 -30..0 1300 sa-w-c-ele.img South America
West Central
/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
...
and when I check the syslog there is messages saying:
haste (17:32) /m/smbkivinen/garmin>tail /var/log/messages
...
Feb 10 17:24:37 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC
Feb 10 17:25:17 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC
Feb 10 17:27:19 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC
Then when I check how many processes are running ps claims:
haste (17:41) /m/smbkivinen/garmin>ps agxu | wc
52 581 4056
The system is configure to have kern.maxproc of 8000, but it
is complaining that the proc table is full, even when it only
has 52 processes running:
haste (17:42) /m/smbkivinen/garmin>sysctl -a | fgrepmaxproc
kern.maxproc = 8000
proc.curproc.rlimit.maxproc.soft = 160
proc.curproc.rlimit.maxproc.hard = 1044
I can still run few processes, but if I try to run more than
few processes the fork fails:
haste (17:42) /m/smbkivinen/garmin>(sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 &)
zsh: fork failed: resource temporarily unavailable
haste (17:42) /m/smbkivinen/garmin>ps agxu | wc
52 581 4056
haste (17:42) /m/smbkivinen/garmin>
There does not seem to be any other way to recover from this
situation than reboot. My guess is that there is something
wrong with linux emulation in kernel which leaks the processes
in the proc table or something. During the 16 hours since last
reboot, I have run osmosis (the java program) around 6500
times, and the java has crashed around 200 times. The linux
emulation java seems to randomly crash quite often (usually
with out of memory error or similar) and rerunning the program
usually works after few tries. Those java crashes might be
related to the fact that I am using quite large java memory
limits, 4 GB, 11 GB or 18 GB depending on the program. Osmosis
uses 4 GB, splitter uses 18 GB and mkgmap uses 11GB. The
splitter was not able to process my maps with the default max
datasize limit (8GB), so thats why I had compile special
kernel.
Looking at my kern.maxproc (8000) and number of times I have
run those linux emulation java programs, it might be that
actually every single linux emulation java program leaks one
kernel proc table entry.
>How-To-Repeat:
Try running osmosis using sun-jre7-7.0.45 in loop and see if
that uses proc table up. I have not tried this but with my
current workload this seems to repeat daily so it is quite
fast for me to test fixes or something. If you set the
kern.maxproc to much lower value then this most likely will
repeat much faster.
>Fix:
Not known.
Most likely raising the kern.maxproc to way bigger (65k or
something) might move the limit further away.
>Audit-Trail:
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/48586: Kern complains proc table full even when it is is not
Date: Mon, 10 Feb 2014 18:56:58 +0000
On Mon, Feb 10, 2014 at 04:05:00PM +0000, kivinen@iki.fi wrote:
> >Number: 48586
> >Category: kern
> >Synopsis: The kernel complais that proc table is full even when it is not
>
> The problem occurred with GENERIC kernel too, so that change should
> not cause it.
>
> >Description:
>
> I am creating garmin maps using the java tools (osmosis,
> splitter, mkgmap). The java tools use sun-jre7-7.0.45 from
> pkgsrc. After 15 or so hours of continously running scripts
> the scripts start to complain:
...
> Feb 10 17:24:37 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC
Hmmm.. confusing error messages.
Nothing wrong with the the 'proc table' the count of active processes
(aka nproc) had exceeded kern.maxproc.
(The table itself also holds entries for process groups (etc), and is
fully dynamically sized.)
> Looking at my kern.maxproc (8000) and number of times I have
> run those linux emulation java programs, it might be that
> actually every single linux emulation java program leaks one
> kernel proc table entry.
A quick bit of analysis:
The linux_clone_nptl() code allocated a 'pid' for the lwps 'lid'
(netbsd usually uses per-process lid values).
This is counted against nproc.
The flag LWP_PIDLID is passed to lwp_create() which then allocates
a pid and sets LP_PIDLID.
When the lwp exits proc_free_pid() is called to free the pid slot
which then decrements nprocs.
Somewhere that must be going wrong.
It isn't actually clear whether a pid number actually gets allocated
and this isn't freed.
I think some values in the linux /proc show the value of nproc (but
we know that goes up).
If a lot of pid number are actually allocated then values larger than
30000 will start being issued.
Might be worth a kernel with a few printfs...
David
--
David Laight: david@l8s.co.uk
From: "D'Arcy J.M. Cain" <darcy@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/48586: Kern complains proc table full even when it is is
not
Date: Thu, 8 May 2014 08:44:41 -0400
Not sure if this is related but I recently rebuilt my kernel to pick up
any fixes. I now run NetBSD 6.1.4_PATCH slightly modified as follows:
include "arch/amd64/conf/GENERIC"
#ident "INSTALL-$Revision: 7922 $"
maxusers 1024
options SEMMNI=20000
options SEMMNS=5000
options SEMUME=200
options SEMMNU=500
options SHMMAXPGS=16384
pseudo-device pf # PF packet filter
pseudo-device pflog # PF log if
This is the same as I have been running for years. Suddenly Postfix is
failing with the same error message as in this PR. I have tried
restarting every service on that system but the only thing that works
is a reboot. I am not running any Linux programs so I don't think that
Linux emulation is the problem.
I do not get the "proc: table is full" message in /var/log/messages.
--
D'Arcy J.M. Cain <darcy@NetBSD.org>
http://www.NetBSD.org/ IM:darcy@Vex.Net
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.