NetBSD Problem Report #48586

From kivinen@fireball.kivinen.iki.fi  Mon Feb 10 16:02:29 2014
Return-Path: <kivinen@fireball.kivinen.iki.fi>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 74D51A6494
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 10 Feb 2014 16:02:29 +0000 (UTC)
Message-Id: <201402101602.s1AG2Mb5016766@fireball.kivinen.iki.fi>
Date: Mon, 10 Feb 2014 18:02:22 +0200 (EET)
From: kivinen@iki.fi
Reply-To: kivinen@iki.fi
To: gnats-bugs@gnats.NetBSD.org
Subject: Kern complains proc table full even when it is is not
X-Send-Pr-Version: 3.95

>Number:         48586
>Category:       kern
>Synopsis:       The kernel complais that proc table is full even when it is not
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 10 16:05:00 +0000 2014
>Last-Modified:  Thu May 08 12:45:01 +0000 2014
>Originator:     Tero Kivinen
>Release:        NetBSD 6.1_STABLE
>Organization:
>Environment:
System: NetBSD 8 12:56:43 EET 2014  root@haste.i.kivinen.iki.fi:/usr/obj/sys/arch/amd64/compile/HASTE amd64
Architecture: x86_64
Machine: amd64

The HASTE kernel is GENERIC with larger MAXDSIZE

	include "arch/amd64/conf/GENERIC"
	options MAXDSIZ=34359738368

The problem occurred with GENERIC kernel too, so that change should
not cause it.

>Description:

	I am creating garmin maps using the java tools (osmosis,
	splitter, mkgmap). The java tools use sun-jre7-7.0.45 from
	pkgsrc. After 15 or so hours of continously running scripts
	the scripts start to complain:

	...
	Map: -110.0..-150.0 -30..0 1300 sa-w-c-ele.img South America
	West Central
	/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
	/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
	/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
	/m/smbkivinen/garmin/bin/gen-el-map.sh: Cannot fork
	...

	and when I check the syslog there is messages saying:

	haste (17:32) /m/smbkivinen/garmin>tail /var/log/messages
	...
	Feb 10 17:24:37 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC
	Feb 10 17:25:17 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC
	Feb 10 17:27:19 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC

	Then when I check how many processes are running ps claims:

	haste (17:41) /m/smbkivinen/garmin>ps agxu | wc
	      52     581    4056

	The system is configure to have kern.maxproc of 8000, but it
	is complaining that the proc table is full, even when it only
	has 52 processes running:

	haste (17:42) /m/smbkivinen/garmin>sysctl -a | fgrepmaxproc
	kern.maxproc = 8000
	proc.curproc.rlimit.maxproc.soft = 160
	proc.curproc.rlimit.maxproc.hard = 1044

	I can still run few processes, but if I try to run more than
	few processes the fork fails:

	haste (17:42) /m/smbkivinen/garmin>(sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 & sleep 5 &)
	zsh: fork failed: resource temporarily unavailable
	haste (17:42) /m/smbkivinen/garmin>ps agxu | wc
	52     581    4056
	haste (17:42) /m/smbkivinen/garmin>

	There does not seem to be any other way to recover from this
	situation than reboot. My guess is that there is something
	wrong with linux emulation in kernel which leaks the processes
	in the proc table or something. During the 16 hours since last
	reboot, I have run osmosis (the java program) around 6500
	times, and the java has crashed around 200 times. The linux
	emulation java seems to randomly crash quite often (usually
	with out of memory error or similar) and rerunning the program
	usually works after few tries. Those java crashes might be
	related to the fact that I am using quite large java memory
	limits, 4 GB, 11 GB or 18 GB depending on the program. Osmosis
	uses 4 GB, splitter uses 18 GB and mkgmap uses 11GB. The
	splitter was not able to process my maps with the default max
	datasize limit (8GB), so thats why I had compile special
	kernel.

	Looking at my kern.maxproc (8000) and number of times I have
	run those linux emulation java programs, it might be that
	actually every single linux emulation java program leaks one
	kernel proc table entry.

>How-To-Repeat:
	Try running osmosis using sun-jre7-7.0.45 in loop and see if
	that uses proc table up. I have not tried this but with my
	current workload this seems to repeat daily so it is quite
	fast for me to test fixes or something. If you set the
	kern.maxproc to much lower value then this most likely will
	repeat much faster.
>Fix:
	Not known.
	Most likely raising the kern.maxproc to way bigger (65k or
	something) might move the limit further away.

>Audit-Trail:
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48586: Kern complains proc table full even when it is is not
Date: Mon, 10 Feb 2014 18:56:58 +0000

 On Mon, Feb 10, 2014 at 04:05:00PM +0000, kivinen@iki.fi wrote:
 > >Number:         48586
 > >Category:       kern
 > >Synopsis:       The kernel complais that proc table is full even when it is not
 > 
 > The problem occurred with GENERIC kernel too, so that change should
 > not cause it.
 > 
 > >Description:
 > 
 > 	I am creating garmin maps using the java tools (osmosis,
 > 	splitter, mkgmap). The java tools use sun-jre7-7.0.45 from
 > 	pkgsrc. After 15 or so hours of continously running scripts
 > 	the scripts start to complain:
 ...
 > 	Feb 10 17:24:37 haste /netbsd: proc: table is full - increase kern.maxproc or NPROC

 Hmmm.. confusing error messages.
 Nothing wrong with the the 'proc table' the count of active processes
 (aka nproc) had exceeded kern.maxproc.

 (The table itself also holds entries for process groups (etc), and is
 fully dynamically sized.)

 > 	Looking at my kern.maxproc (8000) and number of times I have
 > 	run those linux emulation java programs, it might be that
 > 	actually every single linux emulation java program leaks one
 > 	kernel proc table entry.

 A quick bit of analysis:
 The linux_clone_nptl() code allocated a 'pid' for the lwps 'lid'
 (netbsd usually uses per-process lid values).
 This is counted against nproc.
 The flag LWP_PIDLID is passed to lwp_create() which then allocates
   a pid and sets LP_PIDLID.
 When the lwp exits proc_free_pid() is called to free the pid slot
   which then decrements nprocs.

 Somewhere that must be going wrong.
 It isn't actually clear whether a pid number actually gets allocated
 and this isn't freed.
 I think some values in the linux /proc show the value of nproc (but
 we know that goes up).
 If a lot of pid number are actually allocated then values larger than
 30000 will start being issued.

 Might be worth a kernel with a few printfs...

 	David

 -- 
 David Laight: david@l8s.co.uk

From: "D'Arcy J.M. Cain" <darcy@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/48586: Kern complains proc table full even when it is is
 not
Date: Thu, 8 May 2014 08:44:41 -0400

 Not sure if this is related but I recently rebuilt my kernel to pick up
 any fixes.  I now run NetBSD 6.1.4_PATCH slightly modified as follows:

 include "arch/amd64/conf/GENERIC"

 #ident          "INSTALL-$Revision: 7922 $"

 maxusers 1024

 options SEMMNI=20000
 options SEMMNS=5000
 options SEMUME=200
 options SEMMNU=500
 options SHMMAXPGS=16384

 pseudo-device   pf                      # PF packet filter
 pseudo-device   pflog                   # PF log if

 This is the same as I have been running for years.  Suddenly Postfix is
 failing with the same error message as in this PR.  I have tried
 restarting every service on that system but the only thing that works
 is a reboot.  I am not running any Linux programs so I don't think that
 Linux emulation is the problem.

 I do not get the "proc: table is full" message in /var/log/messages.

 -- 
 D'Arcy J.M. Cain <darcy@NetBSD.org>
 http://www.NetBSD.org/ IM:darcy@Vex.Net

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.