NetBSD Problem Report #52679
From dholland@macaran.eecs.harvard.edu Tue Oct 31 08:26:22 2017
Return-Path: <dholland@macaran.eecs.harvard.edu>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 649AE7A1E7
for <gnats-bugs@gnats.NetBSD.org>; Tue, 31 Oct 2017 08:26:22 +0000 (UTC)
Message-Id: <20171031082615.045356E28A@macaran.eecs.harvard.edu>
Date: Tue, 31 Oct 2017 04:26:14 -0400 (EDT)
From: dholland@eecs.harvard.edu
Reply-To: dholland@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: amd64 pmap page leak?
X-Send-Pr-Version: 3.95
>Number: 52679
>Category: port-amd64
>Synopsis: amd64 pmap page leak?
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: port-amd64-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Oct 31 08:30:00 +0000 2017
>Closed-Date: Wed May 17 09:59:56 +0000 2023
>Last-Modified: Wed May 17 09:59:56 +0000 2023
>Originator: David A. Holland
>Release: NetBSD 8.99.1 (20170809)
>Organization:
>Environment:
System: NetBSD macaran 8.99.1 NetBSD 8.99.1 (MACARAN) #42: Wed Aug 9 22:31:11 EDT 2017 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Architecture: x86_64
Machine: amd64
>Description:
Today one of my machines deadlocked due to what turned out to be
garden-variety kva exhaustion: the X server went into D state with
wchan "vmem" and backtrace from crash(8) showed pool_grow and
vmem_alloc.
The proximate cause was a 5GB browser process but in the course of
investigating it looked substantially like a lot of system memory had
gone missing.
The first few lines of vmstat -s output:
4096 bytes per page
8 page colors
1523017 pages managed
5290 pages free
500583 pages active
248837 pages inactive
0 pages paging
763 pages wired
4365 zero pages
1 reserve pagedaemon pages
20 reserve kernel pages
429542 anonymous pages
274459 cached file pages
46182 cached executable pages
1024 minimum free pages
1365 target free pages
507672 maximum wired pages
1 swap devices
1587221 swap pages
1137567 swap pages in use
1593324 swap allocations
Since free + inactive + active + wired + zero should add to roughly
managed, it looks like half the system memory's disappeared somewhere.
vmstat -m said the kernel was using roughly 1.5G, but even if that
isn't counted above there's still 1.5G missing. Is there some other
category not displayed that managed pages can be in?
(The machine has 6G of ram and 6G of swap, and it ought to be able to
handle a 5G browser process without going 4G into swap, since there
wasn't anything else large running. For a while a few days ago I was
running a second not-small browser process as well, but it was shut
down ~36 hours before the events today.)
It's odd that this should have so many approximate halves in it (6G
total -> 3G reported above -> 1.5G used by the kernel) but maybe
that's just the condition required for it to splode.
>How-To-Repeat:
Thrash memory on and off for several days, I guess...
>Fix:
Dunno. Confirmation that this does actually reflect a problem would be
a helpful first step.
It would also be useful if vmstat -s output came as groups of page
counts that were specifically supposed to add up, to make these
diagnoses easier.
I'm filing this in port-amd64 because it's presumptively a pmap-level
issue until proven otherwise... unless it's a false alarm and
something else entirely was going on.
>Release-Note:
>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 04 Nov 2017 12:19:22 +1100
> vmstat -m said the kernel was using roughly 1.5G, but even if that
> isn't counted above there's still 1.5G missing. Is there some other
> category not displayed that managed pages can be in?
FWIW, for my systems, active+inactive+free+vmstat -m final line
comes in very close to managed - mostly within 1-2% or less,
though my erlite that has been up 147 days is at 7%.
yes, i think something is leaking some how..
.mrg.
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@netbsd.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 4 Nov 2017 09:26:25 +0800 (+08)
Just one more data-point...
On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
4096 bytes per page
8 page colors
32572896 pages managed
20457781 pages free
9365903 pages active
34169 pages inactive
0 pages paging
19572 pages wired
12383934 zero pages
free+active+inactive+paging+wired = 29877425, which is only 91.7% of
managed pages...
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, dholland@NetBSD.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 04 Nov 2017 14:02:57 +1100
> From: Paul Goyette <paul@whooppee.com>
> To: gnats-bugs@NetBSD.org
> Cc: port-amd64-maintainer@netbsd.org
> Subject: re: port-amd64/52679: amd64 pmap page leak?
> Date: Sat, 4 Nov 2017 09:26:25 +0800 (+08)
>
> Just one more data-point...
>
> On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
>
> 4096 bytes per page
> 8 page colors
> 32572896 pages managed
> 20457781 pages free
> 9365903 pages active
> 34169 pages inactive
> 0 pages paging
> 19572 pages wired
> 12383934 zero pages
>
> free+active+inactive+paging+wired = 29877425, which is only 91.7% of
> managed pages...
how much does vmstat -m say is used total? does that account for
most of the remaining?
.mrg.
From: Paul Goyette <paul@whooppee.com>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org, port-amd64-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, dholland@NetBSD.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 4 Nov 2017 11:35:34 +0800 (+08)
On Sat, 4 Nov 2017, matthew green wrote:
>> Just one more data-point...
>>
>> On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
>>
>> 4096 bytes per page
>> 8 page colors
>> 32572896 pages managed
>> 20457781 pages free
>> 9365903 pages active
>> 34169 pages inactive
>> 0 pages paging
>> 19572 pages wired
>> 12383934 zero pages
>>
>> free+active+inactive+paging+wired = 29877425, which is only 91.7% of
>> managed pages...
>
> how much does vmstat -m say is used total? does that account for
> most of the remaining?
vmstat -m reported 9GB, which is more than what's missing above.
+------------------+--------------------------+----------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+------------------+--------------------------+----------------------------+
From: Paul Goyette <paul@whooppee.com>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org, port-amd64-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, dholland@NetBSD.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 4 Nov 2017 11:54:34 +0800 (+08)
On Sat, 4 Nov 2017, Paul Goyette wrote:
>>> On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
>>>
>>> 4096 bytes per page
>>> 8 page colors
>>> 32572896 pages managed
>>> 20457781 pages free
>>> 9365903 pages active
>>> 34169 pages inactive
>>> 0 pages paging
>>> 19572 pages wired
>>> 12383934 zero pages
>>> free+active+inactive+paging+wired = 29,877,425, which is only 91.7%
>>> of managed pages...
That leaves a "missing" count of 2,695,471 ...
> vmstat -m reported 9GB, which is more than what's missing above.
The actual number from vmstat is
In use 9740352K, total allocated 9939724K; utilization 98.0%
That 9.9GB of pool from vmstat -m translates to about 2,426,690 pages,
which is _very_close_ to missing 2,695,471 million pages!
+------------------+--------------------------+----------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+------------------+--------------------------+----------------------------+
State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 17 May 2023 09:59:56 +0000
State-Changed-Why:
this problem took a good while to hunt down, but did eventually get
fixed... a good while ago
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.