NetBSD Problem Report #52679

From dholland@macaran.eecs.harvard.edu  Tue Oct 31 08:26:22 2017
Return-Path: <dholland@macaran.eecs.harvard.edu>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 649AE7A1E7
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 31 Oct 2017 08:26:22 +0000 (UTC)
Message-Id: <20171031082615.045356E28A@macaran.eecs.harvard.edu>
Date: Tue, 31 Oct 2017 04:26:14 -0400 (EDT)
From: dholland@eecs.harvard.edu
Reply-To: dholland@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: amd64 pmap page leak?
X-Send-Pr-Version: 3.95

>Number:         52679
>Category:       port-amd64
>Synopsis:       amd64 pmap page leak?
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-amd64-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 31 08:30:00 +0000 2017
>Closed-Date:    Wed May 17 09:59:56 +0000 2023
>Last-Modified:  Wed May 17 09:59:56 +0000 2023
>Originator:     David A. Holland
>Release:        NetBSD 8.99.1 (20170809)
>Organization:
>Environment:
System: NetBSD macaran 8.99.1 NetBSD 8.99.1 (MACARAN) #42: Wed Aug 9 22:31:11 EDT 2017 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Architecture: x86_64
Machine: amd64
>Description:

Today one of my machines deadlocked due to what turned out to be
garden-variety kva exhaustion: the X server went into D state with
wchan "vmem" and backtrace from crash(8) showed pool_grow and
vmem_alloc.

The proximate cause was a 5GB browser process but in the course of
investigating it looked substantially like a lot of system memory had
gone missing.

The first few lines of vmstat -s output:
     4096 bytes per page
        8 page colors
  1523017 pages managed
     5290 pages free
   500583 pages active
   248837 pages inactive
        0 pages paging
      763 pages wired
     4365 zero pages
        1 reserve pagedaemon pages
       20 reserve kernel pages
   429542 anonymous pages
   274459 cached file pages
    46182 cached executable pages
     1024 minimum free pages
     1365 target free pages
   507672 maximum wired pages
        1 swap devices
  1587221 swap pages
  1137567 swap pages in use
  1593324 swap allocations

Since free + inactive + active + wired + zero should add to roughly
managed, it looks like half the system memory's disappeared somewhere.

vmstat -m said the kernel was using roughly 1.5G, but even if that
isn't counted above there's still 1.5G missing. Is there some other
category not displayed that managed pages can be in?

(The machine has 6G of ram and 6G of swap, and it ought to be able to
handle a 5G browser process without going 4G into swap, since there
wasn't anything else large running. For a while a few days ago I was
running a second not-small browser process as well, but it was shut
down ~36 hours before the events today.)

It's odd that this should have so many approximate halves in it (6G
total -> 3G reported above -> 1.5G used by the kernel) but maybe
that's just the condition required for it to splode.

>How-To-Repeat:

Thrash memory on and off for several days, I guess...

>Fix:

Dunno. Confirmation that this does actually reflect a problem would be
a helpful first step.

It would also be useful if vmstat -s output came as groups of page
counts that were specifically supposed to add up, to make these
diagnoses easier.

I'm filing this in port-amd64 because it's presumptively a pmap-level
issue until proven otherwise... unless it's a false alarm and
something else entirely was going on.

>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 04 Nov 2017 12:19:22 +1100

 > vmstat -m said the kernel was using roughly 1.5G, but even if that
 > isn't counted above there's still 1.5G missing. Is there some other
 > category not displayed that managed pages can be in?

 FWIW, for my systems, active+inactive+free+vmstat -m final line
 comes in very close to managed - mostly within 1-2% or less,
 though my erlite that has been up 147 days is at 7%.

 yes, i think something is leaking some how..


 .mrg.

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@netbsd.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 4 Nov 2017 09:26:25 +0800 (+08)

 Just one more data-point...

 On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%

       4096 bytes per page
          8 page colors
   32572896 pages managed
   20457781 pages free
    9365903 pages active
      34169 pages inactive
          0 pages paging
      19572 pages wired
   12383934 zero pages

 free+active+inactive+paging+wired = 29877425, which is only 91.7% of 
 managed pages...

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, dholland@NetBSD.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 04 Nov 2017 14:02:57 +1100

 > From: Paul Goyette <paul@whooppee.com>
 > To: gnats-bugs@NetBSD.org
 > Cc: port-amd64-maintainer@netbsd.org
 > Subject: re: port-amd64/52679: amd64 pmap page leak?
 > Date: Sat, 4 Nov 2017 09:26:25 +0800 (+08)
 > 
 >  Just one more data-point...
 >  
 >  On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
 >  
 >        4096 bytes per page
 >           8 page colors
 >    32572896 pages managed
 >    20457781 pages free
 >     9365903 pages active
 >       34169 pages inactive
 >           0 pages paging
 >       19572 pages wired
 >    12383934 zero pages
 >  
 >  free+active+inactive+paging+wired = 29877425, which is only 91.7% of 
 >  managed pages...

 how much does vmstat -m say is used total?  does that account for
 most of the remaining?


 .mrg.

From: Paul Goyette <paul@whooppee.com>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org, port-amd64-maintainer@netbsd.org, 
    gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, dholland@NetBSD.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 4 Nov 2017 11:35:34 +0800 (+08)

 On Sat, 4 Nov 2017, matthew green wrote:

 >>  Just one more data-point...
 >>
 >>  On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
 >>
 >>        4096 bytes per page
 >>           8 page colors
 >>    32572896 pages managed
 >>    20457781 pages free
 >>     9365903 pages active
 >>       34169 pages inactive
 >>           0 pages paging
 >>       19572 pages wired
 >>    12383934 zero pages
 >>
 >>  free+active+inactive+paging+wired = 29877425, which is only 91.7% of
 >>  managed pages...
 >
 > how much does vmstat -m say is used total?  does that account for
 > most of the remaining?

 vmstat -m reported 9GB, which is more than what's missing above.


 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

From: Paul Goyette <paul@whooppee.com>
To: matthew green <mrg@eterna.com.au>
Cc: gnats-bugs@NetBSD.org, port-amd64-maintainer@netbsd.org, 
    gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, dholland@NetBSD.org
Subject: re: port-amd64/52679: amd64 pmap page leak?
Date: Sat, 4 Nov 2017 11:54:34 +0800 (+08)

 On Sat, 4 Nov 2017, Paul Goyette wrote:

 >>>  On my amd64 8.99.3 system (with 128GB RAM), my discrepancy is > 8%
 >>>
 >>>        4096 bytes per page
 >>>           8 page colors
 >>>    32572896 pages managed
 >>>    20457781 pages free
 >>>     9365903 pages active
 >>>       34169 pages inactive
 >>>           0 pages paging
 >>>       19572 pages wired
 >>>    12383934 zero pages
 >>> free+active+inactive+paging+wired = 29,877,425, which is only 91.7%
 >>> of managed pages...

 That leaves a "missing" count of 2,695,471 ...

 > vmstat -m reported 9GB, which is more than what's missing above.

 The actual number from vmstat is

 In use 9740352K, total allocated 9939724K; utilization 98.0%

 That 9.9GB of pool from vmstat -m translates to about 2,426,690 pages, 
 which is _very_close_ to missing 2,695,471 million pages!



 +------------------+--------------------------+----------------------------+
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
 +------------------+--------------------------+----------------------------+

State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 17 May 2023 09:59:56 +0000
State-Changed-Why:
this problem took a good while to hunt down, but did eventually get
fixed... a good while ago


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.