NetBSD Problem Report #56849

From www@netbsd.org  Sun May 22 18:14:26 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 541B51A9242
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 22 May 2022 18:14:26 +0000 (UTC)
Message-Id: <20220522181424.AF07C1A9243@mollari.NetBSD.org>
Date: Sun, 22 May 2022 18:14:24 +0000 (UTC)
From: tgl@sss.pgh.pa.us
Reply-To: tgl@sss.pgh.pa.us
To: gnats-bugs@NetBSD.org
Subject: Wacko kernel memory accounting in current/hppa
X-Send-Pr-Version: www-1.0

>Number:         56849
>Category:       port-hppa
>Synopsis:       Wacko kernel memory accounting in current/hppa
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-hppa-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun May 22 18:15:00 +0000 2022
>Closed-Date:    Sun May 29 06:28:33 +0000 2022
>Last-Modified:  Sun May 29 06:28:33 +0000 2022
>Originator:     Tom Lane
>Release:        HEAD/202205210400Z
>Organization:
PostgreSQL Global Development Group
>Environment:
NetBSD sss2.sss.pgh.pa.us 9.99.96 NetBSD 9.99.96 (TGL) #0: Sat May 21 16:50:57 EDT 2022  tgl@nuc1.sss.pgh.pa.us:/home/tgl/netbsd-H-202205210400Z/obj.hppa/sys/arch/hppa/compile/TGL hppa

This kernel is custom only to the extent of having a particular root device selection in its config, so I don't have to intervene manually when netbooting (root filesystem is on a USB drive that the boot ROM doesn't know how to boot from)
>Description:
During and after a fairly long, disk-access-intensive test suite, I observe completely wacko reports from top(1) and ps(1) about the kernel's memory consumption.  With the machine sitting
idle post-run, top says

load averages:  0.00,  0.00,  0.00;               up 0+19:01:17        12:37:47
19 processes: 18 sleeping, 1 on CPU
CPU states:  0.0% user,  0.0% nice,  1.5% system,  0.0% interrupt, 98.5% idle
Memory: 38M Act, 4852K Inact, 26M Wired, 8456K Exec, 32M File, 296M Free
Swap: 1024M Total, 9068K Used, 1015M Free

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
    0 root     125    0     0K   79G vdrain     60:25  0.00%  0.00% [system]
 ...

"79G"?  This machine only has 512M RAM, plus the 1024M swap partition.
The "Memory:" line is less obviously implausible, but it's still not right, because those numbers only sum to about 405M.

"ps auxww" seems to be looking at the same incorrect estimate, though
it presents it much differently:

USER        PID %CPU   %MEM   VSZ     RSS TTY   STAT STARTED     TIME COMMAND
root          0  0.0 15781.9     0 3050660 ?     DKl   5:37PM 60:26.58 [system]
...

... or wait, that %MEM estimate tracks pretty closely with 79G, but
that RSS value only means 3G doesn't it?

These numbers appear to have crept up slowly since boot but then
stabilized at what I'm showing here.  I'm guessing some sort of
pseudo-leak in memory accounting, but not actual memory consumption.
This test suite is extremely file-access-heavy, so something wrong in
vnode accounting could fit the facts perhaps.  The suite ran a good
deal slower than I was hoping for, almost double the time it takes
under HP-UX on the same hardware, so I'm wondering if this bogus
accounting is having real performance effects somewhere.

For comparison's sake I looked at a nearby machine running 9.2/amd64,
and there, top and ps agree that the kernel is using about 26M which
seems generally sane.

>How-To-Repeat:
I was running Postgres "make check-world", but I doubt it's specific to this exact workload.
>Fix:
Nick Hudson speculates there's something wrong in sys/arch/hppa/hppa/pmap.c

>Release-Note:

>Audit-Trail:
From: Tom Lane <tgl@sss.pgh.pa.us>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-hppa/56849: Wacko kernel memory accounting in current/hppa
Date: Sun, 22 May 2022 20:15:58 -0400

 I got a chance to reboot the machine, and I can now say that the
 numbers are weird even very shortly after boot:

 load averages:  0.13,  0.16,  0.08;               up 0+00:03:58        18:=
 47:42
 20 processes: 19 sleeping, 1 on CPU
 CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% i=
 dle
 Memory: 57M Act, 26M Wired, 16M Exec, 21M File, 380M Free
 Swap: 1024M Total, 1024M Free

   PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAN=
 D
     0 root      96    0     0K  656M usbevt      0:02  0.00%  0.00% [syste=
 m]

 USER     PID %CPU %MEM   VSZ    RSS TTY   STAT STARTED    TIME COMMAND
 root       0  0.0 128.2     0 671900 ?     DKl   6:44PM 0:02.75 [system]

 ... and it goes up from there.

 			regards, tom lane

From: Nick Hudson <nick.hudson@gmx.co.uk>
To: gnats-bugs@netbsd.org, port-hppa-maintainer@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: port-hppa/56849: Wacko kernel memory accounting in current/hppa
Date: Tue, 24 May 2022 06:50:08 +0100

 This is a multi-part message in MIME format.
 --------------WIqpHylpkgtuYOD7Yk76jtfL
 Content-Type: text/plain; charset=UTF-8; format=flowed
 Content-Transfer-Encoding: 7bit

 Candidate fix...

 Nick
 --------------WIqpHylpkgtuYOD7Yk76jtfL
 Content-Type: text/x-patch; charset=UTF-8; name="pr56849.diff"
 Content-Disposition: attachment; filename="pr56849.diff"
 Content-Transfer-Encoding: base64

 SW5kZXg6IHN5cy9hcmNoL2hwcGEvaHBwYS9wbWFwLmMKPT09PT09PT09PT09PT09PT09PT09
 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQpSQ1MgZmls
 ZTogL2N2c3Jvb3Qvc3JjL3N5cy9hcmNoL2hwcGEvaHBwYS9wbWFwLmMsdgpyZXRyaWV2aW5n
 IHJldmlzaW9uIDEuMTE2CmRpZmYgLXUgLXAgLXIxLjExNiBwbWFwLmMKLS0tIHN5cy9hcmNo
 L2hwcGEvaHBwYS9wbWFwLmMJOSBBcHIgMjAyMiAyMzozODozMiAtMDAwMAkxLjExNgorKysg
 c3lzL2FyY2gvaHBwYS9ocHBhL3BtYXAuYwkyNCBNYXkgMjAyMiAwNTo0OTowNSAtMDAwMApA
 QCAtMTEwOSw3ICsxMTA5LDcgQEAgcG1hcF9ib290c3RyYXAodmFkZHJfdCB2c3RhcnQpCiAJ
 CWVsc2UgaWYgKHZhID09IHV2bV9sd3BfZ2V0dWFyZWEoJmx3cDApICsgVVNQQUNFIC0gUEFH
 RV9TSVpFKQogCQkJcHJvdCA9IFVWTV9QUk9UX05PTkU7CiAjZW5kaWYKLQkJcG1hcF9rZW50
 ZXJfcGEodmEsIHZhLCBwcm90LCAwKTsKKwkJcG1hcF9rZW50ZXJfcGEodmEsIHZhLCBwcm90
 LCBQTUFQX0RJUkVDVE1BUCk7CiAJfQogCiAJLyogWFhYTkggdXBkYXRlICovCkBAIC0xOTkw
 LDggKzE5OTMsMTEgQEAgcG1hcF9rZW50ZXJfcGEodmFkZHJfdCB2YSwgcGFkZHJfdCBwYSwg
 dgogCSAgICBwbWFwX3Byb3QocG1hcF9rZXJuZWwoKSwgcHJvdCAmIFZNX1BST1RfQUxMKSk7
 CiAJaWYgKElTX0lPUEFHRV9QKHBhKSB8fCAoZmxhZ3MgJiBQTUFQX05PQ0FDSEUpKQogCQlw
 dGUgfD0gUFRFX1BST1QoVExCX1VOQ0FDSEVBQkxFKTsKLQlwbWFwX2tlcm5lbCgpLT5wbV9z
 dGF0cy53aXJlZF9jb3VudCsrOwotCXBtYXBfa2VybmVsKCktPnBtX3N0YXRzLnJlc2lkZW50
 X2NvdW50Kys7CisKKwlpZiAoKGZsYWdzICYgUE1BUF9ESVJFQ1RNQVApID09IDApIHsKKwkJ
 cG1hcF9rZXJuZWwoKS0+cG1fc3RhdHMud2lyZWRfY291bnQrKzsKKwkJcG1hcF9rZXJuZWwo
 KS0+cG1fc3RhdHMucmVzaWRlbnRfY291bnQrKzsKKwl9CiAJaWYgKG9wdGUpCiAJCXBtYXBf
 cHRlX2ZsdXNoKHBtYXBfa2VybmVsKCksIHZhLCBvcHRlKTsKIApAQCAtMjA3MSw2ICsyMDc3
 LDkgQEAgcG1hcF9rcmVtb3ZlKHZhZGRyX3QgdmEsIHZzaXplX3Qgc2l6ZSkKIAkJcG1hcF9w
 dGVfZmx1c2gocG1hcCwgdmEsIHB0ZSk7CiAJCXBtYXBfcHRlX3NldChwZGUsIHZhLCAwKTsK
 IAorCQlwbWFwLT5wbV9zdGF0cy53aXJlZF9jb3VudC0tOworCQlwbWFwLT5wbV9zdGF0cy5y
 ZXNpZGVudF9jb3VudC0tOworCiAJCXBnID0gcG1hcF9pbml0aWFsaXplZCA/IFBIWVNfVE9f
 Vk1fUEFHRShQVEVfUEFHRShwdGUpKSA6IE5VTEw7CiAJCWlmIChwZyAhPSBOVUxMKSB7CiAJ
 CQlzdHJ1Y3Qgdm1fcGFnZV9tZCAqIGNvbnN0IG1kID0gVk1fUEFHRV9UT19NRChwZyk7Cklu
 ZGV4OiBzeXMvYXJjaC9ocHBhL2luY2x1ZGUvcG1hcC5oCj09PT09PT09PT09PT09PT09PT09
 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KUkNTIGZp
 bGU6IC9jdnNyb290L3NyYy9zeXMvYXJjaC9ocHBhL2luY2x1ZGUvcG1hcC5oLHYKcmV0cmll
 dmluZyByZXZpc2lvbiAxLjQwCmRpZmYgLXUgLXAgLXIxLjQwIHBtYXAuaAotLS0gc3lzL2Fy
 Y2gvaHBwYS9pbmNsdWRlL3BtYXAuaAkxNCBNYXIgMjAyMCAxNDowNTo0MiAtMDAwMAkxLjQw
 CisrKyBzeXMvYXJjaC9ocHBhL2luY2x1ZGUvcG1hcC5oCTI0IE1heSAyMDIyIDA1OjQ5OjA1
 IC0wMDAwCkBAIC03Miw2ICs3Miw4IEBAIHN0cnVjdCBwbWFwIHsKICNkZWZpbmUJSFBQQV9N
 QVhfUElECTB4ZmZmYQogI2RlZmluZQlIUFBBX1NJRF9NQVgJMHg3ZmZkCiAKKyNkZWZpbmUJ
 UE1BUF9ESVJFQ1RNQVAJMHgxMDAwMDAwMCAvKiBrZW50ZXJfcGEgKi8KKwogLyoKICAqIERP
 TidUIENIQU5HRSBUSElTIC0gdGhpcyBpcyBhc3N1bWVkIGluIGxvdHMgb2YgcGxhY2VzLgog
 ICovCg==

 --------------WIqpHylpkgtuYOD7Yk76jtfL--

From: Tom Lane <tgl@sss.pgh.pa.us>
To: gnats-bugs@netbsd.org, Nick Hudson <nick.hudson@gmx.co.uk>
Cc: port-hppa-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-hppa/56849: Wacko kernel memory accounting in current/hppa
Date: Tue, 24 May 2022 16:55:11 -0400

 Nick Hudson <nick.hudson@gmx.co.uk> writes:
 >  Candidate fix...

 I can confirm that with this patch installed, top and ps report a
 reasonably stable kernel size, ranging from ~80M just after boot
 to ~95M at peak load.  This is a bit more than I was expecting, but
 it's probably fine given that I kicked some things up after hitting
 ENFILE errors:

 $ cat /etc/sysctl.conf
 ...
 # tgl additions
 kern.ipc.semmni=3D100
 kern.ipc.semmns=3D1000
 kern.maxvnodes=3D50000
 kern.maxproc=3D1000
 kern.maxfiles=3D5000


 I poked around with some of the other monitoring tools mentioned in
 the NetBSD Guide, and I noticed that "sysstat bufcache" is a tad
 confused about metadata buffers:

                     /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
      Load Average   |||||||||||||||||||||||||||||||||||||||||||||||

      2144 metadata buffers using             98681168592896 kBytes of memo=
 ry (19777529200%).
     55451 pages for cached file data using   221804 kBytes of memory (44%)=
 .
     15502 pages for executables using         62008 kBytes of memory (12%)=
 .
     35216 pages for anon (non-file) data     140864 kBytes of memory (28%)=
 .
       495 free pages                           1980 kBytes of memory ( 0%)=
 .

 File System          Bufs used   %   kB in use   %  Bufsize kB   %  Util %
 /home                     1227  57       12852   0       12852   0     100
 /usr                       524  24        4460   0        4498   0      99
 /var                        36   1         226   0         226   0     100
 /                           19   0         186   0         186   0     100

 Total:                    1806  84       17724   0       17762   0     100


 error reading kmem for vaddr at 0xda030 (Bad address)


 That error message at the bottom of the screen might or might not
 be related to the silly stat.  I'm not sure to what extent these
 numbers should be expected to match what "top" shows, but they
 don't seem crazy on their own with the exception of that one.

 			regards, tom lane

State-Changed-From-To: open->feedback
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Sat, 28 May 2022 06:39:43 +0000
State-Changed-Why:
Changes committed. Other statistics problem exist elsewhere too, I think.
OK to close?


From: Tom Lane <tgl@sss.pgh.pa.us>
To: gnats-bugs@netbsd.org, skrll@NetBSD.org
Cc: port-hppa-maintainer@netbsd.org, netbsd-bugs@netbsd.org,
        gnats-admin@netbsd.org
Subject: Re: port-hppa/56849 (Wacko kernel memory accounting in current/hppa)
Date: Sat, 28 May 2022 12:57:49 -0400

 skrll@NetBSD.org writes:
 > Changes committed. Other statistics problem exist elsewhere too, I think.
 > OK to close?

 OK by me; this is an improvement, even if it doesn't fix every
 problem in the area.

 			regards, tom lane

State-Changed-From-To: feedback->closed
State-Changed-By: skrll@NetBSD.org
State-Changed-When: Sun, 29 May 2022 06:28:33 +0000
State-Changed-Why:
Submitter agreed to close


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.