NetBSD Problem Report #57558
From Frank.Kardel@Acrys.com Thu Aug 3 08:44:21 2023
Return-Path: <Frank.Kardel@Acrys.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 1BB351A9238
for <gnats-bugs@gnats.NetBSD.org>; Thu, 3 Aug 2023 08:44:21 +0000 (UTC)
Message-Id: <20230803084410.0E6E16019@gaia.acrys.com>
Date: Thu, 3 Aug 2023 10:44:10 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: pgdaemon 100% busy - no scanning (ZFS case)
X-Send-Pr-Version: 3.95
>Number: 57558
>Category: kern
>Synopsis: pgdaemon 100% busy - no scanning (ZFS case)
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Aug 03 08:45:00 +0000 2023
>Last-Modified: Fri Apr 26 15:35:01 +0000 2024
>Originator: Frank Kardel
>Release: NetBSD 10.0_BETA / current
>Organization:
>Environment:
System: NetBSD Marmolata 10.0_BETA NetBSD 10.0_BETA (XEN3_DOM0) #1: Thu Jul 27 18:30:30 CEST 2023 kardel@gaia:/src/NetBSD/n10/src/obj.amd64/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
It has been observed that pgdaemon can get into a tight loop
consuming 100% cpu, not yielding and blocking other RUNable
threads on the cpu. (PRs kern/56516 (for effects - may have other cause), kern/55707)
This analysis and proposed fix relate to the pgdaemon loop cause by KVA exhaustion
by ZFS.
Observed and analyzed in following environment (should be reproducable in simpler
environments):
XEN3_DOM0 providing vnd devices based on files on ZFS.
GENERIC(pvh) DOMU using an ffs filesystem based in the vnd in XEN3_DOM0.
Observed actions/effects:
1) running a database on the ffs file system in XEN3_DOM0
2) load a larger database
3) XEN3_DOM0 is fine until ZFS allocated 90% of KVA
at this point pgdaemon kicks in and enters a tight loop.
4) pgdaemon does not do page scans (enough memory is available)
5) pgdaemon loops as uvm_km_va_starved_p() returns true
6) pool_drain is unable the reclaim any idle pages from the pools
7) uvm_km_va_starved_p() thus keeps returning true - pgdaemon keeps looping
Analyzed causes:
- pool_drain causes upcalls to ZFS reclaim logic
- ZFS reclaim logic does no reclaim anything as the current code
looks at uvm_availmem(false) and that returns 'plenty' memory)
thus no attempt to free memory on ZFS is done and no KVA is reclaimed.
Conclusion:
- using uvm_availmem(false) for ZFS memory throtteling is wrong as
ZFS memory is allocated from kmem KVA pools.
- ZFS arc must to use KVA memory for memory checks
>How-To-Repeat:
run a DB load in an FFS from a vnd of a file on ZFS.
>Fix:
Patch 1:
let ZFS use a correct view on KVA memory:
With this patch arc reclaim now detects memory shortage and
frees pages. Also the ZFS KVA used by ZFS is limited to
75% KVA - could be made tunable
Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
correctly, but pages are not fully reclaimed and ZFS depletes its cache
fully as the freed and now idle page are not reclaimed from the pools yet.
pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
at this point.
To reclaim the pages freed directly we need
Patch 2:
force page reclaim
that will perform the reclaim.
With both fixes the arc reclaim thread kicks in at 75% KVA usage and
reclaim only enough memory to no to exceed 75% KVA.
Any comments?
OK to commit? (happens automatically on no feedback)
Index: external/cddl/osnet/dist/uts/common/fs/zfs/arc.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c,v
retrieving revision 1.22
diff -c -u -r1.22 arc.c
--- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c 3 Aug 2022 01:53:06 -0000 1.22
+++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c 3 Aug 2023 08:19:11 -0000
@@ -276,6 +276,7 @@
#endif /* illumos */
#ifdef __NetBSD__
+#include <sys/vmem.h>
#include <uvm/uvm.h>
#ifndef btop
#define btop(x) ((x) / PAGE_SIZE)
@@ -285,9 +286,9 @@
#endif
//#define needfree (uvm_availmem() < uvmexp.freetarg ? uvmexp.freetarg : 0)
#define buf_init arc_buf_init
-#define freemem uvm_availmem(false)
+#define freemem btop(vmem_size(kmem_arena, VMEM_FREE))
#define minfree uvmexp.freemin
-#define desfree uvmexp.freetarg
+#define desfree (btop(vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE)) / 4)
#define zfs_arc_free_target desfree
#define lotsfree (desfree * 2)
#define availrmem desfree
Patch 2:
force reclaiming of pages on affected pools
Index: external/cddl/osnet/sys/kern/kmem.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/sys/kern/kmem.c,v
retrieving revision 1.3
diff -c -u -r1.3 kmem.c
--- external/cddl/osnet/sys/kern/kmem.c 11 Nov 2020 03:31:04 -0000 1.3
+++ external/cddl/osnet/sys/kern/kmem.c 3 Aug 2023 08:19:11 -0000
@@ -124,6 +124,7 @@
{
pool_cache_invalidate(km->km_pool);
+ pool_cache_reclaim(km->km_pool);
}
#undef kmem_alloc
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: kardel@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 10:23:19 +0000
Cool, thanks for looking into this! I was planning to investigate at
some point soon, starting by adding dtrace probes (and maybe wiring up
the sysctl knobs) so we can reproduce the analysis of the issue in the
field. Your analysis sounds plausible, but I'd like to make sure we
have the visibility to verify the behaviour -- and the change in
behaviour -- first!
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 13:06:33 +0200
Hi Taylor,
I used the existing DTRACE knobs in arc.c (adding one for available
memory() thought, but fbt:return on
arc_available_memory would suffice) and counter based
event debugging (additional code) to track the tight loop in uvm_pdaemon.c.
A sysctl for % KVA would be useful. the existing fbt and sdt probes already
help a lot to track the pattern.
In my setup (soon to be used for actual work) the loops could be reproduced.
With patch 1 the pgdaemon loops went away KVA used fir ZFS was limited
to 75% but the
cache was depleted because of missing patch 2. And patch 2 is needed as
there is no chance
the pgdaemon will trigger a pool_drain unless we reach kva_starvation on
a non ZFS path.
So what would the next steps be?
Frank
On 08/03/23 12:25, Taylor R Campbell wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: Taylor R Campbell <riastradh@NetBSD.org>
> To: kardel@NetBSD.org
> Cc: gnats-bugs@NetBSD.org
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Thu, 3 Aug 2023 10:23:19 +0000
>
> Cool, thanks for looking into this! I was planning to investigate at
> some point soon, starting by adding dtrace probes (and maybe wiring up
> the sysctl knobs) so we can reproduce the analysis of the issue in the
> field. Your analysis sounds plausible, but I'd like to make sure we
> have the visibility to verify the behaviour -- and the change in
> behaviour -- first!
>
From: Taylor R Campbell <riastradh@NetBSD.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 12:22:58 +0000
Can you share the dtrace scripts you used for reference and how you
set up the experiment?
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:11:51 +0200
This is a multi-part message in MIME format.
--------------100111F35787E6220A86651D
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sure
Setup:
- all userlans NetBSD-10.0_BETA
- NetBSD 10.0_BETA (2023-07-26) (-current should al work) XEN3_DOM0
(pagedaemon patched see pd.diff attachment)
- xen-4.15.1
- NetBSD 10.0_BETA GENERIC as DOMU
- on DOM0 zfs file system providing a file for the FFS file system
in the DOMU
- DOMU has a posgresql 14.8 installation
- testcase is load a significant database (~200 Gb) into the
postgres DB.
this seems complicated to setup (but I am prepaing the kind of VM for
our purposes).
Going by the errors detected it should al be possible (not tested)
- creste ZFS file syystem on a plain GENERIC system
- create a file system file in ZFS
- vnconfig vndX <path the file system file>
- disklabel vndX
- newfs vndXa
- mount /dev/vndXa /mnt
- do lots of fs traffic writing, deleting, rewriting the mount fs
Part 1 - current situation:
Use
sdt:::arc-available_memory
{
printf("mem = %d, reason = %d", arg0, arg1);
}
to track what ZFS thinks is has as memory - positive values mean enough
memory there, negative ask ZFS ARC to free the much of memory.
Use vmstat -m to track pool usage - you should see that ZFS will take
more an more memory until 90% kmem is used in the pools.
At the point you should see a ~100% busy pgdaemon in top and
the pagedaemon patch should list high counts for loop, kvm_starved and
available as uvm_availmem(false) still reports many free pages.
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9813.2250709] pagedaemon: loops=16023729, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16023729, cnt_starved=16023729,
cnt_avail=16023729, fpages=336349
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9819.2252810] pagedaemon: loops=16018349, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16018349, cnt_starved=16018349,
cnt_avail=16018349, fpages=336542
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9825.2255258] pagedaemon: loops=16025793, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16025793, cnt_starved=16025793,
cnt_avail=16025793, fpages=336516
...
That document the tight loop with no progress. the pgdaemon will not
recover - see my analysis.
Observe the arc_reclaim is not freeing anything (and collect no cpu
time see top) because arc_available memory claims that there is enough
free memory (looks at uvm_availmem(false).
The dtrace probe documents that.
Part 2 - get the arc_reclaim thread to actually be triggered before kmem
is starving.
Install Patch 1 from the bug report. This lets ZFS look at the
kmem_arena space situation which a also looked at
uvm_km.c:uvm_km_va_starved_p(void).
Now ZFS has a chance to start reclaiming memory.
Run the load test again.
The dtrace probe should now show decreasing memory until it get
negative. And it will stay negative by a certain amount.
vmstat -m should show that ZFS now only hogs ~75% of kmem.
Also the should be a significant count in the Ide page counts as the
arc_reclaim thread did give up memory.
As the idle page are not yet reclaim from the pool ZFS is asked to
always free memory (dtrace probe) an vmstat -m will
show the non zero Idle page counts. Thus now ZFS has 75% kmem memory
allocated but utilized only a small part. Thus the cache
is allocated but not used anymore.
We need to get the Idle pages actually reclaimed from the pools. This is
done by Patch 2 from the bug.
There is no way to pass this task to the pgdaemon the that looks only
uvm_availmem(false) that does not consider kmem unless starving. Also
the pool drain thread drain one pool at a time per invocation and that
is not even triggered.
so Patch 2 directly reclaims from the pool_cache_invalidate()ed pool.
With this strategy ZFS keeps the kmem usage around 75% as now Idle pages
are reclaimed and ZFS only gets negative arc_available_memory
values when called for.
vmstat will show that ZFS is now in the 75% kmem limit. arc_reclaim will
run at a suitable rate when needed. ZFS pools should not show too many
idle pages (idle pages
are removed after some cool down time to reduce xcall activity if I read
the code right).
dtrace should show positive and negative arc_available memory figures.
I did not keep the vmstat and dtrace and top outputs. But from a busy db
loading DOMU ( databases > 350 GB)
I see a vmstat -m of
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg
Maxpg Idle
...
zfs_znode_cache 248 215697 0 0 13482 0 13482 13482 0 inf 0
zil_lwb_cache 208 84 0 0 5 0 5 5 0 inf 0
zio_buf_1024 1536 11248 0 7612 3278 1460 1818 1818 0 inf 0
zio_buf_10240 10240 1130 0 723 973 566 407 407 0 inf 0
zio_buf_114688 114688 351 0 200 339 188 151 151 0 inf 0
zio_buf_12288 12288 1006 0 714 721 429 292 305 0 inf 0
zio_buf_131072 131072 3150 89 2176 1841 867 974 974 0 inf 0
zio_buf_14336 14336 473 0 308 432 267 165 166 0 inf 0
zio_buf_1536 2048 2060 0 1065 549 51 498 498 0 inf 0
zio_buf_16384 16384 9672 0 481 9318 127 9191 9191 0 inf 0
zio_buf_2048 2048 2001 0 826 682 94 588 588 0 inf 0
zio_buf_20480 20480 461 0 301 428 268 160 160 0 inf 0
zio_buf_24576 24576 448 0 293 404 249 155 155 0 inf 0
zio_buf_2560 2560 2319 1 490 1948 119 1829 1829 0 inf 0
zio_buf_28672 28672 369 0 221 345 197 148 152 0 inf 0
zio_buf_3072 3072 4163 2 422 3861 120 3741 3741 0 inf 0
...
zio_buf_7168 7168 506 0 292 465 251 214 214 0 inf 0
zio_buf_8192 8192 724 0 329 635 240 395 395 0 inf 0
zio_buf_81920 81920 379 0 229 371 221 150 161 0 inf 0
zio_buf_98304 98304 580 0 421 442 283 159 163 0 inf 0
zio_cache 992 4707 0 0 1177 0 1177 1177 0 inf 0
zio_data_buf_10 1536 39 0 33 20 17 3 12 0 inf 0
zio_data_buf_10 10240 2 0 2 2 2 0 2 0 inf 0
zio_data_buf_13 131072 488674 0 323782 274996 110104 164892 191800 0
inf 0
zio_data_buf_15 2048 25 0 19 13 10 3 7 0 inf 0
zio_data_buf_20 2048 17 0 13 9 7 2 4 0 inf 0
zio_data_buf_20 20480 1 0 1 1 1 0 1 0 inf 0
zio_data_buf_25 2560 7 0 6 7 6 1 5 0 inf 0
...
Totals 222323337 98 210229180 1033080 125800 907280
In use 24951773K, total allocated 25255540K; utilization 98.8%
In the unpatched case all 32GB where allocated.
The arc_reclaim_thread clocked in 20 CPU sec - that is ok.
Current dtrace output is:
dtrace: script 'zfsmem.d' matched 1 probe
CPU ID FUNCTION:NAME
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
The page daemon was never woken up and has 0 CPU seconds. in 2 days.
This all looks very much as desired.
Hope this helps.
Best regards,
Frank
--------------100111F35787E6220A86651D
Content-Type: text/x-patch;
name="pd.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="pd.diff"
--- /src/NetBSD/n10/src/sys/uvm/uvm_pdaemon.c 2023-07-29 17:52:46.392362932 +0200
+++ /src/NetBSD/n10/src/sys/uvm/.#uvm_pdaemon.c.1.133 2023-07-29 14:18:05.000000000 +0200
@@ -270,11 +270,15 @@
/*
* main loop
*/
-
+/*XXXkd*/ unsigned long cnt_needsfree = 0L, cnt_needsscan = 0, cnt_drain = 0, cnt_starved = 0, cnt_avail = 0, cnt_loops = 0;
+/*XXXkd*/ time_t ts, last_ts = time_second;
for (;;) {
bool needsscan, needsfree, kmem_va_starved;
+/*XXXkd*/ cnt_loops++;
+
kmem_va_starved = uvm_km_va_starved_p();
+/*XXXkd*/ if (kmem_va_starved) cnt_starved++;
mutex_spin_enter(&uvmpd_lock);
if ((uvm_pagedaemon_waiters == 0 || uvmexp.paging > 0) &&
@@ -311,6 +315,8 @@
needsfree = fpages + uvmexp.paging < uvmexp.freetarg;
needsscan = needsfree || uvmpdpol_needsscan_p();
+/*XXXkd*/ if (needsfree) cnt_needsfree++;
+/*XXXkd*/ if (needsscan) cnt_needsscan++;
/*
* scan if needed
*/
@@ -328,8 +334,18 @@
wakeup(&uvmexp.free);
uvm_pagedaemon_waiters = 0;
mutex_spin_exit(&uvmpd_lock);
+/*XXXkd*/ cnt_avail++;
}
+/*XXXkd*/ if (needsfree || kmem_va_starved) cnt_drain++;
+/*XXXkd*/ ts = time_second;
+/*XXXkd*/ if (ts > last_ts + 5 && cnt_loops > 5 * 10000) {
+/*XXXkd*/ printf("pagedaemon: loops=%ld, cnt_needsfree=%ld, cnt_needsscan=%ld, cnt_drain=%ld, cnt_starved=%ld, cnt_avail=%ld, fpages=%d\n",
+/*XXXkd*/ cnt_loops, cnt_needsfree, cnt_needsscan, cnt_drain, cnt_starved, cnt_avail, fpages);
+/*XXXkd*/ cnt_needsfree = cnt_needsscan = cnt_drain = cnt_starved = cnt_avail = cnt_loops = 0;
+/*XXXkd*/ last_ts = ts;
+/*XXXkd*/ }
+
/*
* scan done. if we don't need free memory, we're done.
*/
--------------100111F35787E6220A86651D--
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:19:57 +0200
Just a note: bug patch 1 affects the zfs module, patch 2 affects the
solaris module.
so only module buils are needed
From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 09:27:50 -0700
On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
> Patch 1:
> let ZFS use a correct view on KVA memory:
> With this patch arc reclaim now detects memory shortage and
> frees pages. Also the ZFS KVA used by ZFS is limited to
> 75% KVA - could be made tunable
>
> Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
> correctly, but pages are not fully reclaimed and ZFS depletes its cache
> fully as the freed and now idle page are not reclaimed from the pools yet.
> pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
> at this point.
this patch is not correct. it does not do the right thing when there
is plenty of KVA but a shortage of physical pages. the goal with
previous fixes for ZFS ARC memory management problems was to prevent
KVA shortages by making KVA big enough to map all of RAM, and thus
avoid the need to consider KVA because we would always run low on
physical pages before we would run low on KVA. but apparently in your
environment that is not working. maybe we do something differently in
a XEN kernel that we need to account for?
> To reclaim the pages freed directly we need
> Patch 2:
> force page reclaim
> that will perform the reclaim.
this second patch is fine.
-Chuck
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 20:22:10 +0200
Hi Chuck !
Thanks for looking into that.
I came up with the first patch due to pgdaemon looping due to
uvm_km_va_starved_p() being true.
vmstat -m shows the statistics of the pools is summary close to
32Gb my DOM0 has.
counting the conditions when the pgdaemon is looping gives
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9789.2242179] pagedaemon: loops=16026699, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16026699, cnt_starved=16026
699, cnt_avail=16026699, fpages=337385
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9795.2244437] pagedaemon: loops=16024007, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16024007, cnt_starved=16024
007, cnt_avail=16024007, fpages=335307
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9801.2246381] pagedaemon: loops=16031141, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16031141, cnt_starved=16031
141, cnt_avail=16031141, fpages=335331
uvm_km_va_starved_p(void)
{
vmem_size_t total;
vmem_size_t free;
if (kmem_arena == NULL)
return false;
total = vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE);
free = vmem_size(kmem_arena, VMEM_FREE);
return (free < (total / 10));
}
int
uvm_availmem(bool cached)
{
int64_t fp;
cpu_count_sync(cached);
if ((fp = cpu_count_get(CPU_COUNT_FREEPAGES)) < 0) {
/*
* XXXAD could briefly go negative because it's impossible
* to get a clean snapshot. address this for other
counters
* used as running totals before NetBSD 10 although less
* important for those.
*/
fp = 0;
}
return (int)fp;
}
So, while uvm_km_va_starved_p() considers almost all memory used up
uvm_availmem(false) returns 337385 free pages (~1.28 Gb) well above
uvmexp.freetarg.
So, why do we count so many free pages when the free vmem for kmem_arena
is less than 10% of the total kmem_arena?
Maybe the pool pages have been allocated but not yet been referenced - I
didn't look that deep into the vmen/ZFS interaction.
I understand the reasoning why .kmem size = phymem size should have worked
There are still inconsistencies, though.
Even if uvm_availmem(false) would account for all pages
allocated/reserved in the kmem_arena vmem on the 32Gb system the actual
freetarget is 2730 free pages (~10.7 Mb).
%10 of 32Gb would be 3.2Gb which is a multiple of the free pages target.
So even then we would be stuck with a looping page daemon.
I think we need to find a better way for coping with with the accounting
differences between vmem/uvm free pages. Looking at the vmem statistics
seemed logical to me as ZFS allocates almost everything from kmem_arena
via pools.
I don't know what vmem does when there are less physical pages available
that the vmem allocation would require. This was the case you tried to
avoid.
So, looking at vmen statistic seems to be consistent with the starved
flag logic - that is why it does not trigger the looping pgdaemon. What
isn't
covered is the case of less physical pages than the pool allocation
required.
I think we have yet to find a correct, robust solution that does not
trigger the pgdaemon almost infinite loop.
Frank
On 08/03/23 18:30, Chuck Silvers wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: Chuck Silvers <chuq@chuq.com>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Thu, 3 Aug 2023 09:27:50 -0700
>
> On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
> > Patch 1:
> > let ZFS use a correct view on KVA memory:
> > With this patch arc reclaim now detects memory shortage and
> > frees pages. Also the ZFS KVA used by ZFS is limited to
> > 75% KVA - could be made tunable
> >
> > Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
> > correctly, but pages are not fully reclaimed and ZFS depletes its cache
> > fully as the freed and now idle page are not reclaimed from the pools yet.
> > pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
> > at this point.
>
> this patch is not correct. it does not do the right thing when there
> is plenty of KVA but a shortage of physical pages. the goal with
> previous fixes for ZFS ARC memory management problems was to prevent
> KVA shortages by making KVA big enough to map all of RAM, and thus
> avoid the need to consider KVA because we would always run low on
> physical pages before we would run low on KVA. but apparently in your
> environment that is not working. maybe we do something differently in
> a XEN kernel that we need to account for?
>
>
> > To reclaim the pages freed directly we need
> > Patch 2:
> > force page reclaim
> > that will perform the reclaim.
>
> this second patch is fine.
>
> -Chuck
>
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sat, 5 Aug 2023 13:24:55 -0000 (UTC)
chuq@chuq.com (Chuck Silvers) writes:
> this patch is not correct. it does not do the right thing when there
> is plenty of KVA but a shortage of physical pages.
I doubt that this is handled well enough outside of zfs.
The result of using ZFS (removing pkgsrc tree, unpacking it again) is
about 3.2GB of kernel pools used (from 8GB total RAM).
Trying to reduce ZFS again by tuning ZFS, shrinking maxvnodes and
then allocating user pages ends in the following (only pools >10G shown):
anonpl 0.078G
arc_buf_hdr_t_f 0.018G
buf16k 0.014G
dmu_buf_impl_t 0.107G
dnode_t 0.173G
kmem-00064 0.037G
kmem-00128 0.089G
kmem-00192 0.182G
kmem-00256 0.069G
kmem-00384 0.161G
kmem-01024 0.020G
kmem-02048 0.033G
mutex 0.020G
namecache 0.020G
pcglarge 0.042G
pcgnormal 0.107G
phpool-64 0.014G
rwlock 0.020G
sa_cache 0.032G
vcachepl 0.206G
zfs_znode_cache 0.072G
zio_buf_131072 0.015G
zio_buf_16384 0.075G
zio_buf_4096 0.019G
zio_buf_512 0.170G
zio_cache 0.161G
That's about 2GB left. Things like dnodes or the zio_cache are never
flushed, the 512 byte zio buffer pool is still huge because it is
totally fragmented, but also vcachepl is never drained.
ZFS tries to drop referenced metadata in Solaris or FreeBSD, but
for NetBSD that's still a nop.
The OpenZFS code in current FreeBSD looks quite different too in
that area.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57558 CVS commit: src/external/cddl/osnet/sys/kern
Date: Sat, 9 Sep 2023 00:14:16 +0000
Module Name: src
Committed By: riastradh
Date: Sat Sep 9 00:14:16 UTC 2023
Modified Files:
src/external/cddl/osnet/sys/kern: kmem.c
Log Message:
solaris: Use pool_cache_reclaim, not pool_cache_invalidate.
pool_cache_invalidate invalidates cached objects, but doesn't return
any backing pages to the underlying page allocator.
pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
pages to the underlying page alloator, so it is actually useful for
the page daemon to do when trying to free memory.
PR kern/57558
XXX pullup-10
XXX pullup-9
XXX pullup-8 (by patch to kmem.h instead of kmem.c)
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 src/external/cddl/osnet/sys/kern/kmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57558 CVS commit: [netbsd-9] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:31:14 +0000
Module Name: src
Committed By: martin
Date: Mon Oct 2 13:31:14 UTC 2023
Modified Files:
src/external/cddl/osnet/sys/kern [netbsd-9]: kmem.c
Log Message:
Pull up following revision(s) (requested by riastradh in ticket #1735):
external/cddl/osnet/sys/kern/kmem.c: revision 1.4
solaris: Use pool_cache_reclaim, not pool_cache_invalidate.
pool_cache_invalidate invalidates cached objects, but doesn't return
any backing pages to the underlying page allocator.
pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
pages to the underlying page alloator, so it is actually useful for
the page daemon to do when trying to free memory.
PR kern/57558
To generate a diff of this commit:
cvs rdiff -u -r1.2.2.1 -r1.2.2.2 src/external/cddl/osnet/sys/kern/kmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57558 CVS commit: [netbsd-10] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:29:59 +0000
Module Name: src
Committed By: martin
Date: Mon Oct 2 13:29:59 UTC 2023
Modified Files:
src/external/cddl/osnet/sys/kern [netbsd-10]: kmem.c
Log Message:
Pull up following revision(s) (requested by riastradh in ticket #383):
external/cddl/osnet/sys/kern/kmem.c: revision 1.4
solaris: Use pool_cache_reclaim, not pool_cache_invalidate.
pool_cache_invalidate invalidates cached objects, but doesn't return
any backing pages to the underlying page allocator.
pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
pages to the underlying page alloator, so it is actually useful for
the page daemon to do when trying to free memory.
PR kern/57558
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.3.6.1 src/external/cddl/osnet/sys/kern/kmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)
kardel@netbsd.org (Frank Kardel) writes:
>Observed behavior:
>pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)
I have local patches to avoid spinning. As the page daemon keeps
data structures locked and runs at maximum priority, it prevents
other tasks to release resources. That's only a little help,
if the page daemon really cannot free anything, the system is still
locked up to some degree.
The main reason, that the page daemon cannot free memory is
that vnodes are not drained. This keeps the associated pools
busy and buffers allocated by the file cache.
Of course, if ZFS isn't throttled (and it would be less so, if
others make room), it would just expunge the rest of the system
data, so any improvement here just shifts the problem.
>Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges
>from its tight loop as ZFS finally gives up its stranglehold on pool
>memory..
I locally added some of the arc tunables to experiment with the
free_target value. The calculation of arc_c_max in arc.c also
doesn't agree with the comments.
Later ZFS versions did change a lot in this area. Anything we
correct, might need to be redone, when we move to a newer ZFS
code base.
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 17:34:12 +0200
see below.
On 04/26/24 16:20, Michael van Elst wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)
>
> kardel@netbsd.org (Frank Kardel) writes:
>
> >Observed behavior:
> >pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)
>
> I have local patches to avoid spinning. As the page daemon keeps
> data structures locked and runs at maximum priority, it prevents
> other tasks to release resources. That's only a little help,
> if the page daemon really cannot free anything, the system is still
> locked up to some degree.
In this situation it spins only because the KVA starvation condition is
met. That triggers
the pooldrain thread which in turn attempts to drain the pools. The ZFS
pools
call into arc.c:hdr_recl() which triggers the arc_reclaim_thread. In
this scenario
arc_availablemem() returns a positive value and thus the reclaim thread
does not
reclaim anything.
So the pagedaemon loops but nothing improves as long
as arc_availmem returns positive values. At this point the zfs
statistics lists
large amounts of evictable data.
>
> The main reason, that the page daemon cannot free memory is
> that vnodes are not drained. This keeps the associated pools
> busy and buffers allocated by the file cache.
Well, this may not be the reason in this case - there is not even an attempt
made to let the reclaim thread evict data and drain pools. It doesn't
get that far.
>
> Of course, if ZFS isn't throttled (and it would be less so, if
> others make room), it would just expunge the rest of the system
> data, so any improvement here just shifts the problem.
Well ZFS currently likes to eat up all pool memory in this situation.
>
> >Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges
> >from its tight loop as ZFS finally gives up its stranglehold on pool
> >memory..
>
> I locally added some of the arc tunables to experiment with the
> free_target value. The calculation of arc_c_max in arc.c also
> doesn't agree with the comments.
>
> Later ZFS versions did change a lot in this area. Anything we
> correct, might need to be redone, when we move to a newer ZFS
> code base.
>
Yes, but currently we seem to have a broken ZFS at least on large memory
environments. Effects I observed
are:
- this bug: the famous looping pagedaemon
- PR kern/58198: ZFS can lead to UVM kills (no swap, out of swap)
- I am still trying to find out what causes the whole system to
slow down to a crawl. but that that point the system is unusable to
gather any information.
So something needs to be done.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.