NetBSD Problem Report #57558
From Frank.Kardel@Acrys.com Thu Aug 3 08:44:21 2023
Return-Path: <Frank.Kardel@Acrys.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 1BB351A9238
for <gnats-bugs@gnats.NetBSD.org>; Thu, 3 Aug 2023 08:44:21 +0000 (UTC)
Message-Id: <20230803084410.0E6E16019@gaia.acrys.com>
Date: Thu, 3 Aug 2023 10:44:10 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: pgdaemon 100% busy - no scanning (ZFS case)
X-Send-Pr-Version: 3.95
>Number: 57558
>Category: kern
>Synopsis: pgdaemon 100% busy - no scanning (ZFS case)
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Aug 03 08:45:00 +0000 2023
>Last-Modified: Thu May 09 19:10:01 +0000 2024
>Originator: Frank Kardel
>Release: NetBSD 10.0_BETA / current
>Organization:
>Environment:
System: NetBSD Marmolata 10.0_BETA NetBSD 10.0_BETA (XEN3_DOM0) #1: Thu Jul 27 18:30:30 CEST 2023 kardel@gaia:/src/NetBSD/n10/src/obj.amd64/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
It has been observed that pgdaemon can get into a tight loop
consuming 100% cpu, not yielding and blocking other RUNable
threads on the cpu. (PRs kern/56516 (for effects - may have other cause), kern/55707)
This analysis and proposed fix relate to the pgdaemon loop cause by KVA exhaustion
by ZFS.
Observed and analyzed in following environment (should be reproducable in simpler
environments):
XEN3_DOM0 providing vnd devices based on files on ZFS.
GENERIC(pvh) DOMU using an ffs filesystem based in the vnd in XEN3_DOM0.
Observed actions/effects:
1) running a database on the ffs file system in XEN3_DOM0
2) load a larger database
3) XEN3_DOM0 is fine until ZFS allocated 90% of KVA
at this point pgdaemon kicks in and enters a tight loop.
4) pgdaemon does not do page scans (enough memory is available)
5) pgdaemon loops as uvm_km_va_starved_p() returns true
6) pool_drain is unable the reclaim any idle pages from the pools
7) uvm_km_va_starved_p() thus keeps returning true - pgdaemon keeps looping
Analyzed causes:
- pool_drain causes upcalls to ZFS reclaim logic
- ZFS reclaim logic does no reclaim anything as the current code
looks at uvm_availmem(false) and that returns 'plenty' memory)
thus no attempt to free memory on ZFS is done and no KVA is reclaimed.
Conclusion:
- using uvm_availmem(false) for ZFS memory throtteling is wrong as
ZFS memory is allocated from kmem KVA pools.
- ZFS arc must to use KVA memory for memory checks
>How-To-Repeat:
run a DB load in an FFS from a vnd of a file on ZFS.
>Fix:
Patch 1:
let ZFS use a correct view on KVA memory:
With this patch arc reclaim now detects memory shortage and
frees pages. Also the ZFS KVA used by ZFS is limited to
75% KVA - could be made tunable
Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
correctly, but pages are not fully reclaimed and ZFS depletes its cache
fully as the freed and now idle page are not reclaimed from the pools yet.
pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
at this point.
To reclaim the pages freed directly we need
Patch 2:
force page reclaim
that will perform the reclaim.
With both fixes the arc reclaim thread kicks in at 75% KVA usage and
reclaim only enough memory to no to exceed 75% KVA.
Any comments?
OK to commit? (happens automatically on no feedback)
Index: external/cddl/osnet/dist/uts/common/fs/zfs/arc.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c,v
retrieving revision 1.22
diff -c -u -r1.22 arc.c
--- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c 3 Aug 2022 01:53:06 -0000 1.22
+++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c 3 Aug 2023 08:19:11 -0000
@@ -276,6 +276,7 @@
#endif /* illumos */
#ifdef __NetBSD__
+#include <sys/vmem.h>
#include <uvm/uvm.h>
#ifndef btop
#define btop(x) ((x) / PAGE_SIZE)
@@ -285,9 +286,9 @@
#endif
//#define needfree (uvm_availmem() < uvmexp.freetarg ? uvmexp.freetarg : 0)
#define buf_init arc_buf_init
-#define freemem uvm_availmem(false)
+#define freemem btop(vmem_size(kmem_arena, VMEM_FREE))
#define minfree uvmexp.freemin
-#define desfree uvmexp.freetarg
+#define desfree (btop(vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE)) / 4)
#define zfs_arc_free_target desfree
#define lotsfree (desfree * 2)
#define availrmem desfree
Patch 2:
force reclaiming of pages on affected pools
Index: external/cddl/osnet/sys/kern/kmem.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/sys/kern/kmem.c,v
retrieving revision 1.3
diff -c -u -r1.3 kmem.c
--- external/cddl/osnet/sys/kern/kmem.c 11 Nov 2020 03:31:04 -0000 1.3
+++ external/cddl/osnet/sys/kern/kmem.c 3 Aug 2023 08:19:11 -0000
@@ -124,6 +124,7 @@
{
pool_cache_invalidate(km->km_pool);
+ pool_cache_reclaim(km->km_pool);
}
#undef kmem_alloc
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: kardel@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 10:23:19 +0000
Cool, thanks for looking into this! I was planning to investigate at
some point soon, starting by adding dtrace probes (and maybe wiring up
the sysctl knobs) so we can reproduce the analysis of the issue in the
field. Your analysis sounds plausible, but I'd like to make sure we
have the visibility to verify the behaviour -- and the change in
behaviour -- first!
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 13:06:33 +0200
Hi Taylor,
I used the existing DTRACE knobs in arc.c (adding one for available
memory() thought, but fbt:return on
arc_available_memory would suffice) and counter based
event debugging (additional code) to track the tight loop in uvm_pdaemon.c.
A sysctl for % KVA would be useful. the existing fbt and sdt probes already
help a lot to track the pattern.
In my setup (soon to be used for actual work) the loops could be reproduced.
With patch 1 the pgdaemon loops went away KVA used fir ZFS was limited
to 75% but the
cache was depleted because of missing patch 2. And patch 2 is needed as
there is no chance
the pgdaemon will trigger a pool_drain unless we reach kva_starvation on
a non ZFS path.
So what would the next steps be?
Frank
On 08/03/23 12:25, Taylor R Campbell wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: Taylor R Campbell <riastradh@NetBSD.org>
> To: kardel@NetBSD.org
> Cc: gnats-bugs@NetBSD.org
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Thu, 3 Aug 2023 10:23:19 +0000
>
> Cool, thanks for looking into this! I was planning to investigate at
> some point soon, starting by adding dtrace probes (and maybe wiring up
> the sysctl knobs) so we can reproduce the analysis of the issue in the
> field. Your analysis sounds plausible, but I'd like to make sure we
> have the visibility to verify the behaviour -- and the change in
> behaviour -- first!
>
From: Taylor R Campbell <riastradh@NetBSD.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 12:22:58 +0000
Can you share the dtrace scripts you used for reference and how you
set up the experiment?
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:11:51 +0200
This is a multi-part message in MIME format.
--------------100111F35787E6220A86651D
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sure
Setup:
- all userlans NetBSD-10.0_BETA
- NetBSD 10.0_BETA (2023-07-26) (-current should al work) XEN3_DOM0
(pagedaemon patched see pd.diff attachment)
- xen-4.15.1
- NetBSD 10.0_BETA GENERIC as DOMU
- on DOM0 zfs file system providing a file for the FFS file system
in the DOMU
- DOMU has a posgresql 14.8 installation
- testcase is load a significant database (~200 Gb) into the
postgres DB.
this seems complicated to setup (but I am prepaing the kind of VM for
our purposes).
Going by the errors detected it should al be possible (not tested)
- creste ZFS file syystem on a plain GENERIC system
- create a file system file in ZFS
- vnconfig vndX <path the file system file>
- disklabel vndX
- newfs vndXa
- mount /dev/vndXa /mnt
- do lots of fs traffic writing, deleting, rewriting the mount fs
Part 1 - current situation:
Use
sdt:::arc-available_memory
{
printf("mem = %d, reason = %d", arg0, arg1);
}
to track what ZFS thinks is has as memory - positive values mean enough
memory there, negative ask ZFS ARC to free the much of memory.
Use vmstat -m to track pool usage - you should see that ZFS will take
more an more memory until 90% kmem is used in the pools.
At the point you should see a ~100% busy pgdaemon in top and
the pagedaemon patch should list high counts for loop, kvm_starved and
available as uvm_availmem(false) still reports many free pages.
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9813.2250709] pagedaemon: loops=16023729, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16023729, cnt_starved=16023729,
cnt_avail=16023729, fpages=336349
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9819.2252810] pagedaemon: loops=16018349, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16018349, cnt_starved=16018349,
cnt_avail=16018349, fpages=336542
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9825.2255258] pagedaemon: loops=16025793, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16025793, cnt_starved=16025793,
cnt_avail=16025793, fpages=336516
...
That document the tight loop with no progress. the pgdaemon will not
recover - see my analysis.
Observe the arc_reclaim is not freeing anything (and collect no cpu
time see top) because arc_available memory claims that there is enough
free memory (looks at uvm_availmem(false).
The dtrace probe documents that.
Part 2 - get the arc_reclaim thread to actually be triggered before kmem
is starving.
Install Patch 1 from the bug report. This lets ZFS look at the
kmem_arena space situation which a also looked at
uvm_km.c:uvm_km_va_starved_p(void).
Now ZFS has a chance to start reclaiming memory.
Run the load test again.
The dtrace probe should now show decreasing memory until it get
negative. And it will stay negative by a certain amount.
vmstat -m should show that ZFS now only hogs ~75% of kmem.
Also the should be a significant count in the Ide page counts as the
arc_reclaim thread did give up memory.
As the idle page are not yet reclaim from the pool ZFS is asked to
always free memory (dtrace probe) an vmstat -m will
show the non zero Idle page counts. Thus now ZFS has 75% kmem memory
allocated but utilized only a small part. Thus the cache
is allocated but not used anymore.
We need to get the Idle pages actually reclaimed from the pools. This is
done by Patch 2 from the bug.
There is no way to pass this task to the pgdaemon the that looks only
uvm_availmem(false) that does not consider kmem unless starving. Also
the pool drain thread drain one pool at a time per invocation and that
is not even triggered.
so Patch 2 directly reclaims from the pool_cache_invalidate()ed pool.
With this strategy ZFS keeps the kmem usage around 75% as now Idle pages
are reclaimed and ZFS only gets negative arc_available_memory
values when called for.
vmstat will show that ZFS is now in the 75% kmem limit. arc_reclaim will
run at a suitable rate when needed. ZFS pools should not show too many
idle pages (idle pages
are removed after some cool down time to reduce xcall activity if I read
the code right).
dtrace should show positive and negative arc_available memory figures.
I did not keep the vmstat and dtrace and top outputs. But from a busy db
loading DOMU ( databases > 350 GB)
I see a vmstat -m of
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg
Maxpg Idle
...
zfs_znode_cache 248 215697 0 0 13482 0 13482 13482 0 inf 0
zil_lwb_cache 208 84 0 0 5 0 5 5 0 inf 0
zio_buf_1024 1536 11248 0 7612 3278 1460 1818 1818 0 inf 0
zio_buf_10240 10240 1130 0 723 973 566 407 407 0 inf 0
zio_buf_114688 114688 351 0 200 339 188 151 151 0 inf 0
zio_buf_12288 12288 1006 0 714 721 429 292 305 0 inf 0
zio_buf_131072 131072 3150 89 2176 1841 867 974 974 0 inf 0
zio_buf_14336 14336 473 0 308 432 267 165 166 0 inf 0
zio_buf_1536 2048 2060 0 1065 549 51 498 498 0 inf 0
zio_buf_16384 16384 9672 0 481 9318 127 9191 9191 0 inf 0
zio_buf_2048 2048 2001 0 826 682 94 588 588 0 inf 0
zio_buf_20480 20480 461 0 301 428 268 160 160 0 inf 0
zio_buf_24576 24576 448 0 293 404 249 155 155 0 inf 0
zio_buf_2560 2560 2319 1 490 1948 119 1829 1829 0 inf 0
zio_buf_28672 28672 369 0 221 345 197 148 152 0 inf 0
zio_buf_3072 3072 4163 2 422 3861 120 3741 3741 0 inf 0
...
zio_buf_7168 7168 506 0 292 465 251 214 214 0 inf 0
zio_buf_8192 8192 724 0 329 635 240 395 395 0 inf 0
zio_buf_81920 81920 379 0 229 371 221 150 161 0 inf 0
zio_buf_98304 98304 580 0 421 442 283 159 163 0 inf 0
zio_cache 992 4707 0 0 1177 0 1177 1177 0 inf 0
zio_data_buf_10 1536 39 0 33 20 17 3 12 0 inf 0
zio_data_buf_10 10240 2 0 2 2 2 0 2 0 inf 0
zio_data_buf_13 131072 488674 0 323782 274996 110104 164892 191800 0
inf 0
zio_data_buf_15 2048 25 0 19 13 10 3 7 0 inf 0
zio_data_buf_20 2048 17 0 13 9 7 2 4 0 inf 0
zio_data_buf_20 20480 1 0 1 1 1 0 1 0 inf 0
zio_data_buf_25 2560 7 0 6 7 6 1 5 0 inf 0
...
Totals 222323337 98 210229180 1033080 125800 907280
In use 24951773K, total allocated 25255540K; utilization 98.8%
In the unpatched case all 32GB where allocated.
The arc_reclaim_thread clocked in 20 CPU sec - that is ok.
Current dtrace output is:
dtrace: script 'zfsmem.d' matched 1 probe
CPU ID FUNCTION:NAME
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
The page daemon was never woken up and has 0 CPU seconds. in 2 days.
This all looks very much as desired.
Hope this helps.
Best regards,
Frank
--------------100111F35787E6220A86651D
Content-Type: text/x-patch;
name="pd.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="pd.diff"
--- /src/NetBSD/n10/src/sys/uvm/uvm_pdaemon.c 2023-07-29 17:52:46.392362932 +0200
+++ /src/NetBSD/n10/src/sys/uvm/.#uvm_pdaemon.c.1.133 2023-07-29 14:18:05.000000000 +0200
@@ -270,11 +270,15 @@
/*
* main loop
*/
-
+/*XXXkd*/ unsigned long cnt_needsfree = 0L, cnt_needsscan = 0, cnt_drain = 0, cnt_starved = 0, cnt_avail = 0, cnt_loops = 0;
+/*XXXkd*/ time_t ts, last_ts = time_second;
for (;;) {
bool needsscan, needsfree, kmem_va_starved;
+/*XXXkd*/ cnt_loops++;
+
kmem_va_starved = uvm_km_va_starved_p();
+/*XXXkd*/ if (kmem_va_starved) cnt_starved++;
mutex_spin_enter(&uvmpd_lock);
if ((uvm_pagedaemon_waiters == 0 || uvmexp.paging > 0) &&
@@ -311,6 +315,8 @@
needsfree = fpages + uvmexp.paging < uvmexp.freetarg;
needsscan = needsfree || uvmpdpol_needsscan_p();
+/*XXXkd*/ if (needsfree) cnt_needsfree++;
+/*XXXkd*/ if (needsscan) cnt_needsscan++;
/*
* scan if needed
*/
@@ -328,8 +334,18 @@
wakeup(&uvmexp.free);
uvm_pagedaemon_waiters = 0;
mutex_spin_exit(&uvmpd_lock);
+/*XXXkd*/ cnt_avail++;
}
+/*XXXkd*/ if (needsfree || kmem_va_starved) cnt_drain++;
+/*XXXkd*/ ts = time_second;
+/*XXXkd*/ if (ts > last_ts + 5 && cnt_loops > 5 * 10000) {
+/*XXXkd*/ printf("pagedaemon: loops=%ld, cnt_needsfree=%ld, cnt_needsscan=%ld, cnt_drain=%ld, cnt_starved=%ld, cnt_avail=%ld, fpages=%d\n",
+/*XXXkd*/ cnt_loops, cnt_needsfree, cnt_needsscan, cnt_drain, cnt_starved, cnt_avail, fpages);
+/*XXXkd*/ cnt_needsfree = cnt_needsscan = cnt_drain = cnt_starved = cnt_avail = cnt_loops = 0;
+/*XXXkd*/ last_ts = ts;
+/*XXXkd*/ }
+
/*
* scan done. if we don't need free memory, we're done.
*/
--------------100111F35787E6220A86651D--
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:19:57 +0200
Just a note: bug patch 1 affects the zfs module, patch 2 affects the
solaris module.
so only module buils are needed
From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 09:27:50 -0700
On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
> Patch 1:
> let ZFS use a correct view on KVA memory:
> With this patch arc reclaim now detects memory shortage and
> frees pages. Also the ZFS KVA used by ZFS is limited to
> 75% KVA - could be made tunable
>
> Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
> correctly, but pages are not fully reclaimed and ZFS depletes its cache
> fully as the freed and now idle page are not reclaimed from the pools yet.
> pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
> at this point.
this patch is not correct. it does not do the right thing when there
is plenty of KVA but a shortage of physical pages. the goal with
previous fixes for ZFS ARC memory management problems was to prevent
KVA shortages by making KVA big enough to map all of RAM, and thus
avoid the need to consider KVA because we would always run low on
physical pages before we would run low on KVA. but apparently in your
environment that is not working. maybe we do something differently in
a XEN kernel that we need to account for?
> To reclaim the pages freed directly we need
> Patch 2:
> force page reclaim
> that will perform the reclaim.
this second patch is fine.
-Chuck
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 20:22:10 +0200
Hi Chuck !
Thanks for looking into that.
I came up with the first patch due to pgdaemon looping due to
uvm_km_va_starved_p() being true.
vmstat -m shows the statistics of the pools is summary close to
32Gb my DOM0 has.
counting the conditions when the pgdaemon is looping gives
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9789.2242179] pagedaemon: loops=16026699, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16026699, cnt_starved=16026
699, cnt_avail=16026699, fpages=337385
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9795.2244437] pagedaemon: loops=16024007, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16024007, cnt_starved=16024
007, cnt_avail=16024007, fpages=335307
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9801.2246381] pagedaemon: loops=16031141, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16031141, cnt_starved=16031
141, cnt_avail=16031141, fpages=335331
uvm_km_va_starved_p(void)
{
vmem_size_t total;
vmem_size_t free;
if (kmem_arena == NULL)
return false;
total = vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE);
free = vmem_size(kmem_arena, VMEM_FREE);
return (free < (total / 10));
}
int
uvm_availmem(bool cached)
{
int64_t fp;
cpu_count_sync(cached);
if ((fp = cpu_count_get(CPU_COUNT_FREEPAGES)) < 0) {
/*
* XXXAD could briefly go negative because it's impossible
* to get a clean snapshot. address this for other
counters
* used as running totals before NetBSD 10 although less
* important for those.
*/
fp = 0;
}
return (int)fp;
}
So, while uvm_km_va_starved_p() considers almost all memory used up
uvm_availmem(false) returns 337385 free pages (~1.28 Gb) well above
uvmexp.freetarg.
So, why do we count so many free pages when the free vmem for kmem_arena
is less than 10% of the total kmem_arena?
Maybe the pool pages have been allocated but not yet been referenced - I
didn't look that deep into the vmen/ZFS interaction.
I understand the reasoning why .kmem size = phymem size should have worked
There are still inconsistencies, though.
Even if uvm_availmem(false) would account for all pages
allocated/reserved in the kmem_arena vmem on the 32Gb system the actual
freetarget is 2730 free pages (~10.7 Mb).
%10 of 32Gb would be 3.2Gb which is a multiple of the free pages target.
So even then we would be stuck with a looping page daemon.
I think we need to find a better way for coping with with the accounting
differences between vmem/uvm free pages. Looking at the vmem statistics
seemed logical to me as ZFS allocates almost everything from kmem_arena
via pools.
I don't know what vmem does when there are less physical pages available
that the vmem allocation would require. This was the case you tried to
avoid.
So, looking at vmen statistic seems to be consistent with the starved
flag logic - that is why it does not trigger the looping pgdaemon. What
isn't
covered is the case of less physical pages than the pool allocation
required.
I think we have yet to find a correct, robust solution that does not
trigger the pgdaemon almost infinite loop.
Frank
On 08/03/23 18:30, Chuck Silvers wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: Chuck Silvers <chuq@chuq.com>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Thu, 3 Aug 2023 09:27:50 -0700
>
> On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
> > Patch 1:
> > let ZFS use a correct view on KVA memory:
> > With this patch arc reclaim now detects memory shortage and
> > frees pages. Also the ZFS KVA used by ZFS is limited to
> > 75% KVA - could be made tunable
> >
> > Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
> > correctly, but pages are not fully reclaimed and ZFS depletes its cache
> > fully as the freed and now idle page are not reclaimed from the pools yet.
> > pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
> > at this point.
>
> this patch is not correct. it does not do the right thing when there
> is plenty of KVA but a shortage of physical pages. the goal with
> previous fixes for ZFS ARC memory management problems was to prevent
> KVA shortages by making KVA big enough to map all of RAM, and thus
> avoid the need to consider KVA because we would always run low on
> physical pages before we would run low on KVA. but apparently in your
> environment that is not working. maybe we do something differently in
> a XEN kernel that we need to account for?
>
>
> > To reclaim the pages freed directly we need
> > Patch 2:
> > force page reclaim
> > that will perform the reclaim.
>
> this second patch is fine.
>
> -Chuck
>
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sat, 5 Aug 2023 13:24:55 -0000 (UTC)
chuq@chuq.com (Chuck Silvers) writes:
> this patch is not correct. it does not do the right thing when there
> is plenty of KVA but a shortage of physical pages.
I doubt that this is handled well enough outside of zfs.
The result of using ZFS (removing pkgsrc tree, unpacking it again) is
about 3.2GB of kernel pools used (from 8GB total RAM).
Trying to reduce ZFS again by tuning ZFS, shrinking maxvnodes and
then allocating user pages ends in the following (only pools >10G shown):
anonpl 0.078G
arc_buf_hdr_t_f 0.018G
buf16k 0.014G
dmu_buf_impl_t 0.107G
dnode_t 0.173G
kmem-00064 0.037G
kmem-00128 0.089G
kmem-00192 0.182G
kmem-00256 0.069G
kmem-00384 0.161G
kmem-01024 0.020G
kmem-02048 0.033G
mutex 0.020G
namecache 0.020G
pcglarge 0.042G
pcgnormal 0.107G
phpool-64 0.014G
rwlock 0.020G
sa_cache 0.032G
vcachepl 0.206G
zfs_znode_cache 0.072G
zio_buf_131072 0.015G
zio_buf_16384 0.075G
zio_buf_4096 0.019G
zio_buf_512 0.170G
zio_cache 0.161G
That's about 2GB left. Things like dnodes or the zio_cache are never
flushed, the 512 byte zio buffer pool is still huge because it is
totally fragmented, but also vcachepl is never drained.
ZFS tries to drop referenced metadata in Solaris or FreeBSD, but
for NetBSD that's still a nop.
The OpenZFS code in current FreeBSD looks quite different too in
that area.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57558 CVS commit: src/external/cddl/osnet/sys/kern
Date: Sat, 9 Sep 2023 00:14:16 +0000
Module Name: src
Committed By: riastradh
Date: Sat Sep 9 00:14:16 UTC 2023
Modified Files:
src/external/cddl/osnet/sys/kern: kmem.c
Log Message:
solaris: Use pool_cache_reclaim, not pool_cache_invalidate.
pool_cache_invalidate invalidates cached objects, but doesn't return
any backing pages to the underlying page allocator.
pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
pages to the underlying page alloator, so it is actually useful for
the page daemon to do when trying to free memory.
PR kern/57558
XXX pullup-10
XXX pullup-9
XXX pullup-8 (by patch to kmem.h instead of kmem.c)
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 src/external/cddl/osnet/sys/kern/kmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57558 CVS commit: [netbsd-9] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:31:14 +0000
Module Name: src
Committed By: martin
Date: Mon Oct 2 13:31:14 UTC 2023
Modified Files:
src/external/cddl/osnet/sys/kern [netbsd-9]: kmem.c
Log Message:
Pull up following revision(s) (requested by riastradh in ticket #1735):
external/cddl/osnet/sys/kern/kmem.c: revision 1.4
solaris: Use pool_cache_reclaim, not pool_cache_invalidate.
pool_cache_invalidate invalidates cached objects, but doesn't return
any backing pages to the underlying page allocator.
pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
pages to the underlying page alloator, so it is actually useful for
the page daemon to do when trying to free memory.
PR kern/57558
To generate a diff of this commit:
cvs rdiff -u -r1.2.2.1 -r1.2.2.2 src/external/cddl/osnet/sys/kern/kmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/57558 CVS commit: [netbsd-10] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:29:59 +0000
Module Name: src
Committed By: martin
Date: Mon Oct 2 13:29:59 UTC 2023
Modified Files:
src/external/cddl/osnet/sys/kern [netbsd-10]: kmem.c
Log Message:
Pull up following revision(s) (requested by riastradh in ticket #383):
external/cddl/osnet/sys/kern/kmem.c: revision 1.4
solaris: Use pool_cache_reclaim, not pool_cache_invalidate.
pool_cache_invalidate invalidates cached objects, but doesn't return
any backing pages to the underlying page allocator.
pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
pages to the underlying page alloator, so it is actually useful for
the page daemon to do when trying to free memory.
PR kern/57558
To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.3.6.1 src/external/cddl/osnet/sys/kern/kmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)
kardel@netbsd.org (Frank Kardel) writes:
>Observed behavior:
>pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)
I have local patches to avoid spinning. As the page daemon keeps
data structures locked and runs at maximum priority, it prevents
other tasks to release resources. That's only a little help,
if the page daemon really cannot free anything, the system is still
locked up to some degree.
The main reason, that the page daemon cannot free memory is
that vnodes are not drained. This keeps the associated pools
busy and buffers allocated by the file cache.
Of course, if ZFS isn't throttled (and it would be less so, if
others make room), it would just expunge the rest of the system
data, so any improvement here just shifts the problem.
>Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges
>from its tight loop as ZFS finally gives up its stranglehold on pool
>memory..
I locally added some of the arc tunables to experiment with the
free_target value. The calculation of arc_c_max in arc.c also
doesn't agree with the comments.
Later ZFS versions did change a lot in this area. Anything we
correct, might need to be redone, when we move to a newer ZFS
code base.
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 17:34:12 +0200
see below.
On 04/26/24 16:20, Michael van Elst wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)
>
> kardel@netbsd.org (Frank Kardel) writes:
>
> >Observed behavior:
> >pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)
>
> I have local patches to avoid spinning. As the page daemon keeps
> data structures locked and runs at maximum priority, it prevents
> other tasks to release resources. That's only a little help,
> if the page daemon really cannot free anything, the system is still
> locked up to some degree.
In this situation it spins only because the KVA starvation condition is
met. That triggers
the pooldrain thread which in turn attempts to drain the pools. The ZFS
pools
call into arc.c:hdr_recl() which triggers the arc_reclaim_thread. In
this scenario
arc_availablemem() returns a positive value and thus the reclaim thread
does not
reclaim anything.
So the pagedaemon loops but nothing improves as long
as arc_availmem returns positive values. At this point the zfs
statistics lists
large amounts of evictable data.
>
> The main reason, that the page daemon cannot free memory is
> that vnodes are not drained. This keeps the associated pools
> busy and buffers allocated by the file cache.
Well, this may not be the reason in this case - there is not even an attempt
made to let the reclaim thread evict data and drain pools. It doesn't
get that far.
>
> Of course, if ZFS isn't throttled (and it would be less so, if
> others make room), it would just expunge the rest of the system
> data, so any improvement here just shifts the problem.
Well ZFS currently likes to eat up all pool memory in this situation.
>
> >Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges
> >from its tight loop as ZFS finally gives up its stranglehold on pool
> >memory..
>
> I locally added some of the arc tunables to experiment with the
> free_target value. The calculation of arc_c_max in arc.c also
> doesn't agree with the comments.
>
> Later ZFS versions did change a lot in this area. Anything we
> correct, might need to be redone, when we move to a newer ZFS
> code base.
>
Yes, but currently we seem to have a broken ZFS at least on large memory
environments. Effects I observed
are:
- this bug: the famous looping pagedaemon
- PR kern/58198: ZFS can lead to UVM kills (no swap, out of swap)
- I am still trying to find out what causes the whole system to
slow down to a crawl. but that that point the system is unusable to
gather any information.
So something needs to be done.
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 2 May 2024 10:49:17 +0200
This is a multi-part message in MIME format.
--------------AA25A9A6F0B7E95C7A6B2F0F
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
In order to prevent the pagedaemon looping and kva space being exhausted
ZFS not only needs to ensure that uvmexp.freetarg pages are free but
also that
the at least 10% free kva is obeyed for the sake of the pagedaemon/kva
free space.
Thus arc..c needs an additional check to determine needfree if the kva
free space falls below
10%.
This was tested with the db load scenario on a Xen DOMU with GENERIC
kernel and
360GB memory. The system stayed responsive an used around 270-319GB pool
space, ZFS cut back on pool memory when needed.
In another test a runaway memory consumer got the pool space used to cut
back
from 319 GB to 51GB before hitting "out of swap space".
--------------AA25A9A6F0B7E95C7A6B2F0F
Content-Type: text/x-patch;
name="arc.c.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="arc.c.diff"
Index: arc.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c,v
retrieving revision 1.22
diff -u -r1.22 arc.c
--- arc.c 3 Aug 2022 01:53:06 -0000 1.22
+++ arc.c 2 May 2024 08:38:40 -0000
@@ -3903,6 +3903,31 @@
free_memory_reason_t r = FMR_UNKNOWN;
#ifdef _KERNEL
+#ifdef __NetBSD__
+ vmem_size_t totalpercent;
+ vmem_size_t free;
+
+ /*
+ * PR kern/57558:
+ *
+ * do not let pdaemon get stuck in the uvm_km_va_starved_p()
+ * state. it starts a tight loop when in uvm_km_va_starved state
+ * and ZFS is not freeing any pool pages as it started freeing
+ * only when falling below uvmexp.freetarg.
+ * now we start freeing when falling below 10% kva free or
+ * uvmexp.freetarg.
+ * the 10% magic is shamelessly copied from uvm_km_va_starved_p()
+ * The interface to the pagedaemon has room for improvement.
+ */
+
+ totalpercent = vmem_size(heap_arena, VMEM_ALLOC|VMEM_FREE) / 10;
+ free = vmem_size(heap_arena, VMEM_FREE);
+
+ if (free < totalpercent) {
+ needfree = btop(totalpercent - free);
+ }
+#endif
+
if (needfree > 0) {
n = PAGESIZE * (-needfree);
if (n < lowest) {
--------------AA25A9A6F0B7E95C7A6B2F0F--
From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, kardel@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sat, 4 May 2024 16:07:27 -0700
On Thu, May 02, 2024 at 08:50:02AM +0000, Frank Kardel wrote:
> Thus arc..c needs an additional check to determine needfree if the kva
> free space falls below 10%.
The intention is that 64-bit kernels should configure enough KVA
to be able to map all of physical memory as kmem without running out of KVA.
I changed the general code to work that way a while back, do we do something
different on xen?
-Chuck
From: Frank Kardel <kardel@netbsd.org>
To: Chuck Silvers <chuq@chuq.com>, gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 08:33:03 +0200
My comments to your remark below - a longer (re-)explaination follows
after that.
On 05/05/24 01:07, Chuck Silvers wrote:
> On Thu, May 02, 2024 at 08:50:02AM +0000, Frank Kardel wrote:
>> Thus arc..c needs an additional check to determine needfree if the kva
>> free space falls below 10%.
> The intention is that 64-bit kernels should configure enough KVA
> to be able to map all of physical memory as kmem without running out of KVA.
We are not really running out of KVA, this is not violated. We are
running into less than 10% free KVA available
and ZFS still allocating KVA while other pools give up KVA which gets
allocated to ZFS UNTIL
we fall below uvmexp.freetarg.
> I changed the general code to work that way a while back, do we do something
> different on xen?
The difference is large memory - see below for the reasoning (which can
be supported via measurements/traces)
> -Chuck
TLDR section at the end.
The issue is not having enough KVA configured The issue is that with
large (not XEN specific) memory systems the page daemon attempts to keep
always at least 10% KVA free. See uvm_km.c:uvm_km_va_starved_p(void) and
uvm_pdaemon.c:uvm_pageout(void *arg).
With less than 10% KVA free the local kmem_va_starved variable is true.
This leads to skipping the UVM_UNLOCK_AND_WAIT(). Further on
usually no scan is done as
- needsfree is false as there is enough free memory
(uvmexp.freetarg is 4096 in this case)
- needsscan is also false as uvmpdpol_needsscan_p() does not return
true that that time.
But even if if we would scan it wouldn't help as the target of the scan
is to get around.freetarg
free pages and it would not drain any pools where ZFS is hogging memory..
Further down the pool_drainer thread is kicked as needsfree and
needsscan are false as they are mainly bound
the uvmexp.freetarg.
Following the pooldrain path an attempt is made the reclaim idle pages
from the pools.
At this time most pools will give up idle pages, but ZFS will hold onto
them. This is because
ZFS determines the we are not falling below uvmexp.freetarg and thus ZFS
does not kick
the arc_reclaim thread to give up pages. So all the poolthread
accomplishes is the (most) other pools
give up the idle pages. while ZFS holds onto its pool allocations.
While the poolthread may dig up some more free pages ZFS will keep
allocating pool (KVA)
memory as long as we are not below uvmexp.freetarg. While doing this
more and more pools
get reduced whenever possible (because some pages where currently free).
Effects are
the system becoming very slow to respond, network buffers not being able
to be allocated,
dropped network connections and more.
This relaxes a bit when free memory falls below uvmexp.freetarg as at
that time ZFS starts giving up
pool memory.. At that time we are far below the 10% KVA starvartion limit.
While being below the 10% limit but above uvmexp.freetarg the pagedaemon
happily spins while
ZFS keeps allocating more and more KVA.
So it is not having enough KVA available. It is that ZFS keeps
allocating KVA until we fall
below uvmexp.freetarg. With larger memory systems the gap between
uvmexp.freetarg and
10% KVA increases and the problem becomes critical..
Given the current mechanics the pool memory for all non-ZFS pools is
initially effectively limited to uvmexp.freetarg pages which is not enough
for reliable system operation.
It is not a XEN issue.
TLDR:
- pagedaemon aggressively starts pool darining once KVA free falls below 10%
- ZFS won't free pool pages until free memory falls below uvmexp.freetarg.
- there is a huge gap between uvmexp.freetarg and 10% KVA free
increasing with larger memory(10%)
- while below 10% KVA free ZFS eventually depletes all other pools that
are cooperatively giving up pages
causing all sorts of shortages in other areas (visible in e.g.
network buffers)
Mitigation: allow ZFS to detect free KVA memory falling below 10% to
start reclaiming memory.
It is not related to XEN at all. Just ZFS + large memory is sufficient
for the problems to occur.
Base issue is the big difference between 10% free KVA memory limit and
uvmexp.freetarg.
I seem to explain the mechanism over and over again. And so far no one
has verified this analysis.
-Frank
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 06:45:29 -0000 (UTC)
kardel@netbsd.org (Frank Kardel) writes:
>It is not related to XEN at all. Just ZFS + large memory is sufficient
>for the problems to occur.
>Base issue is the big difference between 10% free KVA memory limit and
>uvmexp.freetarg.
I've added some sysctls that FreeBSD supports and also changed
how some defaults are computed.
% vmstat -s | grep target
10922 target free pages
% sysctl vfs.zfs_arc
vfs.zfs_arc.meta_limit = 0
vfs.zfs_arc.meta_min = 0
vfs.zfs_arc.shrink_shift = 0
vfs.zfs_arc.max = 128770969600
vfs.zfs_arc.min = 16096371200
vfs.zfs_arc.compressed = 1
vfs.zfs_arc.free_target = 10922 <----------
I just have no reliable test case to verify that it changes
the behaviour.
I have uploaded my current patch (still with some debug printfs).
http://cdn.netbsd.org/pub/NetBSD/misc/mlelstv/zfs.diff
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 09:47:51 +0200
Good to have those knobs.
In you example you have 10922 target free pages which is ~45Mbytes. On a
system with more then
450Mb memory the problem would occur again.
On our system i see (vmstat -s)
4096 bytes per page
16 page colors
86872544 pages managed
4266905 pages free
18512074 pages active
4929394 pages inactive
0 pages paging
4429 pages wired
1 reserve pagedaemon pages
60 reserve kernel pages
2714934 boot kernel pages
58193232 kernel pool pages
6295559 anonymous pages
17136740 cached file pages
13700 cached executable pages
3072 minimum free pages
4096 target free pages
28957514 maximum wired pages
1 swap devices
1048575 swap pages
201179 swap pages in use
2458785 swap allocations
Thus
356 Gb managed memory
17Gb free
26 Gb anonymous memory
70Gb file pages
16 Mb free target
238 Gb allocated to pool
This situation is with the 10% correction in ZFS and has survived
(without stalls/allocation failure) creating a 1.6Tb database in ZFS and
two parallel multiple
hour production like runs on that database.
Without the fix the system would have stalled during load.
Your setup would be safe with the current implementation if
zfs_arc.free_target(= uvmexo.freetarg) would be at 10% of KVA memory..
This is usually not the case.
I don't know how much memory you system has and I did not check how
uvmexp.freetarg is calculated at startup or adjusted thereafter. Fact is
that 16Mb
even seems sufficient on large memory systems if ZFS is kept from
allocating pool memory beyond 90%. ZFS must start reclaiming when there
is less than
10% free KVA available due to the page daemon logic.
I see your patch also contains my proposed fix. So when testing this
patch we should be safe from the KVA starvation issue and changing
zfs_arc_free_target would only make an additional effect when it is set
higher then 10% KVA memory.. Being able to set this value has the
benefit that we can limit ZFS pool usage even more. Maybe we should provide
a way to specify %KVA or an absolute allocation value.
-Frank
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 10:03:38 +0200
On 05/05/24 08:50, Michael van Elst wrote:
>
> I just have no reliable test case to verify that it changes
> the behaviour.
>
>
As for testing you could use the D-Script
sdt:::arc-available_memory
{
printf("mem = %d, reason = %d", arg0, arg1);
}
mem is positive when arc_available_memory thinks the is memory
(prohibits the reclaimthread to run even when triggered).
mem is negative when memory should be freed giving the negative amount
to be freed.
reason is the cause for the memory decision.
reason is
typedef enum free_memory_reason_t {
FMR_UNKNOWN,
FMR_NEEDFREE,
FMR_LOTSFREE,
FMR_SWAPFS_MINFREE,
FMR_PAGES_PP_MAXIMUM,
FMR_HEAP_ARENA,
FMR_ZIO_ARENA,
FMR_ZIO_FRAG,
} free_memory_reason_t;
Test case would be any large ZFS write/read operations.
If you see 1 as cause it is the startvation guard fix.
-Frank
From: matthew green <mrg@eterna23.net>
To: Frank Kardel <kardel@netbsd.org>
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
Chuck Silvers <chuq@chuq.com>, gnats-bugs@netbsd.org
Subject: re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 05 May 2024 18:13:40 +1000
just to clear up what i think is a confusion.
> My comments to your remark below - a longer (re-)explaination follows =
> after that.
>
> On 05/05/24 01:07, Chuck Silvers wrote:
> > On Thu, May 02, 2024 at 08:50:02AM +0000, Frank Kardel wrote:
> >> Thus arc..c needs an additional check to determine needfree if the =
kva
> >> free space falls below 10%.
> > The intention is that 64-bit kernels should configure enough KVA
> > to be able to map all of physical memory as kmem without running out o=
f KVA.
> We are not really running out of KVA, this is not violated. We are =
> running into less than 10% free KVA available
> and ZFS still allocating KVA while other pools give up KVA which gets =
> allocated to ZFS UNTIL
> we fall below uvmexp.freetarg.
freetarg has nothing to do with KVA. that's about free memory
(actual physical pages). KVA is a space that gets allocated out
of, and sometimes that space has real pages backing it but not
always (and sometimes the same page may be mapped more the once.)
on 64-bit platforms, KVA is generally *huge*.
KVA starvation shouldn't happen -- we have many terabytes
available for the KVA on amd64, and pool reclaim only happens
when we run low on free pages. KVA on amd64 already is large
enough to map all of physical memory in the 'direct map' region,
as well as other places as needed.
(it sounds like zfs needs to be able to reclaim pages like other
consumers?)
.mrg.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 10:03:46 -0000 (UTC)
kardel@netbsd.org (Frank Kardel) writes:
> benefit that we can limit ZFS pool usage even more. Maybe we should provide
> a way to specify %KVA or an absolute allocation value.
There are lots of changes in that area in more recent ZFS versions.
I don't think we should create too sophisticated changes if a later
update will make them obsolete.
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 12:50:31 +0200
On 05/05/24 10:13, matthew green wrote:
> just to clear up what i think is a confusion.
>
> freetarg has nothing to do with KVA.
Correct - that is why we are running into the issue as ZFS currently
looks only on freetarg.
> that's about free memory
> (actual physical pages). KVA is a space that gets allocated out
> of, and sometimes that space has real pages backing it but not
> always (and sometimes the same page may be mapped more the once.)
> on 64-bit platforms, KVA is generally *huge*.
Yes.
>
> KVA starvation shouldn't happen -- we have many terabytes
> available for the KVA on amd64, and pool reclaim only happens
> when we run low on free pages.
Well according the the code starvation happens as in uvm/uvm_km
.c
bool
uvm_km_va_starved_p(void)
{
vmem_size_t total;
vmem_size_t free;
if (kmem_arena == NULL)
return false;
total = vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE);
free = vmem_size(kmem_arena, VMEM_FREE);
return (free < (total / 10));
}
returns true. It may not be the starvation you have in mind (out of vmem
address space), but as it returns true
it affects the uvm_pageout pagedaemon process.
The difference in semantics here may be the available KVA address space
being HUGE and
the presumably much smaller vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE)
value which is
the basis for the uvm_km_va_starved_p() predicate.
> KVA on amd64 already is large
> enough to map all of physical memory in the 'direct map' region,
> as well as other places as needed.
>
> (it sounds like zfs needs to be able to reclaim pages like other
> consumers?)
Yes, many other consumers(maybe even all non ZFS consumers) give up idle
pages(+ maybe even more) when asked to.
ZFS pool memory is currently only reclaimed when we fall below
uvmexp.freetarg. starvartion is signaled long before that.
I think the ZFS reclaim strategy is not in line with the general pool
reclaim expectations
It is also not synchronous as if arc_reclaim is triggered only a thread
starts the cleanup evicting pages. With swap space this is not overly
critical.
> .mrg.
With this issue we need to look past design ideas and known invariants
and look at the implementation and find where the design ideas and
invariants do
not match in the implementation.
-Frank
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
kardel@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 05 May 2024 07:37:20 -0400
Frank Kardel <kardel@netbsd.org> writes:
[snip]
> TLDR:
> - pagedaemon aggressively starts pool darining once KVA free falls below 10%
> - ZFS won't free pool pages until free memory falls below uvmexp.freetarg.
> - there is a huge gap between uvmexp.freetarg and 10% KVA free
> increasing with larger memory(10%)
> - while below 10% KVA free ZFS eventually depletes all other pools that
> are cooperatively giving up pages
> causing all sorts of shortages in other areas (visible in e.g.
> network buffers)
This is a pretty good description of a problem I am/was seeing with the
daily cron checking for core files. On a DOMU with not a lot of memory,
12GB - 16GB and a WHOLE lot of ZFS filesets, this job would never
complete and the guest would appear to lock up (actually it may be any
job that did "find" that crossed into a ZFS fileset). To work around it
I ended up commenting out the daily job. The guest is my build system
for the OS and it would also start to bog down and would eventually hang
up after a few OS builds, but that was a more manageable situation.
With the simple kardel patch that was provided, the daily job could run
to completion and the system appears to be responsive after a couple of
days. I have not had time to run builds to see how that effects the
matter. The guest has 2 vcpus and I sometimes would abuse it pretty
hard by running 3 builds with -j2 on the build.sh line at the same time.
Very often the system would hang up at some point if I did this and I
had to back off and only run 1 or 2 at the same time.
> Mitigation: allow ZFS to detect free KVA memory falling below 10% to
> start reclaiming memory.
>
> It is not related to XEN at all. Just ZFS + large memory is sufficient
> for the problems to occur.
> Base issue is the big difference between 10% free KVA memory limit and
> uvmexp.freetarg.
I am not sure that "large memory" needs to be all that large to prompt
the problem. The description of what happens when ZFS gobbles
everything up is pretty close to what I am seeing...
> I seem to explain the mechanism over and over again. And so far no one
> has verified this analysis.
>
> -Frank
>
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 14:10:50 +0200
On 05/05/24 13:40, Brad Spencer wrote:
> The following reply was made to PR kern/57558; it has been noted by GNATS.
>
> From: Brad Spencer <brad@anduin.eldar.org>
> To: gnats-bugs@netbsd.org
> Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
> kardel@netbsd.org
> Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
> Date: Sun, 05 May 2024 07:37:20 -0400
>
> Frank Kardel <kardel@netbsd.org> writes:
>
> [snip]
>
> > TLDR:
> > - pagedaemon aggressively starts pool darining once KVA free falls below 10%
> > - ZFS won't free pool pages until free memory falls below uvmexp.freetarg.
> > - there is a huge gap between uvmexp.freetarg and 10% KVA free
> > increasing with larger memory(10%)
> > - while below 10% KVA free ZFS eventually depletes all other pools that
> > are cooperatively giving up pages
> > causing all sorts of shortages in other areas (visible in e.g.
> > network buffers)
>
> This is a pretty good description of a problem I am/was seeing with the
> daily cron checking for core files. On a DOMU with not a lot of memory,
> 12GB - 16GB and a WHOLE lot of ZFS filesets, this job would never
> complete and the guest would appear to lock up (actually it may be any
> job that did "find" that crossed into a ZFS fileset). To work around it
> I ended up commenting out the daily job. The guest is my build system
> for the OS and it would also start to bog down and would eventually hang
> up after a few OS builds, but that was a more manageable situation.
[snip]
>
> I am not sure that "large memory" needs to be all that large to prompt
> the problem. The description of what happens when ZFS gobbles
> everything up is pretty close to what I am seeing...
>
Thanks for your observation. Actually "large memory" could be seen more
like where
vmem_size(kernel_arena, VMEM_ALLOC|VMEM_FREE) / 10 in pages being
significantly larger than uvmexp.freetarg.
As you have observed this can already happen on smaller systems.
-Frank
From: Brad Spencer <brad@anduin.eldar.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 05 May 2024 15:56:13 -0400
Frank Kardel <kardel@netbsd.org> writes:
> Thanks for your observation. Actually "large memory" could be seen more
> like where
> vmem_size(kernel_arena, VMEM_ALLOC|VMEM_FREE) / 10 in pages being
> significantly larger than uvmexp.freetarg.
> As you have observed this can already happen on smaller systems.
>
> -Frank
Sure...
I was able to perform the abusive build operation and was able to make
the system fall over. The abuse is the following:
Have a 10.0 PVH guest with 16GB and 2vcpus. Run the following builds at
the same time:
build.sh -j2 <- for amd64
build.sh -j2 <- for i386
build.sh -j2 <- for earmv7hf
The source tree is in a ZFS fileset and is used by all of the builds.
The artifacts (obj, dist, release, etc..) are all in their own ZFS
filesets for each of the arch types (that is /artifacts/amd64 would be
its own ZFS filesystem and contain object, release and dist
subdirectories, using the -O, -R and -D flags to build.sh to point to
/artifacts/amd64/OBJ and etc. There would also be a /artifacts/i386 and
/artifacts/earmv7hf which are also theirs own filesets).
Everything will humming along just fine, until the earmv7hf build nears
the end and does /usr/src/distrib/utils/embedded/mkimage which does "dd
bs=1 count=4456448 if=/dev/zero" ... that dd will run with high CPU for
a little bit and then cause all active reads and writes going on with
the other builds and itself to more or less deadlock. The CPU
utilization will fall to zero and disk utilization on the zpool will
fall to zero. The system will be responsive, but if you try hitting any
of the files being used the command (ls, or whatever) will hang up.
As far as I can tell what was going on in the system was two objcopy and
two rm along with the dd. One objcopy was stuck in tstile and the other
in &zilog. The dd was stuck in &tx->t and both rm were stuck in &zio->
... all according to top.
I can almost reproduce this on demand, as long as the amd64 and i386
builds are actually building something and the earmv7hf build hits the
mkimage call at the same time. A clean build of all three will probably
provoke it and update builds (-u flag to build.sh) may as well.
This is all probably unrelated to the patch that was provided and the
problem being reported. The patch does appear to make the situation
better.
Might want to consider switching the arguments to that "dd" to be "dd
count=1 bs=4456448 if=/dev/zero" .. that is just write one block of
4456448 bytes instead of 4456448 one byte blocks. Might be less
stressful.
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 22:21:09 +0200
Nice data point. It might not be related.
Could you check the stack traces with either crash or DDB of the hung
processes
to check we are not subject to a resource shortage but rather a locking
issue?
Frank
From: Chuck Silvers <chuq@chuq.com>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 23:05:35 -0700
--4Yo0DCM283VB5+BQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Sun, May 05, 2024 at 08:33:03AM +0200, Frank Kardel wrote:
> The issue is not having enough KVA configured The issue is that with
> large (not XEN specific) memory systems the page daemon attempts to keep
> always at least 10% KVA free. See uvm_km.c:uvm_km_va_starved_p(void) and
> uvm_pdaemon.c:uvm_pageout(void *arg).
ah yes, that is the problem. this is really a mismatch between how much kmem space
is allocated vs. how much kmem space is allowed to be used before the pagedaemon
tries to reclaim some kmem space. ZFS is just a victim in this because it happens
to use a lot of kmem space.
I think the right fix for this is to increase the amount of kmem space that we
allocate such that all of physical memory can be allocated as kmem without
the pagedaemon considering the system to be starved for kmem virtual space.
this means allocating an enough kmem space for 10/9 of physical memory,
so that even though only 9/10 of kmem virtual space can be used, there is still
enough kmem virtual space available for all of physical memory.
please try the attached patch.
-Chuck
--4Yo0DCM283VB5+BQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="diff.nkmempages-10-9ths.1"
Index: src/sys/uvm/uvm_km.c
===================================================================
RCS file: /home/chs/netbsd/cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.165
diff -u -p -r1.165 uvm_km.c
--- src/sys/uvm/uvm_km.c 9 Apr 2023 09:00:56 -0000 1.165
+++ src/sys/uvm/uvm_km.c 5 May 2024 17:02:49 -0000
@@ -227,7 +227,14 @@ kmeminit_nkmempages(void)
}
#if defined(NKMEMPAGES_MAX_UNLIMITED) && !defined(KMSAN)
- npages = physmem;
+ /*
+ * The extra 1/9 here is to account for uvm_km_va_starved_p()
+ * wanting to keep 10% of kmem virtual space free.
+ * The intent is that on "unlimited" platforms we should be able
+ * to allocate all of physical memory as kmem without running short
+ * of kmem virtual space.
+ */
+ npages = (physmem * 10) / 9;
#else
#if defined(KMSAN)
--4Yo0DCM283VB5+BQ--
From: Frank Kardel <kardel@netbsd.org>
To: Chuck Silvers <chuq@chuq.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 9 May 2024 19:26:01 +0200
Thanks for that simpler fix to avoid the starvation state. Test have
shown that that works in the
environment.
I am wondering, though, whether something needs to be done for the other
cases.
-Frank
On 05/06/24 08:05, Chuck Silvers wrote:
> [snip]
> I think the right fix for this is to increase the amount of kmem space that we
> allocate such that all of physical memory can be allocated as kmem without
> the pagedaemon considering the system to be starved for kmem virtual space.
> this means allocating an enough kmem space for 10/9 of physical memory,
> so that even though only 9/10 of kmem virtual space can be used, there is still
> enough kmem virtual space available for all of physical memory.
>
> please try the attached patch.
>
> -Chuck
From: Chuck Silvers <chuq@chuq.com>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 9 May 2024 10:48:24 -0700
great, thanks for testing.
which other cases are you talking about?
-Chuck
On Thu, May 09, 2024 at 07:26:01PM +0200, Frank Kardel wrote:
> Thanks for that simpler fix to avoid the starvation state. Test have shown
> that that works in the
>
> environment.
>
> I am wondering, though, whether something needs to be done for the other
> cases.
>
> -Frank
>
>
> On 05/06/24 08:05, Chuck Silvers wrote:
> > [snip]
> > I think the right fix for this is to increase the amount of kmem space that we
> > allocate such that all of physical memory can be allocated as kmem without
> > the pagedaemon considering the system to be starved for kmem virtual space.
> > this means allocating an enough kmem space for 10/9 of physical memory,
> > so that even though only 9/10 of kmem virtual space can be used, there is still
> > enough kmem virtual space available for all of physical memory.
> >
> > please try the attached patch.
> >
> > -Chuck
From: Frank Kardel <kardel@netbsd.org>
To: Chuck Silvers <chuq@chuq.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 9 May 2024 21:06:43 +0200
On 05/09/24 19:48, Chuck Silvers wrote:
> great, thanks for testing.
>
> which other cases are you talking about?
>
> -Chuck
The cases in the KVA nkmempages determination.
#if defined(NKMEMPAGES_MAX_UNLIMITED) && !defined(KMSAN)
/*
* The extra 1/9 here is to account for uvm_km_va_starved_p()
* wanting to keep 10% of kmem virtual space free.
* The intent is that on "unlimited" platforms we should be able
* to allocate all of physical memory as kmem without running short
* of kmem virtual space.
*/
npages = (physmem * 10) / 9;
#else
#if defined(KMSAN)
npages = (physmem / 4);
#elif defined(PMAP_MAP_POOLPAGE)
npages = (physmem / 4);
#else
npages = (physmem / 3) * 2;
#endif /* defined(PMAP_MAP_POOLPAGE) */
#if !defined(NKMEMPAGES_MAX_UNLIMITED)
if (npages > NKMEMPAGES_MAX)
npages = NKMEMPAGES_MAX;
#endif
#endif
if (npages < NKMEMPAGES_MIN)
npages = NKMEMPAGES_MIN;
nkmempages = npages;
defined(NKMEMPAGES_MAX_UNLIMITED) && !defined(KMSAN) is running fine now.
I wonder whether we could hit the starvation scenario where ZFS does not
reclaim memory in time in
the other cases as there nkmempages is set to a value less than physmem.
Just wondering.
-Frank
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.