NetBSD Problem Report #57558

From Frank.Kardel@Acrys.com  Thu Aug  3 08:44:21 2023
Return-Path: <Frank.Kardel@Acrys.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1BB351A9238
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  3 Aug 2023 08:44:21 +0000 (UTC)
Message-Id: <20230803084410.0E6E16019@gaia.acrys.com>
Date: Thu,  3 Aug 2023 10:44:10 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: pgdaemon 100% busy - no scanning (ZFS case)
X-Send-Pr-Version: 3.95

>Number:         57558
>Category:       kern
>Synopsis:       pgdaemon 100% busy - no scanning (ZFS case)
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 03 08:45:00 +0000 2023
>Last-Modified:  Thu May 09 19:10:01 +0000 2024
>Originator:     Frank Kardel
>Release:        NetBSD 10.0_BETA / current
>Organization:

>Environment:


System: NetBSD Marmolata 10.0_BETA NetBSD 10.0_BETA (XEN3_DOM0) #1: Thu Jul 27 18:30:30 CEST 2023 kardel@gaia:/src/NetBSD/n10/src/obj.amd64/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
	It has been observed that pgdaemon can get into a tight loop
	consuming 100% cpu, not yielding and blocking other RUNable
	threads on the cpu. (PRs kern/56516 (for effects - may have other cause), kern/55707)
	This analysis and proposed fix relate to the pgdaemon loop cause by KVA exhaustion
	by ZFS.

	Observed and analyzed in following environment (should be reproducable in simpler
	environments):
		XEN3_DOM0 providing vnd devices based on files on ZFS.
		GENERIC(pvh) DOMU using an ffs filesystem based in the vnd in XEN3_DOM0.

	Observed actions/effects:
		1) running a database on the ffs file system in XEN3_DOM0
		2) load a larger database
		3) XEN3_DOM0 is fine until ZFS allocated 90% of KVA
		   at this point pgdaemon kicks in and enters a tight loop.
		4) pgdaemon does not do page scans (enough memory is available)
		5) pgdaemon loops as uvm_km_va_starved_p() returns true
		6) pool_drain is unable the reclaim any idle pages from the pools
		7) uvm_km_va_starved_p() thus keeps returning true - pgdaemon keeps looping

	Analyzed causes:
		- pool_drain causes upcalls to ZFS reclaim logic
		- ZFS reclaim logic does no reclaim anything as the current code
		  looks at uvm_availmem(false) and that returns 'plenty' memory)
		  thus no attempt to free memory on ZFS is done and no KVA is reclaimed.

	Conclusion:
		- using uvm_availmem(false) for ZFS memory throtteling is wrong as
		  ZFS memory is allocated from kmem KVA pools.
		- ZFS arc must to use KVA memory for memory checks

>How-To-Repeat:
	run a DB load in an FFS from a vnd of a file on ZFS.
>Fix:
	Patch 1:
		let ZFS use a correct view on KVA memory:
		With this patch arc reclaim now detects memory shortage and
		frees pages. Also the ZFS KVA used by ZFS is limited to
		75% KVA - could be made tunable

	Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
	correctly, but pages are not fully reclaimed and ZFS depletes its cache
	fully as the freed and now idle page are not reclaimed from the pools yet.
	pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
	at this point.

	To reclaim the pages freed directly we need
	Patch 2:
		force page reclaim
	that will perform the reclaim.

	With both fixes the arc reclaim thread kicks in at 75% KVA usage and
	reclaim only enough memory to no to exceed 75% KVA.

	Any comments?

	OK to commit? (happens automatically on no feedback)

Index: external/cddl/osnet/dist/uts/common/fs/zfs/arc.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c,v
retrieving revision 1.22
diff -c -u -r1.22 arc.c
--- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	3 Aug 2022 01:53:06 -0000	1.22
+++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	3 Aug 2023 08:19:11 -0000
@@ -276,6 +276,7 @@
 #endif /* illumos */

 #ifdef __NetBSD__
+#include <sys/vmem.h>
 #include <uvm/uvm.h>
 #ifndef btop
 #define	btop(x)		((x) / PAGE_SIZE)
@@ -285,9 +286,9 @@
 #endif
 //#define	needfree	(uvm_availmem() < uvmexp.freetarg ? uvmexp.freetarg : 0)
 #define	buf_init	arc_buf_init
-#define	freemem		uvm_availmem(false)
+#define	freemem		btop(vmem_size(kmem_arena, VMEM_FREE))
 #define	minfree		uvmexp.freemin
-#define	desfree		uvmexp.freetarg
+#define	desfree		(btop(vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE)) / 4)
 #define	zfs_arc_free_target desfree
 #define	lotsfree	(desfree * 2)
 #define	availrmem	desfree


	Patch 2:
		force reclaiming of pages on affected pools

Index: external/cddl/osnet/sys/kern/kmem.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/sys/kern/kmem.c,v
retrieving revision 1.3
diff -c -u -r1.3 kmem.c
--- external/cddl/osnet/sys/kern/kmem.c	11 Nov 2020 03:31:04 -0000	1.3
+++ external/cddl/osnet/sys/kern/kmem.c	3 Aug 2023 08:19:11 -0000
@@ -124,6 +124,7 @@
 {

 	pool_cache_invalidate(km->km_pool);
+	pool_cache_reclaim(km->km_pool);
 }

 #undef kmem_alloc

>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: kardel@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 10:23:19 +0000

 Cool, thanks for looking into this!  I was planning to investigate at
 some point soon, starting by adding dtrace probes (and maybe wiring up
 the sysctl knobs) so we can reproduce the analysis of the issue in the
 field.  Your analysis sounds plausible, but I'd like to make sure we
 have the visibility to verify the behaviour -- and the change in
 behaviour -- first!

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 13:06:33 +0200

 Hi Taylor,

 I used the existing DTRACE knobs in arc.c (adding one for available 
 memory() thought, but fbt:return on

 arc_available_memory would suffice) and counter based

 event debugging (additional code) to track the tight loop in uvm_pdaemon.c.

 A sysctl for % KVA would be useful. the existing fbt and sdt probes already

 help a lot to track the pattern.

 In my setup (soon to be used for actual work) the loops could be reproduced.

 With patch 1 the pgdaemon loops went away KVA used fir ZFS was limited 
 to 75% but the

 cache was depleted because of missing patch 2. And patch 2 is needed as 
 there is no chance

 the pgdaemon will trigger a pool_drain unless we reach kva_starvation on 
 a non ZFS path.

 So what would the next steps be?

 Frank



 On 08/03/23 12:25, Taylor R Campbell wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: Taylor R Campbell <riastradh@NetBSD.org>
 > To: kardel@NetBSD.org
 > Cc: gnats-bugs@NetBSD.org
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Thu, 3 Aug 2023 10:23:19 +0000
 >
 >   Cool, thanks for looking into this!  I was planning to investigate at
 >   some point soon, starting by adding dtrace probes (and maybe wiring up
 >   the sysctl knobs) so we can reproduce the analysis of the issue in the
 >   field.  Your analysis sounds plausible, but I'd like to make sure we
 >   have the visibility to verify the behaviour -- and the change in
 >   behaviour -- first!
 >   

From: Taylor R Campbell <riastradh@NetBSD.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 12:22:58 +0000

 Can you share the dtrace scripts you used for reference and how you
 set up the experiment?

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:11:51 +0200

 This is a multi-part message in MIME format.
 --------------100111F35787E6220A86651D
 Content-Type: text/plain; charset=utf-8; format=flowed
 Content-Transfer-Encoding: 7bit

 Sure

 Setup:
      - all userlans NetBSD-10.0_BETA
      - NetBSD 10.0_BETA (2023-07-26) (-current should al work) XEN3_DOM0 
 (pagedaemon patched see pd.diff attachment)
      - xen-4.15.1
      - NetBSD 10.0_BETA GENERIC as DOMU
      - on DOM0 zfs file system providing a file for the FFS file system 
 in the DOMU
      - DOMU has a posgresql 14.8 installation
      - testcase is load a significant database (~200 Gb) into the 
 postgres DB.

 this seems complicated to setup (but I am prepaing the kind of VM for 
 our purposes).
 Going by the errors detected it should al be possible (not tested)
      - creste ZFS file syystem on a plain GENERIC system
      - create a file system file in ZFS
      - vnconfig vndX <path the file system file>
      - disklabel vndX
      - newfs vndXa
      - mount /dev/vndXa /mnt
      - do lots of fs traffic writing, deleting, rewriting the mount fs

 Part 1 - current situation:

 Use
 sdt:::arc-available_memory
 {
          printf("mem = %d, reason = %d", arg0, arg1);
 }

 to track what ZFS thinks is has as memory - positive values mean enough 
 memory there, negative ask ZFS ARC to free the much of memory.

 Use vmstat -m to track pool usage - you should see that ZFS will take 
 more an more memory until 90% kmem is used in the pools.
 At the point you should see a ~100% busy pgdaemon in top and
 the pagedaemon patch should list high counts for loop, kvm_starved and 
 available as uvm_availmem(false) still reports many free pages.

 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9813.2250709] pagedaemon: loops=16023729, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16023729, cnt_starved=16023729, 
 cnt_avail=16023729, fpages=336349
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9819.2252810] pagedaemon: loops=16018349, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16018349, cnt_starved=16018349, 
 cnt_avail=16018349, fpages=336542
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9825.2255258] pagedaemon: loops=16025793, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16025793, cnt_starved=16025793, 
 cnt_avail=16025793, fpages=336516
 ...

 That document the tight loop with no progress. the pgdaemon will not 
 recover - see my analysis.
 Observe the arc_reclaim is not freeing anything (and collect no cpu 
 time  see top) because arc_available memory claims that there is enough 
 free memory (looks at uvm_availmem(false).
 The dtrace probe documents that.

 Part 2 - get the arc_reclaim thread to actually be triggered before kmem 
 is starving.
 Install Patch 1 from the bug report. This lets ZFS look at the 
 kmem_arena space situation which a also looked at 
 uvm_km.c:uvm_km_va_starved_p(void).
 Now ZFS has a chance to start reclaiming memory.
 Run the load test again.
 The dtrace probe should now show decreasing memory until it get 
 negative. And it will stay negative by a certain amount.
 vmstat -m should show that ZFS now only hogs ~75% of kmem.
 Also the should be a significant count in the Ide page counts as the 
 arc_reclaim thread did give up memory.
 As the idle page are not yet reclaim from the pool ZFS is asked to 
 always free memory (dtrace probe) an vmstat -m will
 show the non zero Idle page counts. Thus now ZFS has 75% kmem memory 
 allocated but utilized only a small part. Thus the cache
 is allocated but not used anymore.

 We need to get the Idle pages actually reclaimed from the pools. This is 
 done by Patch 2 from the bug.
 There is no way to pass this task to the pgdaemon the that looks only 
 uvm_availmem(false) that does not consider kmem unless starving. Also
 the pool drain thread drain one pool at a time per invocation and that 
 is not even triggered.
 so Patch 2 directly reclaims from the pool_cache_invalidate()ed pool.

 With this strategy ZFS keeps the kmem usage around 75% as now Idle pages 
 are reclaimed and ZFS only gets negative arc_available_memory
 values when called for.
 vmstat will show that ZFS is now in the 75% kmem limit. arc_reclaim will 
 run at a suitable rate when needed. ZFS pools should not show too many 
 idle pages (idle pages
 are removed after some cool down time to reduce xcall activity if I read 
 the code right).
 dtrace should show positive and negative arc_available memory figures.

 I did not keep the vmstat and dtrace and top outputs. But from a busy db 
 loading DOMU ( databases > 350 GB)
 I see a vmstat -m of

 Memory resource pool statistics
 Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg 
 Maxpg Idle
 ...
 zfs_znode_cache 248 215697   0        0 13482     0 13482 13482 0   inf    0
 zil_lwb_cache 208      84    0        0     5     0     5     5 0   inf    0
 zio_buf_1024 1536   11248    0     7612  3278  1460  1818  1818 0   inf    0
 zio_buf_10240 10240  1130    0      723   973   566   407   407 0   inf    0
 zio_buf_114688 114688 351    0      200   339   188   151   151 0   inf    0
 zio_buf_12288 12288  1006    0      714   721   429   292   305 0   inf    0
 zio_buf_131072 131072 3150  89     2176  1841   867   974   974 0   inf    0
 zio_buf_14336 14336   473    0      308   432   267   165   166 0   inf    0
 zio_buf_1536 2048    2060    0     1065   549    51   498   498 0   inf    0
 zio_buf_16384 16384  9672    0      481  9318   127  9191  9191 0   inf    0
 zio_buf_2048 2048    2001    0      826   682    94   588   588 0   inf    0
 zio_buf_20480 20480   461    0      301   428   268   160   160 0   inf    0
 zio_buf_24576 24576   448    0      293   404   249   155   155 0   inf    0
 zio_buf_2560 2560    2319    1      490  1948   119  1829  1829 0   inf    0
 zio_buf_28672 28672   369    0      221   345   197   148   152 0   inf    0
 zio_buf_3072 3072    4163    2      422  3861   120  3741  3741 0   inf    0
 ...
 zio_buf_7168 7168     506    0      292   465   251   214   214 0   inf    0
 zio_buf_8192 8192     724    0      329   635   240   395   395 0   inf    0
 zio_buf_81920 81920   379    0      229   371   221   150   161 0   inf    0
 zio_buf_98304 98304   580    0      421   442   283   159   163 0   inf    0
 zio_cache    992     4707    0        0  1177     0  1177  1177 0   inf    0
 zio_data_buf_10 1536   39    0       33    20    17     3    12 0   inf    0
 zio_data_buf_10 10240   2    0        2     2     2     0     2 0   inf    0
 zio_data_buf_13 131072 488674 0  323782 274996 110104 164892 191800 0   
 inf    0
 zio_data_buf_15 2048   25    0       19    13    10     3     7 0   inf    0
 zio_data_buf_20 2048   17    0       13     9     7     2     4 0   inf    0
 zio_data_buf_20 20480   1    0        1     1     1     0     1 0   inf    0
 zio_data_buf_25 2560    7    0        6     7     6     1     5 0   inf    0
 ...
 Totals           222323337  98 210229180 1033080 125800 907280

 In use 24951773K, total allocated 25255540K; utilization 98.8%

 In the unpatched case all 32GB where allocated.

 The arc_reclaim_thread clocked in 20 CPU sec - that is ok.

 Current dtrace output is:
 dtrace: script 'zfsmem.d' matched 1 probe
 CPU     ID                    FUNCTION:NAME
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2

 The page daemon was never woken up and has 0 CPU seconds. in 2 days.

 This all looks very much as desired.

 Hope this helps.

 Best regards,
    Frank


 --------------100111F35787E6220A86651D
 Content-Type: text/x-patch;
  name="pd.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="pd.diff"

 --- /src/NetBSD/n10/src/sys/uvm/uvm_pdaemon.c	2023-07-29 17:52:46.392362932 +0200
 +++ /src/NetBSD/n10/src/sys/uvm/.#uvm_pdaemon.c.1.133	2023-07-29 14:18:05.000000000 +0200
 @@ -270,11 +270,15 @@
  	/*
  	 * main loop
  	 */
 -
 +/*XXXkd*/ unsigned long cnt_needsfree = 0L, cnt_needsscan = 0, cnt_drain = 0, cnt_starved = 0, cnt_avail = 0, cnt_loops = 0;
 +/*XXXkd*/ time_t ts, last_ts = time_second;
  	for (;;) {
  		bool needsscan, needsfree, kmem_va_starved;

 +/*XXXkd*/ cnt_loops++;
 +
  		kmem_va_starved = uvm_km_va_starved_p();
 +/*XXXkd*/ if (kmem_va_starved) cnt_starved++;

  		mutex_spin_enter(&uvmpd_lock);
  		if ((uvm_pagedaemon_waiters == 0 || uvmexp.paging > 0) &&
 @@ -311,6 +315,8 @@
  		needsfree = fpages + uvmexp.paging < uvmexp.freetarg;
  		needsscan = needsfree || uvmpdpol_needsscan_p();

 +/*XXXkd*/ if (needsfree) cnt_needsfree++;
 +/*XXXkd*/ if (needsscan) cnt_needsscan++;
  		/*
  		 * scan if needed
  		 */
 @@ -328,8 +334,18 @@
  			wakeup(&uvmexp.free);
  			uvm_pagedaemon_waiters = 0;
  			mutex_spin_exit(&uvmpd_lock);
 +/*XXXkd*/		cnt_avail++;
  		}

 +/*XXXkd*/	if (needsfree || kmem_va_starved) cnt_drain++;
 +/*XXXkd*/	ts = time_second;
 +/*XXXkd*/	if (ts > last_ts + 5 && cnt_loops > 5 * 10000) {
 +/*XXXkd*/		printf("pagedaemon: loops=%ld, cnt_needsfree=%ld, cnt_needsscan=%ld, cnt_drain=%ld, cnt_starved=%ld, cnt_avail=%ld, fpages=%d\n",
 +/*XXXkd*/		       cnt_loops, cnt_needsfree, cnt_needsscan, cnt_drain, cnt_starved, cnt_avail, fpages);
 +/*XXXkd*/ 		cnt_needsfree = cnt_needsscan = cnt_drain = cnt_starved = cnt_avail = cnt_loops = 0;
 +/*XXXkd*/		last_ts = ts;
 +/*XXXkd*/	}
 +
  		/*
  		 * scan done.  if we don't need free memory, we're done.
  		 */

 --------------100111F35787E6220A86651D--

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:19:57 +0200

 Just a note: bug patch 1 affects the zfs module, patch 2 affects the 
 solaris module.
 so only module buils are needed

From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 09:27:50 -0700

 On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
 > 	Patch 1:
 > 		let ZFS use a correct view on KVA memory:
 > 		With this patch arc reclaim now detects memory shortage and
 > 		frees pages. Also the ZFS KVA used by ZFS is limited to
 > 		75% KVA - could be made tunable
 > 
 > 	Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
 > 	correctly, but pages are not fully reclaimed and ZFS depletes its cache
 > 	fully as the freed and now idle page are not reclaimed from the pools yet.
 > 	pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
 > 	at this point.

 this patch is not correct.  it does not do the right thing when there
 is plenty of KVA but a shortage of physical pages.  the goal with
 previous fixes for ZFS ARC memory management problems was to prevent
 KVA shortages by making KVA big enough to map all of RAM, and thus
 avoid the need to consider KVA because we would always run low on
 physical pages before we would run low on KVA.  but apparently in your
 environment that is not working.  maybe we do something differently in
 a XEN kernel that we need to account for?


 > 	To reclaim the pages freed directly we need
 > 	Patch 2:
 > 		force page reclaim
 > 	that will perform the reclaim.

 this second patch is fine.

 -Chuck

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 20:22:10 +0200

 Hi Chuck !

 Thanks for looking into that.

 I came up with the first patch due to pgdaemon looping due to

 uvm_km_va_starved_p() being true.

 vmstat -m shows the statistics of the pools is summary close to

 32Gb my DOM0 has.

 counting the conditions when the pgdaemon is looping gives

 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9789.2242179] pagedaemon: loops=16026699, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16026699, cnt_starved=16026
 699, cnt_avail=16026699, fpages=337385
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9795.2244437] pagedaemon: loops=16024007, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16024007, cnt_starved=16024
 007, cnt_avail=16024007, fpages=335307
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9801.2246381] pagedaemon: loops=16031141, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16031141, cnt_starved=16031
 141, cnt_avail=16031141, fpages=335331

 uvm_km_va_starved_p(void)
 {
          vmem_size_t total;
          vmem_size_t free;

          if (kmem_arena == NULL)
                  return false;

          total = vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE);
          free = vmem_size(kmem_arena, VMEM_FREE);

          return (free < (total / 10));
 }

 int
 uvm_availmem(bool cached)
 {
          int64_t fp;

          cpu_count_sync(cached);
          if ((fp = cpu_count_get(CPU_COUNT_FREEPAGES)) < 0) {
                  /*
                   * XXXAD could briefly go negative because it's impossible
                   * to get a clean snapshot.  address this for other 
 counters
                   * used as running totals before NetBSD 10 although less
                   * important for those.
                   */
                  fp = 0;
          }
          return (int)fp;
 }

 So, while uvm_km_va_starved_p() considers almost all memory used up 
 uvm_availmem(false) returns 337385 free pages (~1.28 Gb) well above 
 uvmexp.freetarg.

 So, why do we count so many free pages when the free vmem for kmem_arena 
 is less than 10% of the total kmem_arena?
 Maybe the pool pages have been allocated but not yet been referenced - I 
 didn't look that deep into the vmen/ZFS interaction.

 I understand the reasoning why .kmem size = phymem size should have worked

 There are still inconsistencies, though.
 Even if uvm_availmem(false) would account for all pages 
 allocated/reserved in the kmem_arena vmem on the 32Gb system the actual 
 freetarget is 2730 free pages (~10.7 Mb).
 %10 of 32Gb would be 3.2Gb which is a multiple of the free pages target. 
 So even then we would be stuck with a looping page daemon.

 I think we need to find a better way for coping with with the accounting 
 differences between vmem/uvm free pages. Looking at the vmem statistics 
 seemed logical to me as ZFS allocates almost everything from kmem_arena 
 via pools.
 I don't know what vmem does when there are less physical pages available 
 that the vmem allocation would require. This was the case you tried to
 avoid.

 So, looking at vmen statistic seems to be consistent with the starved 
 flag logic - that is why it does not trigger the looping pgdaemon. What 
 isn't
 covered is the case of less physical pages than the pool allocation 
 required.

 I think we have yet to find a correct, robust solution that does not 
 trigger the pgdaemon almost infinite loop.

 Frank


 On 08/03/23 18:30, Chuck Silvers wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: Chuck Silvers <chuq@chuq.com>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Thu, 3 Aug 2023 09:27:50 -0700
 >
 >   On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
 >   > 	Patch 1:
 >   > 		let ZFS use a correct view on KVA memory:
 >   > 		With this patch arc reclaim now detects memory shortage and
 >   > 		frees pages. Also the ZFS KVA used by ZFS is limited to
 >   > 		75% KVA - could be made tunable
 >   >
 >   > 	Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
 >   > 	correctly, but pages are not fully reclaimed and ZFS depletes its cache
 >   > 	fully as the freed and now idle page are not reclaimed from the pools yet.
 >   > 	pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
 >   > 	at this point.
 >   
 >   this patch is not correct.  it does not do the right thing when there
 >   is plenty of KVA but a shortage of physical pages.  the goal with
 >   previous fixes for ZFS ARC memory management problems was to prevent
 >   KVA shortages by making KVA big enough to map all of RAM, and thus
 >   avoid the need to consider KVA because we would always run low on
 >   physical pages before we would run low on KVA.  but apparently in your
 >   environment that is not working.  maybe we do something differently in
 >   a XEN kernel that we need to account for?
 >   
 >   
 >   > 	To reclaim the pages freed directly we need
 >   > 	Patch 2:
 >   > 		force page reclaim
 >   > 	that will perform the reclaim.
 >   
 >   this second patch is fine.
 >   
 >   -Chuck
 >   

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sat, 5 Aug 2023 13:24:55 -0000 (UTC)

 chuq@chuq.com (Chuck Silvers) writes:

 > this patch is not correct.  it does not do the right thing when there
 > is plenty of KVA but a shortage of physical pages.

 I doubt that this is handled well enough outside of zfs.

 The result of using ZFS (removing pkgsrc tree, unpacking it again) is
 about 3.2GB of kernel pools used (from 8GB total RAM).

 Trying to reduce ZFS again by tuning ZFS, shrinking maxvnodes and
 then allocating user pages ends in the following (only pools >10G shown):

 anonpl               0.078G
 arc_buf_hdr_t_f      0.018G
 buf16k               0.014G
 dmu_buf_impl_t       0.107G
 dnode_t              0.173G
 kmem-00064           0.037G
 kmem-00128           0.089G
 kmem-00192           0.182G
 kmem-00256           0.069G
 kmem-00384           0.161G
 kmem-01024           0.020G
 kmem-02048           0.033G
 mutex                0.020G
 namecache            0.020G
 pcglarge             0.042G
 pcgnormal            0.107G
 phpool-64            0.014G
 rwlock               0.020G
 sa_cache             0.032G
 vcachepl             0.206G
 zfs_znode_cache      0.072G
 zio_buf_131072       0.015G
 zio_buf_16384        0.075G
 zio_buf_4096         0.019G
 zio_buf_512          0.170G
 zio_cache            0.161G

 That's about 2GB left. Things like dnodes or the zio_cache are never
 flushed, the 512 byte zio buffer pool is still huge because it is
 totally fragmented, but also vcachepl is never drained.

 ZFS tries to drop referenced metadata in Solaris or FreeBSD, but
 for NetBSD that's still a nop.

 The OpenZFS code in current FreeBSD looks quite different too in
 that area.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57558 CVS commit: src/external/cddl/osnet/sys/kern
Date: Sat, 9 Sep 2023 00:14:16 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sat Sep  9 00:14:16 UTC 2023

 Modified Files:
 	src/external/cddl/osnet/sys/kern: kmem.c

 Log Message:
 solaris: Use pool_cache_reclaim, not pool_cache_invalidate.

 pool_cache_invalidate invalidates cached objects, but doesn't return
 any backing pages to the underlying page allocator.

 pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
 pages to the underlying page alloator, so it is actually useful for
 the page daemon to do when trying to free memory.

 PR kern/57558

 XXX pullup-10
 XXX pullup-9
 XXX pullup-8 (by patch to kmem.h instead of kmem.c)


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/external/cddl/osnet/sys/kern/kmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57558 CVS commit: [netbsd-9] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:31:14 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Oct  2 13:31:14 UTC 2023

 Modified Files:
 	src/external/cddl/osnet/sys/kern [netbsd-9]: kmem.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1735):

 	external/cddl/osnet/sys/kern/kmem.c: revision 1.4

 solaris: Use pool_cache_reclaim, not pool_cache_invalidate.

 pool_cache_invalidate invalidates cached objects, but doesn't return
 any backing pages to the underlying page allocator.
 pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
 pages to the underlying page alloator, so it is actually useful for
 the page daemon to do when trying to free memory.

 PR kern/57558


 To generate a diff of this commit:
 cvs rdiff -u -r1.2.2.1 -r1.2.2.2 src/external/cddl/osnet/sys/kern/kmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57558 CVS commit: [netbsd-10] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:29:59 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Oct  2 13:29:59 UTC 2023

 Modified Files:
 	src/external/cddl/osnet/sys/kern [netbsd-10]: kmem.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #383):

 	external/cddl/osnet/sys/kern/kmem.c: revision 1.4

 solaris: Use pool_cache_reclaim, not pool_cache_invalidate.

 pool_cache_invalidate invalidates cached objects, but doesn't return
 any backing pages to the underlying page allocator.
 pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
 pages to the underlying page alloator, so it is actually useful for
 the page daemon to do when trying to free memory.

 PR kern/57558


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.3.6.1 src/external/cddl/osnet/sys/kern/kmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)

 kardel@netbsd.org (Frank Kardel) writes:

 >Observed behavior:
 >pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)

 I have local patches to avoid spinning. As the page daemon keeps
 data structures locked and runs at maximum priority, it prevents
 other tasks to release resources. That's only a little help,
 if the page daemon really cannot free anything, the system is still
 locked up to some degree.

 The main reason, that the page daemon cannot free memory is
 that vnodes are not drained. This keeps the associated pools
 busy and buffers allocated by the file cache.

 Of course, if ZFS isn't throttled (and it would be less so, if
 others make room), it would just expunge the rest of the system
 data, so any improvement here just shifts the problem.


 >Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges 
 >from its tight loop as ZFS finally gives up its stranglehold on pool 
 >memory..

 I locally added some of the arc tunables to experiment with the
 free_target value. The calculation of arc_c_max in arc.c also
 doesn't agree with the comments.

 Later ZFS versions did change a lot in this area. Anything we
 correct, might need to be redone, when we move to a newer ZFS
 code base.

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 17:34:12 +0200

 see below.

 On 04/26/24 16:20, Michael van Elst wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)
 >
 >   kardel@netbsd.org (Frank Kardel) writes:
 >   
 >   >Observed behavior:
 >   >pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)
 >   
 >   I have local patches to avoid spinning. As the page daemon keeps
 >   data structures locked and runs at maximum priority, it prevents
 >   other tasks to release resources. That's only a little help,
 >   if the page daemon really cannot free anything, the system is still
 >   locked up to some degree.
 In this situation it spins only because the KVA starvation condition is 
 met. That triggers
 the pooldrain thread which in turn attempts to drain the pools. The ZFS 
 pools
 call into arc.c:hdr_recl() which triggers the arc_reclaim_thread. In 
 this scenario
 arc_availablemem() returns a positive value and thus the reclaim thread 
 does not
 reclaim anything.
 So the pagedaemon loops but nothing improves as long
 as arc_availmem returns positive values. At this point the zfs 
 statistics lists
 large amounts of evictable data.
 >   
 >   The main reason, that the page daemon cannot free memory is
 >   that vnodes are not drained. This keeps the associated pools
 >   busy and buffers allocated by the file cache.
 Well, this may not be the reason in this case - there is not even an attempt
 made to let the reclaim thread evict data and drain pools. It doesn't 
 get that far.
 >   
 >   Of course, if ZFS isn't throttled (and it would be less so, if
 >   others make room), it would just expunge the rest of the system
 >   data, so any improvement here just shifts the problem.
 Well ZFS currently likes to eat up all pool memory in this situation.
 >   
 >   >Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges
 >   >from its tight loop as ZFS finally gives up its stranglehold on pool
 >   >memory..
 >   
 >   I locally added some of the arc tunables to experiment with the
 >   free_target value. The calculation of arc_c_max in arc.c also
 >   doesn't agree with the comments.
 >   
 >   Later ZFS versions did change a lot in this area. Anything we
 >   correct, might need to be redone, when we move to a newer ZFS
 >   code base.
 >   
 Yes, but currently we seem to have a broken ZFS at least on large memory 
 environments. Effects I observed
 are:
      - this bug: the famous looping pagedaemon
      - PR kern/58198: ZFS can lead to UVM kills (no swap, out of swap)
      - I am still trying to find out what causes the whole system to 
 slow down to a crawl. but that that point the system is unusable to 
 gather any information.

 So something needs to be done.

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 2 May 2024 10:49:17 +0200

 This is a multi-part message in MIME format.
 --------------AA25A9A6F0B7E95C7A6B2F0F
 Content-Type: text/plain; charset=utf-8; format=flowed
 Content-Transfer-Encoding: 7bit

 In order to prevent the pagedaemon looping and kva space being exhausted
 ZFS not only needs to ensure that uvmexp.freetarg pages are free but 
 also that
 the at least 10% free kva is obeyed for the sake of the pagedaemon/kva 
 free space.

 Thus arc..c needs an additional check to determine needfree if the kva 
 free space falls below
 10%.

 This was tested with the db load scenario on a Xen DOMU with GENERIC 
 kernel and
 360GB memory. The system stayed responsive an used around 270-319GB pool 
 space, ZFS cut back on pool memory when needed.
 In another test a runaway memory consumer got the pool space used to cut 
 back
 from 319 GB to 51GB before hitting "out of swap space".








 --------------AA25A9A6F0B7E95C7A6B2F0F
 Content-Type: text/x-patch;
  name="arc.c.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="arc.c.diff"

 Index: arc.c
 ===================================================================
 RCS file: /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c,v
 retrieving revision 1.22
 diff -u -r1.22 arc.c
 --- arc.c	3 Aug 2022 01:53:06 -0000	1.22
 +++ arc.c	2 May 2024 08:38:40 -0000
 @@ -3903,6 +3903,31 @@
  	free_memory_reason_t r = FMR_UNKNOWN;

  #ifdef _KERNEL
 +#ifdef __NetBSD__
 +	vmem_size_t totalpercent;
 +	vmem_size_t free;
 +
 +	/*
 +	 * PR kern/57558:
 +	 *
 +	 * do not let pdaemon get stuck in the uvm_km_va_starved_p()
 +	 * state. it starts a tight loop when in uvm_km_va_starved state
 +	 * and ZFS is not freeing any pool pages as it started freeing
 +	 * only when falling below uvmexp.freetarg.
 +	 * now we start freeing when falling below 10% kva free or
 +	 * uvmexp.freetarg.
 +	 * the 10% magic is shamelessly copied from uvm_km_va_starved_p()
 +	 * The interface to the pagedaemon has room for improvement.
 +	 */
 +
 +	totalpercent = vmem_size(heap_arena, VMEM_ALLOC|VMEM_FREE) / 10;
 +	free = vmem_size(heap_arena, VMEM_FREE);
 +
 +	if (free < totalpercent) {
 +		needfree = btop(totalpercent - free);
 +	}
 +#endif
 +
  	if (needfree > 0) {
  		n = PAGESIZE * (-needfree);
  		if (n < lowest) {

 --------------AA25A9A6F0B7E95C7A6B2F0F--

From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, kardel@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sat, 4 May 2024 16:07:27 -0700

 On Thu, May 02, 2024 at 08:50:02AM +0000, Frank Kardel wrote:
 >  Thus arc..c needs an additional check to determine needfree if the kva 
 >  free space falls below 10%.

 The intention is that 64-bit kernels should configure enough KVA
 to be able to map all of physical memory as kmem without running out of KVA.
 I changed the general code to work that way a while back, do we do something
 different on xen?

 -Chuck

From: Frank Kardel <kardel@netbsd.org>
To: Chuck Silvers <chuq@chuq.com>, gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 08:33:03 +0200

 My comments to your remark below - a longer (re-)explaination follows 
 after that.

 On 05/05/24 01:07, Chuck Silvers wrote:
 > On Thu, May 02, 2024 at 08:50:02AM +0000, Frank Kardel wrote:
 >>   Thus arc..c needs an additional check to determine needfree if the kva
 >>   free space falls below 10%.
 > The intention is that 64-bit kernels should configure enough KVA
 > to be able to map all of physical memory as kmem without running out of KVA.
 We are not really running out of KVA, this is not violated. We are 
 running into less than 10% free KVA available
 and ZFS still allocating KVA while other pools give up KVA which gets 
 allocated to ZFS UNTIL
 we fall below uvmexp.freetarg.
 > I changed the general code to work that way a while back, do we do something
 > different on xen?
 The difference is large memory - see below for the reasoning (which can 
 be supported via measurements/traces)
 > -Chuck

 TLDR section at the end.

 The issue is not having enough KVA configured  The issue is that with
 large (not XEN specific) memory systems the page daemon attempts to keep
 always at least 10% KVA free. See uvm_km.c:uvm_km_va_starved_p(void) and
 uvm_pdaemon.c:uvm_pageout(void *arg).

 With less than 10% KVA free the local kmem_va_starved variable is true.
 This leads to skipping the UVM_UNLOCK_AND_WAIT(). Further on
 usually no scan is done as
      - needsfree is false as there is enough free memory 
 (uvmexp.freetarg is 4096 in this case)
      - needsscan is also false as uvmpdpol_needsscan_p() does not return 
 true that that time.
 But even if if we would scan it wouldn't help as the target of the scan 
 is to get around.freetarg
 free pages and it would not drain any pools where ZFS is hogging memory..

 Further down the pool_drainer thread is kicked as needsfree and 
 needsscan are false as they are mainly bound
 the uvmexp.freetarg.

 Following the pooldrain path an attempt is made the reclaim idle pages 
 from the pools.
 At this time most pools will give up idle pages, but ZFS will hold onto 
 them. This is because
 ZFS determines the we are not falling below uvmexp.freetarg and thus ZFS 
 does not kick
 the arc_reclaim thread to give up pages. So all the poolthread 
 accomplishes is the (most) other pools
 give up the idle pages. while ZFS holds onto its pool allocations.

 While the poolthread may dig up some more free pages ZFS will keep 
 allocating pool (KVA)
 memory as long as we are not below uvmexp.freetarg. While doing this 
 more and more pools
 get reduced whenever possible (because some pages where currently free). 
 Effects are
 the system becoming very slow to respond, network buffers not being able 
 to be allocated,
 dropped network connections and more.

 This relaxes a bit when free memory falls below uvmexp.freetarg as at 
 that time ZFS starts giving up
 pool memory.. At that time we are far below the 10% KVA starvartion limit.

 While being below the 10% limit but above uvmexp.freetarg the pagedaemon 
 happily spins while
 ZFS keeps allocating more and more KVA.

 So it is not having enough KVA available. It is that ZFS keeps 
 allocating KVA until we fall
 below uvmexp.freetarg. With larger memory systems the gap between 
 uvmexp.freetarg and
 10% KVA increases and the problem becomes critical..

 Given the current mechanics the pool memory for all non-ZFS pools is 
 initially effectively limited to uvmexp.freetarg pages which is not enough
 for reliable system operation.

 It is not a XEN issue.

 TLDR:
 - pagedaemon aggressively starts pool darining once KVA free falls below 10%
 - ZFS won't free pool pages until free memory falls below uvmexp.freetarg.
 - there is a huge gap between uvmexp.freetarg and 10% KVA free 
 increasing with larger memory(10%)
 - while below 10% KVA free ZFS eventually depletes all other pools that 
 are cooperatively giving up pages
    causing all sorts of shortages in other areas (visible in e.g. 
 network buffers)

 Mitigation: allow ZFS to detect free KVA memory falling below 10% to 
 start reclaiming memory.

 It is not related to XEN at all. Just ZFS + large memory is sufficient 
 for the problems to occur.
 Base issue is the big difference between 10% free KVA memory limit and 
 uvmexp.freetarg.

 I seem to explain the mechanism over and over again. And so far no one 
 has verified this analysis.

 -Frank

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 06:45:29 -0000 (UTC)

 kardel@netbsd.org (Frank Kardel) writes:

 >It is not related to XEN at all. Just ZFS + large memory is sufficient 
 >for the problems to occur.
 >Base issue is the big difference between 10% free KVA memory limit and 
 >uvmexp.freetarg.


 I've added some sysctls that FreeBSD supports and also changed
 how some defaults are computed.

 % vmstat -s | grep target
     10922 target free pages

 % sysctl vfs.zfs_arc
 vfs.zfs_arc.meta_limit = 0
 vfs.zfs_arc.meta_min = 0
 vfs.zfs_arc.shrink_shift = 0
 vfs.zfs_arc.max = 128770969600
 vfs.zfs_arc.min = 16096371200
 vfs.zfs_arc.compressed = 1
 vfs.zfs_arc.free_target = 10922         <----------


 I just have no reliable test case to verify that it changes
 the behaviour.

 I have uploaded my current patch (still with some debug printfs).

 http://cdn.netbsd.org/pub/NetBSD/misc/mlelstv/zfs.diff


From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 09:47:51 +0200

 Good to have those knobs.

 In you example you have 10922 target free pages which is ~45Mbytes. On a 
 system with more then

 450Mb memory the problem would occur again.

 On our system i see (vmstat -s)

       4096 bytes per page
         16 page colors
   86872544 pages managed
    4266905 pages free
   18512074 pages active
    4929394 pages inactive
          0 pages paging
       4429 pages wired
          1 reserve pagedaemon pages
         60 reserve kernel pages
    2714934 boot kernel pages
   58193232 kernel pool pages
    6295559 anonymous pages
   17136740 cached file pages
      13700 cached executable pages
       3072 minimum free pages
       4096 target free pages
   28957514 maximum wired pages
          1 swap devices
    1048575 swap pages
     201179 swap pages in use
    2458785 swap allocations

 Thus

 356 Gb managed memory

 17Gb free

 26 Gb anonymous memory

 70Gb file pages

 16 Mb free target

 238 Gb allocated to pool

 This situation is with the 10% correction in ZFS and has survived 
 (without stalls/allocation failure) creating a 1.6Tb database in ZFS and 
 two parallel multiple

 hour production like runs on that database.

 Without the fix the system would have stalled during load.

 Your setup would be safe with the current implementation if 
 zfs_arc.free_target(= uvmexo.freetarg) would be at 10% of KVA memory..

 This is usually not the case.

 I don't know how much memory you system has and I did not check how 
 uvmexp.freetarg is calculated at startup or adjusted thereafter. Fact is 
 that 16Mb

 even seems sufficient on large memory systems if ZFS is kept from 
 allocating pool memory beyond 90%. ZFS must start reclaiming when there 
 is less than

 10% free KVA available due to the page daemon logic.

 I see your patch also contains my proposed fix. So when testing this 
 patch we should be safe from the KVA starvation issue and changing 
 zfs_arc_free_target would only make an additional effect when it is set 
 higher then 10% KVA memory.. Being able to set this value has the 
 benefit that we can limit ZFS pool usage even more. Maybe we should provide

 a way to specify %KVA or an absolute allocation value.

 -Frank

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 10:03:38 +0200

 On 05/05/24 08:50, Michael van Elst wrote:
 >   
 >   I just have no reliable test case to verify that it changes
 >   the behaviour.
 >   
 >   
 As for testing you could use the D-Script
 sdt:::arc-available_memory
 {
            printf("mem = %d, reason = %d", arg0, arg1);
 }

 mem is positive when arc_available_memory thinks the is memory 
 (prohibits the reclaimthread to run even when triggered).
 mem is negative when memory should be freed giving the negative amount 
 to be freed.
 reason is the cause for the memory decision.
 reason is
 typedef enum free_memory_reason_t {
          FMR_UNKNOWN,
          FMR_NEEDFREE,
          FMR_LOTSFREE,
          FMR_SWAPFS_MINFREE,
          FMR_PAGES_PP_MAXIMUM,
          FMR_HEAP_ARENA,
          FMR_ZIO_ARENA,
          FMR_ZIO_FRAG,
 } free_memory_reason_t;

 Test case would be any large ZFS write/read operations.
 If you see 1 as cause it is the startvation guard fix.

 -Frank

From: matthew green <mrg@eterna23.net>
To: Frank Kardel <kardel@netbsd.org>
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org,
    Chuck Silvers <chuq@chuq.com>, gnats-bugs@netbsd.org
Subject: re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 05 May 2024 18:13:40 +1000

 just to clear up what i think is a confusion.

 > My comments to your remark below - a longer (re-)explaination follows =

 > after that.
 >
 > On 05/05/24 01:07, Chuck Silvers wrote:
 > > On Thu, May 02, 2024 at 08:50:02AM +0000, Frank Kardel wrote:
 > >>   Thus arc..c needs an additional check to determine needfree if the =
 kva
 > >>   free space falls below 10%.
 > > The intention is that 64-bit kernels should configure enough KVA
 > > to be able to map all of physical memory as kmem without running out o=
 f KVA.
 > We are not really running out of KVA, this is not violated. We are =

 > running into less than 10% free KVA available
 > and ZFS still allocating KVA while other pools give up KVA which gets =

 > allocated to ZFS UNTIL
 > we fall below uvmexp.freetarg.

 freetarg has nothing to do with KVA.  that's about free memory
 (actual physical pages).  KVA is a space that gets allocated out
 of, and sometimes that space has real pages backing it but not
 always (and sometimes the same page may be mapped more the once.)
 on 64-bit platforms, KVA is generally *huge*.

 KVA starvation shouldn't happen -- we have many terabytes
 available for the KVA on amd64, and pool reclaim only happens
 when we run low on free pages.  KVA on amd64 already is large
 enough to map all of physical memory in the 'direct map' region,
 as well as other places as needed.

 (it sounds like zfs needs to be able to reclaim pages like other
 consumers?)


 .mrg.

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 10:03:46 -0000 (UTC)

 kardel@netbsd.org (Frank Kardel) writes:

 > benefit that we can limit ZFS pool usage even more. Maybe we should provide
 > a way to specify %KVA or an absolute allocation value.

 There are lots of changes in that area in more recent ZFS versions.
 I don't think we should create too sophisticated changes if a later
 update will make them obsolete.

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs <gnats-bugs@NetBSD.org>
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 12:50:31 +0200

 On 05/05/24 10:13, matthew green wrote:

 > just to clear up what i think is a confusion.
 >
 > freetarg has nothing to do with KVA.
 Correct - that is why we are running into the issue as ZFS currently 
 looks only on freetarg.
 >    that's about free memory
 > (actual physical pages).  KVA is a space that gets allocated out
 > of, and sometimes that space has real pages backing it but not
 > always (and sometimes the same page may be mapped more the once.)
 > on 64-bit platforms, KVA is generally *huge*.
 Yes.
 >
 > KVA starvation shouldn't happen -- we have many terabytes
 > available for the KVA on amd64, and pool reclaim only happens
 > when we run low on free pages.
 Well according the the code starvation happens as in uvm/uvm_km
 .c
 bool
 uvm_km_va_starved_p(void)
 {
          vmem_size_t total;
          vmem_size_t free;

          if (kmem_arena == NULL)
                  return false;

          total = vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE);
          free = vmem_size(kmem_arena, VMEM_FREE);

          return (free < (total / 10));
 }

 returns true. It may not be the starvation you have in mind (out of vmem 
 address space), but as it returns true
 it affects the uvm_pageout pagedaemon process.

 The difference in semantics here may be the available KVA address space 
 being HUGE and
 the presumably much smaller vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE) 
 value which is
 the basis for the uvm_km_va_starved_p() predicate.
 >    KVA on amd64 already is large
 > enough to map all of physical memory in the 'direct map' region,
 > as well as other places as needed.
 >
 > (it sounds like zfs needs to be able to reclaim pages like other
 > consumers?)
 Yes, many other consumers(maybe even all non ZFS consumers) give up idle 
 pages(+ maybe even more) when asked to.
 ZFS pool memory is currently only reclaimed when we fall below 
 uvmexp.freetarg. starvartion is signaled long before that.
 I think the ZFS reclaim strategy is not in line with the general pool 
 reclaim expectations
 It is also not synchronous as if arc_reclaim  is triggered only a thread 
 starts the cleanup evicting pages. With swap space this is not overly 
 critical.
 > .mrg.
 With this issue we need to look past design ideas and known invariants 
 and look at the implementation and find where the design ideas and 
 invariants do
 not match in the implementation.

 -Frank

From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        kardel@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 05 May 2024 07:37:20 -0400

 Frank Kardel <kardel@netbsd.org> writes:

 [snip]

 >  TLDR:
 >  - pagedaemon aggressively starts pool darining once KVA free falls below 10%
 >  - ZFS won't free pool pages until free memory falls below uvmexp.freetarg.
 >  - there is a huge gap between uvmexp.freetarg and 10% KVA free 
 >  increasing with larger memory(10%)
 >  - while below 10% KVA free ZFS eventually depletes all other pools that 
 >  are cooperatively giving up pages
 >     causing all sorts of shortages in other areas (visible in e.g. 
 >  network buffers)

 This is a pretty good description of a problem I am/was seeing with the
 daily cron checking for core files.  On a DOMU with not a lot of memory,
 12GB - 16GB and a WHOLE lot of ZFS filesets, this job would never
 complete and the guest would appear to lock up (actually it may be any
 job that did "find" that crossed into a ZFS fileset).  To work around it
 I ended up commenting out the daily job.  The guest is my build system
 for the OS and it would also start to bog down and would eventually hang
 up after a few OS builds, but that was a more manageable situation.

 With the simple kardel patch that was provided, the daily job could run
 to completion and the system appears to be responsive after a couple of
 days.  I have not had time to run builds to see how that effects the
 matter.  The guest has 2 vcpus and I sometimes would abuse it pretty
 hard by running 3 builds with -j2 on the build.sh line at the same time.
 Very often the system would hang up at some point if I did this and I
 had to back off and only run 1 or 2 at the same time.

 >  Mitigation: allow ZFS to detect free KVA memory falling below 10% to 
 >  start reclaiming memory.
 >  
 >  It is not related to XEN at all. Just ZFS + large memory is sufficient 
 >  for the problems to occur.
 >  Base issue is the big difference between 10% free KVA memory limit and 
 >  uvmexp.freetarg.

 I am not sure that "large memory" needs to be all that large to prompt
 the problem.  The description of what happens when ZFS gobbles
 everything up is pretty close to what I am seeing...

 >  I seem to explain the mechanism over and over again. And so far no one 
 >  has verified this analysis.
 >  
 >  -Frank
 >  




 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 14:10:50 +0200

 On 05/05/24 13:40, Brad Spencer wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: Brad Spencer <brad@anduin.eldar.org>
 > To: gnats-bugs@netbsd.org
 > Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
 >          kardel@netbsd.org
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Sun, 05 May 2024 07:37:20 -0400
 >
 >   Frank Kardel <kardel@netbsd.org> writes:
 >   
 >   [snip]
 >   
 >   >  TLDR:
 >   >  - pagedaemon aggressively starts pool darining once KVA free falls below 10%
 >   >  - ZFS won't free pool pages until free memory falls below uvmexp.freetarg.
 >   >  - there is a huge gap between uvmexp.freetarg and 10% KVA free
 >   >  increasing with larger memory(10%)
 >   >  - while below 10% KVA free ZFS eventually depletes all other pools that
 >   >  are cooperatively giving up pages
 >   >     causing all sorts of shortages in other areas (visible in e.g.
 >   >  network buffers)
 >   
 >   This is a pretty good description of a problem I am/was seeing with the
 >   daily cron checking for core files.  On a DOMU with not a lot of memory,
 >   12GB - 16GB and a WHOLE lot of ZFS filesets, this job would never
 >   complete and the guest would appear to lock up (actually it may be any
 >   job that did "find" that crossed into a ZFS fileset).  To work around it
 >   I ended up commenting out the daily job.  The guest is my build system
 >   for the OS and it would also start to bog down and would eventually hang
 >   up after a few OS builds, but that was a more manageable situation.
 [snip]
 >   
 >   I am not sure that "large memory" needs to be all that large to prompt
 >   the problem.  The description of what happens when ZFS gobbles
 >   everything up is pretty close to what I am seeing...
 >   
 Thanks for your observation. Actually "large memory" could be seen more 
 like where
 vmem_size(kernel_arena, VMEM_ALLOC|VMEM_FREE) / 10 in pages being 
 significantly larger than uvmexp.freetarg.
 As you have observed this can already happen on smaller systems.

 -Frank

From: Brad Spencer <brad@anduin.eldar.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 05 May 2024 15:56:13 -0400

 Frank Kardel <kardel@netbsd.org> writes:


 > Thanks for your observation. Actually "large memory" could be seen more 
 > like where
 > vmem_size(kernel_arena, VMEM_ALLOC|VMEM_FREE) / 10 in pages being 
 > significantly larger than uvmexp.freetarg.
 > As you have observed this can already happen on smaller systems.
 >
 > -Frank


 Sure...

 I was able to perform the abusive build operation and was able to make
 the system fall over.  The abuse is the following:

 Have a 10.0 PVH guest with 16GB and 2vcpus.  Run the following builds at
 the same time:

 build.sh -j2 <- for amd64
 build.sh -j2 <- for i386
 build.sh -j2 <- for earmv7hf

 The source tree is in a ZFS fileset and is used by all of the builds.
 The artifacts (obj, dist, release, etc..) are all in their own ZFS
 filesets for each of the arch types (that is /artifacts/amd64 would be
 its own ZFS filesystem and contain object, release and dist
 subdirectories, using the -O, -R and -D flags to build.sh to point to
 /artifacts/amd64/OBJ and etc.  There would also be a /artifacts/i386 and
 /artifacts/earmv7hf which are also theirs own filesets).

 Everything will humming along just fine, until the earmv7hf build nears
 the end and does /usr/src/distrib/utils/embedded/mkimage which does "dd
 bs=1 count=4456448 if=/dev/zero" ... that dd will run with high CPU for
 a little bit and then cause all active reads and writes going on with
 the other builds and itself to more or less deadlock.  The CPU
 utilization will fall to zero and disk utilization on the zpool will
 fall to zero.  The system will be responsive, but if you try hitting any
 of the files being used the command (ls, or whatever) will hang up.

 As far as I can tell what was going on in the system was two objcopy and
 two rm along with the dd.  One objcopy was stuck in tstile and the other
 in &zilog.  The dd was stuck in &tx->t and both rm were stuck in &zio->
 ... all according to top.

 I can almost reproduce this on demand, as long as the amd64 and i386
 builds are actually building something and the earmv7hf build hits the
 mkimage call at the same time.  A clean build of all three will probably
 provoke it and update builds (-u flag to build.sh) may as well.

 This is all probably unrelated to the patch that was provided and the
 problem being reported.  The patch does appear to make the situation
 better.

 Might want to consider switching the arguments to that "dd" to be "dd
 count=1 bs=4456448 if=/dev/zero" .. that is just write one block of
 4456448 bytes instead of 4456448 one byte blocks.  Might be less
 stressful.





 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 22:21:09 +0200

 Nice data point. It might not be related.

 Could you check the stack traces with either crash or DDB of the hung 
 processes

 to check we are not subject to a resource shortage but rather a locking 
 issue?

 Frank

From: Chuck Silvers <chuq@chuq.com>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sun, 5 May 2024 23:05:35 -0700

 --4Yo0DCM283VB5+BQ
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Sun, May 05, 2024 at 08:33:03AM +0200, Frank Kardel wrote:
 > The issue is not having enough KVA configured  The issue is that with
 > large (not XEN specific) memory systems the page daemon attempts to keep
 > always at least 10% KVA free. See uvm_km.c:uvm_km_va_starved_p(void) and
 > uvm_pdaemon.c:uvm_pageout(void *arg).

 ah yes, that is the problem.  this is really a mismatch between how much kmem space
 is allocated vs. how much kmem space is allowed to be used before the pagedaemon
 tries to reclaim some kmem space.  ZFS is just a victim in this because it happens
 to use a lot of kmem space.

 I think the right fix for this is to increase the amount of kmem space that we
 allocate such that all of physical memory can be allocated as kmem without
 the pagedaemon considering the system to be starved for kmem virtual space.
 this means allocating an enough kmem space for 10/9 of physical memory,
 so that even though only 9/10 of kmem virtual space can be used, there is still
 enough kmem virtual space available for all of physical memory.

 please try the attached patch.

 -Chuck

 --4Yo0DCM283VB5+BQ
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="diff.nkmempages-10-9ths.1"

 Index: src/sys/uvm/uvm_km.c
 ===================================================================
 RCS file: /home/chs/netbsd/cvs/src/sys/uvm/uvm_km.c,v
 retrieving revision 1.165
 diff -u -p -r1.165 uvm_km.c
 --- src/sys/uvm/uvm_km.c	9 Apr 2023 09:00:56 -0000	1.165
 +++ src/sys/uvm/uvm_km.c	5 May 2024 17:02:49 -0000
 @@ -227,7 +227,14 @@ kmeminit_nkmempages(void)
  	}

  #if defined(NKMEMPAGES_MAX_UNLIMITED) && !defined(KMSAN)
 -	npages = physmem;
 +	/*
 +	 * The extra 1/9 here is to account for uvm_km_va_starved_p()
 +	 * wanting to keep 10% of kmem virtual space free.
 +	 * The intent is that on "unlimited" platforms we should be able
 +	 * to allocate all of physical memory as kmem without running short
 +	 * of kmem virtual space.
 +	 */
 +	npages = (physmem * 10) / 9;
  #else

  #if defined(KMSAN)

 --4Yo0DCM283VB5+BQ--

From: Frank Kardel <kardel@netbsd.org>
To: Chuck Silvers <chuq@chuq.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 9 May 2024 19:26:01 +0200

 Thanks for that simpler fix to avoid the starvation state. Test have 
 shown that that works in the

 environment.

 I am wondering, though, whether something needs to be done for the other 
 cases.

 -Frank


 On 05/06/24 08:05, Chuck Silvers wrote:
 > [snip]
 > I think the right fix for this is to increase the amount of kmem space that we
 > allocate such that all of physical memory can be allocated as kmem without
 > the pagedaemon considering the system to be starved for kmem virtual space.
 > this means allocating an enough kmem space for 10/9 of physical memory,
 > so that even though only 9/10 of kmem virtual space can be used, there is still
 > enough kmem virtual space available for all of physical memory.
 >
 > please try the attached patch.
 >
 > -Chuck

From: Chuck Silvers <chuq@chuq.com>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 9 May 2024 10:48:24 -0700

 great, thanks for testing.

 which other cases are you talking about?

 -Chuck


 On Thu, May 09, 2024 at 07:26:01PM +0200, Frank Kardel wrote:
 > Thanks for that simpler fix to avoid the starvation state. Test have shown
 > that that works in the
 > 
 > environment.
 > 
 > I am wondering, though, whether something needs to be done for the other
 > cases.
 > 
 > -Frank
 > 
 > 
 > On 05/06/24 08:05, Chuck Silvers wrote:
 > > [snip]
 > > I think the right fix for this is to increase the amount of kmem space that we
 > > allocate such that all of physical memory can be allocated as kmem without
 > > the pagedaemon considering the system to be starved for kmem virtual space.
 > > this means allocating an enough kmem space for 10/9 of physical memory,
 > > so that even though only 9/10 of kmem virtual space can be used, there is still
 > > enough kmem virtual space available for all of physical memory.
 > > 
 > > please try the attached patch.
 > > 
 > > -Chuck

From: Frank Kardel <kardel@netbsd.org>
To: Chuck Silvers <chuq@chuq.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 9 May 2024 21:06:43 +0200

 On 05/09/24 19:48, Chuck Silvers wrote:

 > great, thanks for testing.
 >
 > which other cases are you talking about?
 >
 > -Chuck

 The cases in the KVA nkmempages determination.

 #if defined(NKMEMPAGES_MAX_UNLIMITED) && !defined(KMSAN)
          /*
           * The extra 1/9 here is to account for uvm_km_va_starved_p()
           * wanting to keep 10% of kmem virtual space free.
           * The intent is that on "unlimited" platforms we should be able
           * to allocate all of physical memory as kmem without running short
           * of kmem virtual space.
           */
          npages = (physmem * 10) / 9;
 #else

 #if defined(KMSAN)
          npages = (physmem / 4);
 #elif defined(PMAP_MAP_POOLPAGE)
          npages = (physmem / 4);
 #else
          npages = (physmem / 3) * 2;
 #endif /* defined(PMAP_MAP_POOLPAGE) */

 #if !defined(NKMEMPAGES_MAX_UNLIMITED)
          if (npages > NKMEMPAGES_MAX)
                  npages = NKMEMPAGES_MAX;
 #endif

 #endif

          if (npages < NKMEMPAGES_MIN)
                  npages = NKMEMPAGES_MIN;

          nkmempages = npages;

 defined(NKMEMPAGES_MAX_UNLIMITED) && !defined(KMSAN) is running fine now.

 I wonder whether we could hit the starvation scenario where ZFS does not 
 reclaim memory in time in

 the other cases as there nkmempages is set to a value less than physmem.

 Just wondering.

 -Frank


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.