NetBSD Problem Report #57558

From Frank.Kardel@Acrys.com  Thu Aug  3 08:44:21 2023
Return-Path: <Frank.Kardel@Acrys.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 1BB351A9238
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  3 Aug 2023 08:44:21 +0000 (UTC)
Message-Id: <20230803084410.0E6E16019@gaia.acrys.com>
Date: Thu,  3 Aug 2023 10:44:10 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: pgdaemon 100% busy - no scanning (ZFS case)
X-Send-Pr-Version: 3.95

>Number:         57558
>Category:       kern
>Synopsis:       pgdaemon 100% busy - no scanning (ZFS case)
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 03 08:45:00 +0000 2023
>Last-Modified:  Fri Apr 26 15:35:01 +0000 2024
>Originator:     Frank Kardel
>Release:        NetBSD 10.0_BETA / current
>Organization:

>Environment:


System: NetBSD Marmolata 10.0_BETA NetBSD 10.0_BETA (XEN3_DOM0) #1: Thu Jul 27 18:30:30 CEST 2023 kardel@gaia:/src/NetBSD/n10/src/obj.amd64/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
	It has been observed that pgdaemon can get into a tight loop
	consuming 100% cpu, not yielding and blocking other RUNable
	threads on the cpu. (PRs kern/56516 (for effects - may have other cause), kern/55707)
	This analysis and proposed fix relate to the pgdaemon loop cause by KVA exhaustion
	by ZFS.

	Observed and analyzed in following environment (should be reproducable in simpler
	environments):
		XEN3_DOM0 providing vnd devices based on files on ZFS.
		GENERIC(pvh) DOMU using an ffs filesystem based in the vnd in XEN3_DOM0.

	Observed actions/effects:
		1) running a database on the ffs file system in XEN3_DOM0
		2) load a larger database
		3) XEN3_DOM0 is fine until ZFS allocated 90% of KVA
		   at this point pgdaemon kicks in and enters a tight loop.
		4) pgdaemon does not do page scans (enough memory is available)
		5) pgdaemon loops as uvm_km_va_starved_p() returns true
		6) pool_drain is unable the reclaim any idle pages from the pools
		7) uvm_km_va_starved_p() thus keeps returning true - pgdaemon keeps looping

	Analyzed causes:
		- pool_drain causes upcalls to ZFS reclaim logic
		- ZFS reclaim logic does no reclaim anything as the current code
		  looks at uvm_availmem(false) and that returns 'plenty' memory)
		  thus no attempt to free memory on ZFS is done and no KVA is reclaimed.

	Conclusion:
		- using uvm_availmem(false) for ZFS memory throtteling is wrong as
		  ZFS memory is allocated from kmem KVA pools.
		- ZFS arc must to use KVA memory for memory checks

>How-To-Repeat:
	run a DB load in an FFS from a vnd of a file on ZFS.
>Fix:
	Patch 1:
		let ZFS use a correct view on KVA memory:
		With this patch arc reclaim now detects memory shortage and
		frees pages. Also the ZFS KVA used by ZFS is limited to
		75% KVA - could be made tunable

	Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
	correctly, but pages are not fully reclaimed and ZFS depletes its cache
	fully as the freed and now idle page are not reclaimed from the pools yet.
	pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
	at this point.

	To reclaim the pages freed directly we need
	Patch 2:
		force page reclaim
	that will perform the reclaim.

	With both fixes the arc reclaim thread kicks in at 75% KVA usage and
	reclaim only enough memory to no to exceed 75% KVA.

	Any comments?

	OK to commit? (happens automatically on no feedback)

Index: external/cddl/osnet/dist/uts/common/fs/zfs/arc.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c,v
retrieving revision 1.22
diff -c -u -r1.22 arc.c
--- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	3 Aug 2022 01:53:06 -0000	1.22
+++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	3 Aug 2023 08:19:11 -0000
@@ -276,6 +276,7 @@
 #endif /* illumos */

 #ifdef __NetBSD__
+#include <sys/vmem.h>
 #include <uvm/uvm.h>
 #ifndef btop
 #define	btop(x)		((x) / PAGE_SIZE)
@@ -285,9 +286,9 @@
 #endif
 //#define	needfree	(uvm_availmem() < uvmexp.freetarg ? uvmexp.freetarg : 0)
 #define	buf_init	arc_buf_init
-#define	freemem		uvm_availmem(false)
+#define	freemem		btop(vmem_size(kmem_arena, VMEM_FREE))
 #define	minfree		uvmexp.freemin
-#define	desfree		uvmexp.freetarg
+#define	desfree		(btop(vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE)) / 4)
 #define	zfs_arc_free_target desfree
 #define	lotsfree	(desfree * 2)
 #define	availrmem	desfree


	Patch 2:
		force reclaiming of pages on affected pools

Index: external/cddl/osnet/sys/kern/kmem.c
===================================================================
RCS file: /cvsroot/src/external/cddl/osnet/sys/kern/kmem.c,v
retrieving revision 1.3
diff -c -u -r1.3 kmem.c
--- external/cddl/osnet/sys/kern/kmem.c	11 Nov 2020 03:31:04 -0000	1.3
+++ external/cddl/osnet/sys/kern/kmem.c	3 Aug 2023 08:19:11 -0000
@@ -124,6 +124,7 @@
 {

 	pool_cache_invalidate(km->km_pool);
+	pool_cache_reclaim(km->km_pool);
 }

 #undef kmem_alloc

>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: kardel@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 10:23:19 +0000

 Cool, thanks for looking into this!  I was planning to investigate at
 some point soon, starting by adding dtrace probes (and maybe wiring up
 the sysctl knobs) so we can reproduce the analysis of the issue in the
 field.  Your analysis sounds plausible, but I'd like to make sure we
 have the visibility to verify the behaviour -- and the change in
 behaviour -- first!

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 13:06:33 +0200

 Hi Taylor,

 I used the existing DTRACE knobs in arc.c (adding one for available 
 memory() thought, but fbt:return on

 arc_available_memory would suffice) and counter based

 event debugging (additional code) to track the tight loop in uvm_pdaemon.c.

 A sysctl for % KVA would be useful. the existing fbt and sdt probes already

 help a lot to track the pattern.

 In my setup (soon to be used for actual work) the loops could be reproduced.

 With patch 1 the pgdaemon loops went away KVA used fir ZFS was limited 
 to 75% but the

 cache was depleted because of missing patch 2. And patch 2 is needed as 
 there is no chance

 the pgdaemon will trigger a pool_drain unless we reach kva_starvation on 
 a non ZFS path.

 So what would the next steps be?

 Frank



 On 08/03/23 12:25, Taylor R Campbell wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: Taylor R Campbell <riastradh@NetBSD.org>
 > To: kardel@NetBSD.org
 > Cc: gnats-bugs@NetBSD.org
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Thu, 3 Aug 2023 10:23:19 +0000
 >
 >   Cool, thanks for looking into this!  I was planning to investigate at
 >   some point soon, starting by adding dtrace probes (and maybe wiring up
 >   the sysctl knobs) so we can reproduce the analysis of the issue in the
 >   field.  Your analysis sounds plausible, but I'd like to make sure we
 >   have the visibility to verify the behaviour -- and the change in
 >   behaviour -- first!
 >   

From: Taylor R Campbell <riastradh@NetBSD.org>
To: Frank Kardel <kardel@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 12:22:58 +0000

 Can you share the dtrace scripts you used for reference and how you
 set up the experiment?

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:11:51 +0200

 This is a multi-part message in MIME format.
 --------------100111F35787E6220A86651D
 Content-Type: text/plain; charset=utf-8; format=flowed
 Content-Transfer-Encoding: 7bit

 Sure

 Setup:
      - all userlans NetBSD-10.0_BETA
      - NetBSD 10.0_BETA (2023-07-26) (-current should al work) XEN3_DOM0 
 (pagedaemon patched see pd.diff attachment)
      - xen-4.15.1
      - NetBSD 10.0_BETA GENERIC as DOMU
      - on DOM0 zfs file system providing a file for the FFS file system 
 in the DOMU
      - DOMU has a posgresql 14.8 installation
      - testcase is load a significant database (~200 Gb) into the 
 postgres DB.

 this seems complicated to setup (but I am prepaing the kind of VM for 
 our purposes).
 Going by the errors detected it should al be possible (not tested)
      - creste ZFS file syystem on a plain GENERIC system
      - create a file system file in ZFS
      - vnconfig vndX <path the file system file>
      - disklabel vndX
      - newfs vndXa
      - mount /dev/vndXa /mnt
      - do lots of fs traffic writing, deleting, rewriting the mount fs

 Part 1 - current situation:

 Use
 sdt:::arc-available_memory
 {
          printf("mem = %d, reason = %d", arg0, arg1);
 }

 to track what ZFS thinks is has as memory - positive values mean enough 
 memory there, negative ask ZFS ARC to free the much of memory.

 Use vmstat -m to track pool usage - you should see that ZFS will take 
 more an more memory until 90% kmem is used in the pools.
 At the point you should see a ~100% busy pgdaemon in top and
 the pagedaemon patch should list high counts for loop, kvm_starved and 
 available as uvm_availmem(false) still reports many free pages.

 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9813.2250709] pagedaemon: loops=16023729, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16023729, cnt_starved=16023729, 
 cnt_avail=16023729, fpages=336349
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9819.2252810] pagedaemon: loops=16018349, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16018349, cnt_starved=16018349, 
 cnt_avail=16018349, fpages=336542
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9825.2255258] pagedaemon: loops=16025793, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16025793, cnt_starved=16025793, 
 cnt_avail=16025793, fpages=336516
 ...

 That document the tight loop with no progress. the pgdaemon will not 
 recover - see my analysis.
 Observe the arc_reclaim is not freeing anything (and collect no cpu 
 time  see top) because arc_available memory claims that there is enough 
 free memory (looks at uvm_availmem(false).
 The dtrace probe documents that.

 Part 2 - get the arc_reclaim thread to actually be triggered before kmem 
 is starving.
 Install Patch 1 from the bug report. This lets ZFS look at the 
 kmem_arena space situation which a also looked at 
 uvm_km.c:uvm_km_va_starved_p(void).
 Now ZFS has a chance to start reclaiming memory.
 Run the load test again.
 The dtrace probe should now show decreasing memory until it get 
 negative. And it will stay negative by a certain amount.
 vmstat -m should show that ZFS now only hogs ~75% of kmem.
 Also the should be a significant count in the Ide page counts as the 
 arc_reclaim thread did give up memory.
 As the idle page are not yet reclaim from the pool ZFS is asked to 
 always free memory (dtrace probe) an vmstat -m will
 show the non zero Idle page counts. Thus now ZFS has 75% kmem memory 
 allocated but utilized only a small part. Thus the cache
 is allocated but not used anymore.

 We need to get the Idle pages actually reclaimed from the pools. This is 
 done by Patch 2 from the bug.
 There is no way to pass this task to the pgdaemon the that looks only 
 uvm_availmem(false) that does not consider kmem unless starving. Also
 the pool drain thread drain one pool at a time per invocation and that 
 is not even triggered.
 so Patch 2 directly reclaims from the pool_cache_invalidate()ed pool.

 With this strategy ZFS keeps the kmem usage around 75% as now Idle pages 
 are reclaimed and ZFS only gets negative arc_available_memory
 values when called for.
 vmstat will show that ZFS is now in the 75% kmem limit. arc_reclaim will 
 run at a suitable rate when needed. ZFS pools should not show too many 
 idle pages (idle pages
 are removed after some cool down time to reduce xcall activity if I read 
 the code right).
 dtrace should show positive and negative arc_available memory figures.

 I did not keep the vmstat and dtrace and top outputs. But from a busy db 
 loading DOMU ( databases > 350 GB)
 I see a vmstat -m of

 Memory resource pool statistics
 Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg 
 Maxpg Idle
 ...
 zfs_znode_cache 248 215697   0        0 13482     0 13482 13482 0   inf    0
 zil_lwb_cache 208      84    0        0     5     0     5     5 0   inf    0
 zio_buf_1024 1536   11248    0     7612  3278  1460  1818  1818 0   inf    0
 zio_buf_10240 10240  1130    0      723   973   566   407   407 0   inf    0
 zio_buf_114688 114688 351    0      200   339   188   151   151 0   inf    0
 zio_buf_12288 12288  1006    0      714   721   429   292   305 0   inf    0
 zio_buf_131072 131072 3150  89     2176  1841   867   974   974 0   inf    0
 zio_buf_14336 14336   473    0      308   432   267   165   166 0   inf    0
 zio_buf_1536 2048    2060    0     1065   549    51   498   498 0   inf    0
 zio_buf_16384 16384  9672    0      481  9318   127  9191  9191 0   inf    0
 zio_buf_2048 2048    2001    0      826   682    94   588   588 0   inf    0
 zio_buf_20480 20480   461    0      301   428   268   160   160 0   inf    0
 zio_buf_24576 24576   448    0      293   404   249   155   155 0   inf    0
 zio_buf_2560 2560    2319    1      490  1948   119  1829  1829 0   inf    0
 zio_buf_28672 28672   369    0      221   345   197   148   152 0   inf    0
 zio_buf_3072 3072    4163    2      422  3861   120  3741  3741 0   inf    0
 ...
 zio_buf_7168 7168     506    0      292   465   251   214   214 0   inf    0
 zio_buf_8192 8192     724    0      329   635   240   395   395 0   inf    0
 zio_buf_81920 81920   379    0      229   371   221   150   161 0   inf    0
 zio_buf_98304 98304   580    0      421   442   283   159   163 0   inf    0
 zio_cache    992     4707    0        0  1177     0  1177  1177 0   inf    0
 zio_data_buf_10 1536   39    0       33    20    17     3    12 0   inf    0
 zio_data_buf_10 10240   2    0        2     2     2     0     2 0   inf    0
 zio_data_buf_13 131072 488674 0  323782 274996 110104 164892 191800 0   
 inf    0
 zio_data_buf_15 2048   25    0       19    13    10     3     7 0   inf    0
 zio_data_buf_20 2048   17    0       13     9     7     2     4 0   inf    0
 zio_data_buf_20 20480   1    0        1     1     1     0     1 0   inf    0
 zio_data_buf_25 2560    7    0        6     7     6     1     5 0   inf    0
 ...
 Totals           222323337  98 210229180 1033080 125800 907280

 In use 24951773K, total allocated 25255540K; utilization 98.8%

 In the unpatched case all 32GB where allocated.

 The arc_reclaim_thread clocked in 20 CPU sec - that is ok.

 Current dtrace output is:
 dtrace: script 'zfsmem.d' matched 1 probe
 CPU     ID                    FUNCTION:NAME
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2

 The page daemon was never woken up and has 0 CPU seconds. in 2 days.

 This all looks very much as desired.

 Hope this helps.

 Best regards,
    Frank


 --------------100111F35787E6220A86651D
 Content-Type: text/x-patch;
  name="pd.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="pd.diff"

 --- /src/NetBSD/n10/src/sys/uvm/uvm_pdaemon.c	2023-07-29 17:52:46.392362932 +0200
 +++ /src/NetBSD/n10/src/sys/uvm/.#uvm_pdaemon.c.1.133	2023-07-29 14:18:05.000000000 +0200
 @@ -270,11 +270,15 @@
  	/*
  	 * main loop
  	 */
 -
 +/*XXXkd*/ unsigned long cnt_needsfree = 0L, cnt_needsscan = 0, cnt_drain = 0, cnt_starved = 0, cnt_avail = 0, cnt_loops = 0;
 +/*XXXkd*/ time_t ts, last_ts = time_second;
  	for (;;) {
  		bool needsscan, needsfree, kmem_va_starved;

 +/*XXXkd*/ cnt_loops++;
 +
  		kmem_va_starved = uvm_km_va_starved_p();
 +/*XXXkd*/ if (kmem_va_starved) cnt_starved++;

  		mutex_spin_enter(&uvmpd_lock);
  		if ((uvm_pagedaemon_waiters == 0 || uvmexp.paging > 0) &&
 @@ -311,6 +315,8 @@
  		needsfree = fpages + uvmexp.paging < uvmexp.freetarg;
  		needsscan = needsfree || uvmpdpol_needsscan_p();

 +/*XXXkd*/ if (needsfree) cnt_needsfree++;
 +/*XXXkd*/ if (needsscan) cnt_needsscan++;
  		/*
  		 * scan if needed
  		 */
 @@ -328,8 +334,18 @@
  			wakeup(&uvmexp.free);
  			uvm_pagedaemon_waiters = 0;
  			mutex_spin_exit(&uvmpd_lock);
 +/*XXXkd*/		cnt_avail++;
  		}

 +/*XXXkd*/	if (needsfree || kmem_va_starved) cnt_drain++;
 +/*XXXkd*/	ts = time_second;
 +/*XXXkd*/	if (ts > last_ts + 5 && cnt_loops > 5 * 10000) {
 +/*XXXkd*/		printf("pagedaemon: loops=%ld, cnt_needsfree=%ld, cnt_needsscan=%ld, cnt_drain=%ld, cnt_starved=%ld, cnt_avail=%ld, fpages=%d\n",
 +/*XXXkd*/		       cnt_loops, cnt_needsfree, cnt_needsscan, cnt_drain, cnt_starved, cnt_avail, fpages);
 +/*XXXkd*/ 		cnt_needsfree = cnt_needsscan = cnt_drain = cnt_starved = cnt_avail = cnt_loops = 0;
 +/*XXXkd*/		last_ts = ts;
 +/*XXXkd*/	}
 +
  		/*
  		 * scan done.  if we don't need free memory, we're done.
  		 */

 --------------100111F35787E6220A86651D--

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:19:57 +0200

 Just a note: bug patch 1 affects the zfs module, patch 2 affects the 
 solaris module.
 so only module buils are needed

From: Chuck Silvers <chuq@chuq.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 09:27:50 -0700

 On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
 > 	Patch 1:
 > 		let ZFS use a correct view on KVA memory:
 > 		With this patch arc reclaim now detects memory shortage and
 > 		frees pages. Also the ZFS KVA used by ZFS is limited to
 > 		75% KVA - could be made tunable
 > 
 > 	Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
 > 	correctly, but pages are not fully reclaimed and ZFS depletes its cache
 > 	fully as the freed and now idle page are not reclaimed from the pools yet.
 > 	pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
 > 	at this point.

 this patch is not correct.  it does not do the right thing when there
 is plenty of KVA but a shortage of physical pages.  the goal with
 previous fixes for ZFS ARC memory management problems was to prevent
 KVA shortages by making KVA big enough to map all of RAM, and thus
 avoid the need to consider KVA because we would always run low on
 physical pages before we would run low on KVA.  but apparently in your
 environment that is not working.  maybe we do something differently in
 a XEN kernel that we need to account for?


 > 	To reclaim the pages freed directly we need
 > 	Patch 2:
 > 		force page reclaim
 > 	that will perform the reclaim.

 this second patch is fine.

 -Chuck

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 20:22:10 +0200

 Hi Chuck !

 Thanks for looking into that.

 I came up with the first patch due to pgdaemon looping due to

 uvm_km_va_starved_p() being true.

 vmstat -m shows the statistics of the pools is summary close to

 32Gb my DOM0 has.

 counting the conditions when the pgdaemon is looping gives

 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9789.2242179] pagedaemon: loops=16026699, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16026699, cnt_starved=16026
 699, cnt_avail=16026699, fpages=337385
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9795.2244437] pagedaemon: loops=16024007, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16024007, cnt_starved=16024
 007, cnt_avail=16024007, fpages=335307
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9801.2246381] pagedaemon: loops=16031141, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16031141, cnt_starved=16031
 141, cnt_avail=16031141, fpages=335331

 uvm_km_va_starved_p(void)
 {
          vmem_size_t total;
          vmem_size_t free;

          if (kmem_arena == NULL)
                  return false;

          total = vmem_size(kmem_arena, VMEM_ALLOC|VMEM_FREE);
          free = vmem_size(kmem_arena, VMEM_FREE);

          return (free < (total / 10));
 }

 int
 uvm_availmem(bool cached)
 {
          int64_t fp;

          cpu_count_sync(cached);
          if ((fp = cpu_count_get(CPU_COUNT_FREEPAGES)) < 0) {
                  /*
                   * XXXAD could briefly go negative because it's impossible
                   * to get a clean snapshot.  address this for other 
 counters
                   * used as running totals before NetBSD 10 although less
                   * important for those.
                   */
                  fp = 0;
          }
          return (int)fp;
 }

 So, while uvm_km_va_starved_p() considers almost all memory used up 
 uvm_availmem(false) returns 337385 free pages (~1.28 Gb) well above 
 uvmexp.freetarg.

 So, why do we count so many free pages when the free vmem for kmem_arena 
 is less than 10% of the total kmem_arena?
 Maybe the pool pages have been allocated but not yet been referenced - I 
 didn't look that deep into the vmen/ZFS interaction.

 I understand the reasoning why .kmem size = phymem size should have worked

 There are still inconsistencies, though.
 Even if uvm_availmem(false) would account for all pages 
 allocated/reserved in the kmem_arena vmem on the 32Gb system the actual 
 freetarget is 2730 free pages (~10.7 Mb).
 %10 of 32Gb would be 3.2Gb which is a multiple of the free pages target. 
 So even then we would be stuck with a looping page daemon.

 I think we need to find a better way for coping with with the accounting 
 differences between vmem/uvm free pages. Looking at the vmem statistics 
 seemed logical to me as ZFS allocates almost everything from kmem_arena 
 via pools.
 I don't know what vmem does when there are less physical pages available 
 that the vmem allocation would require. This was the case you tried to
 avoid.

 So, looking at vmen statistic seems to be consistent with the starved 
 flag logic - that is why it does not trigger the looping pgdaemon. What 
 isn't
 covered is the case of less physical pages than the pool allocation 
 required.

 I think we have yet to find a correct, robust solution that does not 
 trigger the pgdaemon almost infinite loop.

 Frank


 On 08/03/23 18:30, Chuck Silvers wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: Chuck Silvers <chuq@chuq.com>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Thu, 3 Aug 2023 09:27:50 -0700
 >
 >   On Thu, Aug 03, 2023 at 08:45:01AM +0000, kardel@netbsd.org wrote:
 >   > 	Patch 1:
 >   > 		let ZFS use a correct view on KVA memory:
 >   > 		With this patch arc reclaim now detects memory shortage and
 >   > 		frees pages. Also the ZFS KVA used by ZFS is limited to
 >   > 		75% KVA - could be made tunable
 >   >
 >   > 	Patch 1 is not sufficient though. arc reclaim thread kicks in at 75%
 >   > 	correctly, but pages are not fully reclaimed and ZFS depletes its cache
 >   > 	fully as the freed and now idle page are not reclaimed from the pools yet.
 >   > 	pgdaemon will now not trigger pool_drain, as uvm_km_va_starved_p() returns false
 >   > 	at this point.
 >   
 >   this patch is not correct.  it does not do the right thing when there
 >   is plenty of KVA but a shortage of physical pages.  the goal with
 >   previous fixes for ZFS ARC memory management problems was to prevent
 >   KVA shortages by making KVA big enough to map all of RAM, and thus
 >   avoid the need to consider KVA because we would always run low on
 >   physical pages before we would run low on KVA.  but apparently in your
 >   environment that is not working.  maybe we do something differently in
 >   a XEN kernel that we need to account for?
 >   
 >   
 >   > 	To reclaim the pages freed directly we need
 >   > 	Patch 2:
 >   > 		force page reclaim
 >   > 	that will perform the reclaim.
 >   
 >   this second patch is fine.
 >   
 >   -Chuck
 >   

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Sat, 5 Aug 2023 13:24:55 -0000 (UTC)

 chuq@chuq.com (Chuck Silvers) writes:

 > this patch is not correct.  it does not do the right thing when there
 > is plenty of KVA but a shortage of physical pages.

 I doubt that this is handled well enough outside of zfs.

 The result of using ZFS (removing pkgsrc tree, unpacking it again) is
 about 3.2GB of kernel pools used (from 8GB total RAM).

 Trying to reduce ZFS again by tuning ZFS, shrinking maxvnodes and
 then allocating user pages ends in the following (only pools >10G shown):

 anonpl               0.078G
 arc_buf_hdr_t_f      0.018G
 buf16k               0.014G
 dmu_buf_impl_t       0.107G
 dnode_t              0.173G
 kmem-00064           0.037G
 kmem-00128           0.089G
 kmem-00192           0.182G
 kmem-00256           0.069G
 kmem-00384           0.161G
 kmem-01024           0.020G
 kmem-02048           0.033G
 mutex                0.020G
 namecache            0.020G
 pcglarge             0.042G
 pcgnormal            0.107G
 phpool-64            0.014G
 rwlock               0.020G
 sa_cache             0.032G
 vcachepl             0.206G
 zfs_znode_cache      0.072G
 zio_buf_131072       0.015G
 zio_buf_16384        0.075G
 zio_buf_4096         0.019G
 zio_buf_512          0.170G
 zio_cache            0.161G

 That's about 2GB left. Things like dnodes or the zio_cache are never
 flushed, the 512 byte zio buffer pool is still huge because it is
 totally fragmented, but also vcachepl is never drained.

 ZFS tries to drop referenced metadata in Solaris or FreeBSD, but
 for NetBSD that's still a nop.

 The OpenZFS code in current FreeBSD looks quite different too in
 that area.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57558 CVS commit: src/external/cddl/osnet/sys/kern
Date: Sat, 9 Sep 2023 00:14:16 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sat Sep  9 00:14:16 UTC 2023

 Modified Files:
 	src/external/cddl/osnet/sys/kern: kmem.c

 Log Message:
 solaris: Use pool_cache_reclaim, not pool_cache_invalidate.

 pool_cache_invalidate invalidates cached objects, but doesn't return
 any backing pages to the underlying page allocator.

 pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
 pages to the underlying page alloator, so it is actually useful for
 the page daemon to do when trying to free memory.

 PR kern/57558

 XXX pullup-10
 XXX pullup-9
 XXX pullup-8 (by patch to kmem.h instead of kmem.c)


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/external/cddl/osnet/sys/kern/kmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57558 CVS commit: [netbsd-9] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:31:14 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Oct  2 13:31:14 UTC 2023

 Modified Files:
 	src/external/cddl/osnet/sys/kern [netbsd-9]: kmem.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1735):

 	external/cddl/osnet/sys/kern/kmem.c: revision 1.4

 solaris: Use pool_cache_reclaim, not pool_cache_invalidate.

 pool_cache_invalidate invalidates cached objects, but doesn't return
 any backing pages to the underlying page allocator.
 pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
 pages to the underlying page alloator, so it is actually useful for
 the page daemon to do when trying to free memory.

 PR kern/57558


 To generate a diff of this commit:
 cvs rdiff -u -r1.2.2.1 -r1.2.2.2 src/external/cddl/osnet/sys/kern/kmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57558 CVS commit: [netbsd-10] src/external/cddl/osnet/sys/kern
Date: Mon, 2 Oct 2023 13:29:59 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Oct  2 13:29:59 UTC 2023

 Modified Files:
 	src/external/cddl/osnet/sys/kern [netbsd-10]: kmem.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #383):

 	external/cddl/osnet/sys/kern/kmem.c: revision 1.4

 solaris: Use pool_cache_reclaim, not pool_cache_invalidate.

 pool_cache_invalidate invalidates cached objects, but doesn't return
 any backing pages to the underlying page allocator.
 pool_cache_reclaim does pool_cache_invalidate _and_ reutrns backing
 pages to the underlying page alloator, so it is actually useful for
 the page daemon to do when trying to free memory.

 PR kern/57558


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.3.6.1 src/external/cddl/osnet/sys/kern/kmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)

 kardel@netbsd.org (Frank Kardel) writes:

 >Observed behavior:
 >pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)

 I have local patches to avoid spinning. As the page daemon keeps
 data structures locked and runs at maximum priority, it prevents
 other tasks to release resources. That's only a little help,
 if the page daemon really cannot free anything, the system is still
 locked up to some degree.

 The main reason, that the page daemon cannot free memory is
 that vnodes are not drained. This keeps the associated pools
 busy and buffers allocated by the file cache.

 Of course, if ZFS isn't throttled (and it would be less so, if
 others make room), it would just expunge the rest of the system
 data, so any improvement here just shifts the problem.


 >Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges 
 >from its tight loop as ZFS finally gives up its stranglehold on pool 
 >memory..

 I locally added some of the arc tunables to experiment with the
 free_target value. The calculation of arc_c_max in arc.c also
 doesn't agree with the comments.

 Later ZFS versions did change a lot in this area. Anything we
 correct, might need to be redone, when we move to a newer ZFS
 code base.

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Fri, 26 Apr 2024 17:34:12 +0200

 see below.

 On 04/26/24 16:20, Michael van Elst wrote:
 > The following reply was made to PR kern/57558; it has been noted by GNATS.
 >
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
 > Date: Fri, 26 Apr 2024 14:19:37 -0000 (UTC)
 >
 >   kardel@netbsd.org (Frank Kardel) writes:
 >   
 >   >Observed behavior:
 >   >pagedaemon runs at 100% (alreads at 550 minutes CPU and counting)
 >   
 >   I have local patches to avoid spinning. As the page daemon keeps
 >   data structures locked and runs at maximum priority, it prevents
 >   other tasks to release resources. That's only a little help,
 >   if the page daemon really cannot free anything, the system is still
 >   locked up to some degree.
 In this situation it spins only because the KVA starvation condition is 
 met. That triggers
 the pooldrain thread which in turn attempts to drain the pools. The ZFS 
 pools
 call into arc.c:hdr_recl() which triggers the arc_reclaim_thread. In 
 this scenario
 arc_availablemem() returns a positive value and thus the reclaim thread 
 does not
 reclaim anything.
 So the pagedaemon loops but nothing improves as long
 as arc_availmem returns positive values. At this point the zfs 
 statistics lists
 large amounts of evictable data.
 >   
 >   The main reason, that the page daemon cannot free memory is
 >   that vnodes are not drained. This keeps the associated pools
 >   busy and buffers allocated by the file cache.
 Well, this may not be the reason in this case - there is not even an attempt
 made to let the reclaim thread evict data and drain pools. It doesn't 
 get that far.
 >   
 >   Of course, if ZFS isn't throttled (and it would be less so, if
 >   others make room), it would just expunge the rest of the system
 >   data, so any improvement here just shifts the problem.
 Well ZFS currently likes to eat up all pool memory in this situation.
 >   
 >   >Once uvm_availmem() falls below uvmexp.freetarg the pagedaemon unwedges
 >   >from its tight loop as ZFS finally gives up its stranglehold on pool
 >   >memory..
 >   
 >   I locally added some of the arc tunables to experiment with the
 >   free_target value. The calculation of arc_c_max in arc.c also
 >   doesn't agree with the comments.
 >   
 >   Later ZFS versions did change a lot in this area. Anything we
 >   correct, might need to be redone, when we move to a newer ZFS
 >   code base.
 >   
 Yes, but currently we seem to have a broken ZFS at least on large memory 
 environments. Effects I observed
 are:
      - this bug: the famous looping pagedaemon
      - PR kern/58198: ZFS can lead to UVM kills (no swap, out of swap)
      - I am still trying to find out what causes the whole system to 
 slow down to a crawl. but that that point the system is unusable to 
 gather any information.

 So something needs to be done.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.