NetBSD Problem Report #56309

From kardel@kardel.name  Wed Jul 14 08:13:14 2021
Return-Path: <kardel@kardel.name>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 802431A921F
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 14 Jul 2021 08:13:14 +0000 (UTC)
Message-Id: <20210714081200.56A15AAAC818@pip.kardel.name>
Date: Wed, 14 Jul 2021 10:12:00 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: swapctl -U is very inefficient (takes ages to eternities)
X-Send-Pr-Version: 3.95

>Number:         56309
>Notify-List:    wiz@NetBSD.org
>Category:       kern
>Synopsis:       swapctl -U is very inefficient (takes ages to eternities)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jul 14 08:15:00 +0000 2021
>Closed-Date:    
>Last-Modified:  Sat Aug 03 21:00:02 +0000 2024
>Originator:     Frank Kardel
>Release:        NetBSD 9.99.80 (also - .85 and likely beyond)
>Organization:

>Environment:


System: NetBSD pip 9.99.80 NetBSD 9.99.80 (PIPGEN) #0: Thu Feb 11 20:11:26 CET 2021 kardel@pip:/src/NetBSD/cur/src/obj.amd64/sys/arch/amd64/compile/PIPGEN amd64
Architecture: x86_64
Machine: amd64
>Description:
	On a system with swap configured and swap having been used shutdown can
	take a very long time when shutting down in the "Removing block-type swap devices"
	phase. The read rate from the swap devices is very low for rust media and slow
	for SSD media. swapctl -U picks up significant CPU time. The whole process takes
	eternities for rust media and ages for SSDs.

Example (about 7 minutes into the shutdown):
   0 12876 23709 25309 117 -20    19376     1656 -       O<   ?       6:18.30 swapctl -U -t blk
   0   188 22490    84 117   0    17820     1400 tstile  D+   pts/2   0:00.00 swapctl -l

	On rust media the shutdown can take several hours. Even SSDs(NVME) can take several
	tens of minutes. 

>How-To-Repeat:
	Use a system with swap enabled. Force the system to use swap space. Watch
	a very long shutdown.
>Fix:
	Workaround: set swapoff=NO in /etc/rc.conf.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sun, 16 Jun 2024 20:05:28 +0000
Responsible-Changed-Why:
I have some ideas to try.


State-Changed-From-To: open->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 03 Aug 2024 17:48:44 +0000
State-Changed-Why:
The swapoff code reuses fault mechanics. That code path is synchronous
and reading 1 memory page at a time. This effectively means the swapoff
is synchronously reading data from the drive with 1 request for each 4KB.

Secondary problem is that the swslots for each amap are processed
in increasing order, which happen to use decreasing disk blocks.
This might defeat drive disk cache if the drive caches the blocks after
the requested block, but not onces before.

The main problem is the synchronous reads. I'll check if that path can
be changed to be asynchronous for the amap reads.


From: Frank Kardel <kardel@kardel.name>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56309 (swapctl -U is very inefficient (takes ages to
 eternities))
Date: Sat, 3 Aug 2024 20:55:38 +0200

 Thanks for looking into it.

 I think there may be additional factors. Synchronous reads do not 
 explain high cpu usage and low transfer rates.

 Somewhere, I believe, there seems to be a computationally expensive path 
 (may due to using parts of the fault logic).

 If it were only synchronous reads and backwards reading I would expect 
 low cpu usage and almost 100% disk utilization, but that is not what I 
 have seen.

 Frank

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56309 (swapctl -U is very inefficient (takes ages to eternities))
Date: Sat, 3 Aug 2024 22:01:24 +0200

 For amap processing at least, it rescans the whole amap for slots
 every time it does the I/O, searching for another swap slot to free.
 It also does quite a lot rwlock manipulation there in order to reuse
 the fault function.

 I'll change this to avoid the relocking and scan each amap just once.
 Hopefully that will help with the CPU usage too.

 uobj swap off code does something very similar. Once the async code
 works for amap, I'll adapt that too.

From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56309 (swapctl -U is very inefficient (takes ages to
 eternities))
Date: Sat, 3 Aug 2024 22:12:28 +0200

 That seems to explain some (maybe even a major part) of the cpu time. 
 The re-scans for every I/O increase the computational

 complexity significantly.

 Frank

From: matthew green <mrg@eterna23.net>
To: gnats-bugs@netbsd.org
Cc: wiz@NetBSD.org, jdolecek@netbsd.org, netbsd-bugs@netbsd.org,
    gnats-admin@netbsd.org, kardel@netbsd.org
Subject: re: kern/56309 (swapctl -U is very inefficient (takes ages to eternities))
Date: Sun, 04 Aug 2024 06:59:27 +1000

 AFAIK, the problem is that everything is a 4KB read, via the trap path.
 it has nothing to with swapctl -U itself, but just pagein.

 i've spent decades trying to figure out a better way to deal with this
 but so far everything i thought of was obviously bad or basic impl was
 not helpful in enough cases.

 the only method i've come up with that actually helps the swapin path
 requires more setup and less ability to pageout some uses well.  this
 idea is when doing swapout, instead of just finding upto 64KiB worth
 of data to pageout at once (ie, 16 pages on x86), really try hard to
 pageout 64KiB that is contiguous to a process, so that when pagein
 occurs, we can do 64K reads.

 it works sort-of-OK, but tends to cause more swapping out and my tests
 were inconclusive about the benefit.

 ---

 hmm, i just had a new idea.  instead of doing the setup in the pageout,
 when handling a pagein fault, check the surrround pages for being in
 swap, and pre-fault page them in.  one might still have the 4KB IO
 issue, but triggering these async and potentially in parallel (eg, on
 nvme that has upto ncpu workers available) should actually allow at
 least some benefits.  probably want to combine with the larger-pageout
 (even if limiting that to say 4 pages?) method for best.


 .mrg.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.