NetBSD Problem Report #56309
From kardel@kardel.name Wed Jul 14 08:13:14 2021
Return-Path: <kardel@kardel.name>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 802431A921F
for <gnats-bugs@gnats.NetBSD.org>; Wed, 14 Jul 2021 08:13:14 +0000 (UTC)
Message-Id: <20210714081200.56A15AAAC818@pip.kardel.name>
Date: Wed, 14 Jul 2021 10:12:00 +0200 (CEST)
From: kardel@netbsd.org
Reply-To: kardel@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: swapctl -U is very inefficient (takes ages to eternities)
X-Send-Pr-Version: 3.95
>Number: 56309
>Notify-List: wiz@NetBSD.org
>Category: kern
>Synopsis: swapctl -U is very inefficient (takes ages to eternities)
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: analyzed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jul 14 08:15:00 +0000 2021
>Closed-Date:
>Last-Modified: Sat Aug 03 21:00:02 +0000 2024
>Originator: Frank Kardel
>Release: NetBSD 9.99.80 (also - .85 and likely beyond)
>Organization:
>Environment:
System: NetBSD pip 9.99.80 NetBSD 9.99.80 (PIPGEN) #0: Thu Feb 11 20:11:26 CET 2021 kardel@pip:/src/NetBSD/cur/src/obj.amd64/sys/arch/amd64/compile/PIPGEN amd64
Architecture: x86_64
Machine: amd64
>Description:
On a system with swap configured and swap having been used shutdown can
take a very long time when shutting down in the "Removing block-type swap devices"
phase. The read rate from the swap devices is very low for rust media and slow
for SSD media. swapctl -U picks up significant CPU time. The whole process takes
eternities for rust media and ages for SSDs.
Example (about 7 minutes into the shutdown):
0 12876 23709 25309 117 -20 19376 1656 - O< ? 6:18.30 swapctl -U -t blk
0 188 22490 84 117 0 17820 1400 tstile D+ pts/2 0:00.00 swapctl -l
On rust media the shutdown can take several hours. Even SSDs(NVME) can take several
tens of minutes.
>How-To-Repeat:
Use a system with swap enabled. Force the system to use swap space. Watch
a very long shutdown.
>Fix:
Workaround: set swapoff=NO in /etc/rc.conf.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Sun, 16 Jun 2024 20:05:28 +0000
Responsible-Changed-Why:
I have some ideas to try.
State-Changed-From-To: open->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sat, 03 Aug 2024 17:48:44 +0000
State-Changed-Why:
The swapoff code reuses fault mechanics. That code path is synchronous
and reading 1 memory page at a time. This effectively means the swapoff
is synchronously reading data from the drive with 1 request for each 4KB.
Secondary problem is that the swslots for each amap are processed
in increasing order, which happen to use decreasing disk blocks.
This might defeat drive disk cache if the drive caches the blocks after
the requested block, but not onces before.
The main problem is the synchronous reads. I'll check if that path can
be changed to be asynchronous for the amap reads.
From: Frank Kardel <kardel@kardel.name>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56309 (swapctl -U is very inefficient (takes ages to
eternities))
Date: Sat, 3 Aug 2024 20:55:38 +0200
Thanks for looking into it.
I think there may be additional factors. Synchronous reads do not
explain high cpu usage and low transfer rates.
Somewhere, I believe, there seems to be a computationally expensive path
(may due to using parts of the fault logic).
If it were only synchronous reads and backwards reading I would expect
low cpu usage and almost 100% disk utilization, but that is not what I
have seen.
Frank
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56309 (swapctl -U is very inefficient (takes ages to eternities))
Date: Sat, 3 Aug 2024 22:01:24 +0200
For amap processing at least, it rescans the whole amap for slots
every time it does the I/O, searching for another swap slot to free.
It also does quite a lot rwlock manipulation there in order to reuse
the fault function.
I'll change this to avoid the relocking and scan each amap just once.
Hopefully that will help with the CPU usage too.
uobj swap off code does something very similar. Once the async code
works for amap, I'll adapt that too.
From: Frank Kardel <kardel@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56309 (swapctl -U is very inefficient (takes ages to
eternities))
Date: Sat, 3 Aug 2024 22:12:28 +0200
That seems to explain some (maybe even a major part) of the cpu time.
The re-scans for every I/O increase the computational
complexity significantly.
Frank
From: matthew green <mrg@eterna23.net>
To: gnats-bugs@netbsd.org
Cc: wiz@NetBSD.org, jdolecek@netbsd.org, netbsd-bugs@netbsd.org,
gnats-admin@netbsd.org, kardel@netbsd.org
Subject: re: kern/56309 (swapctl -U is very inefficient (takes ages to eternities))
Date: Sun, 04 Aug 2024 06:59:27 +1000
AFAIK, the problem is that everything is a 4KB read, via the trap path.
it has nothing to with swapctl -U itself, but just pagein.
i've spent decades trying to figure out a better way to deal with this
but so far everything i thought of was obviously bad or basic impl was
not helpful in enough cases.
the only method i've come up with that actually helps the swapin path
requires more setup and less ability to pageout some uses well. this
idea is when doing swapout, instead of just finding upto 64KiB worth
of data to pageout at once (ie, 16 pages on x86), really try hard to
pageout 64KiB that is contiguous to a process, so that when pagein
occurs, we can do 64K reads.
it works sort-of-OK, but tends to cause more swapping out and my tests
were inconclusive about the benefit.
---
hmm, i just had a new idea. instead of doing the setup in the pageout,
when handling a pagein fault, check the surrround pages for being in
swap, and pre-fault page them in. one might still have the 4KB IO
issue, but triggering these async and potentially in parallel (eg, on
nvme that has upto ncpu workers available) should actually allow at
least some benefits. probably want to combine with the larger-pageout
(even if limiting that to say 4 pages?) method for best.
.mrg.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.