NetBSD Problem Report #53124

From mlelstv@serpens.de  Sat Mar 24 08:54:07 2018
Return-Path: <mlelstv@serpens.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 4EC387A10D
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 24 Mar 2018 08:54:07 +0000 (UTC)
Message-Id: <201803240853.w2O8riXv025381@serpens.de>
Date: Sat, 24 Mar 2018 09:53:45 +0100 (MET)
From: mlelstv@serpens.de
Reply-To: mlelstv@serpens.de
To: gnats-bugs@NetBSD.org
Subject: FFS is slow
X-Send-Pr-Version: 3.95

>Number:         53124
>Category:       kern
>Synopsis:       FFS is slow because pmap_update doesn't scale
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          analyzed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 24 08:55:01 +0000 2018
>Closed-Date:    
>Last-Modified:  Mon Jan 13 19:35:01 +0000 2020
>Originator:     mlelstv@serpens.de
>Release:        NetBSD 8.99.12
>Organization:
	NetBSD
>Environment:


System: NetBSD slowpoke 8.99.12 NetBSD 8.99.12 (SLOWPOKE) #27: Tue Mar 20 02:21:41 CET 2018 mlelstv@gossam:/home/netbsd-current/obj.amd64/home/netbsd-current/src/sys/arch/amd64/compile/SLOWPOKE amd64
Architecture: x86_64
Machine: amd64
>Description:
Filesystem I/O is slowed down significantly on systems with many cores.

Setup: 32 Core (16 Core + HT) Ryzen, 64GB RAM, NVME disk.

You can read from the raw NVME disk at about 3GByte/s with 'dd'
as in 'dd if=/dev/rdk0 of=/dev/null bs=1024k'.

However, reading from a FFS filesytem (well aligned, etc..) on
that disk is limited to about 140MB/s.

Even reading the file a second time, when everything is cached
in memory isn't any faster. The problem is not related to the
physical disk system.

A check against a Haswell system with only 4 cores running netbsd-7
with a backported NVME driver yields 2.2GB/s raw, 1.2GB/s through
the filesystem and 1.7GB/s from cache.

A laptop with an old i5 (dual core + HT) reads a cached file at
about 550MB/s.

Disabling HT on the Ryzen system doubles the speed (filesystem
or cache) to about 330MB/s.


I've started crash(8) to sample kernel stack traces from the dd
process and at least 90% of the time it is working in pmap_update().

trace: pid 2073 lid 1 at 0xffff8004953cfbc0
pmap_update() at pmap_update+0x26
ubc_alloc() at ubc_alloc+0x51e
ubc_uiomove() at ubc_uiomove+0x8e
ffs_read() at ffs_read+0xd3
VOP_READ() at VOP_READ+0x37
vn_read() at vn_read+0x94
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x1bc
--- syscall (number 3) ---
7bd2cfa3e5fa:


From systat(1) you can see that reading from the filesystem needs
about one TLB shootdown per 8kB read. Broadcasting this to all
cores is apparently something slow and obviously takes more
time the more cores you have.

The TLB shootdown process is optimized to skip CPUs that have
mapped a different address space. This can be easily verified
by running N-1 infinite loops while doing the I/O test.

The result is that reading from cache speeds up to 1.8GB/s.

But with N-1 idle Cores, they are all waiting in the idle loop
that has mapped the kernel address space.

>How-To-Repeat:
Do Filesystem I/O on a machine with many cores.

>Fix:


>Release-Note:

>Audit-Trail:

State-Changed-From-To: open->analyzed
State-Changed-By: maxv@NetBSD.org
State-Changed-When: Sat, 24 Mar 2018 19:10:03 +0000
State-Changed-Why:
I understand your point about the N-1 user threads. But as far as I
can tell/remember, that's not how things work.

The kernel is mapped in each pmap. When a kernel page is shot down, an
IPI is sent to _all_ CPUs, because the kernel is always mapped
everywhere, and as a result the TLB must be flushed everywhere too.

My guess is that the 'pmap_update' you're talking about actually
touches a user pmap, and not pmap_kernel. User pmaps are used 'lazily':
when a user-lwp -> kern-lwp context switch occurs the user pmap remains
loaded on the CPU. The reason being that since the kernel is mapped in
this pmap we don't need to reload the page tables. Given this, my guess
is that your I/O program gets context-switched on several cores, and
since these cores then switch to the idle thread when your program
leaves, the pmap of your program remains loaded on them. As a result
each page modification in this pmap needs to be synchronized on each
core the program has executed on in the past; hence the IPIs, and the
slowdown.

By using N-1 user threads, you are forcing a kern-lwp -> user-lwp
transition on each core, and after that your pmap does not need to be
synchronized there anymore; so the latency disappears.

But this guess would have to be verified. You should probably try to
assign your program to a given core - and this, early, _before_ your
program starts doing heavy stuff. schedctl, or pset would be even
better. If I'm right, it should "fix" the slowdown.

(Please CC me in the answer if any)


From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
        maxv@NetBSD.org
Subject: Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
Date: Sat, 24 Mar 2018 23:07:14 +0100

 On Sat, Mar 24, 2018 at 07:10:03PM +0000, maxv@NetBSD.org wrote:

 > My guess is that the 'pmap_update' you're talking about actually
 > touches a user pmap, and not pmap_kernel.

 ubc_alloc touches pmap_kernel when deleting old mappings in the UBC
 window. It is this call that is slow:

 	/*
 	 * Mapping must be removed before the list entry,
 	 * since there is a race with ubc_purge().
 	 */
 	if (umap->flags & UMAP_MAPPING_CACHED) {
 			umap->flags &= ~UMAP_MAPPING_CACHED;
 			mutex_enter(oobj->vmobjlock);
 			pmap_remove(pmap_kernel(), va,
 				va + ubc_winsize);
 			pmap_update(pmap_kernel());
 			mutex_exit(oobj->vmobjlock);
 	}


 You can easily mitigate the effect by increasing UBC_WINSHIFT, i.e.
 increasing ubc_winsize, for sequential reads.


 > By using N-1 user threads, you are forcing a kern-lwp -> user-lwp
 > transition on each core, and after that your pmap does not need to be
 > synchronized there anymore; so the latency disappears.

 Apparently the kernel pmap update isn't synchronized either.


 > But this guess would have to be verified. You should probably try to
 > assign your program to a given core - and this, early, _before_ your
 > program starts doing heavy stuff. schedctl, or pset would be even
 > better. If I'm right, it should "fix" the slowdown.

 Why would the synchronization with other CPUs go away?

 But no, binding the dd process to a single cpu doesn't change anything.

 N.B. even putting cpus offline doesn't change anything.


 Greetings,
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Maxime Villard <max@m00nbsd.net>
To: Michael van Elst <mlelstv@serpens.de>, gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
Date: Sun, 25 Mar 2018 10:00:21 +0200

 Le 24/03/2018 à 23:07, Michael van Elst a écrit :
 > On Sat, Mar 24, 2018 at 07:10:03PM +0000, maxv@NetBSD.org wrote:
 > 
 >> My guess is that the 'pmap_update' you're talking about actually
 >> touches a user pmap, and not pmap_kernel.
 > 
 > ubc_alloc touches pmap_kernel when deleting old mappings in the UBC
 > window. It is this call that is slow:
 > 
 > 	/*
 > 	 * Mapping must be removed before the list entry,
 > 	 * since there is a race with ubc_purge().
 > 	 */
 > 	if (umap->flags & UMAP_MAPPING_CACHED) {
 > 			umap->flags &= ~UMAP_MAPPING_CACHED;
 > 			mutex_enter(oobj->vmobjlock);
 > 			pmap_remove(pmap_kernel(), va,
 > 				va + ubc_winsize);
 > 			pmap_update(pmap_kernel());
 > 			mutex_exit(oobj->vmobjlock);
 > 	}

 Alright, so my initial guess was wrong.

 > [...]
 >> But this guess would have to be verified. You should probably try to
 >> assign your program to a given core - and this, early, _before_ your
 >> program starts doing heavy stuff. schedctl, or pset would be even
 >> better. If I'm right, it should "fix" the slowdown.
 > 
 > Why would the synchronization with other CPUs go away?

 Because if my initial guess was correct, by assigning your program to a cpu
 you would have guaranteed that your pmap was not loaded on other cpus; and
 as a result you wouldn't have had to send IPIs to them when changing a page
 in your pmap.

 But I may have another guess now. Here the call path is: pmap_remove ->
 pmap_remove_ptes -> pmap_remove_pte -> pmap_tlb_shootdown. You can see that
 the last one gets called only when ((opte & PG_U) != 0). That is, only when
 the page has already been used (read/write) by a core on the machine.

 It would be nice if you could verify, somehow, two things:

   o Does the 'UMAP_MAPPING_CACHED' branch get taken the same way with and
     without your N-1 user threads. If there is a clear difference here, then
     it means the problem is that UBC does not scale. Otherwise:

   o When the 'UMAP_MAPPING_CACHED' branch gets taken, does the 'PG_U' one get
     taken too afterwards with and without your N-1 user threads. If there is a
     clear difference here, then it means that for some reason an idle kernel
     thread on a remote cpu decided to touch the page before you did yourself,
     which resulted in PG_U being set, and the IPIs being sent to everybody.

 Maxime

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
Date: Sun, 25 Mar 2018 10:35:52 -0000 (UTC)

 max@m00nbsd.net (Maxime Villard) writes:

 >  o Does the 'UMAP_MAPPING_CACHED' branch get taken the same way with and
 >    without your N-1 user threads. If there is a clear difference here, then
 >    it means the problem is that UBC does not scale. Otherwise:

 The branch is always taken the same way. It's the same number of TLB
 shootdowns. pmap_update just completes faster when the other CPUs
 are running in userland.

 A TLB shootdown for pmap_kernel is sent to:

                 kcpuset_copy(tp->tp_cpumask, kcpuset_running);

 kcpuset_running is filled by a CPU reaching the idle loop and
 apprently is not cleared (putting a CPU offline should probably
 clear it, but that only happens during ACPI sleep).

 So that should be independent on what pmap is used by the other
 CPUs.

 But maybe it depends on wether a CPU is idle or not. For a test
 I disabled the acpicpu module, this changes machdep.idle-mechanism
 from acpi to halt (why not mwait?)

 As a result, reading from cache sped up from ~145MB/s to 600MB/s
 (16 core with HT) and from ~300MB/s to ~1GB/s (16 core without HT).

 In the first case, running up to 15 infinite loops didn't change
 anything. With 24 infinite loops, we are at ~1GB/s. With 31 loops,
 we are at 1.5GB/s.

 So for one, the acpi idle-mechanism has a larger wakeup latency
 for handling the IPI than halt, and the latency for a running process
 is low enough so that you don't see the bad scaling.

 One optimization would therefore be to skip idle CPUs when flushing
 the TLB and to catch up when leaving the idle loop. This is not
 trivial, as "leaving" also includes interrupt handlers.

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Mon, 02 Apr 2018 19:16:15 +0000
Responsible-Changed-Why:
I'm looking on this.


State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Mon, 02 Apr 2018 19:16:15 +0000
State-Changed-Why:
Can you try this patch with the default UBC_WINSHIFT, if it makes
any difference?
http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff

I'm leaning towards just using direct map for this, but would like
to try this one first. I was able to boot system with this patch
so it seems working to some limited extend, but of course might need
further improvements.


State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 08 Apr 2018 19:14:10 +0000
State-Changed-Why:
The system gets stuck in genfs_do_putpages() eventually with the patch,
seems never getting a page for write. Needs further investigation.


State-Changed-From-To: analyzed->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 08 Apr 2018 21:25:11 +0000
State-Changed-Why:
Put revised patch on http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff
This one seems to work without system getting stuck. Can you try it out
please?


From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: 
Subject: Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
Date: Mon, 9 Apr 2018 23:43:43 +0200

 I've updated the patch once more, it's now using emap also for the
 read (fault) path, and removing the rest of the pmap_remove() calls.
 Can you please re-check with the updated patch?

 This is still very likely not complete and correct, the faults violate
 the emap produce/consume pattern. Nevertheless, I'd like to test
 whether/how exactly it fails for you, and if it has the desired
 performance effect.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys
Date: Sat, 19 May 2018 15:03:26 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat May 19 15:03:26 UTC 2018

 Modified Files:
 	src/sys/arch/amd64/include: pmap.h
 	src/sys/uvm: uvm_page.c uvm_page.h uvm_pmap.h

 Log Message:
 add experimental new function uvm_direct_process(), to allow of read/writes
 of contents of uvm pages without mapping them into kernel, using
 direct map or moral equivalent; pmaps supporting the interface need
 to provide pmap_direct_process() and define PMAP_DIRECT

 implement the new interface for amd64; I hear alpha and mips might be relatively
 easy to add too, but I lack the knowledge

 part of resolution for PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.45 -r1.46 src/sys/arch/amd64/include/pmap.h
 cvs rdiff -u -r1.197 -r1.198 src/sys/uvm/uvm_page.c
 cvs rdiff -u -r1.82 -r1.83 src/sys/uvm/uvm_page.h
 cvs rdiff -u -r1.38 -r1.39 src/sys/uvm/uvm_pmap.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/uvm
Date: Sat, 19 May 2018 15:13:26 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat May 19 15:13:26 UTC 2018

 Modified Files:
 	src/sys/uvm: uvm_bio.c

 Log Message:
 change code to take advantage of direct map when available, avoiding the need
 to map pages into kernel

 this improves performance of UBC-based (read(2)/write(2)) I/O especially
 for cached block I/O - sequential read on my NVMe goes from 1.7 GB/s to 1.9 GB/s
 for non-cached, and from 2.2 GB/s to 5.6 GB/s for cached read

 the new code is conditional now and off for now, so that it can be tested further;
 can be turned on by adjusting ubc_direct variable to true

 part of fix for PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.94 -r1.95 src/sys/uvm/uvm_bio.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/uvm
Date: Sat, 19 May 2018 15:18:02 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sat May 19 15:18:02 UTC 2018

 Modified Files:
 	src/sys/uvm: uvm_readahead.c

 Log Message:
 adjust heuristics for read-ahead to skip the full read-ahead when last page of
 the range is already cached; this speeds up I/O from cache, since it avoids
 the lookup and allocation overhead

 on my system I observed 4.5% - 15% improvement for cached I/O - from 2.2 GB/s to
 2.3 GB/s for cached reads using non-direct UBC, and from 5.6 GB/s to 6.5 GB/s
 for UBC using direct map

 part of PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.9 -r1.10 src/sys/uvm/uvm_readahead.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->analyzed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 22 Aug 2018 06:14:15 +0000
State-Changed-Why:
Needs further stabilization work.


From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/uvm
Date: Tue, 20 Nov 2018 20:07:20 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Tue Nov 20 20:07:20 UTC 2018

 Modified Files:
 	src/sys/uvm: uvm_bio.c

 Log Message:
 need to use PGO_NOBLOCKALLOC also in ubc_alloc_direct() case, same
 as non-direct code - otherwise the code tries to acquire the wapbl
 lock again in genfs_getpages(), and panic due to locking against itself

 towards PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.97 -r1.98 src/sys/uvm/uvm_bio.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/arch/aarch64/include
Date: Tue, 20 Nov 2018 20:53:50 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Tue Nov 20 20:53:50 UTC 2018

 Modified Files:
 	src/sys/arch/aarch64/include: pmap.h

 Log Message:
 Implement PMAP_DIRECT / pmap_direct_process() in support of experimental
 UBC optimizations (compile-tested only for now)

 PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.17 -r1.18 src/sys/arch/aarch64/include/pmap.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/uvm
Date: Sun, 9 Dec 2018 20:45:37 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Sun Dec  9 20:45:37 UTC 2018

 Modified Files:
 	src/sys/uvm: uvm_bio.c

 Log Message:
 for direct map case, avoid PGO_NOBLOCKALLOC when writing, it makes
 genfs_getpages() return unallocated pages using the zero page and
 PG_RDONLY; the old code relied on fault logic to get it allocated, which
 the direct case can't rely on

 instead just allocate the blocks right away; pass PGO_JOURNALLOCKED
 so that code wouldn't try to take wapbl lock, this code path is called
 with it already held

 this should fix KASSERT() due to PG_RDONLY on write with wapbl

 towards resolution of PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.98 -r1.99 src/sys/uvm/uvm_bio.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/arch/aarch64/include
Date: Fri, 4 Jan 2019 21:57:53 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Fri Jan  4 21:57:53 UTC 2019

 Modified Files:
 	src/sys/arch/aarch64/include: pmap.h

 Log Message:
 re-apply rev. 1.18, now tested by Jonathan Kollasch and Ryo Shimizu - no
 problems observed, and about 2x speedup for cached read

 Implement PMAP_DIRECT / pmap_direct_process() in support of experimental
 UBC optimization

 PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.19 -r1.20 src/sys/arch/aarch64/include/pmap.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/53124 CVS commit: src/sys/uvm
Date: Mon, 7 Jan 2019 22:48:01 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Mon Jan  7 22:48:01 UTC 2019

 Modified Files:
 	src/sys/uvm: uvm_meter.c uvm_page.h

 Log Message:
 add sysctl to easily set ubc_direct

 PR kern/53124


 To generate a diff of this commit:
 cvs rdiff -u -r1.68 -r1.69 src/sys/uvm/uvm_meter.c
 cvs rdiff -u -r1.83 -r1.84 src/sys/uvm/uvm_page.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
Date: Mon, 13 Jan 2020 19:30:27 +0000

 I'm looking into fixing the issues with ubc_direct, and trying to figure out
 what to do with ACPI low power idle.  That said due to a recent reduction in
 the cost of TLB shootdowns on x86 this should noticeably improved in
 -current, especially if you use drvctl to detach all "acpicpu" which turns
 off ACPI idle.

 Andrew

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.