NetBSD Problem Report #56170
From woods@xentastic.weird.com Fri May 14 20:43:29 2021
Return-Path: <woods@xentastic.weird.com>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id BF3D71A9245
for <gnats-bugs@gnats.NetBSD.org>; Fri, 14 May 2021 20:43:29 +0000 (UTC)
Message-Id: <20210514204326.17C52511CA@xentastic.weird.com>
Date: Fri, 14 May 2021 13:43:25 -0700 (PDT)
From: "Greg A. Woods" <woods@planix.ca>
Reply-To: "Greg A. Woods" <woods@planix.ca>
To: gnats-bugs@NetBSD.org
Subject: NFS-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
X-Send-Pr-Version: 3.95
>Number: 56170
>Category: kern
>Synopsis: NFS+gcc-ASAN-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 14 20:45:00 +0000 2021
>Closed-Date:
>Last-Modified: Wed Jun 16 07:02:07 +0000 2021
>Originator: Greg A. Woods
>Release: NetBSD 9.99.81
>Organization:
Planix, Inc.; Kelowna, BC; Canada
>Environment:
System: NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #16: Thu May 6 13:40:07 PDT 2021 woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:
I've been trying out the GCC sanitizers on one of my recently
favourite little projects, and I've found I can reliably crash
NetBSD with one of the tests, when it is compiled with
USE_ASAN=yes, at least when it is run with $PWD on an NFS mount.
Here is the console output from an example crash:
[ 663.0426878] Mutex error: mutex_vector_enter,543: locking against myself
[ 663.0426878] lock address : 0xffffc8800b962b00
[ 663.0426878] current cpu : 1
[ 663.0426878] current lwp : 0xffffc8800b9db1c0
[ 663.0426878] owner field : 0xffffc8800b9db1c0 wait/spin: 0/0
[ 663.0426878] panic: lock error: Mutex: mutex_vector_enter,543: locking against myself: lock 0xffffc8800b00b9db1c0
[ 663.0426878] cpu1: Begin traceback...
[ 663.0426878] vpanic() at netbsd:vpanic+0x14a
[ 663.0426878] snprintf() at netbsd:snprintf
[ 663.0426878] lockdebug_abort() at netbsd:lockdebug_abort+0xcd
[ 663.0426878] mutex_vector_enter() at netbsd:mutex_vector_enter+0x406
[ 663.0426878] sigpending1() at netbsd:sigpending1+0x24
[ 663.0527222] nfs_sigintr() at netbsd:nfs_sigintr+0x2c
[ 663.0527222] nfs_rcvlock() at netbsd:nfs_rcvlock+0xaf
[ 663.0527222] nfs_request() at netbsd:nfs_request+0x40d
[ 663.0527222] nfs_access() at netbsd:nfs_access+0x1d4
[ 663.0527222] VOP_ACCESS() at netbsd:VOP_ACCESS+0x55
[ 663.0527222] getcwd_common() at netbsd:getcwd_common+0x251
[ 663.0527222] vnode_to_path() at netbsd:vnode_to_path+0xbb
[ 663.0527222] sysctl_vmproc() at netbsd:sysctl_vmproc+0x6cd
[ 663.0527222] sysctl_dispatch() at netbsd:sysctl_dispatch+0xa5
[ 663.0527222] sys___sysctl() at netbsd:sys___sysctl+0xc5
[ 663.0527222] syscall() at netbsd:syscall+0x9c
[ 663.0527222] --- syscall (number 202) ---
[ 663.0527222] netbsd:syscall+0x9c:
[ 663.0527222] cpu1: End traceback...
[ 663.0527222] fatal breakpoint trap in supervisor mode
[ 663.0527222] trap type 1 code 0 rip 0xffffffff8023e93d cs 0xe030 rflags 0x202 cr2 0x7f7ff6892ce0 ilevel
[ 663.0527222] curlwp 0xffffc8800b9db1c0 pid 6987.6987 lowest kstack 0xffffc880ef49a2c0
Stopped in pid 6987.6987 (yajl_test) at netbsd:breakpoint+0x5: leave
ds e650
es e600
fs e640
gs 10
rdi 0
rsi 1
rbp ffffc880ef49e640
rbx ffffffff80ed2f50 mutex_adaptive_lockops
rdx 2
rcx 0
rax 0
r8 ffffffff80ed2f50 mutex_adaptive_lockops
r9 1
r10 0
r11 fffffffe
r12 104
r13 ffffffff80d43960 ostype+0xa6448
r14 ffffc880ef49e688
r15 ffffffff80d3c46b ostype+0x9ef53
rip ffffffff8023e93d breakpoint+0x5
cs e030
rflags 202
rsp ffffc880ef49e640
ss e02b
netbsd:breakpoint+0x5: leave
db{1}> (XEN) [2021-05-14 18:09:45.682] Watchdog timer fired for domain 0
(XEN) [2021-05-14 18:09:45.682] Hardware Dom0 shutdown: watchdog rebooting machine
(I guess ddb.onpanic=1 and the Xen watchdog aren't very useful
together!)
>How-To-Repeat:
I don't yet have an isolated example test, but running the
regression tests in my robohack/yajl project, and in particular
the "ap_eof_str" test, with USE_ASAN=yes and with the source and
build on an NFS mount (which I'm only guessing about because of
the nfs_*() calls in the kernel stack backtrace), has reliably
reproduced this crash for me:
$ cd /some/NFS/mountpoint
$ git clone https://github.com/robohack/yajl
$ cd yajl
$ mkdir build
$ MAKEOBJDIRPREFIX=$(/bin/pwd)/build make regress USE_ASAN=yes MKDOC=no
If I understand correctly the system call involved here is
sysctl(2), and that there's something to do with proc too, but
I'm quite unfamiliar with ASAN runtime internals so I don't know
what it's doing to cause this, especially since a couple of
other tests have already run when this one crashes. I do know
that ASAN will check to make sure ASLR is not enabled, and it
will also mmap() something somewhere really high up and it fails
unless you do "ulimit -v unlimited" first.
If necessary I can try in a domU, or disable the Xen watchdog
for the dom0 (as otherwise I only have 20 seconds before the
reboot!), and try the crash again and do more DDB digging if
someone can guide me along. And/Or I can change what's in
ddb.commandonenter too...
>Fix:
>Release-Note:
>Audit-Trail:
From: Christos Zoulas <christos@zoulas.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org,
gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: kern/56170: NFS-related: panic: lock error: Mutex:
mutex_vector_enter,543: locking against myself
Date: Fri, 14 May 2021 17:24:29 -0400
--Apple-Mail=_8E4D82D2-4B57-482B-9A3A-6F9F00CB463A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=us-ascii
https://www.netbsd.org/~christos/nfs.diff for a disgusting hack I am =
using to avoid this.
christos
> On May 14, 2021, at 4:45 PM, Greg A. Woods <woods@planix.ca> wrote:
>=20
>> Number: 56170
>> Category: kern
>> Synopsis: NFS+gcc-ASAN-related: panic: lock error: Mutex: =
mutex_vector_enter,543: locking against myself
>> Confidential: no
>> Severity: serious
>> Priority: medium
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Fri May 14 20:45:00 +0000 2021
>> Originator: Greg A. Woods
>> Release: NetBSD 9.99.81
>> Organization:
> Planix, Inc.; Kelowna, BC; Canada
>> Environment:
> System: NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #16: Thu =
May 6 13:40:07 PDT 2021 =
woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/s=
ys/arch/amd64/compile/XEN3_DOM0 amd64
> Architecture: x86_64
> Machine: amd64
>> Description:
>=20
> I've been trying out the GCC sanitizers on one of my recently
> favourite little projects, and I've found I can reliably crash
> NetBSD with one of the tests, when it is compiled with
> USE_ASAN=3Dyes, at least when it is run with $PWD on an NFS =
mount.
>=20
> Here is the console output from an example crash:
>=20
>=20
> [ 663.0426878] Mutex error: mutex_vector_enter,543: locking against =
myself
>=20
> [ 663.0426878] lock address : 0xffffc8800b962b00
> [ 663.0426878] current cpu : 1
> [ 663.0426878] current lwp : 0xffffc8800b9db1c0
> [ 663.0426878] owner field : 0xffffc8800b9db1c0 wait/spin: =
0/0
>=20
> [ 663.0426878] panic: lock error: Mutex: mutex_vector_enter,543: =
locking against myself: lock 0xffffc8800b00b9db1c0
> [ 663.0426878] cpu1: Begin traceback...
> [ 663.0426878] vpanic() at netbsd:vpanic+0x14a
> [ 663.0426878] snprintf() at netbsd:snprintf
> [ 663.0426878] lockdebug_abort() at netbsd:lockdebug_abort+0xcd
> [ 663.0426878] mutex_vector_enter() at netbsd:mutex_vector_enter+0x406
> [ 663.0426878] sigpending1() at netbsd:sigpending1+0x24
> [ 663.0527222] nfs_sigintr() at netbsd:nfs_sigintr+0x2c
> [ 663.0527222] nfs_rcvlock() at netbsd:nfs_rcvlock+0xaf
> [ 663.0527222] nfs_request() at netbsd:nfs_request+0x40d
> [ 663.0527222] nfs_access() at netbsd:nfs_access+0x1d4
> [ 663.0527222] VOP_ACCESS() at netbsd:VOP_ACCESS+0x55
> [ 663.0527222] getcwd_common() at netbsd:getcwd_common+0x251
> [ 663.0527222] vnode_to_path() at netbsd:vnode_to_path+0xbb
> [ 663.0527222] sysctl_vmproc() at netbsd:sysctl_vmproc+0x6cd
> [ 663.0527222] sysctl_dispatch() at netbsd:sysctl_dispatch+0xa5
> [ 663.0527222] sys___sysctl() at netbsd:sys___sysctl+0xc5
> [ 663.0527222] syscall() at netbsd:syscall+0x9c
> [ 663.0527222] --- syscall (number 202) ---
> [ 663.0527222] netbsd:syscall+0x9c:
> [ 663.0527222] cpu1: End traceback...
> [ 663.0527222] fatal breakpoint trap in supervisor mode
> [ 663.0527222] trap type 1 code 0 rip 0xffffffff8023e93d cs 0xe030 =
rflags 0x202 cr2 0x7f7ff6892ce0 ilevel
>=20
> [ 663.0527222] curlwp 0xffffc8800b9db1c0 pid 6987.6987 lowest kstack =
0xffffc880ef49a2c0
> Stopped in pid 6987.6987 (yajl_test) at netbsd:breakpoint+0x5: leave
> ds e650
> es e600
> fs e640
> gs 10
> rdi 0
> rsi 1
> rbp ffffc880ef49e640
> rbx ffffffff80ed2f50 mutex_adaptive_lockops
> rdx 2
> rcx 0
> rax 0
> r8 ffffffff80ed2f50 mutex_adaptive_lockops
> r9 1
> r10 0
> r11 fffffffe
> r12 104
> r13 ffffffff80d43960 ostype+0xa6448
> r14 ffffc880ef49e688
> r15 ffffffff80d3c46b ostype+0x9ef53
> rip ffffffff8023e93d breakpoint+0x5
> cs e030
> rflags 202
> rsp ffffc880ef49e640
> ss e02b
> netbsd:breakpoint+0x5: leave
> db{1}> (XEN) [2021-05-14 18:09:45.682] Watchdog timer fired for domain =
0
> (XEN) [2021-05-14 18:09:45.682] Hardware Dom0 shutdown: watchdog =
rebooting machine
>=20
> (I guess ddb.onpanic=3D1 and the Xen watchdog aren't very useful
> together!)
>=20
>=20
>> How-To-Repeat:
>=20
> I don't yet have an isolated example test, but running the
> regression tests in my robohack/yajl project, and in particular
> the "ap_eof_str" test, with USE_ASAN=3Dyes and with the source =
and
> build on an NFS mount (which I'm only guessing about because of
> the nfs_*() calls in the kernel stack backtrace), has reliably
> reproduced this crash for me:
>=20
> $ cd /some/NFS/mountpoint
> $ git clone https://github.com/robohack/yajl
> $ cd yajl
> $ mkdir build
> $ MAKEOBJDIRPREFIX=3D$(/bin/pwd)/build make regress USE_ASAN=3Dyes=
MKDOC=3Dno
>=20
> If I understand correctly the system call involved here is
> sysctl(2), and that there's something to do with proc too, but
> I'm quite unfamiliar with ASAN runtime internals so I don't know
> what it's doing to cause this, especially since a couple of
> other tests have already run when this one crashes. I do know
> that ASAN will check to make sure ASLR is not enabled, and it
> will also mmap() something somewhere really high up and it fails
> unless you do "ulimit -v unlimited" first.
>=20
> If necessary I can try in a domU, or disable the Xen watchdog
> for the dom0 (as otherwise I only have 20 seconds before the
> reboot!), and try the crash again and do more DDB digging if
> someone can guide me along. And/Or I can change what's in
> ddb.commandonenter too...
>=20
>> Fix:
>=20
>> Unformatted:
> 2021-03-10T23:08:13Z
--Apple-Mail=_8E4D82D2-4B57-482B-9A3A-6F9F00CB463A
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename=signature.asc
Content-Type: application/pgp-signature;
name=signature.asc
Content-Description: Message signed with OpenPGP
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
iF0EARECAB0WIQS+BJlbqPkO0MDBdsRxESqxbLM7OgUCYJ7qjQAKCRBxESqxbLM7
Ot/uAJ9UtJFkEo+iV50fRvSqZLuVg1TJ+wCdHCLtwUtGSLYt/9ufHziixqEcBGc=
=Mxp+
-----END PGP SIGNATURE-----
--Apple-Mail=_8E4D82D2-4B57-482B-9A3A-6F9F00CB463A--
From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: woods@planix.ca
Subject: Re: kern/56170: NFS+gcc-ASAN-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
Date: Sat, 15 May 2021 09:05:13 +0000
This is a multi-part message in MIME format.
--=_eoavAYPm/lvdxE8JbrWxVr5OLxSgiJLT
The disgusting hack is for a different issue.
The problem here is that sysctl vm.proc.$pid.<VM_PROC_MAP> holds
p->p_lock over vnode_to_path, which is forbidden. It should hold a
reference that prevents the proc from going away, but not p_lock, like
procfs_rw does in procfs_subr.c.
The attached patch tries to address the issue, and should be
pullable-up to 9 if it works, but I haven't tested it.
(That said, this code is grody, and we should factor out the common
parts this less grodily. It looks like sysctl proc.$pid.cwd might
have the same issue -- fill_cwd in sys/kern/kern_proc.c.)
--=_eoavAYPm/lvdxE8JbrWxVr5OLxSgiJLT
Content-Type: text/plain; charset="ISO-8859-1"; name="sysctlvmmap"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="sysctlvmmap.patch"
diff -r 461e169ca34e sys/uvm/uvm_map.c
--- a/sys/uvm/uvm_map.c Thu Apr 29 09:27:29 2021 +0000
+++ b/sys/uvm/uvm_map.c Sat May 15 01:06:25 2021 +0000
@@ -5217,6 +5217,7 @@ fill_vmentries(struct lwp *l, pid_t pid,
size_t *oldlenp)
{
int error;
+ struct lwp *l1 =3D NULL;
struct proc *p;
struct kinfo_vmentry *vme;
struct vmspace *vm;
@@ -5224,6 +5225,7 @@ fill_vmentries(struct lwp *l, pid_t pid,
struct vm_map_entry *entry;
char *dp;
size_t count, vmesize;
+ bool unlock =3D false;
=20
if (elem_size =3D=3D 0 || elem_size > 2 * sizeof(*vme))
return EINVAL;
@@ -5238,15 +5240,47 @@ fill_vmentries(struct lwp *l, pid_t pid,
} else
vmesize =3D 0;
=20
+ /*
+ * Find the process by pid and lock it -- unless pid is -1, in
+ * which case we get curproc, unlocked.
+ */
if ((error =3D proc_find_locked(l, &p, pid)) !=3D 0)
return error;
-
+ KASSERT(pid =3D=3D -1 || mutex_owned(p->p_lock));
+ unlock =3D true;
+
+ /*
+ * Initialize the loop state now, so we can back out what we've
+ * done if need be.
+ */
+ vm =3D NULL;
vme =3D NULL;
count =3D 0;
=20
+ /* Get a reference to the VM space. */
if ((error =3D proc_vmspace_getref(p, &vm)) !=3D 0)
goto out;
=20
+ /*
+ * Find a non-zombie lwp in the process so we can grab a
+ * reference to it and release p->p_lock while we work -- that
+ * way it won't go away but we can reach into subsystems that
+ * need to take the lock (and we don't hoard the lock for long
+ * durations).
+ */
+ LIST_FOREACH(l1, &p->p_lwps, l_sibling) {
+ if (l1->l_stat !=3D LSZOMB)
+ break;
+ }
+ if (l1 =3D=3D NULL) {
+ error =3D ESRCH;
+ goto out;
+ }
+ lwp_addref(l1);
+ if (pid !=3D -1)
+ mutex_exit(p->p_lock);
+ unlock =3D false;
+
map =3D &vm->vm_map;
vm_map_lock_read(map);
=20
@@ -5263,11 +5297,15 @@ fill_vmentries(struct lwp *l, pid_t pid,
}
count++;
}
+
vm_map_unlock_read(map);
- uvmspace_free(vm);
=20
out:
- if (pid !=3D -1)
+ if (vm !=3D NULL)
+ uvmspace_free(vm);
+ if (l1 !=3D NULL)
+ lwp_delref(l1);
+ if (unlock && pid !=3D -1)
mutex_exit(p->p_lock);
if (error =3D=3D 0) {
const u_int esize =3D uimin(sizeof(*vme), elem_size);
--=_eoavAYPm/lvdxE8JbrWxVr5OLxSgiJLT--
State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sat, 15 May 2021 09:13:39 +0000
State-Changed-Why:
patch provided
From: Rin Okuyama <rokuyama.rk@gmail.com>
To: gnats-bugs@netbsd.org, Christos Zoulas <christos@zoulas.com>
Cc:
Subject: Re: kern/56170: NFS+gcc-ASAN-related: panic: lock error: Mutex:
mutex_vector_enter,543: locking against myself
Date: Sat, 15 May 2021 21:02:25 +0900
On 2021/05/15 18:10, Taylor R Campbell wrote:
> The disgusting hack is for a different issue.
I just sent a new PR for this problem: kern/56177.
Thank you christos for your patch!
rin
From: "Greg A. Woods" <woods@planix.ca>
To: NetBSD GNATS <gnats-bugs@NetBSD.org>
Cc:
Subject: Re: kern/56170: NFS+gcc-ASAN-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
Date: Sun, 16 May 2021 22:50:32 -0700
--pgp-sign-Multipart_Sun_May_16_22:50:08_2021-1
Content-Type: text/plain; charset=US-ASCII
At Sat, 15 May 2021 09:05:13 +0000, Taylor R Campbell <riastradh@NetBSD.org> wrote:
Subject: Re: kern/56170: NFS+gcc-ASAN-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
>
> The problem here is that sysctl vm.proc.$pid.<VM_PROC_MAP> holds
> p->p_lock over vnode_to_path, which is forbidden. It should hold a
> reference that prevents the proc from going away, but not p_lock, like
> procfs_rw does in procfs_subr.c.
>
> The attached patch tries to address the issue, and should be
> pullable-up to 9 if it works, but I haven't tested it.
Yes! Thanks! This does indeed seem to fix the problem I experienced.
It took me all too much fiddly time to reproduce the bug in a Xen domU,
but after booting the patched kernel I was able to run my test in a loop
for nearly 24 hours without any crash or any other noticeable problem.
(there was not much other activity, except /etc/daily I guess)
> (That said, this code is grody, and we should factor out the common
> parts this less grodily. It looks like sysctl proc.$pid.cwd might
> have the same issue -- fill_cwd in sys/kern/kern_proc.c.)
Indeed.
The other hack Christos mentioned covers only one of what look like many
similar code paths in the NFS code, so either the fix is very incomplete
or else the problem is similarly elsewhere.
--
Greg A. Woods <gwoods@acm.org>
Kelowna, BC +1 250 762-7675 RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Avoncote Farms <woods@avoncote.ca>
--pgp-sign-Multipart_Sun_May_16_22:50:08_2021-1
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit
Content-Description: OpenPGP Digital Signature
-----BEGIN PGP SIGNATURE-----
iF0EABECAB0WIQRuK6dmwVAucmRxuh9mfXG3eL/0fwUCYKIEFAAKCRBmfXG3eL/0
f7/3AJ0anptkEwEpmkbHHEivsmZUbVRwCwCeJ4B3tkQcqFFL4Sx5cLx3nM9D75w=
=PZ/e
-----END PGP SIGNATURE-----
--pgp-sign-Multipart_Sun_May_16_22:50:08_2021-1--
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Wed, 16 Jun 2021 07:02:07 +0000
State-Changed-Why:
patch works, time to commit
>Unformatted:
2021-03-10T23:08:13Z
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.