NetBSD Problem Report #53817
From gson@gson.org Sat Dec 29 21:32:29 2018
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 93DF27A158
for <gnats-bugs@gnats.NetBSD.org>; Sat, 29 Dec 2018 21:32:29 +0000 (UTC)
Message-Id: <20181229213222.AA18698930C@guava.gson.org>
Date: Sat, 29 Dec 2018 23:32:22 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Random panics in vfs_mountroot()
X-Send-Pr-Version: 3.95
>Number: 53817
>Category: kern
>Synopsis: Random panics in vfs_mountroot()
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kamil
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Dec 29 21:35:00 +0000 2018
>Closed-Date: Wed Jan 23 13:39:55 +0000 2019
>Last-Modified: Wed Jan 23 13:40:00 +0000 2019
>Originator: Andreas Gustafsson
>Release: NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: aarch64
Machine: evbarm
>Description:
The automated evbarm-aarch64 test runs on lyta.netbsd.org have ended
with a panic in vfs_mountroot in 36 out of 98 runs:
lyta /bracket/evbarm-aarch64/results/2018 $ zgrep vfs_mountroot */test.log.gz
2018.10.26.22.10.15/test.log.gz:[ 2.1539321] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b524 netbsd:vfs_mountroot+0x214
2018.10.27.13.20.21/test.log.gz:[ 2.3229166] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b4f4 netbsd:vfs_mountroot+0x214
2018.10.28.00.44.37/test.log.gz:[ 2.1098840] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b4f4 netbsd:vfs_mountroot+0x214
[...]
2018.12.24.08.40.33/test.log.gz:[ 2.4557637] fp ffffffc000b8adf0 vfs_mountroot() at ffffffc0004581a4 netbsd:vfs_mountroot+0x214
2018.12.24.22.47.19/test.log.gz:[ 2.3375370] fp ffffffc000b8bdf0 vfs_mountroot() at ffffffc00045a0c4 netbsd:vfs_mountroot+0x214
2018.12.25.11.56.14/test.log.gz:[ 2.4641909] fp ffffffc000b8bdf0 vfs_mountroot() at ffffffc00045a0c4 netbsd:vfs_mountroot+0x214
2018.12.27.04.54.03/test.log.gz:[ 2.4717805] fp ffffffc000b8bdf0 vfs_mountroot() at ffffffc00045a054 netbsd:vfs_mountroot+0x214
Console output from the first recorded failure:
[ 2.0451998] mountroot: trying ffs...
[ 2.0561633] root file system type: ffs
[ 2.0561633] panic: Trap: Data Abort (EL1): Translation Fault L1 with read access for 0000000000000070: pc ffffffc00043b524: opcode f9403800: ldr x0, [x0,#112]
[ 2.0672672] cpu0: Begin traceback...
[ 2.0672672] trace fp ffffffc000979b50
[ 2.0773923] fp ffffffc000979b70 vpanic() at ffffffc0003efd00 netbsd:vpanic+0x190
[ 2.0773923] fp ffffffc000979bd0 panic() at ffffffc0003efdcc netbsd:panic+0x44
[ 2.0889589] fp ffffffc000979c60 data_abort_handler() at ffffffc00005fcf8 netbsd:data_abort_handler+0x480
[ 2.0889589] tf ffffffc000979cd0 el1_trap() at ffffffc00005d3f8 netbsd:el1_trap
[ 2.1014251] ---- trapframe 0xffffffc000979cd0 (304 bytes) ----
[ 2.1014251] pc=ffffffc00043b524, spsr=0000000060000005
[ 2.1014251] esr=0000000096000005, far=0000000000000070
[ 2.1114037] x0=0000000000000000, x1=ffff00005fab2bd0
[ 2.1114037] x2=0000000000000004, x3=00000000000000f0
[ 2.1114037] x4=ffffffc000820558, x5=0000000000000000
[ 2.1114037] x6=0000000000000004, x7=0000000000000001
[ 2.1225739] x8=0000000000000004, x9=ffffffc000979da0
[ 2.1225739] x10=ffff00005f9f9f58, x11=000000000000003f
[ 2.1225739] x12=fffffc00017e7e7c, x13=fffffc00017e7e80
[ 2.1225739] x14=ffffffffffffffe8, x15=ffff00005f9fa000
[ 2.1338631] x16=0000000000000021, x17=0000000000000083
[ 2.1338631] x18=0000000000001000, x19=0000000000000000
[ 2.1338631] x20=ffffffc000a1c000, x21=ffffffc000a8bce8
[ 2.1338631] x22=ffffffc000854a00, x23=ffffffc0006ef800
[ 2.1449961] x24=ffffffc000856000, x25=0000000000000000
[ 2.1449961] x26=ffffffc000a1c000, x27=ffffffc000971000
[ 2.1449961] x28=0000000000000000, fp=x29=ffffffc000979e00
[ 2.1449961] lr=x30=ffffffc00043b518, sp=ffffffc000979e00
[ 2.1449961] ------------------------------------------------
[ 2.1539321] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b524 netbsd:vfs_mountroot+0x214
[ 2.1539321] fp ffffffc000979e60 main() at ffffffc00050f3e4 netbsd:main+0x44c
[ 2.1539321] fp 0000000000000000 aarch64_start() at ffffffc00000183c netbsd:aarch64_start+0x103c
[ 2.1675298] cpu0: End traceback...
Stopped in pid 0.1 (system) at netbsd:cpu_Debugger+0x4: ret
Disassembling around the faulting instruction (0x214 is decimal 532):
0xffffffc000443824 <vfs_mountroot+508>: bl 0xffffffc000451210 <vref>
0xffffffc000443828 <vfs_mountroot+512>: ldr x0, [x21, #80]
0xffffffc00044382c <vfs_mountroot+516>: bl 0xffffffc00045d388 <VOP_UNLOCK>
0xffffffc000443830 <vfs_mountroot+520>: str xzr, [x22, #8]
0xffffffc000443834 <vfs_mountroot+524>: ldr x0, [x20, #2376]
0xffffffc000443838 <vfs_mountroot+528>: ldr x1, [x21, #80]
0xffffffc00044383c <vfs_mountroot+532>: ldr x0, [x0, #112]
0xffffffc000443840 <vfs_mountroot+536>: str x1, [x0]
0xffffffc000443844 <vfs_mountroot+540>: ldr x0, [x20, #2376]
0xffffffc000443848 <vfs_mountroot+544>: ldr x0, [x0, #112]
0xffffffc00044384c <vfs_mountroot+548>: ldr x0, [x0]
0xffffffc000443850 <vfs_mountroot+552>: bl 0xffffffc000451210 <vref>
This looks like the source line
/*
* Now that root is mounted, we can fixup initproc's CWD
* info. All other processes are kthreads, which merely
* share proc0's CWD info.
*/
initproc->p_cwdi->cwdi_cdir = rootvnode;
and indeed the offset #112 in the faulting instruction matches the offset
of the p_cwdi field in struct proc. Also, the faulting address is 0x70,
which is at a 112 bytes offset from a null pointer.
This looks like a race condition to me: main() calls fork1() to create
process 1 and then calls vfs_mountroot() which uses the global
variable initproc, but initproc is initialized in start_init() which
is executed by process 1, so it may or may not have been initialized
by the time it is used by vfs_mountroot().
I suspect the bug may have been introduced by:
revision 1.496
date: 2018-04-16 17:18:16 +0300; author: kamil; state: Exp; lines: +5 -3; commitid: YY8XhMArR8bEOFyA;
Set initproc inside start_init()
This allows us to stop using the rnewprocp argument in fork1(9).
The rnewprocp argument will be removed soon from the API, as it can cause
use-after-free scenarios.
No functional change intended.
Noted by <Mateusz Guzik>
Reference: http://mail-index.netbsd.org/tech-kern/2017/09/08/msg022267.html
Sponsored by <The NetBSD Foundation>
Although I have only seen this happen on evbarm-aarch64 under qemu,
the bug looks machine independent, so I'm filing it in category kern.
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->kamil
Responsible-Changed-By: kamil@NetBSD.org
Responsible-Changed-When: Wed, 23 Jan 2019 12:42:33 +0100
Responsible-Changed-Why:
Take.
State-Changed-From-To: open->closed
State-Changed-By: kamil@NetBSD.org
State-Changed-When: Wed, 23 Jan 2019 14:39:55 +0100
State-Changed-Why:
Fixed in src/sys/kern/init_main.c r1.502
From: "Kamil Rytarowski" <kamil@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/53817 CVS commit: src/sys/kern
Date: Wed, 23 Jan 2019 13:38:30 +0000
Module Name: src
Committed By: kamil
Date: Wed Jan 23 13:38:30 UTC 2019
Modified Files:
src/sys/kern: init_main.c
Log Message:
Change the place of initproc initialization
The initproc variable cannot be initialized in start_init as there
is a race between vfs_mountroot and start_init.
PR kern/53817 by Andreas Gustafsson
To generate a diff of this commit:
cvs rdiff -u -r1.501 -r1.502 src/sys/kern/init_main.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.