NetBSD Problem Report #53074
From brad@anduin.eldar.org Mon Mar 5 18:09:41 2018
Return-Path: <brad@anduin.eldar.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id C9F887A1DA
for <gnats-bugs@gnats.NetBSD.org>; Mon, 5 Mar 2018 18:09:41 +0000 (UTC)
Message-Id: <201803051809.w25I9cc3021569@anduin.eldar.org>
Date: Mon, 5 Mar 2018 13:09:38 -0500 (EST)
From: brad@anduin.eldar.org
Reply-To: brad@anduin.eldar.org
To: gnats-bugs@NetBSD.org
Subject: Daily 8.99.12 DOMU panic
X-Send-Pr-Version: 3.95
>Number: 53074
>Category: port-xen
>Synopsis: Daily 8.99.12 DOMU panic
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: jdolecek
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Mar 05 18:10:00 +0000 2018
>Closed-Date: Wed Aug 22 06:11:13 +0000 2018
>Last-Modified: Wed Aug 22 06:11:13 +0000 2018
>Originator: Brad Spencer
>Release: NetBSD 8.99.12
>Organization:
Eldar.org
>Environment:
System: NetBSD varda.nat.eldar.org 8.99.12 NetBSD 8.99.12 (XEN3_DOMU_ISCSI) #0: Thu Mar 1 22:43:01 EST 2018 brad@gimli.nat.eldar.org:/usr/src/sys/arch/amd64/compile/XEN3_DOMU_ISCSI amd64
Architecture: x86_64
Machine: amd64
>Description:
On a testing 64 bit DOMU running NetBSD 8.99.12 against a NetBSD 7.1_STABLE DOM0 I have a daily panic. The ddb traceback is as follows:
panic: kernel diagnostic assertion "entry > NR_RESERVED_ENTRIES" failed: file "../../../../arch/xen/xen/xengnt.c", line 264
cpu0: Begin traceback...
?() at ffffffff804f74b8
?() at ffffffff80600b95
?() at ffffffff802139eb
?() at ffffffff8022eb90
?() at ffffffff80212a48
?() at ffffffff80203119
?() at ffffffff804cd598
?() at ffffffff804b2ae6
?() at ffffffff8020f1a6
cpu0: End traceback...
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff802057a5 cs 0xe030 rflags 0x202 cr2 0x7eb70853b020 ilevel 0x8 rsp 0xffffa0002c28fae8
curlwp 0xffffa00000c681a0 pid 10930.1 lowest kstack 0xffffa0002c28c2c0
Stopped in pid 10930.1 (ssh) at ffffffff802057a5: leave
ds 0
es faf8
fs fa98
gs fae8
rdi 6
rsi a
rbp ffffa0002c28fae8
rbx 104
rdx 1
rcx 0
rax 1
r8 ffffffff807373c0
r9 0
r10 75
r11 ffffa00000740048
r12 ffffffff80648e28
r13 ffffa0002c28fb30
r14 ffffa00000960000
r15 ffffa00000a57008
rip ffffffff802057a5
cs e030
rflags 202
rsp ffffa0002c28fae8
ss e02b
ffffffff802057a5: leave
db{0}>
The kernel used by this particular DOMU is just XEN3_DOMU plus the
iscsi support compiled in because module loading does not appear to
work. The DOM0 is NetBSD 7.1_STABLE running Xen 4.5.1 in 64 bit.
>How-To-Repeat:
The DOMU lasts almost exactly 24 hours at a time. Just wait and it
will happen again. I can provide more information if it is desired.
>Fix:
No idea on the fix, just reporting.
>Release-Note:
>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Mon, 5 Mar 2018 19:59:38 +0100
On Mon, Mar 05, 2018 at 06:10:00PM +0000, brad@anduin.eldar.org wrote:
> panic: kernel diagnostic assertion "entry > NR_RESERVED_ENTRIES" failed: file "../../../../arch/xen/xen/xengnt.c", line 264
> cpu0: Begin traceback...
> ?() at ffffffff804f74b8
> ?() at ffffffff80600b95
> ?() at ffffffff802139eb
> ?() at ffffffff8022eb90
> ?() at ffffffff80212a48
> ?() at ffffffff80203119
> ?() at ffffffff804cd598
> ?() at ffffffff804b2ae6
> ?() at ffffffff8020f1a6
Could you check the value of entry in this case ?
Also could you match the addresses in the stack trace to functions ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: coypu@sdf.org
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Tue, 6 Mar 2018 01:49:26 +0000
You can parse the backtrace like so (normally, it would report the
functions, but that is a bug).
gdb /my-netbsd-kernel
info symbol 0xffffffff804f74b8
and so on.
To check the value of entry, I suspect this will do:
Index: xengnt.c
===================================================================
RCS file: /cvsroot/src/sys/arch/xen/xen/xengnt.c,v
retrieving revision 1.25
diff -u -r1.25 xengnt.c
--- xengnt.c 24 Oct 2012 13:07:46 -0000 1.25
+++ xengnt.c 6 Mar 2018 01:47:55 -0000
@@ -261,7 +261,8 @@
xengnt_free_entry(grant_ref_t entry)
{
mutex_enter(&grant_lock);
- KASSERT(entry > NR_RESERVED_ENTRIES);
+ KASSERTMSG(entry > NR_RESERVED_ENTRIES,
+ "entry > NR_RESERVED_ENTRIES and entry is %d", entry);
KASSERT(gnt_entries[last_gnt_entry] == XENGNT_NO_ENTRY);
KASSERT(last_gnt_entry >= 0);
KASSERT(last_gnt_entry <= gnt_max_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);
Write the above to a file, and use patch -p0 < /path/to/patch-file
and rebuild a new kernel:
cd src; ./build.sh -U -u -m amd64 -O ~/xen tools kernel=XEN3_DOMU
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Mon, 05 Mar 2018 21:30:03 -0500
coypu@sdf.org writes:
> You can parse the backtrace like so (normally, it would report the
> functions, but that is a bug).
>
> gdb /my-netbsd-kernel
> info symbol 0xffffffff804f74b8
>
> and so on.
I have compiled a new kernel with debugging symbols which may prove more
helpful. But in the meantime, using the original kernel I get the
following:
?() at ffffffff804f74b8
?() at ffffffff80600b95
?() at ffffffff802139eb
?() at ffffffff8022eb90
?() at ffffffff80212a48
?() at ffffffff80203119
?() at ffffffff804cd598
?() at ffffffff804b2ae6
?() at ffffffff8020f1a6
(gdb) info symbol 0xffffffff804f74b8
vpanic + 320 in section .text
(gdb) info symbol 0xffffffff80600b95
No symbol matches 0xffffffff80600b95.
(gdb) info symbol 0xffffffff802139eb
xengnt_free_entry + 155 in section .text
(gdb) info symbol 0xffffffff8022eb90
xbd_handler + 223 in section .text
(gdb) info symbol 0xffffffff80212a48
xen_intr_biglock_wrapper + 29 in section .text
(gdb) info symbol 0xffffffff80203119
Xresume_xenev6 + 73 in section .text
(gdb) info symbol 0xffffffff804cd598
softint_overlay + 1278 in section .text
(gdb) info symbol 0xffffffff804b2ae6
lwp_userret + 521 in section .text
(gdb) info symbol 0xffffffff8020f1a6
syscall + 406 in section .text
> To check the value of entry, I suspect this will do:
>
> Index: xengnt.c
> ===================================================================
> RCS file: /cvsroot/src/sys/arch/xen/xen/xengnt.c,v
> retrieving revision 1.25
> diff -u -r1.25 xengnt.c
> --- xengnt.c 24 Oct 2012 13:07:46 -0000 1.25
> +++ xengnt.c 6 Mar 2018 01:47:55 -0000
> @@ -261,7 +261,8 @@
> xengnt_free_entry(grant_ref_t entry)
> {
> mutex_enter(&grant_lock);
> - KASSERT(entry > NR_RESERVED_ENTRIES);
> + KASSERTMSG(entry > NR_RESERVED_ENTRIES,
> + "entry > NR_RESERVED_ENTRIES and entry is %d", entry);
> KASSERT(gnt_entries[last_gnt_entry] == XENGNT_NO_ENTRY);
> KASSERT(last_gnt_entry >= 0);
> KASSERT(last_gnt_entry <= gnt_max_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);
>
>
> Write the above to a file, and use patch -p0 < /path/to/patch-file
> and rebuild a new kernel:
> cd src; ./build.sh -U -u -m amd64 -O ~/xen tools kernel=XEN3_DOMU
>
I can apply a patch in a day or so and recompile. If the panic happens
on schedule it should occur late afternoon tomorrow.
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS
http://anduin.eldar.org - & - http://anduin.ipv6.eldar.org [IPv6 only]
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Wed, 07 Mar 2018 20:43:56 -0500
coypu@sdf.org writes:
> The following reply was made to PR port-xen/53074; it has been noted by GNATS.
>
> From: coypu@sdf.org
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
> Date: Tue, 6 Mar 2018 01:49:26 +0000
>
> You can parse the backtrace like so (normally, it would report the
> functions, but that is a bug).
>
> gdb /my-netbsd-kernel
> info symbol 0xffffffff804f74b8
>
> and so on.
>
[snip]
I noticed something odd about the daily panic I am having. It always
happened around 9:15 in the morning, not every 24 hours from a reboot.
At that time, the DOMU was performing a backup using a line much like
this:
/sbin/dump -0u -a -f - $fs | ssh backup@server.eldar.org "/usr/pkg/bin/buffer | /usr/bin/bzip2 -9c > some_backup.bz2"
The dump would proceed for a while and the at some point panic 100% of
the time. The latest kernel that I compiled up for debugging symbols
produced a slightly different panic then was reported initially in the
PR. I have attached the more recent panic to the end of this email along
with the decoding of the symbols. The original panic that was reported
does not happen with the new kernel.
I suspected some sort of file system damage, but that was ruled out.
Multiple fsck are clean, and even coping the data using pax to a new
volume produced the same panic on the new volume. The file system is a
FFSv2, fslevel 5. WAPBL was enabled on the file system, but that was
also removed, although the panic was slightly different when WAPBL was
present. The volumes are presented to the DOMU as raw lvm devices.
This DOMU was updated from a 6.99 era build to 8.99.12 recently and
performed this sort of backup every day in the past. The DOMU is has a
DOM0 which is a NetBSD/amd64 Xen 4.5.1 DOM0 running 7.1_STABLE, also
recent, and is given a sched_credit weight of 32 and CPU cap of 10.
I can help if given some guidance as to what may be needed.
panic: biodone2 already
cpu0: Begin traceback...
?() at ffffffff804f74b8
?() at ffffffff804f7575
?() at ffffffff80536841
?() at ffffffff8053689c
?() at ffffffff804cc257
cpu0: End traceback...
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff802057a5 cs 0xe030 rflags 0x202 cr2 0xffffa0002b9fe000 ilevel 0 rsp 0xffffa0002b353dc0
curlwp 0xffffa00000741000 pid 0.4 lowest kstack 0xffffa0002b3502c0
Stopped in pid 0.4 (system) at ffffffff802057a5: leave
ds ffff
es 0
fs 3dd0
gs 3d70
rdi 0
rsi a
rbp ffffa0002b353dc0
rbx 104
rdx 1
rcx 0
rax 0
r8 ffffffff807373c0
r9 0
r10 75
r11 e02b
r12 ffffffff8068dbcc
r13 ffffa0002b353e08
r14 ffffffff8068da62
r15 ffffa00000741000
rip ffffffff802057a5
cs e030
rflags 202
rsp ffffa0002b353dc0
ss e02b
ffffffff802057a5: leave
(gdb) l *(0xffffffff804f74b8)
0xffffffff804f74b8 is in vpanic (../../../../kern/subr_prf.c:342).
337 kdbpanic();
338 #endif
339 #ifdef DDB
340 db_panic();
341 #endif
342 cpu_reboot(bootopt, NULL);
343 }
344
345 /*
346 * kernel logging functions: log, logpri, addlog
(gdb) l *(0xffffffff804f7575)
0xffffffff804f7575 is in snprintf (../../../../kern/subr_prf.c:1075).
1070 /*
1071 * snprintf: print a message to a buffer
1072 */
1073 int
1074 snprintf(char *bf, size_t size, const char *fmt, ...)
1075 {
1076 int retval;
1077 va_list ap;
1078
1079 va_start(ap, fmt);
(gdb) l *(0xffffffff80536841)
0xffffffff80536841 is in biointr (../../../../kern/vfs_bio.c:1654).
1649 }
1650 }
1651
1652 static void
1653 biointr(void *cookie)
1654 {
1655 struct cpu_info *ci;
1656 buf_t *bp;
1657 int s;
1658
(gdb) l *(0xffffffff8053689c)
0xffffffff8053689c is in biointr (./x86/intr.h:187).
182
183 static inline int
184 splraiseipl(ipl_cookie_t icookie)
185 {
186
187 return splraise(icookie._ipl);
188 }
189
190 #include <sys/spl.h>
191
(gdb) l *(0xffffffff804cc257)
0xffffffff804cc257 is in softint_thread (/usr/src/sys/arch/amd64/compile/XEN3_DOMU_ISCSI/xen-ma/machine/cpu.h:55).
50 __inline static struct cpu_info * __unused
51 x86_curcpu(void)
52 {
53 struct cpu_info *ci;
54
55 __asm volatile("movq %%gs:%1, %0" :
56 "=r" (ci) :
57 "m"
58 (*(struct cpu_info * const *)offsetof(struct cpu_info, ci_self)));
59 return ci;
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS
http://anduin.eldar.org - & - http://anduin.ipv6.eldar.org [IPv6 only]
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 24 Jun 2018 17:32:03 +0000
State-Changed-Why:
Asked for feedback on recent change.
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To:
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Sun, 24 Jun 2018 19:31:40 +0200
FYI - I've just fixed the ddb symbol issues, so if you'd investigate
this further, you'd be able to at least have proper stack trace now.
Can you recheck if that issue still happens with
arch/amd64/conf/kern.ldscript.Xen rev. 1.15 or later?
Jaromir
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Fri, 10 Aug 2018 15:54:25 -0400
Sorry.. I had not noticed that there was a reply to this. I updated
the DOMU guest that has the panic to 8.99.23 and it still does it.
However, the backtrace may be more useful. I suspect that the bug I
found was noted again by someone else recently and may be the same as
the more recent kern/53506. In any case, here is a new backtrace:
[ 1085.6009218] panic: biodone2 already
[ 1085.6009218] cpu0: Begin traceback...
[ 1085.6009218] vpanic() at netbsd:vpanic+0x152
[ 1085.6009218] snprintf() at netbsd:snprintf
[ 1085.6009218] biointr() at netbsd:biointr
[ 1085.6009218] biointr() at netbsd:biointr+0x5b
[ 1085.6009218] softint_thread() at netbsd:softint_thread+0x97
[ 1085.6009218] cpu0: End traceback...
[ 1085.6009218] fatal breakpoint trap in supervisor mode
[ 1085.6009218] trap type 1 code 0 rip 0xffffffff80205855 cs 0xe030 rflags 0x202 cr2 0xffffa0002b88d267 ilevel 0 rsp 0xffffa0002b359dc0
[ 1085.6009218] curlwp 0xffffa00000743000 pid 0.4 lowest kstack 0xffffa0002b3552c0
Stopped in pid 0.4 (system) at netbsd:breakpoint+0x5: leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x152
snprintf() at netbsd:snprintf
biointr() at netbsd:biointr
biointr() at netbsd:biointr+0x5b
softint_thread() at netbsd:softint_thread+0x97
ds ffff
es 0
fs 9dd0
gs 9d70
rdi 0
rsi a
rbp ffffa0002b359dc0
rbx 104
rdx 1
rcx 0
rax 0
r8 ffffffff80704540 cpu_info_primary
r9 cccccccccccccccd
r10 75
r11 ffffffff805d97f8 ptp_shifts+0x18
r12 ffffffff8065dd8f ostype+0x463af
r13 ffffa0002b359e08
r14 ffffffff8065dc25 ostype+0x46245
r15 ffffa00000743000
rip ffffffff80205855 breakpoint+0x5
cs e030
rflags 202
rsp ffffa0002b359dc0
ss e02b
netbsd:breakpoint+0x5: leave
This panic is being tripped by a line much line this:
/sbin/dump -0u -a -f - / | ssh backup@backupsystem "/usr/pkg/bin/buffer | /usr/bin/bzip2 -9c > /sp/HOSTS/domu_guest_that_panics/x.bz2"
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
Responsible-Changed-From-To: port-xen-maintainer->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 14 Aug 2018 18:31:58 +0000
Responsible-Changed-Why:
I'm looking on port-xen/53506 which might be same problem.
State-Changed-From-To: feedback->open
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 14 Aug 2018 18:31:58 +0000
State-Changed-Why:
Feedback provided, still happens.
State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 21 Aug 2018 09:28:43 +0000
State-Changed-Why:
Can you try if this patch fixes the problem? It did for port-xen/53506
http://www.netbsd.org/~jdolecek/xbd_diskstart_splbio.diff
From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
jdolecek@NetBSD.org
Subject: Re: port-xen/53074 (Daily 8.99.12 DOMU panic)
Date: Tue, 21 Aug 2018 19:25:27 -0400
jdolecek@NetBSD.org writes:
> Synopsis: Daily 8.99.12 DOMU panic
>
> State-Changed-From-To: open->feedback
> State-Changed-By: jdolecek@NetBSD.org
> State-Changed-When: Tue, 21 Aug 2018 09:28:43 +0000
> State-Changed-Why:
> Can you try if this patch fixes the problem? It did for port-xen/53506
>
> http://www.netbsd.org/~jdolecek/xbd_diskstart_splbio.diff
This patch does seem to prevent the panic on 8.99.23. I tested a DOMU
with 1 and 2 vcpus. I think that this PR can be closed.
Thanks.
--
Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org
State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 22 Aug 2018 06:11:13 +0000
State-Changed-Why:
Confirmed fixed. Pullup is being tracked in PR port-xen/53506
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.