NetBSD Problem Report #53074

From brad@anduin.eldar.org  Mon Mar  5 18:09:41 2018
Return-Path: <brad@anduin.eldar.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id C9F887A1DA
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  5 Mar 2018 18:09:41 +0000 (UTC)
Message-Id: <201803051809.w25I9cc3021569@anduin.eldar.org>
Date: Mon, 5 Mar 2018 13:09:38 -0500 (EST)
From: brad@anduin.eldar.org
Reply-To: brad@anduin.eldar.org
To: gnats-bugs@NetBSD.org
Subject: Daily 8.99.12 DOMU panic
X-Send-Pr-Version: 3.95

>Number:         53074
>Category:       port-xen
>Synopsis:       Daily 8.99.12 DOMU panic
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 05 18:10:00 +0000 2018
>Closed-Date:    Wed Aug 22 06:11:13 +0000 2018
>Last-Modified:  Wed Aug 22 06:11:13 +0000 2018
>Originator:     Brad Spencer
>Release:        NetBSD 8.99.12
>Organization:
	Eldar.org
>Environment:
System: NetBSD varda.nat.eldar.org 8.99.12 NetBSD 8.99.12 (XEN3_DOMU_ISCSI) #0: Thu Mar  1 22:43:01 EST 2018  brad@gimli.nat.eldar.org:/usr/src/sys/arch/amd64/compile/XEN3_DOMU_ISCSI amd64
Architecture: x86_64
Machine: amd64
>Description:
On a testing 64 bit DOMU running NetBSD 8.99.12 against a NetBSD 7.1_STABLE DOM0 I have a daily panic.  The ddb traceback is as follows:

panic: kernel diagnostic assertion "entry > NR_RESERVED_ENTRIES" failed: file "../../../../arch/xen/xen/xengnt.c", line 264 
cpu0: Begin traceback...
?() at ffffffff804f74b8
?() at ffffffff80600b95
?() at ffffffff802139eb
?() at ffffffff8022eb90
?() at ffffffff80212a48
?() at ffffffff80203119
?() at ffffffff804cd598
?() at ffffffff804b2ae6
?() at ffffffff8020f1a6
cpu0: End traceback...
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip 0xffffffff802057a5 cs 0xe030 rflags 0x202 cr2 0x7eb70853b020 ilevel 0x8 rsp 0xffffa0002c28fae8
curlwp 0xffffa00000c681a0 pid 10930.1 lowest kstack 0xffffa0002c28c2c0
Stopped in pid 10930.1 (ssh) at ffffffff802057a5:       leave
ds          0
es          faf8
fs          fa98
gs          fae8
rdi         6
rsi         a
rbp         ffffa0002c28fae8
rbx         104
rdx         1
rcx         0
rax         1
r8          ffffffff807373c0
r9          0
r10         75
r11         ffffa00000740048
r12         ffffffff80648e28
r13         ffffa0002c28fb30
r14         ffffa00000960000
r15         ffffa00000a57008
rip         ffffffff802057a5
cs          e030
rflags      202
rsp         ffffa0002c28fae8
ss          e02b
ffffffff802057a5:       leave
db{0}> 

The kernel used by this particular DOMU is just XEN3_DOMU plus the
iscsi support compiled in because module loading does not appear to
work.  The DOM0 is NetBSD 7.1_STABLE running Xen 4.5.1 in 64 bit.

>How-To-Repeat:
The DOMU lasts almost exactly 24 hours at a time.  Just wait and it
will happen again.  I can provide more information if it is desired.

>Fix:
No idea on the fix, just reporting.

>Release-Note:

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Mon, 5 Mar 2018 19:59:38 +0100

 On Mon, Mar 05, 2018 at 06:10:00PM +0000, brad@anduin.eldar.org wrote:
 > panic: kernel diagnostic assertion "entry > NR_RESERVED_ENTRIES" failed: file "../../../../arch/xen/xen/xengnt.c", line 264 
 > cpu0: Begin traceback...
 > ?() at ffffffff804f74b8
 > ?() at ffffffff80600b95
 > ?() at ffffffff802139eb
 > ?() at ffffffff8022eb90
 > ?() at ffffffff80212a48
 > ?() at ffffffff80203119
 > ?() at ffffffff804cd598
 > ?() at ffffffff804b2ae6
 > ?() at ffffffff8020f1a6

 Could you check the value of entry in this case ?
 Also could you match the addresses in the stack trace to functions ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: coypu@sdf.org
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Tue, 6 Mar 2018 01:49:26 +0000

 You can parse the backtrace like so (normally, it would report the
 functions, but that is a bug).

 gdb /my-netbsd-kernel
 info symbol 0xffffffff804f74b8

 and so on.

 To check the value of entry, I suspect this will do:

 Index: xengnt.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/xen/xen/xengnt.c,v
 retrieving revision 1.25
 diff -u -r1.25 xengnt.c
 --- xengnt.c	24 Oct 2012 13:07:46 -0000	1.25
 +++ xengnt.c	6 Mar 2018 01:47:55 -0000
 @@ -261,7 +261,8 @@
  xengnt_free_entry(grant_ref_t entry)
  {
  	mutex_enter(&grant_lock);
 -	KASSERT(entry > NR_RESERVED_ENTRIES);
 +	KASSERTMSG(entry > NR_RESERVED_ENTRIES,
 +	    "entry > NR_RESERVED_ENTRIES and entry is %d", entry);
  	KASSERT(gnt_entries[last_gnt_entry] == XENGNT_NO_ENTRY);
  	KASSERT(last_gnt_entry >= 0);
  	KASSERT(last_gnt_entry <= gnt_max_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);


 Write the above to a file, and use patch -p0 < /path/to/patch-file
 and rebuild a new kernel:
 cd src; ./build.sh -U -u -m amd64 -O ~/xen tools kernel=XEN3_DOMU

From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Mon, 05 Mar 2018 21:30:03 -0500

 coypu@sdf.org writes:

 >  You can parse the backtrace like so (normally, it would report the
 >  functions, but that is a bug).
 >  
 >  gdb /my-netbsd-kernel
 >  info symbol 0xffffffff804f74b8
 >  
 >  and so on.

 I have compiled a new kernel with debugging symbols which may prove more
 helpful.  But in the meantime, using the original kernel I get the
 following:

 ?() at ffffffff804f74b8
 ?() at ffffffff80600b95
 ?() at ffffffff802139eb
 ?() at ffffffff8022eb90
 ?() at ffffffff80212a48
 ?() at ffffffff80203119
 ?() at ffffffff804cd598
 ?() at ffffffff804b2ae6
 ?() at ffffffff8020f1a6

 (gdb) info symbol 0xffffffff804f74b8
 vpanic + 320 in section .text
 (gdb) info symbol 0xffffffff80600b95
 No symbol matches 0xffffffff80600b95.
 (gdb) info symbol 0xffffffff802139eb
 xengnt_free_entry + 155 in section .text
 (gdb) info symbol 0xffffffff8022eb90
 xbd_handler + 223 in section .text
 (gdb) info symbol 0xffffffff80212a48
 xen_intr_biglock_wrapper + 29 in section .text
 (gdb) info symbol 0xffffffff80203119
 Xresume_xenev6 + 73 in section .text
 (gdb) info symbol 0xffffffff804cd598
 softint_overlay + 1278 in section .text
 (gdb) info symbol 0xffffffff804b2ae6
 lwp_userret + 521 in section .text
 (gdb) info symbol 0xffffffff8020f1a6
 syscall + 406 in section .text


 >  To check the value of entry, I suspect this will do:
 >  
 >  Index: xengnt.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/arch/xen/xen/xengnt.c,v
 >  retrieving revision 1.25
 >  diff -u -r1.25 xengnt.c
 >  --- xengnt.c	24 Oct 2012 13:07:46 -0000	1.25
 >  +++ xengnt.c	6 Mar 2018 01:47:55 -0000
 >  @@ -261,7 +261,8 @@
 >   xengnt_free_entry(grant_ref_t entry)
 >   {
 >   	mutex_enter(&grant_lock);
 >  -	KASSERT(entry > NR_RESERVED_ENTRIES);
 >  +	KASSERTMSG(entry > NR_RESERVED_ENTRIES,
 >  +	    "entry > NR_RESERVED_ENTRIES and entry is %d", entry);
 >   	KASSERT(gnt_entries[last_gnt_entry] == XENGNT_NO_ENTRY);
 >   	KASSERT(last_gnt_entry >= 0);
 >   	KASSERT(last_gnt_entry <= gnt_max_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);
 >  
 >  
 >  Write the above to a file, and use patch -p0 < /path/to/patch-file
 >  and rebuild a new kernel:
 >  cd src; ./build.sh -U -u -m amd64 -O ~/xen tools kernel=XEN3_DOMU
 >  

 I can apply a patch in a day or so and recompile.  If the panic happens
 on schedule it should occur late afternoon tomorrow.





 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS
 http://anduin.eldar.org  - & -  http://anduin.ipv6.eldar.org [IPv6 only]

From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Wed, 07 Mar 2018 20:43:56 -0500

 coypu@sdf.org writes:

 > The following reply was made to PR port-xen/53074; it has been noted by GNATS.
 >
 > From: coypu@sdf.org
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
 > Date: Tue, 6 Mar 2018 01:49:26 +0000
 >
 >  You can parse the backtrace like so (normally, it would report the
 >  functions, but that is a bug).
 >  
 >  gdb /my-netbsd-kernel
 >  info symbol 0xffffffff804f74b8
 >  
 >  and so on.
 >  

 [snip]

 I noticed something odd about the daily panic I am having.  It always
 happened around 9:15 in the morning, not every 24 hours from a reboot.
 At that time, the DOMU was performing a backup using a line much like
 this:

 /sbin/dump -0u -a -f - $fs | ssh backup@server.eldar.org "/usr/pkg/bin/buffer | /usr/bin/bzip2 -9c > some_backup.bz2"

 The dump would proceed for a while and the at some point panic 100% of
 the time.  The latest kernel that I compiled up for debugging symbols
 produced a slightly different panic then was reported initially in the
 PR. I have attached the more recent panic to the end of this email along
 with the decoding of the symbols.  The original panic that was reported
 does not happen with the new kernel.

 I suspected some sort of file system damage, but that was ruled out.
 Multiple fsck are clean, and even coping the data using pax to a new
 volume produced the same panic on the new volume.  The file system is a
 FFSv2, fslevel 5.  WAPBL was enabled on the file system, but that was
 also removed, although the panic was slightly different when WAPBL was
 present.  The volumes are presented to the DOMU as raw lvm devices.

 This DOMU was updated from a 6.99 era build to 8.99.12 recently and
 performed this sort of backup every day in the past.  The DOMU is has a
 DOM0 which is a NetBSD/amd64 Xen 4.5.1 DOM0 running 7.1_STABLE, also
 recent, and is given a sched_credit weight of 32 and CPU cap of 10.

 I can help if given some guidance as to what may be needed.



 panic: biodone2 already
 cpu0: Begin traceback...
 ?() at ffffffff804f74b8
 ?() at ffffffff804f7575
 ?() at ffffffff80536841
 ?() at ffffffff8053689c
 ?() at ffffffff804cc257
 cpu0: End traceback...
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip 0xffffffff802057a5 cs 0xe030 rflags 0x202 cr2 0xffffa0002b9fe000 ilevel 0 rsp 0xffffa0002b353dc0
 curlwp 0xffffa00000741000 pid 0.4 lowest kstack 0xffffa0002b3502c0
 Stopped in pid 0.4 (system) at  ffffffff802057a5:       leave
 ds          ffff
 es          0
 fs          3dd0
 gs          3d70
 rdi         0
 rsi         a
 rbp         ffffa0002b353dc0
 rbx         104
 rdx         1
 rcx         0
 rax         0
 r8          ffffffff807373c0
 r9          0
 r10         75
 r11         e02b
 r12         ffffffff8068dbcc
 r13         ffffa0002b353e08
 r14         ffffffff8068da62
 r15         ffffa00000741000
 rip         ffffffff802057a5
 cs          e030
 rflags      202
 rsp         ffffa0002b353dc0
 ss          e02b
 ffffffff802057a5:       leave

 (gdb) l *(0xffffffff804f74b8)
 0xffffffff804f74b8 is in vpanic (../../../../kern/subr_prf.c:342).
 337                     kdbpanic();
 338     #endif
 339     #ifdef DDB
 340             db_panic();
 341     #endif
 342             cpu_reboot(bootopt, NULL);
 343     }
 344
 345     /*
 346      * kernel logging functions: log, logpri, addlog

 (gdb) l *(0xffffffff804f7575)
 0xffffffff804f7575 is in snprintf (../../../../kern/subr_prf.c:1075).
 1070    /*
 1071     * snprintf: print a message to a buffer
 1072     */
 1073    int
 1074    snprintf(char *bf, size_t size, const char *fmt, ...)
 1075    {
 1076            int retval;
 1077            va_list ap;
 1078
 1079            va_start(ap, fmt);

 (gdb) l *(0xffffffff80536841)
 0xffffffff80536841 is in biointr (../../../../kern/vfs_bio.c:1654).
 1649            }
 1650    }
 1651
 1652    static void
 1653    biointr(void *cookie)
 1654    {
 1655            struct cpu_info *ci;
 1656            buf_t *bp;
 1657            int s;
 1658

 (gdb) l *(0xffffffff8053689c)
 0xffffffff8053689c is in biointr (./x86/intr.h:187).
 182
 183     static inline int
 184     splraiseipl(ipl_cookie_t icookie)
 185     {
 186
 187             return splraise(icookie._ipl);
 188     }
 189
 190     #include <sys/spl.h>
 191

 (gdb) l *(0xffffffff804cc257)
 0xffffffff804cc257 is in softint_thread (/usr/src/sys/arch/amd64/compile/XEN3_DOMU_ISCSI/xen-ma/machine/cpu.h:55).
 50      __inline static struct cpu_info * __unused
 51      x86_curcpu(void)
 52      {
 53              struct cpu_info *ci;
 54
 55              __asm volatile("movq %%gs:%1, %0" :
 56                  "=r" (ci) :
 57                  "m"
 58                  (*(struct cpu_info * const *)offsetof(struct cpu_info, ci_self)));
 59              return ci;




 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS
 http://anduin.eldar.org  - & -  http://anduin.ipv6.eldar.org [IPv6 only]

State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Sun, 24 Jun 2018 17:32:03 +0000
State-Changed-Why:
Asked for feedback on recent change.


From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek@gmail.com>
To: 
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Sun, 24 Jun 2018 19:31:40 +0200

 FYI - I've just fixed the ddb symbol issues, so if you'd investigate
 this further, you'd be able to at least have proper stack trace now.

 Can you recheck if that issue still happens with
 arch/amd64/conf/kern.ldscript.Xen rev. 1.15 or later?

 Jaromir

From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org,
        netbsd-bugs@netbsd.org
Subject: Re: port-xen/53074: Daily 8.99.12 DOMU panic
Date: Fri, 10 Aug 2018 15:54:25 -0400

 Sorry..  I had not noticed that there was a reply to this.  I updated
 the DOMU guest that has the panic to 8.99.23 and it still does it.
 However, the backtrace may be more useful.  I suspect that the bug I
 found was noted again by someone else recently and may be the same as
 the more recent kern/53506.  In any case, here is a new backtrace:

 [ 1085.6009218] panic: biodone2 already
 [ 1085.6009218] cpu0: Begin traceback...
 [ 1085.6009218] vpanic() at netbsd:vpanic+0x152
 [ 1085.6009218] snprintf() at netbsd:snprintf
 [ 1085.6009218] biointr() at netbsd:biointr
 [ 1085.6009218] biointr() at netbsd:biointr+0x5b
 [ 1085.6009218] softint_thread() at netbsd:softint_thread+0x97
 [ 1085.6009218] cpu0: End traceback...
 [ 1085.6009218] fatal breakpoint trap in supervisor mode
 [ 1085.6009218] trap type 1 code 0 rip 0xffffffff80205855 cs 0xe030 rflags 0x202 cr2 0xffffa0002b88d267 ilevel 0 rsp 0xffffa0002b359dc0
 [ 1085.6009218] curlwp 0xffffa00000743000 pid 0.4 lowest kstack 0xffffa0002b3552c0
 Stopped in pid 0.4 (system) at  netbsd:breakpoint+0x5:  leave
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x152
 snprintf() at netbsd:snprintf
 biointr() at netbsd:biointr
 biointr() at netbsd:biointr+0x5b
 softint_thread() at netbsd:softint_thread+0x97
 ds          ffff
 es          0
 fs          9dd0
 gs          9d70
 rdi         0
 rsi         a
 rbp         ffffa0002b359dc0
 rbx         104
 rdx         1
 rcx         0
 rax         0
 r8          ffffffff80704540    cpu_info_primary
 r9          cccccccccccccccd
 r10         75
 r11         ffffffff805d97f8    ptp_shifts+0x18
 r12         ffffffff8065dd8f    ostype+0x463af
 r13         ffffa0002b359e08
 r14         ffffffff8065dc25    ostype+0x46245
 r15         ffffa00000743000
 rip         ffffffff80205855    breakpoint+0x5
 cs          e030
 rflags      202
 rsp         ffffa0002b359dc0
 ss          e02b
 netbsd:breakpoint+0x5:  leave



 This panic is being tripped by a line much line this:

 /sbin/dump -0u -a -f - / | ssh backup@backupsystem "/usr/pkg/bin/buffer | /usr/bin/bzip2 -9c > /sp/HOSTS/domu_guest_that_panics/x.bz2"




 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

Responsible-Changed-From-To: port-xen-maintainer->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Tue, 14 Aug 2018 18:31:58 +0000
Responsible-Changed-Why:
I'm looking on port-xen/53506 which might be same problem.


State-Changed-From-To: feedback->open
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 14 Aug 2018 18:31:58 +0000
State-Changed-Why:
Feedback provided, still happens.


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Tue, 21 Aug 2018 09:28:43 +0000
State-Changed-Why:
Can you try if this patch fixes the problem? It did for port-xen/53506

http://www.netbsd.org/~jdolecek/xbd_diskstart_splbio.diff


From: Brad Spencer <brad@anduin.eldar.org>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org, netbsd-bugs@netbsd.org, gnats-admin@netbsd.org,
        jdolecek@NetBSD.org
Subject: Re: port-xen/53074 (Daily 8.99.12 DOMU panic)
Date: Tue, 21 Aug 2018 19:25:27 -0400

 jdolecek@NetBSD.org writes:

 > Synopsis: Daily 8.99.12 DOMU panic
 >
 > State-Changed-From-To: open->feedback
 > State-Changed-By: jdolecek@NetBSD.org
 > State-Changed-When: Tue, 21 Aug 2018 09:28:43 +0000
 > State-Changed-Why:
 > Can you try if this patch fixes the problem? It did for port-xen/53506
 >
 > http://www.netbsd.org/~jdolecek/xbd_diskstart_splbio.diff

 This patch does seem to prevent the panic on 8.99.23.  I tested a DOMU
 with 1 and 2 vcpus.  I think that this PR can be closed.

 Thanks.


 -- 
 Brad Spencer - brad@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

State-Changed-From-To: feedback->closed
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Wed, 22 Aug 2018 06:11:13 +0000
State-Changed-Why:
Confirmed fixed. Pullup is being tracked in PR port-xen/53506


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.