NetBSD Problem Report #56925
From www@netbsd.org Thu Jul 14 09:40:24 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id F20E11A9239
for <gnats-bugs@gnats.NetBSD.org>; Thu, 14 Jul 2022 09:40:23 +0000 (UTC)
Message-Id: <20220714094022.5A5E71A923C@mollari.NetBSD.org>
Date: Thu, 14 Jul 2022 09:40:22 +0000 (UTC)
From: joel.bertrand@systella.fr
Reply-To: joel.bertrand@systella.fr
To: gnats-bugs@NetBSD.org
Subject: Amd64 server randomly panics
X-Send-Pr-Version: www-1.0
>Number: 56925
>Category: kern
>Synopsis: Amd64 server randomly panics
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Jul 14 09:45:00 +0000 2022
>Last-Modified: Thu Dec 08 12:35:01 +0000 2022
>Originator: BERTRAND Joël
>Release: NetBSD 9.2_STABLE
>Organization:
>Environment:
NetBSD legendre.systella.fr 9.2_STABLE NetBSD 9.2_STABLE (CUSTOM) #12: Mon Jun 20 08:30:25 CEST 2022 root@legendre.systella.fr:/usr/src/netbsd-9/obj/sys/arch/amd64/compile/CUSTOM amd64
>Description:
Hello,
My main server runs NetBSD 9.2 as rolling release. Since NetBSD 9.0, it randomly panics with traceback like :
[ 613880.906790] S-1C-1: ccb_timeout: num=1 total=1 disp=0
[ 613880.906790] uvm_fault(0xffffffff815862c0, 0x0, 2) -> e
[ 613880.906790] fatal page fault in supervisor mode
[ 613880.906790] trap type 6 code 0x2 rip 0xffffffff8022800c cs 0x8 rflags
0x10246 cr2 0x10 ilevel 0 rsp 0xffffdb813c0c9f48
[ 613880.906790] curlwp 0xfffffda14406ba60 pid 0.181 lowest kstack
0xffffdb813c0c72c0
[ 613880.906790] panic: trap
[ 613880.906790] cpu1: Begin traceback...
[ 613880.906790] vpanic() at netbsd:vpanic+0x160
[ 613880.906790] snprintf() at netbsd:snprintf
[ 613880.906790] startlwp() at netbsd:startlwp
[ 613880.906790] alltraps() at netbsd:alltraps+0xbb
[ 613880.916794] ccb_timeout() at netbsd:ccb_timeout+0xf0
[ 613880.916794] iscsi_cleanup_thread() at netbsd:iscsi_cleanup_thread+0x2b6
[ 613880.916794] cpu1: End traceback...
[ 613880.916794] uvm_fault(0xfffffda43aad15c0, 0x0, 2) -> e
[ 613880.916794] fatal page fault in supervisor mode
[ 613880.916794] trap type 6 code 0x2 rip 0xffffffff8097d518 cs
[ 613880.916794] 0x8 rflags 0x10246 cr2 0x84 ilevel 0x8 rsp
0xffffdb8141e16cd9u0m
[ 613880.916794] ping to dev 18,1 (offset=251919, size=4162814):
[ 613880.916794] curlwp 0xfffffda3854184c0 pid 25426.1 lowest kstacdku m0px
ffffdb8141e142c0
It doesn't write dump file in /var/crash or enter in ddb. It only reboots. I'm pretty sure I haven't seen this kind of panic with NetBSD 8.x.
This server acts as initiator and target iSCSI. Panic randomly occurs (last time was in may and I would believe this issue was fixed...) and can occur when server is idle.
>How-To-Repeat:
Issue seems to be related to iscsi initiator (target is provided by istgt daemon). But I don't know how trigger this bug.
>Fix:
No idea.
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: joel.bertrand@systella.fr
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Thu, 14 Jul 2022 11:24:36 +0000
This is a null pointer dereference by the instruction at address
0xffffffff8022800c. I'm guessing it's in wake_ccb but I'm not sure.
Can you try to obtain the following information from your kernel?
1. The disassembly of the function where rip 0xffffffff8022800c lies,
and, if possible, the line number of 0xffffffff8022800c.
2. The disassembly of ccb_timeout, and, if possible, the line number
of ccb_timeout+0xf0.
If you have a netbsd.gdb file, you can get these with:
$ gdb ./netbsd.gdb
(gdb) info line *(0xffffffff8022800c)
(gdb) disas 0xffffffff8022800c
(gdb) info line *(ccb_timeout+0xf0)
(gdb) disas ccb_timeout
If not, you can at least use the `disas' commands in the netbsd
kernel.
It may help to set the sysctl variables ddb.onpanic=1 and
ddb.commandonenter="bt;show registers", or
ddb.commandonenter="bt;show registers;sync"
if you want ddb to force a crash dump (with the caveat that the latter
means manually entering ddb will immediately force a crash dump with
no interaction). It may also help to set `options ISCSI_DEBUG' and
set hw.iscsi.debug=1 or higher (with the caveat that it might produce
a lot of log spew).
From: =?UTF-8?Q?BERTRAND_Jo=c3=abl?= <joel.bertrand@systella.fr>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Thu, 14 Jul 2022 15:17:09 +0200
Taylor R Campbell a écrit :
> The following reply was made to PR kern/56925; it has been noted by GNATS.
>
> From: Taylor R Campbell <riastradh@NetBSD.org>
> To: joel.bertrand@systella.fr
> Cc: gnats-bugs@NetBSD.org
> Subject: Re: kern/56925: Amd64 server randomly panics
> Date: Thu, 14 Jul 2022 11:24:36 +0000
>
> This is a null pointer dereference by the instruction at address
> 0xffffffff8022800c. I'm guessing it's in wake_ccb but I'm not sure.
>
> Can you try to obtain the following information from your kernel?
>
> 1. The disassembly of the function where rip 0xffffffff8022800c lies,
> and, if possible, the line number of 0xffffffff8022800c.
>
> 2. The disassembly of ccb_timeout, and, if possible, the line number
> of ccb_timeout+0xf0.
>
> If you have a netbsd.gdb file, you can get these with:
>
I have ;-)
(gdb) info line *(0xffffffff8022800c)
No line number information available for address
0xffffffff8022800c <mutex_enter+12>
(gdb) disas 0xffffffff8022800c
Dump of assembler code for function mutex_enter:
0xffffffff80228000 <+0>: mov %gs:0x1d0,%rcx
0xffffffff80228009 <+9>: xor %rax,%rax
0xffffffff8022800c <+12>: lock cmpxchg %rcx,(%rdi)
0xffffffff80228011 <+17>: jne 0xffffffff80228017 <mutex_enter+23>
0xffffffff80228013 <+19>: retq
0xffffffff80228014 <+20>: nop
0xffffffff80228015 <+21>: nop
0xffffffff80228016 <+22>: retq
0xffffffff80228017 <+23>: jmpq 0xffffffff8096c7b4
<mutex_vector_enter>
End of assembler dump.
(gdb) info line *(ccb_timeout+0xf0)
Line 1674 of "/usr/src/netbsd-9/src/sys/dev/iscsi/iscsi_send.c"
starts at address 0xffffffff80a9b544 <ccb_timeout+240>
and ends at 0xffffffff80a9b551 <ccb_timeout+253>.
(gdb) disas ccb_timeout
Dump of assembler code for function ccb_timeout:
0xffffffff80a9b454 <+0>: push %rbp
0xffffffff80a9b455 <+1>: mov %rsp,%rbp
0xffffffff80a9b458 <+4>: push %r12
0xffffffff80a9b45a <+6>: push %rbx
0xffffffff80a9b45b <+7>: mov %rdi,%rbx
0xffffffff80a9b45e <+10>: mov 0x168(%rdi),%r12
0xffffffff80a9b465 <+17>: addl $0x1,0x80(%rdi)
0xffffffff80a9b46c <+24>: mov 0x9c8392(%rip),%edx #
0xffffffff81463804 <iscsi_debug_level>
0xffffffff80a9b472 <+30>: test %edx,%edx
0xffffffff80a9b474 <+32>: js 0xffffffff80a9b4c2 <ccb_timeout+110>
0xffffffff80a9b476 <+34>: test %r12,%r12
0xffffffff80a9b479 <+37>: je 0xffffffff80a9b55a <ccb_timeout+262>
0xffffffff80a9b47f <+43>: movzwl 0x128(%r12),%edx
0xffffffff80a9b488 <+52>: mov 0x140(%r12),%rax
0xffffffff80a9b490 <+60>: movzwl 0x224(%rax),%esi
0xffffffff80a9b497 <+67>: mov $0xffffffff811b8b59,%rdi
0xffffffff80a9b49e <+74>: xor %eax,%eax
0xffffffff80a9b4a0 <+76>: callq 0xffffffff809a88b7 <printf>
0xffffffff80a9b4a5 <+81>: mov 0x14(%rbx),%ecx
0xffffffff80a9b4a8 <+84>: mov 0x80(%rbx),%edx
0xffffffff80a9b4ae <+90>: mov 0x7c(%rbx),%eax
0xffffffff80a9b4b1 <+93>: lea 0x1(%rax),%esi
0xffffffff80a9b4b4 <+96>: mov $0xffffffff811ba8e0,%rdi
0xffffffff80a9b4bb <+103>: xor %eax,%eax
0xffffffff80a9b4bd <+105>: callq 0xffffffff809a88b7 <printf>
0xffffffff80a9b4c2 <+110>: mov 0x7c(%rbx),%eax
0xffffffff80a9b4c5 <+113>: add $0x1,%eax
0xffffffff80a9b4c8 <+116>: mov %eax,0x7c(%rbx)
0xffffffff80a9b4cb <+119>: cmp $0x3,%eax
0xffffffff80a9b4ce <+122>: jg 0xffffffff80a9b537 <ccb_timeout+227>
0xffffffff80a9b4d0 <+124>: cmpl $0x9,0x80(%rbx)
0xffffffff80a9b4d7 <+131>: jg 0xffffffff80a9b537 <ccb_timeout+227>
0xffffffff80a9b4d9 <+133>: cmpl $0x3,0x14(%rbx)
0xffffffff80a9b4dd <+137>: jbe 0xffffffff80a9b537 <ccb_timeout+227>
0xffffffff80a9b4df <+139>: mov 0x170(%rbx),%rax
0xffffffff80a9b4e6 <+146>: mov 0x234(%rax),%eax
0xffffffff80a9b4ec <+152>: test %eax,%eax
0xffffffff80a9b4ee <+154>: je 0xffffffff80a9b537 <ccb_timeout+227>
0xffffffff80a9b4f0 <+156>: xor %r8d,%r8d
0xffffffff80a9b4f3 <+159>: xor %ecx,%ecx
0xffffffff80a9b4f5 <+161>: cmpb $0x0,0x12a(%rbx)
0xffffffff80a9b4fc <+168>: je 0xffffffff80a9b50c <ccb_timeout+184>
0xffffffff80a9b4fe <+170>: mov 0x138(%rbx),%eax
0xffffffff80a9b504 <+176>: cmp %eax,0x13c(%rbx)
0xffffffff80a9b50a <+182>: jb 0xffffffff80a9b566 <ccb_timeout+274>
0xffffffff80a9b50c <+184>: mov $0x1,%edx
0xffffffff80a9b511 <+189>: xor %esi,%esi
0xffffffff80a9b513 <+191>: mov %r12,%rdi
0xffffffff80a9b516 <+194>: callq 0xffffffff80a9936c <snack_missing>
0xffffffff80a9b51b <+199>: mov 0x9bd2a3(%rip),%eax #
0xffffffff814587c4 <hz>
0xffffffff80a9b521 <+205>: mov %eax,%esi
0xffffffff80a9b523 <+207>: shl $0x4,%esi
0xffffffff80a9b526 <+210>: sub %eax,%esi
0xffffffff80a9b528 <+212>: shl $0x2,%esi
0xffffffff80a9b52b <+215>: mov %rbx,%rdi
0xffffffff80a9b52e <+218>: pop %rbx
0xffffffff80a9b52f <+219>: pop %r12
0xffffffff80a9b531 <+221>: pop %rbp
0xffffffff80a9b532 <+222>: jmpq 0xffffffff80a93e29
<ccb_timeout_start>
0xffffffff80a9b537 <+227>: mov $0x18,%esi
0xffffffff80a9b53c <+232>: mov %rbx,%rdi
0xffffffff80a9b53f <+235>: callq 0xffffffff80a9e21f <wake_ccb>
0xffffffff80a9b544 <+240>: mov $0x2,%edx
0xffffffff80a9b549 <+245>: mov $0x18,%esi
0xffffffff80a9b54e <+250>: mov %r12,%rdi
0xffffffff80a9b551 <+253>: pop %rbx
0xffffffff80a9b552 <+254>: pop %r12
0xffffffff80a9b554 <+256>: pop %rbp
0xffffffff80a9b555 <+257>: jmpq 0xffffffff80a93570
<handle_connection_error>
0xffffffff80a9b55a <+262>: mov $0xffffffff,%edx
0xffffffff80a9b55f <+267>: mov %edx,%esi
0xffffffff80a9b561 <+269>: jmpq 0xffffffff80a9b497 <ccb_timeout+67>
0xffffffff80a9b566 <+274>: xor %edx,%edx
0xffffffff80a9b568 <+276>: mov %rbx,%rsi
0xffffffff80a9b56b <+279>: mov %r12,%rdi
0xffffffff80a9b56e <+282>: callq 0xffffffff80a9936c <snack_missing>
0xffffffff80a9b573 <+287>: jmp 0xffffffff80a9b51b <ccb_timeout+199>
End of assembler dump.
Best regards,
JB
From: Taylor R Campbell <riastradh@NetBSD.org>
To: joel.bertrand@systella.fr
Cc: gnats-bugs@NetBSD.org, mlelstv@NetBSD.org
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Thu, 14 Jul 2022 18:26:39 +0000
So it is in wake_ccb, as I guessed, and it looks like conn is null
when wake_ccb tries to acquire conn->c_lock:
https://nxr.netbsd.org/xref/src/sys/dev/iscsi/iscsi_utils.c?r=1.27#400
This can only happen if free_ccb had run before (or concurrently with)
ccb_timeout, nulling out ccb->ccb_connection before ccb_timeout read
it:
https://nxr.netbsd.org/xref/src/sys/dev/iscsi/iscsi_send.c?r=1.37#1661
Which free_ccb did it is not clear to me (lots of calls to it). But
my guess is a race between successful command completion and the
callout firing. If successful command completion tries callout_stop
to stop the timeout, but the callout had already begun firing on
another CPU, then successful command completion might call free_ccb
and null out ccb->ccb_connection just before ccb_timeout grabs it.
mlelstv: You recently(ish) worked on MP iscsi -- can you take a look?
From: =?UTF-8?Q?BERTRAND_Jo=c3=abl?= <joel.bertrand@systella.fr>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: gnats-bugs@NetBSD.org, mlelstv@NetBSD.org
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Thu, 14 Jul 2022 20:48:56 +0200
Taylor R Campbell a écrit :
> So it is in wake_ccb, as I guessed, and it looks like conn is null
> when wake_ccb tries to acquire conn->c_lock:
>
> https://nxr.netbsd.org/xref/src/sys/dev/iscsi/iscsi_utils.c?r=1.27#400
>
> This can only happen if free_ccb had run before (or concurrently with)
> ccb_timeout, nulling out ccb->ccb_connection before ccb_timeout read
> it:
>
> https://nxr.netbsd.org/xref/src/sys/dev/iscsi/iscsi_send.c?r=1.37#1661
>
> Which free_ccb did it is not clear to me (lots of calls to it). But
> my guess is a race between successful command completion and the
> callout firing. If successful command completion tries callout_stop
> to stop the timeout, but the callout had already begun firing on
> another CPU, then successful command completion might call free_ccb
> and null out ccb->ccb_connection just before ccb_timeout grabs it.
>
> mlelstv: You recently(ish) worked on MP iscsi -- can you take a look?
For your information, this server acts as an initiator for two iscsi
targets (/opt/ngspice and /opt/bacula). I haven't tested with only one
target.
Best regards,
JB
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Thu, 14 Jul 2022 23:22:10 -0000 (UTC)
riastradh@NetBSD.org (Taylor R Campbell) writes:
>From: Taylor R Campbell <riastradh@NetBSD.org>
> So it is in wake_ccb, as I guessed, and it looks like conn is null
> when wake_ccb tries to acquire conn->c_lock:
The first thing should be a kernel built with DIAGNOSTIC. If conn
is NULL, the assertion will trigger.
ISCSI_DEBUG is enabled by default, you just have to configure
the level with sysctl. A level of 15 will trace CCB usage, but
that generates lots of output, but if the error still appears,
you can probably see where things go wrong.
From: Taylor R Campbell <riastradh@NetBSD.org>
To: joel.bertrand@systella.fr
Cc: gnats-bugs@NetBSD.org, mlelstv@NetBSD.org
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Fri, 15 Jul 2022 08:56:29 +0000
I see two problems in the ccb_timeout and wake_ccb logic:
1. ccb_timeout is called from softint context, and calls wake_ccb,
which calls ccb_timeout_stop, which in turn sometimes calls kpause.
Calling kpause is forbidden in softint context, so this has to
change -- either the logic is never reached, in which case it's
dead code and should be removed, or the logic is reachable, in
which case it's broken and needs to be fixed.
2. ccb_timeout_stop may fail to stop the callout from firing again, if
it had already started.
In particular, unless there are other non-obvious conditions making
this scenario impossible, the following sequence of events may
happen:
(a) cpu0: callout fires and enters ccb_timeout
(b) cpu1: ccb_timeout_stop calls callout_stop
(c) cpu0: ccb_timeout calls ccb_timeout_start, which
reschedules the callout
(d) cpu1: caller of ccb_timeout_stop assumes callout
will not fire again
(e) a minute passes (COMMAND_TIMEOUT ticks)
(f) cpuN: callout fires and enters ccb_timeout again
If ccb_timeout_stop is not allowed to block, fixing this requires
ccb_timeout_stop to issue callout_stop and set a flag under a lock,
under which lock ccb_timeout tests the flag to decide whether to
skip everything. Then either:
- if ccb_timeout_start is allowed to sleep, it must run
callout_halt to wait for any concurrent firing of the
callout to finish, clear the flag, and finally call
callout_schedule; or
- if ccb_timeout_start is forbidden from sleeping, it needs a
second flag to ask ccb_timeout to reschedule itself without
doing anything.
Fixing this might draw inspiration from how these problems were
addressed in the USB stack -- see sys/dev/usb/usbdi.c and the logic
around struct usbd_xfer::ux_callout, ux_timeout_set, ux_timeout_reset,
callout_ack/invoking, &c.
(The USB stack has one additional complication: the abort action on
timeout is has to be performed in thread context, so it's run in a USB
task scheduled from a callout, not directly in the callout. I don't
know whether the ccb abort action (handle_connection_error) has the
same constraint.
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Fri, 15 Jul 2022 13:18:16 -0000 (UTC)
riastradh@NetBSD.org (Taylor R Campbell) writes:
> I see two problems in the ccb_timeout and wake_ccb logic:
>
> 1. ccb_timeout is called from softint context, and calls wake_ccb,
ccb_timeout_co is the callout function (i.e. softint context) that queues
the ccb via ccb_tchain and signals the cleanup thread. The operation is
protected by a mutex at IPL_BIO.
ccb_timeout is then called later from the cleanup thread.
From: =?UTF-8?Q?BERTRAND_Jo=c3=abl?= <joel.bertrand@systella.fr>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Sat, 6 Aug 2022 09:32:17 +0200
Michael van Elst a écrit :
> The following reply was made to PR kern/56925; it has been noted by GNATS.
>
> From: mlelstv@serpens.de (Michael van Elst)
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: kern/56925: Amd64 server randomly panics
> Date: Fri, 15 Jul 2022 13:18:16 -0000 (UTC)
>
> riastradh@NetBSD.org (Taylor R Campbell) writes:
>
> > I see two problems in the ccb_timeout and wake_ccb logic:
> >
> > 1. ccb_timeout is called from softint context, and calls wake_ccb,
>
> ccb_timeout_co is the callout function (i.e. softint context) that queues
> the ccb via ccb_tchain and signals the cleanup thread. The operation is
> protected by a mutex at IPL_BIO.
>
> ccb_timeout is then called later from the cleanup thread.
Hello,
Do you have some news about this PR ? This server has crashed twice
last night (one panic and a deadlock that could be related). I can do
some tests but I have no idea to fix this issue.
Best regards,
JB
From: =?UTF-8?Q?BERTRAND_Jo=c3=abl?= <joel.bertrand@systella.fr>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Wed, 16 Nov 2022 09:36:18 +0100
Hello,
I have upgraded my server to 9.3_STABLE. In a first time, I tought that
this issue was solved, as this server ran without panic during more than
a month. But yesterday, kernel has crashed with the same panic (and
/home was destroyed, I have restored data from last daily backup).
I hope this PR can be solved as soon as possible...
Best regards,
JB
From: =?UTF-8?Q?BERTRAND_Jo=c3=abl?= <joel.bertrand@systella.fr>
To: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Cc:
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Thu, 8 Dec 2022 13:32:46 +0100
Hello,
I have applied this patch
http://ftp.netbsd.org/pub/NetBSD/misc/mlelstv/iscsi-9-current.diff
against netbsd-9 and it seems to fix this PR.
legendre:[~] > uptime
1:28PM up 18 days, 21:58, 5 users, load averages: 0.52, 0.84, 0.85
legendre:[~] > uname -a
NetBSD legendre.systella.fr 9.3_STABLE NetBSD 9.3_STABLE (CUSTOM) #19:
Sat Nov 19 15:12:20 CET 2022
root@legendre.systella.fr:/usr/src/netbsd-9/obj/sys/arch/amd64/compile/CUSTOM
amd64
Maybe this patch should be pulled in -9 branch.
Best regards,
JKB
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.