NetBSD Problem Report #57476
From martin@duskware.de Tue Jun 20 17:31:44 2023
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id D40D01A923E
for <gnats-bugs@gnats.NetBSD.org>; Tue, 20 Jun 2023 17:31:43 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: dhcpd dumps core
X-Send-Pr-Version: 3.95
>Number: 57476
>Category: bin
>Synopsis: dhcpd dumps core
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Jun 20 17:35:00 +0000 2023
>Last-Modified: Sun Sep 10 09:10:00 +0000 2023
>Originator: Martin Husemann
>Release: NetBSD 10.0_BETA
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD plug.duskware.de 10.0_BETA NetBSD 10.0_BETA (GENERIC64) #11: Fri Mar 31 13:02:15 CEST 2023 martin@seven-days-to-the-wolves.aprisoft.de:/work/src-10/sys/arch/evbarm/compile/GENERIC64 evbarm
Architecture: aarch64
Machine: evbarm
>Description:
Every once in a while my dhcpd process dumps core:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000071ec68 in lookup_hashed_option (universe=<optimized out>,
options=<optimized out>, code=12)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
2500 /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c: No such file or directory.
[Current thread is 1 (process 9282)]
(gdb) bt
#0 0x000000000071ec68 in lookup_hashed_option (universe=<optimized out>,
options=<optimized out>, code=12)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
#1 0x00000000006d8a24 in dhcp (packet=0xfaab9f593100)
at /work/src-10/external/mpl/dhcp/bin/server/../../dist/server/dhcp.c:246
#2 0x0000000000724de0 in do_packet (interface=0xfaaba05a7500,
packet=0xfaab9a3dee20, len=290, from_port=17408, from=...,
hfrom=0xfaab9a3dee08)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:4078
(gdb) info reg
x0 0x0 0
x1 0xfaab9e0001d0 275614997152208
x2 0xc 12
x3 0xc 12
x4 0x71ec10 7466000
x5 0x0 0
x6 0x0 0
x7 0x0 0
x8 0xfaab9f22f7c6 275615016220614
x9 0xfe3 4067
x10 0x75b510 7714064
x11 0x506faa0 84343456
x12 0x506fad0 84343504
x13 0xfaab9f22d278 275615016211064
x14 0x2 2
x15 0xfaab9d25e510 275614982858000
x16 0x71ec10 7466000
x17 0xfaab9f77d800 275615021783040
x18 0x2 2
x19 0xfaab9f593100 275615019774208
x20 0x0 0
x21 0xfaab9a3dee08 275614934101512
x22 0x4400 17408
x23 0x122 290
x24 0x75b510 7714064
x25 0xfaab9a3ded50 275614934101328
x26 0x75b000 7712768
x27 0xfaaba05a7500 275615036634368
x28 0x1 1
x29 0xfaab9a3dec20 275614934101024
x30 0x6d8a24 7178788
sp 0xfaab9a3dec10 0xfaab9a3dec10
pc 0x71ec68 0x71ec68 <lookup_hashed_option+88>
cpsr 0x80000000 [ EL=0 N ]
fpsr 0x0 0
fpcr 0x0 0
(gdb) x/16i $pc-16
0x71ec58 <lookup_hashed_option+72>:
b 0x71ec7c <lookup_hashed_option+108>
0x71ec5c <lookup_hashed_option+76>: ldr x1, [x1, #8]
0x71ec60 <lookup_hashed_option+80>:
cbz x1, 0x71ec7c <lookup_hashed_option+108>
0x71ec64 <lookup_hashed_option+84>: ldr x0, [x1]
=> 0x71ec68 <lookup_hashed_option+88>: ldr x3, [x0, #24]
0x71ec6c <lookup_hashed_option+92>: ldr w3, [x3, #24]
0x71ec70 <lookup_hashed_option+96>: cmp w3, w2
0x71ec74 <lookup_hashed_option+100>:
b.ne 0x71ec5c <lookup_hashed_option+76> // b.any
0x71ec78 <lookup_hashed_option+104>: ret
0x71ec7c <lookup_hashed_option+108>:
mov x0, #0x0 // #0
0x71ec80 <lookup_hashed_option+112>: ret
>How-To-Repeat:
n/a
>Fix:
n/a
>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/57476: dhcpd dumps core
Date: Wed, 21 Jun 2023 02:55:39 +0700
Date: Tue, 20 Jun 2023 17:35:01 +0000 (UTC)
From: martin@NetBSD.org
Message-ID: <20230620173501.228E21A9242@mollari.NetBSD.org>
From your debug info it looks as if in this loop:
for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
if (((struct option_cache *)(bptr -> car)) -> option -> code ==
code)
return (struct option_cache *)(bptr -> car);
}
bptr->car == 0 (NULL). I can't tell on which iteration,
(in dist/common/options.c: lookup_hashed_option() .. starts line 2499
of the current sources)
The following patch checks for that (and for good measure, just in
case, whether (that)->option == 0 as well) and should avoid that
SEGV. Whether the code works properly or not with this in place
I'm not sure. The real problem may be in how bptr->car isn't being
set to something before bptr gets linked into the list.
--- options.c 2022-10-07 09:18:48.195826304 +0700
+++ options.c.patched 2023-06-21 02:52:25.324112692 +0700
@@ -2497,8 +2497,10 @@
hashix = compute_option_hash (code);
for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
- if (((struct option_cache *)(bptr -> car)) -> option -> code ==
- code)
+ if (bptr -> car &&
+ ((struct option_cache *)(bptr -> car)) -> option &&
+ ((struct option_cache *)(bptr -> car)) -> option -> code ==
+ code)
return (struct option_cache *)(bptr -> car);
}
return (struct option_cache *)0;
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/57476: dhcpd dumps core
Date: Wed, 21 Jun 2023 07:57:12 +0200
The code seems to iterate over that in a few places and never checks for
car == NULL anywhere. I'll instrument the option_*_reference call that
sets car to != NULL and see if I can catch the caller that does this.
Martin
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/57476: dhcpd dumps core
Date: Wed, 21 Jun 2023 13:26:29 +0700
Date: Wed, 21 Jun 2023 06:00:02 +0000 (UTC)
From: Martin Husemann <martin@duskware.de>
Message-ID: <20230621060003.0E2BF1A9241@mollari.NetBSD.org>
| The code seems to iterate over that in a few places and never checks for
| car == NULL anywhere. I'll instrument the option_*_reference call that
| sets car to != NULL and see if I can catch the caller that does this.
I took a bit more of a look after I sent the message. Given that you
said "sometimes" I'm suspecting a thread related race condition (somewhere).
I didn't see anywhere things are set up (or torn down) incorrectly,
but that doesn't mean a lot.
kre
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, martin@NetBSD.org
Subject: re: bin/57476: dhcpd dumps core
Date: Thu, 22 Jun 2023 06:17:27 +1000
> From your debug info it looks as if in this loop:
>
> for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
> if (((struct option_cache *)(bptr -> car)) -> option -> code ==
> code)
> return (struct option_cache *)(bptr -> car);
> }
>
> bptr->car == 0 (NULL). I can't tell on which iteration,
when i first saw this PR, it made me ponder, but having you
debug this far i'm convinced i have seen this problem on big
endian arm64 the last year and a half (ever since switching
my host over from a sparc64.)
IIRC, this system went live in late feb last year, and i've
seen this crash twice so far, so it does not happen often,
and i did not see any obvious errors in the code here, but
i suspect you looked further going on the next message. my
last crash was april, and i haven't seen the log trigger.
i still have the latest core file lying around, but the
binary is newer since i'm running with this patch now:
Index: common/options.c
===================================================================
RCS file: /cvsroot/src/external/mpl/dhcp/dist/common/options.c,v
retrieving revision 1.5
diff -p -r1.5 options.c
*** common/options.c 13 Jan 2021 17:01:31 -0000 1.5
--- common/options.c 21 Jun 2023 20:12:29 -0000
*************** struct option_cache *lookup_hashed_optio
*** 2497,2502 ****
--- 2497,2507 ----
hashix = compute_option_hash (code);
for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
+ if (bptr -> car == NULL) {
+ log_error ("Null pointer in bptr -> cdr: %s(%d)",
+ __func__, __LINE__);
+ continue;
+ }
if (((struct option_cache *)(bptr -> car)) -> option -> code ==
code)
return (struct option_cache *)(bptr -> car);
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/57476: dhcpd dumps core
Date: Sun, 10 Sep 2023 11:07:48 +0200
I saw my added printf triggering and got another crash in
lookup_hashed_option() :
hashix = compute_option_hash (code);
for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
>>>> if (((struct option_cache *)(bptr -> car)) -> option -> code ==
code)
return (struct option_cache *)(bptr -> car);
}
again with bptr -> car == NULL:
#1 0x000000000b960ddc in lookup_option (code=<optimized out>,
options=0xfb0d4de0d160, universe=0xba06e98 <dhcp_universe>)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2475
2475 in /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c
(gdb) list
2470 in /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c
(gdb) dow
#0 0x000000000b95ec68 in lookup_hashed_option (universe=<optimized out>,
options=<optimized out>, code=59)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
2500 in /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c
(gdb) p bptr
$1 = (pair) 0xfb0d4de011b0
(gdb) p *bptr
$2 = {car = 0x0, cdr = 0xfb0d4de01480}
full bt:
(gdb) bt
#0 0x000000000b95ec68 in lookup_hashed_option (universe=<optimized out>,
options=<optimized out>, code=59)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
#1 0x000000000b960ddc in lookup_option (code=<optimized out>,
options=0xfb0d4de0d160, universe=0xba06e98 <dhcp_universe>)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2475
#2 store_options (ocount=0xfb0d4b7e3874, ocount@entry=0xfb0d4b7e5204,
buffer=0xfb0d4b7e4144 "5\001\002\066\004\300\250\226\274\063\004",
buffer@entry=0xfb0d4b7e4140 "c\202Sc5\001\002\066\004\300\250\226\274\063\004", index=index@entry=4, buflen=1424, packet=packet@entry=0xfb0d4de0f000,
lease=lease@entry=0xfb0d4ff61430, client_state=client_state@entry=0x0,
in_options=in_options@entry=0xfb0d4de0d0f0, cfg_options=0xfb0d4de0d160,
scope=0xba07450 <global_scope>, scope@entry=0xfb0d4b7e5210,
priority_list=<optimized out>, priority_list@entry=0xfb0d4b7e3c90,
priority_len=<optimized out>, priority_len@entry=21,
first_cutoff=<optimized out>, second_cutoff=<optimized out>,
second_cutoff@entry=1364, terminate=0,
terminate@entry=<error reading variable: Cannot access memory at address 0xfb0d4b7e6ad8>, vuname=0x0,
vuname@entry=<error reading variable: Cannot access memory at address 0xfb0d4b7e6af0>)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:1364
#3 0x000000000b9619a8 in cons_options (inpacket=0xfb0d4de0f000,
outpacket=0xfb0d502ba16c <_vsprintf_l+108>,
outpacket@entry=0xfb0d4b7e5230, lease=lease@entry=0xfb0d4ff61430,
client_state=client_state@entry=0x0, mms=<optimized out>,
in_options=0xfb0d4de0d0f0, cfg_options=<optimized out>,
scope=0xfb0d4b7e5210, overload_avail=<optimized out>,
overload_avail@entry=3, terminate=<optimized out>, bootpp=<optimized out>,
prl=<optimized out>, prl@entry=0xfb0d4d3be028, vuname=<optimized out>,
vuname@entry=0x0)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:827
#4 0x000000000b9111a0 in dhcp_reply (lease=0xfb0d4ff61430)
at /work/src-10/external/mpl/dhcp/bin/server/../../dist/server/dhcp.c:3973
#5 0x000000000b94c060 in isclib_timer_callback (taskp=<optimized out>,
eventp=<optimized out>)
at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/dispatch.c:181
#6 0x0000fb0d51033330 in task_run (task=0xfb0d50daab40)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/task.c:861
#7 isc_task_run (task=0xfb0d50daab40)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/task.c:955
#8 0x0000fb0d5102a18c in isc__nm_async_task (worker=0xfb0d5066e168,
ev0=0xfb0d4d36ef20)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:873
#9 process_netievent (worker=worker@entry=0xfb0d5066e168,
ievent=0xfb0d4d36ef20)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:945
#10 0x0000fb0d5102a530 in process_queue (worker=worker@entry=0xfb0d5066e168,
type=type@entry=NETIEVENT_TASK)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:1011
#11 0x0000fb0d5102ae30 in process_all_queues (worker=0xfb0d5066e168)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:792
#12 async_cb (handle=0xfb0d5066e498)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:821
#13 0x0000fb0d5105646c in uv__async_io (loop=0xfb0d5066e178,
w=<optimized out>, events=<optimized out>)
at /work/src-10/external/mit/libuv/lib/../dist/src/unix/async.c:163
#14 0x0000fb0d5104b758 in uv__io_poll (loop=loop@entry=0xfb0d5066e178,
timeout=<optimized out>)
at /work/src-10/external/mit/libuv/lib/../dist/src/unix/kqueue.c:390
#15 0x0000fb0d51053928 in uv_run (loop=loop@entry=0xfb0d5066e178,
mode=mode@entry=UV_RUN_DEFAULT)
at /work/src-10/external/mit/libuv/lib/../dist/src/unix/core.c:406
#16 0x0000fb0d5102a7c4 in nm_thread (worker0=0xfb0d5066e168)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:723
#17 0x0000fb0d510463d0 in isc__trampoline_run (arg=0xfb0d50e2d940)
at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/trampoline.c:215
It seems we need to make the code deal with car=NULL everywhere, unless we
find the culprit that corrupts memory or explicitly sets this pointer to
NULL.
Martin
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.