NetBSD Problem Report #57476

From martin@duskware.de  Tue Jun 20 17:31:44 2023
Return-Path: <martin@duskware.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id D40D01A923E
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 20 Jun 2023 17:31:43 +0000 (UTC)
From: martin@NetBSD.org
Reply-To: martin@NetBSD.org
To: gnats-bugs@NetBSD.org
Subject: dhcpd dumps core
X-Send-Pr-Version: 3.95

>Number:         57476
>Category:       bin
>Synopsis:       dhcpd dumps core
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jun 20 17:35:00 +0000 2023
>Last-Modified:  Sun Sep 10 09:10:00 +0000 2023
>Originator:     Martin Husemann
>Release:        NetBSD 10.0_BETA
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD plug.duskware.de 10.0_BETA NetBSD 10.0_BETA (GENERIC64) #11: Fri Mar 31 13:02:15 CEST 2023 martin@seven-days-to-the-wolves.aprisoft.de:/work/src-10/sys/arch/evbarm/compile/GENERIC64 evbarm
Architecture: aarch64
Machine: evbarm
>Description:

Every once in a while my dhcpd process dumps core:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000071ec68 in lookup_hashed_option (universe=<optimized out>, 
    options=<optimized out>, code=12)
    at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
2500    /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c: No such file or directory.
[Current thread is 1 (process 9282)]
(gdb) bt
#0  0x000000000071ec68 in lookup_hashed_option (universe=<optimized out>, 
    options=<optimized out>, code=12)
    at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
#1  0x00000000006d8a24 in dhcp (packet=0xfaab9f593100)
    at /work/src-10/external/mpl/dhcp/bin/server/../../dist/server/dhcp.c:246
#2  0x0000000000724de0 in do_packet (interface=0xfaaba05a7500, 
    packet=0xfaab9a3dee20, len=290, from_port=17408, from=..., 
    hfrom=0xfaab9a3dee08)
    at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:4078
(gdb) info reg
x0             0x0                 0
x1             0xfaab9e0001d0      275614997152208
x2             0xc                 12
x3             0xc                 12
x4             0x71ec10            7466000
x5             0x0                 0
x6             0x0                 0
x7             0x0                 0
x8             0xfaab9f22f7c6      275615016220614
x9             0xfe3               4067
x10            0x75b510            7714064
x11            0x506faa0           84343456
x12            0x506fad0           84343504
x13            0xfaab9f22d278      275615016211064
x14            0x2                 2
x15            0xfaab9d25e510      275614982858000
x16            0x71ec10            7466000
x17            0xfaab9f77d800      275615021783040
x18            0x2                 2
x19            0xfaab9f593100      275615019774208
x20            0x0                 0
x21            0xfaab9a3dee08      275614934101512
x22            0x4400              17408
x23            0x122               290
x24            0x75b510            7714064
x25            0xfaab9a3ded50      275614934101328
x26            0x75b000            7712768
x27            0xfaaba05a7500      275615036634368
x28            0x1                 1
x29            0xfaab9a3dec20      275614934101024
x30            0x6d8a24            7178788
sp             0xfaab9a3dec10      0xfaab9a3dec10
pc             0x71ec68            0x71ec68 <lookup_hashed_option+88>
cpsr           0x80000000          [ EL=0 N ]
fpsr           0x0                 0
fpcr           0x0                 0
(gdb) x/16i $pc-16
   0x71ec58 <lookup_hashed_option+72>:  
    b   0x71ec7c <lookup_hashed_option+108>
   0x71ec5c <lookup_hashed_option+76>:  ldr     x1, [x1, #8]
   0x71ec60 <lookup_hashed_option+80>:  
    cbz x1, 0x71ec7c <lookup_hashed_option+108>
   0x71ec64 <lookup_hashed_option+84>:  ldr     x0, [x1]
=> 0x71ec68 <lookup_hashed_option+88>:  ldr     x3, [x0, #24]
   0x71ec6c <lookup_hashed_option+92>:  ldr     w3, [x3, #24]
   0x71ec70 <lookup_hashed_option+96>:  cmp     w3, w2
   0x71ec74 <lookup_hashed_option+100>: 
    b.ne        0x71ec5c <lookup_hashed_option+76>  // b.any
   0x71ec78 <lookup_hashed_option+104>: ret
   0x71ec7c <lookup_hashed_option+108>: 
    mov x0, #0x0                        // #0
   0x71ec80 <lookup_hashed_option+112>: ret


>How-To-Repeat:
n/a

>Fix:
n/a

>Audit-Trail:
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/57476: dhcpd dumps core
Date: Wed, 21 Jun 2023 02:55:39 +0700

     Date:        Tue, 20 Jun 2023 17:35:01 +0000 (UTC)
     From:        martin@NetBSD.org
     Message-ID:  <20230620173501.228E21A9242@mollari.NetBSD.org>

 From your debug info it looks as if in this loop:

 	for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
 		if (((struct option_cache *)(bptr -> car)) -> option -> code ==
 		    code)
 			return (struct option_cache *)(bptr -> car);
 	}

 bptr->car == 0   (NULL).   I can't tell on which iteration,

 (in dist/common/options.c: lookup_hashed_option()  .. starts line 2499
 of the current sources)

 The following patch checks for that (and for good measure, just in
 case, whether (that)->option == 0 as well) and should avoid that
 SEGV.    Whether the code works properly or not with this in place
 I'm not sure.    The real problem may be in how bptr->car isn't being
 set to something before bptr gets linked into the list.

 --- options.c	2022-10-07 09:18:48.195826304 +0700
 +++ options.c.patched	2023-06-21 02:52:25.324112692 +0700
 @@ -2497,8 +2497,10 @@

  	hashix = compute_option_hash (code);
  	for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
 -		if (((struct option_cache *)(bptr -> car)) -> option -> code ==
 -		    code)
 +		if (bptr -> car &&
 +		    ((struct option_cache *)(bptr -> car)) -> option &&
 +		    ((struct option_cache *)(bptr -> car)) -> option -> code ==
 +		      code)
  			return (struct option_cache *)(bptr -> car);
  	}
  	return (struct option_cache *)0;


From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/57476: dhcpd dumps core
Date: Wed, 21 Jun 2023 07:57:12 +0200

 The code seems to iterate over that in a few places and never checks for
 car == NULL anywhere. I'll instrument the option_*_reference call that
 sets car to != NULL and see if I can catch the caller that does this.

 Martin

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/57476: dhcpd dumps core
Date: Wed, 21 Jun 2023 13:26:29 +0700

     Date:        Wed, 21 Jun 2023 06:00:02 +0000 (UTC)
     From:        Martin Husemann <martin@duskware.de>
     Message-ID:  <20230621060003.0E2BF1A9241@mollari.NetBSD.org>

   |  The code seems to iterate over that in a few places and never checks for
   |  car == NULL anywhere. I'll instrument the option_*_reference call that
   |  sets car to != NULL and see if I can catch the caller that does this.

 I took a bit more of a look after I sent the message.   Given that you
 said "sometimes" I'm suspecting a thread related race condition (somewhere).
 I didn't see anywhere things are set up (or torn down) incorrectly,
 but that doesn't mean a lot.

 kre

From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, martin@NetBSD.org
Subject: re: bin/57476: dhcpd dumps core
Date: Thu, 22 Jun 2023 06:17:27 +1000

 >  From your debug info it looks as if in this loop:
 >  
 >  	for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
 >  		if (((struct option_cache *)(bptr -> car)) -> option -> code ==
 >  		    code)
 >  			return (struct option_cache *)(bptr -> car);
 >  	}
 >  
 >  bptr->car == 0   (NULL).   I can't tell on which iteration,

 when i first saw this PR, it made me ponder, but having you
 debug this far i'm convinced i have seen this problem on big
 endian arm64 the last year and a half (ever since switching
 my host over from a sparc64.)

 IIRC, this system went live in late feb last year, and i've
 seen this crash twice so far, so it does not happen often,
 and i did not see any obvious errors in the code here, but
 i suspect you looked further going on the next message.  my
 last crash was april, and i haven't seen the log trigger.

 i still have the latest core file lying around, but the
 binary is newer since i'm running with this patch now:


 Index: common/options.c
 ===================================================================
 RCS file: /cvsroot/src/external/mpl/dhcp/dist/common/options.c,v
 retrieving revision 1.5
 diff -p -r1.5 options.c
 *** common/options.c	13 Jan 2021 17:01:31 -0000	1.5
 --- common/options.c	21 Jun 2023 20:12:29 -0000
 *************** struct option_cache *lookup_hashed_optio
 *** 2497,2502 ****
 --- 2497,2507 ----

   	hashix = compute_option_hash (code);
   	for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
 + 		if (bptr -> car == NULL) {
 + 			log_error ("Null pointer in bptr -> cdr: %s(%d)",
 + 			   __func__, __LINE__);
 + 			continue;
 + 		}
   		if (((struct option_cache *)(bptr -> car)) -> option -> code ==
   		    code)
   			return (struct option_cache *)(bptr -> car);

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/57476: dhcpd dumps core
Date: Sun, 10 Sep 2023 11:07:48 +0200

 I saw my added printf triggering and got another crash in
 lookup_hashed_option() : 

         hashix = compute_option_hash (code);
         for (bptr = hash [hashix]; bptr; bptr = bptr -> cdr) {
 >>>>           if (((struct option_cache *)(bptr -> car)) -> option -> code ==
                     code)
                         return (struct option_cache *)(bptr -> car);
         }

 again with bptr -> car == NULL:

 #1  0x000000000b960ddc in lookup_option (code=<optimized out>, 
     options=0xfb0d4de0d160, universe=0xba06e98 <dhcp_universe>)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2475
 2475    in /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c
 (gdb) list
 2470    in /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c
 (gdb) dow
 #0  0x000000000b95ec68 in lookup_hashed_option (universe=<optimized out>, 
     options=<optimized out>, code=59)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
 2500    in /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c
 (gdb) p bptr
 $1 = (pair) 0xfb0d4de011b0
 (gdb) p *bptr
 $2 = {car = 0x0, cdr = 0xfb0d4de01480}


 full bt:

 (gdb) bt
 #0  0x000000000b95ec68 in lookup_hashed_option (universe=<optimized out>, 
     options=<optimized out>, code=59)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2500
 #1  0x000000000b960ddc in lookup_option (code=<optimized out>, 
     options=0xfb0d4de0d160, universe=0xba06e98 <dhcp_universe>)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:2475
 #2  store_options (ocount=0xfb0d4b7e3874, ocount@entry=0xfb0d4b7e5204, 
     buffer=0xfb0d4b7e4144 "5\001\002\066\004\300\250\226\274\063\004", 
     buffer@entry=0xfb0d4b7e4140 "c\202Sc5\001\002\066\004\300\250\226\274\063\004", index=index@entry=4, buflen=1424, packet=packet@entry=0xfb0d4de0f000, 
     lease=lease@entry=0xfb0d4ff61430, client_state=client_state@entry=0x0, 
     in_options=in_options@entry=0xfb0d4de0d0f0, cfg_options=0xfb0d4de0d160, 
     scope=0xba07450 <global_scope>, scope@entry=0xfb0d4b7e5210, 
     priority_list=<optimized out>, priority_list@entry=0xfb0d4b7e3c90, 
     priority_len=<optimized out>, priority_len@entry=21, 
     first_cutoff=<optimized out>, second_cutoff=<optimized out>, 
     second_cutoff@entry=1364, terminate=0, 
     terminate@entry=<error reading variable: Cannot access memory at address 0xfb0d4b7e6ad8>, vuname=0x0, 
     vuname@entry=<error reading variable: Cannot access memory at address 0xfb0d4b7e6af0>)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:1364
 #3  0x000000000b9619a8 in cons_options (inpacket=0xfb0d4de0f000, 
     outpacket=0xfb0d502ba16c <_vsprintf_l+108>, 
     outpacket@entry=0xfb0d4b7e5230, lease=lease@entry=0xfb0d4ff61430, 
     client_state=client_state@entry=0x0, mms=<optimized out>, 
     in_options=0xfb0d4de0d0f0, cfg_options=<optimized out>, 
     scope=0xfb0d4b7e5210, overload_avail=<optimized out>, 
     overload_avail@entry=3, terminate=<optimized out>, bootpp=<optimized out>, 
     prl=<optimized out>, prl@entry=0xfb0d4d3be028, vuname=<optimized out>, 
     vuname@entry=0x0)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/options.c:827
 #4  0x000000000b9111a0 in dhcp_reply (lease=0xfb0d4ff61430)
     at /work/src-10/external/mpl/dhcp/bin/server/../../dist/server/dhcp.c:3973
 #5  0x000000000b94c060 in isclib_timer_callback (taskp=<optimized out>, 
     eventp=<optimized out>)
     at /work/src-10/external/mpl/dhcp/lib/common/../../dist/common/dispatch.c:181
 #6  0x0000fb0d51033330 in task_run (task=0xfb0d50daab40)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/task.c:861
 #7  isc_task_run (task=0xfb0d50daab40)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/task.c:955
 #8  0x0000fb0d5102a18c in isc__nm_async_task (worker=0xfb0d5066e168, 
     ev0=0xfb0d4d36ef20)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:873
 #9  process_netievent (worker=worker@entry=0xfb0d5066e168, 
     ievent=0xfb0d4d36ef20)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:945
 #10 0x0000fb0d5102a530 in process_queue (worker=worker@entry=0xfb0d5066e168, 
     type=type@entry=NETIEVENT_TASK)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:1011
 #11 0x0000fb0d5102ae30 in process_all_queues (worker=0xfb0d5066e168)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:792
 #12 async_cb (handle=0xfb0d5066e498)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:821
 #13 0x0000fb0d5105646c in uv__async_io (loop=0xfb0d5066e178, 
     w=<optimized out>, events=<optimized out>)
     at /work/src-10/external/mit/libuv/lib/../dist/src/unix/async.c:163
 #14 0x0000fb0d5104b758 in uv__io_poll (loop=loop@entry=0xfb0d5066e178, 
     timeout=<optimized out>)
     at /work/src-10/external/mit/libuv/lib/../dist/src/unix/kqueue.c:390
 #15 0x0000fb0d51053928 in uv_run (loop=loop@entry=0xfb0d5066e178, 
     mode=mode@entry=UV_RUN_DEFAULT)
     at /work/src-10/external/mit/libuv/lib/../dist/src/unix/core.c:406
 #16 0x0000fb0d5102a7c4 in nm_thread (worker0=0xfb0d5066e168)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/netmgr/netmgr.c:723
 #17 0x0000fb0d510463d0 in isc__trampoline_run (arg=0xfb0d50e2d940)
     at /work/src-10/external/mpl/bind/lib/libisc/../../dist/lib/isc/trampoline.c:215


 It seems we need to make the code deal with car=NULL everywhere, unless we
 find the culprit that corrupts memory or explicitly sets this pointer to
 NULL.

 Martin

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.