NetBSD Problem Report #55164

From gson@gson.org  Sat Apr 11 11:56:35 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 9A3251A921E
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 11 Apr 2020 11:56:35 +0000 (UTC)
Message-Id: <20200411115631.2214725445F@guava.gson.org>
Date: Sat, 11 Apr 2020 14:56:31 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: net/ndp/t_ndp:ndp_rtm test case fails again
X-Send-Pr-Version: 3.95

>Number:         55164
>Category:       kern
>Synopsis:       net/ndp/t_ndp:ndp_rtm test case fails again
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Apr 11 12:00:00 +0000 2020
>Closed-Date:    Mon Apr 13 11:45:40 +0000 2020
>Last-Modified:  Mon Apr 13 11:45:40 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2020.04.02.21.36.03
>Organization:

>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

The net/ndp/t_ndp:ndp_rtm test case is failing since these commits:

  2020.04.02.17.40.33 christos src/usr.bin/kdump/ioctlprint.1 1.3
  2020.04.02.17.40.33 christos src/usr.bin/kdump/kdump.c 1.134
  2020.04.02.17.40.33 christos src/usr.bin/kdump/mkioctls 1.54
  2020.04.02.17.41.33 christos src/usr.bin/ktrace/ktrace.h 1.21
  2020.04.02.17.41.34 christos src/usr.bin/kdump/ioctlprint.1 1.4
  2020.04.02.17.41.34 christos src/usr.bin/kdump/kdump.c 1.135
  2020.04.02.17.41.34 christos src/usr.bin/kdump/mkioctls 1.55
  2020.04.02.18.32.31 christos src/sbin/route/prog_ops.h 1.4
  2020.04.02.18.32.31 christos src/sbin/route/route.c 1.168
  2020.04.02.18.32.31 christos src/sbin/route/route_hostops.c 1.2
  2020.04.02.18.32.31 christos src/sbin/route/route_rumpops.c 1.2
  2020.04.02.20.57.20 roy src/share/man/man7/groups.7 1.8
  2020.04.02.20.57.20 roy src/share/man/man7/users.7 1.5
  2020.04.02.21.36.03 christos src/usr.bin/kdump/mkioctls 1.56

An earlier failure of the same test was reported in PR 55074; that one
was fixed, but it's now failing again.

Logs:

  http://releng.netbsd.org/b5reports/i386/commits-2020.04.html#2020.04.02.21.36.03
  http://releng.netbsd.org/b5reports/i386/2020/2020.04.09.18.20.40/test.html#net_ndp_t_ndp_ndp_rtm

It also fails on amd64, on real hardware:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.04.10.21.03.20/test.html#net_ndp_t_ndp_ndp_rtm

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->christos
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Sat, 11 Apr 2020 12:05:12 +0000
Responsible-Changed-Why:
Over to committer.


Responsible-Changed-From-To: christos->kre
Responsible-Changed-By: kre@NetBSD.org
Responsible-Changed-When: Sun, 12 Apr 2020 06:20:00 +0000
Responsible-Changed-Why:
I have been looking into this one...

One thing for certain is that none of the commits mentioned in
the PR is relevant - those were the ones that fixed the
"setsockopt on non-socket" issue, but in the process broke
twice as many tests as was fixed.

This test (net/t_ndp/ndp_rtm) was one of those that broke.

Those were subsequently fixed by these commits:

commit 2020.04.03.16.20.51 martin src/usr.bin/netstat/Attic/prog_ops.h 1.4
commit 2020.04.03.16.20.51 martin src/usr.bin/netstat/netstat_hostops.c 1.3
commit 2020.04.03.16.20.51 martin src/usr.bin/netstat/netstat_rumpops.c 1.3
commit 2020.04.03.16.20.51 martin src/usr.bin/netstat/prog_ops.h 1.4
commit 2020.04.03.16.20.52 martin src/sbin/route/prog_ops.h 1.5
commit 2020.04.03.16.20.52 martin src/sbin/route/route_hostops.c 1.3
commit 2020.04.03.16.20.52 martin src/sbin/route/route_rumpops.c 1.3

which unified netstat & route more than they had before, and in the
process seems to have lost printing the AUTHOR info (flag and addr)
when present.

I am still testing to confirm that the info is actually in the RTM_MISS
packet when it is sent to the route command, and is just not being
printed (rather than being missing completely in the msg from the kernel).

I have confirmed that (at least directly) rump is not involved in this
issue, when I do the equivalent commands on real hardware (actually
configuring the ULA, and pinging a non-existant ULA (which would be
on the same link as the source - just as the test does - and with
"route monitor" running) I don't see the author info in that case either.

Expect more soon...


State-Changed-From-To: open->analyzed
State-Changed-By: kre@NetBSD.org
State-Changed-When: Sun, 12 Apr 2020 07:31:02 +0000
State-Changed-Why:
I said:

  I am still testing to confirm that the info is actually in the RTM_MISS
  packet when it is sent to the route command, and is just not being
  printed (rather than being missing completely in the msg from the kernel).

Well, it turns out that it isn't.   The issue was caused by this commit
to src/sys/netinet6/nd6.c (cut/paste from cvsweb, with irrelevant stuff
deleted)

  Revision 1.268 / (download) - annotate - [select for diffs],
	Fri Apr 3 14:04:27 2020 UTC (8 days, 16 hours ago) by christos
  Diff to previous 1.267 (colored)

  PR/55030: Avoid locking against myself panic by moving the icmp error outside
  the lock. Thanks ozaki-r!

It turns out that the icmp error call (icmp_error2()) is what set the
address to be used in the RTM_MISS message, that is, the sequence of the
relevant address now is (from nd6_llinfo_timer() in netinet6/nd6.c):

        struct in6_addr mdaddr6 = zeroin6_addr;

                if (!IN6_IS_ADDR_UNSPECIFIED(&mdaddr6)) {
                        sockaddr_in6_init(&dsin6, &mdaddr6, 0, 0, 0);
                        sa = sin6tosa(&dsin6);
                } else
                        sa = NULL;
                rt_clonedmsg(RTM_MISS, sa, sin6tosa(&tsin6), NULL, ifp);

	if (m) {        
                icmp6_error2(m, ICMP6_DST_UNREACH,
                    ICMP6_DST_UNREACH_ADDR, 0, ifp, &mdaddr6);
        }       

It is easy to see that mdaddr6 is still zeroin6_addr which is
IN6_IS_ADDR_UNSPECIFIED() and hence sa == NULL when rt_clonedmsg()
is called.   That address is the "author" address for the RTM_MISS
message sent to the routing socket, and is omitted when it is NULL.

The icmp6_error2() call used to come before the rt_clonedmsg() sequence,
set mdaddr6 to be the source addr of the packet, which usually not have
been an unspecified address (and certainly not in the case in the t_ndp
ndp_rtm test) sa would thus not be null, and would be included in the
RTM_MISS message.

The easy for for the test (if the objective is simply to make the test
succeed) would be to do:

--- t_ndp.sh    15 Mar 2020 21:15:25 -0000      1.38
+++ t_ndp.sh    12 Apr 2020 07:03:15 -0000
@@ -471,8 +471,8 @@
        $DEBUG && cat $file

        hdr="RTM_MISS.+<DONE>"
-       what="<DST,GATEWAY,AUTHOR>"
-       addr="$IP6DST_FAIL1 link#2 $IP6SRC"
+       what="<DST,GATEWAY>"
+       addr="$IP6DST_FAIL1 link#2"
        atf_check -s exit:0 -o match:"$hdr" -o match:"$what" -o match:"$addr" \
                cat $file


and simply not expect the AUTHOR flag, or its address, to be present
(they certainly will not be with the current kernel).

I won't commit that, as I suspect that's not the correct resolution,
as at the very least netinet6/nd6.c needs more work - it is 100% pointless
to test an address we know is 0 to see if it is unspecified (it is).
So, if the eventual decision is to drop the author info from this particular
RTM_MISS (an ndp lookup failure) then the rt_clonedmsg() call should
simply have NULL instead of sa as its src parameter, and all the stuff
calculating sa can go away.

Otherwise, a reasonable method needs to be found to get the src addr
and set it in mdaddr6 without calling icmp_error2() and hence doing
the "locking against myself".

One way to do that might be to make

	struct mbuf *
	icmp6_error3(/*same params as icmp_error2)
		/* or some more intelligent name than that! */
	{
		/* everything that is in icmp6_error2()
		   except the final icmp6_error() call */

		return m;

	 out:
		m_freem(m);
		return NULL;
	}

and make icmp6_error2() become

	void
	icmp6_error2(/*same params as it has now*/)
	{
		if (m = icmp6_error3(/* pass down all params *.))
			icmp6_error(m, type, code, param);
	}

and then in nd6_llinfo_timer() where the rt_clonedmsg(RTM_MISS,...)
call is made, separately call icmp6_error3() early (without the
icmp6_error() call there is no locking issue, I believe) and then
later, icmp6_error() directly.   There's probably more work required
for any other cases where icmp6_error2() ends up being now called that
might need to be changed if that call turns into icmp6_error() instead.

But I am going to leave that to Roy, Christos, and Ozaki-R to sort
out, as they're the ones involved in this.   For now, I will return
ownership of this PR to kern-bug-people as I have no idea whether
it is possible to set it to 3 different developers at once (if that's
OK, someone can assign this PR to roy christos and ozaki-r

Rather than inventing a bool, the new function could just return m
(mbuf *) when it hasn't been freed, and NULL when it has - that would
save inventing a new variable to save he bool (the updated icmp6_error2()
wouldn't care either way).


Responsible-Changed-From-To: kre->kern-bug-people
Responsible-Changed-By: kre@NetBSD.org
Responsible-Changed-When: Sun, 12 Apr 2020 07:31:42 +0000
Responsible-Changed-Why:
This belongs to one of roy,christos,ozaki-r
returning responsibility for this to kern-bug-people


From: "Roy Marples" <roy@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55164 CVS commit: src/sys/netinet6
Date: Sun, 12 Apr 2020 12:13:52 +0000

 Module Name:	src
 Committed By:	roy
 Date:		Sun Apr 12 12:13:52 UTC 2020

 Modified Files:
 	src/sys/netinet6: nd6.c

 Log Message:
 nd6: RTM_MISS reports RTA_AUTHOR once more

 Just moves the logic to send RTM_MISS after the ICMP6 report as we
 rely on that function to extract the requesting address.

 Fixes PR kern/55164.


 To generate a diff of this commit:
 cvs rdiff -u -r1.268 -r1.269 src/sys/netinet6/nd6.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: "Roy Marples" <roy@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Sun, 12 Apr 2020 19:48:22 +0300

 Roy Marples wrote:
 >  Modified Files:
 >  	src/sys/netinet6: nd6.c
 >  
 >  Log Message:
 >  nd6: RTM_MISS reports RTA_AUTHOR once more
 >  
 >  Just moves the logic to send RTM_MISS after the ICMP6 report as we
 >  rely on that function to extract the requesting address.
 >  
 >  Fixes PR kern/55164.

 So far I have done only one test run after this commit,
 and there, net/ndp/t_ndp:ndp_rtm passed, but
 net/route/t_flags:route_flags_connected failed instead:

   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.04.12.12.13.52/test.html#net_route_t_flags_route_flags_connected

 -- 
 Andreas Gustafsson, gson@gson.org

From: Roy Marples <roy@marples.name>
To: Andreas Gustafsson <gson@gson.org>, Roy Marples <roy@netbsd.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Sun, 12 Apr 2020 18:11:27 +0100

 On 12/04/2020 17:48, Andreas Gustafsson wrote:
 > Roy Marples wrote:
 >>   Modified Files:
 >>   	src/sys/netinet6: nd6.c
 >>   
 >>   Log Message:
 >>   nd6: RTM_MISS reports RTA_AUTHOR once more
 >>   
 >>   Just moves the logic to send RTM_MISS after the ICMP6 report as we
 >>   rely on that function to extract the requesting address.
 >>   
 >>   Fixes PR kern/55164.
 > 
 > So far I have done only one test run after this commit,
 > and there, net/ndp/t_ndp:ndp_rtm passed, but
 > net/route/t_flags:route_flags_connected failed instead:
 > 
 >    http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.04.12.12.13.52/test.html#net_route_t_flags_route_flags_connected

 Nothing in that test even uses IPv6 so I have no idea how it's even related!

 [   1.0000090] panic: rn_init 2

 Does that mean the rump kernel panic.
 Is there any way of getting a backtrace from that?

 Roy

From: Andreas Gustafsson <gson@gson.org>
To: Roy Marples <roy@marples.name>
Cc: Roy Marples <roy@netbsd.org>,
    gnats-bugs@netbsd.org
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Sun, 12 Apr 2020 22:24:56 +0300

 Roy Marples wrote:
 > > So far I have done only one test run after this commit,
 > > and there, net/ndp/t_ndp:ndp_rtm passed, but
 > > net/route/t_flags:route_flags_connected failed instead:
 > > 
 > >    http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.04.12.12.13.52/test.html#net_route_t_flags_route_flags_connected
 > 
 > Nothing in that test even uses IPv6 so I have no idea how it's even related!

 Since this is just one test run, there is a possiblity that it was
 some kind of random failure, but since your commit and the failure
 both involved routing, I figured you might want to look into it.
 Whether it's random or related to your commit should become clearer as
 more test runs finish.

 > [   1.0000090] panic: rn_init 2
 >
 > Does that mean the rump kernel panic.

 I think so, yes.

 > Is there any way of getting a backtrace from that?

 I don't know.  Maybe someone who knows rump better than I do can chime in.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
    Andreas Gustafsson <gson@gson.org>
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Sun, 12 Apr 2020 12:35:49 -0700 (PDT)

 On Sun, 12 Apr 2020, Andreas Gustafsson wrote:

 > > [   1.0000090] panic: rn_init 2
 > >
 > > Does that mean the rump kernel panic.
 >
 > I think so, yes.
 >
 > > Is there any way of getting a backtrace from that?
 >
 > I don't know.  Maybe someone who knows rump better than I do can chime in.

 You can manually execute the same sequence of commands that the test
 would run, and add the -s flag when starting the rump-server itself.
 It also helps if you have the debug set installed.  Oh, and don't
 forget to run rump.halt at the end of your testing otherwise I've seen
 the unix socket (pointed at by the RUMP_SERVER environment variable)
 hang around and get in the way of a subsequent test run.  :)



 +--------------------+--------------------------+-----------------------+
 | Paul Goyette       | PGP Key fingerprint:     | E-mail addresses:     |
 | (Retired)          | FA29 0E3B 35AF E8AE 6651 | paul@whooppee.com     |
 | Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette@netbsd.org   |
 +--------------------+--------------------------+-----------------------+

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Mon, 13 Apr 2020 03:18:36 +0700

     Date:        Sun, 12 Apr 2020 16:50:02 +0000 (UTC)
     From:        Andreas Gustafsson <gson@gson.org>
     Message-ID:  <20200412165002.1D4701A9219@mollari.NetBSD.org>

   |  So far I have done only one test run after this commit,
   |  and there, net/ndp/t_ndp:ndp_rtm passed, but
   |  net/route/t_flags:route_flags_connected failed instead:

 All of the net/route tests pass for me.   That one included.

   |    http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.04.12.12.13.52/test.html#net_route_t_flags_route_flags_connected

 Whatever that is/was, looks to be unrelated.   What's the

 	cat: ./.__socks: No such file or directory

 that appeared on stderr?    I'd say something went badly awry during
 the setup, and panic "rn_init2" looks to be more related to the
 rn_init failed messages that appear a lot (as you said) rather than
 anything related to touting - from the messages in your log, the whole
 things collapses very very early.

 kre

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Mon, 13 Apr 2020 03:36:30 +0700

     Date:        Sun, 12 Apr 2020 19:30:02 +0000 (UTC)
     From:        Andreas Gustafsson <gson@gson.org>
     Message-ID:  <20200412193002.D64E21A9219@mollari.NetBSD.org>

   |  Since this is just one test run, there is a possiblity that it was
   |  some kind of random failure,

 It looks very peciuliar if that was a rump panic.

 As I suspected, the "rn_init 2" panic comes from rn_init() in net/radix.c
 which is the same thing that does the

 	rn_init: radix functions require max_keylen be set

 printf (log).     In fact. in rump I would expect that printf to occur
 every time (ever time when rn_init() is called, as max_keylen looks like
 it can never be anything but 0 in a rump kernel (unless there's something
 I'm getting incorrect).   If that happens, then rn_init() simply returns,
 and never gets further in where the "rn_init 2" panic appears...

         if (rn_inithead((void *)&mask_rnhead, 0) == 0)
                 panic("rn_init 2");

 Whatever is happening here, it doesn't look to be related to routing in
 any specific way, and certainly not to Roy's fix for this PR.

 It looks as if the only way that rn_inithead() can return 0 is if malloc()
 fails.   The malloc() call (R_Malloc() macro) there uses M_NOWAIT so
 failure really out to be expected, sometimes.

 kre

From: Andreas Gustafsson <gson@gson.org>
To: Roy Marples <roy@marples.name>
Cc: gnats-bugs@netbsd.org, Robert Elz <kre@munnari.OZ.AU>
Subject: Re: PR/55164 CVS commit: src/sys/netinet6
Date: Mon, 13 Apr 2020 14:19:26 +0300

 Roy,

 Three more test runs of versions newer than your commit now finished,
 and the net/route/t_flags:route_flags_connected test case passed in
 all of them, so the failure in the run immediately following your
 commit now looks like a random one unrelated to your commit.  Sorry
 about the false alarm.

 I will close this PR.  In principle, the route_flags_connected failure
 should get a PR of its own, but unless someone can reproduce it,
 there's probably not much point.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: analyzed->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 13 Apr 2020 11:45:40 +0000
State-Changed-Why:
The ndp_rtm test failure is fixed, and the route_flags_connected
test failure appears to be unrelated.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.