NetBSD Problem Report #53329
From www@NetBSD.org Tue May 29 16:00:38 2018
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id B21E17A110
for <gnats-bugs@gnats.NetBSD.org>; Tue, 29 May 2018 16:00:38 +0000 (UTC)
Message-Id: <20180529160037.D12787A221@mollari.NetBSD.org>
Date: Tue, 29 May 2018 16:00:37 +0000 (UTC)
From: prlw1@cam.ac.uk
Reply-To: prlw1@cam.ac.uk
To: gnats-bugs@NetBSD.org
Subject: netboot over nfs broken
X-Send-Pr-Version: www-1.0
>Number: 53329
>Category: kern
>Synopsis: netboot over nfs broken
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue May 29 16:05:00 +0000 2018
>Last-Modified: Tue Aug 24 14:55:01 +0000 2021
>Originator: Patrick Welche
>Release: NetBSD-8.99.19/amd64
>Organization:
>Environment:
>Description:
Netbooting no longer works. Tried 2 new boxes (one last week with 8.99.17) and one today, so tried my known-to-have-netbooted-successfully-in-the-past laptop, and in all cases (alc0, wm0, wm0), "cannot mount root, error = 79".
The kernel is loaded and started over nfs, not tftp, so I believe that dhcpd and nfsd on a 8.99.19 server are working correctly.
What is new is:
nfs_boot: trying DHCP/BOOTP
nfs_boot: timeout...
nfs_boot: timeout...
nfs_boot: timeout...
The server sees at most 2 requests. One presumably is the successful dhcp request before tftp get pxeboot_ia32.bin.
>How-To-Repeat:
>Fix:
>Audit-Trail:
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 11:46:19 +0900
On Wed, May 30, 2018 at 1:05 AM <prlw1@cam.ac.uk> wrote:
>
> >Number: 53329
> >Category: kern
> >Synopsis: netboot over nfs broken
> >Confidential: no
> >Severity: serious
> >Priority: medium
> >Responsible: kern-bug-people
> >State: open
> >Class: sw-bug
> >Submitter-Id: net
> >Arrival-Date: Tue May 29 16:05:00 +0000 2018
> >Originator: Patrick Welche
> >Release: NetBSD-8.99.19/amd64
> >Organization:
> >Environment:
> >Description:
> Netbooting no longer works. Tried 2 new boxes (one last week with 8.99.17) and one today, so tried my known-to-have-netbooted-successfully-in-the-past laptop, and in all cases (alc0, wm0, wm0), "cannot mount root, error = 79".
>
> The kernel is loaded and started over nfs, not tftp, so I believe that dhcpd and nfsd on a 8.99.19 server are working correctly.
>
> What is new is:
>
> nfs_boot: trying DHCP/BOOTP
> nfs_boot: timeout...
> nfs_boot: timeout...
> nfs_boot: timeout...
>
> The server sees at most 2 requests. One presumably is the successful dhcp request before tftp get pxeboot_ia32.bin.
I suspect the recent switch of the default transport to TCP.
If so adding NFS_BOOT_UDP to your kernel would fix the issue.
ozaki-r
From: Jason Thorpe <thorpej@me.com>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/53329: netboot over nfs broken
Date: Wed, 30 May 2018 20:32:24 -0700
> On May 30, 2018, at 7:46 PM, Ryota Ozaki <ozaki-r@netbsd.org> wrote:
>> What is new is:
>>=20
>> nfs_boot: trying DHCP/BOOTP
>> nfs_boot: timeout...
>> nfs_boot: timeout...
>> nfs_boot: timeout...
>>=20
>> The server sees at most 2 requests. One presumably is the successful =
dhcp request before tftp get pxeboot_ia32.bin.
>=20
> I suspect the recent switch of the default transport to TCP.
> If so adding NFS_BOOT_UDP to your kernel would fix the issue.
Really? Looks to me like the DHCP doesn=E2=80=99t succeed, which has =
nothing to do with transport change. If we were getting a DHCP =
response, we=E2=80=99d see messages with the results, yes?
-- thorpej
From: Martin Husemann <martin@duskware.de>
To: Jason Thorpe <thorpej@me.com>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 10:05:18 +0200
On Wed, May 30, 2018 at 08:32:24PM -0700, Jason Thorpe wrote:
> Really? Looks to me like the DHCP doesn't succeed, which has nothing
> to do with transport change. If we were getting a DHCP response, we'd
> see messages with the results, yes?
Indeed, we should.
Something wrong with receive interrupts or media type negotiation?
Martin
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 09:14:45 +0100
On Thu, May 31, 2018 at 08:10:00AM +0000, Martin Husemann wrote:
> Something wrong with receive interrupts or media type negotiation?
First: does it work for you? The reason I am suprised is that the kernel
is correctly loaded over nfs by pxeboot => all should be well to find root
at exactly the same place the kernel is. The 2nd dhcp response must be
the kernel load. Question is why there isn't a 3rd or nth given the
"trying" messages emitted by the kernel.
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 10:56:21 +0100
Tried NFS_BOOT_UDP just in case, and no change. Checked and the dhcpd
server really only does see 2 reequests (disc/offer/req/ack x 2).
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 12:13:39 +0200
On Thu, May 31, 2018 at 08:15:01AM +0000, Patrick Welche wrote:
> First: does it work for you?
I haven't tried with pxe on x86 lately, but it works for various arm boards
here.
Have you updated the pxeboot or only the kernel?
Martin
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 1 Jun 2018 10:00:06 +0100
On Thu, May 31, 2018 at 10:15:01AM +0000, Martin Husemann wrote:
> Have you updated the pxeboot or only the kernel?
Both. I thought pxeboot had successfully finished its job once it loaded
the kernel over nfs. The dhcpd/nfsd server is also updated. The hardware /
network hasn't changed since the last known working state.
Building a kernel with DEBUG_NFS_BOOT_DHCP hasn't changed the output:
7.2... nfs_boot: trying DHCP/BOOTP
25.9... nfs_boot: timeout...
30.9... nfs_boot: timeout...
35.9... nfs_boot: timeout...
40.9... nfs_boot: timeout...
Running tcpdump on the dhcpd server shows no requests at all corresponding
to those, so I don't think that changing the timeout would help.
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 1 Jun 2018 12:42:38 +0200
On Fri, Jun 01, 2018 at 09:05:00AM +0000, Patrick Welche wrote:
> On Thu, May 31, 2018 at 10:15:01AM +0000, Martin Husemann wrote:
> > Have you updated the pxeboot or only the kernel?
>
> Both. I thought pxeboot had successfully finished its job once it loaded
Can you try an old pxeboot? Maybe it leaves the network device in a state
the kernel does not properly recover.
Martin
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 1 Jun 2018 17:19:00 +0100
It seems to be a new local problem in the sense that:
- requests were emitted - I didn't seem them as I was filtering on host
instead of port bootps or port bootpc
- I successfully booted after increasing timeouts in
sys/nfs/nfs_boot.c to
-#define MAX_RESEND_DELAY 5 /* seconds */
-#define TOTAL_TIMEOUT 30 /* seconds */
+#define MAX_RESEND_DELAY 50 /* seconds */
+#define TOTAL_TIMEOUT 300 /* seconds */
and observe:
7.2... nfs_boot: trying DHCP/BOOTP
46.9... nfs_boot: DHCP next-server: ...
as opposed to the post above:
7.2... nfs_boot: trying DHCP/BOOTP
25.9... nfs_boot: timeout...
30.9... nfs_boot: timeout...
35.9... nfs_boot: timeout...
40.9... nfs_boot: timeout...
(old cisco catalyst switch)
That last output doesn't really seem to match the comment:
422 /*
423 * What is the longest we will wait before re-sending a request?
424 * Note this is also the frequency of "timeout" messages.
425 * The re-send loop counts up linearly to this maximum, so the
426 * first complaint will happen after (1+2+3+4+5)=15 seconds.
427 */
The frequency of "timeout" messages seems to be 5 rather than 15 seconds.
From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, prlw1@cam.ac.uk
Subject: Re: kern/53329: netboot over nfs broken
Date: Mon, 4 Jun 2018 15:21:43 +0200
Hi,
> - I successfully booted after increasing timeouts in
> sys/nfs/nfs_boot.c to
> 7.2... nfs_boot: trying DHCP/BOOTP
> 25.9... nfs_boot: timeout...
> 30.9... nfs_boot: timeout...
> 35.9... nfs_boot: timeout...
> 40.9... nfs_boot: timeout...
>
> (old cisco catalyst switch)
Can you configure portfast on the switch? I'm guessing that the switch does
not enable forwarding until spanning-tree negotiation has completed. See:
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/10553-12.html
for more details.
If this worked previously, but fails now, then I also guess that we drop the
link now and did not in the past. (Each time the link comes up, the cisco will
renegotiate and not enable forwarding for 30s.)
Regards,
Julian
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 11:38:00 +0100
I could work around this by changing the configuration of the switch,
but rather than having to put a note in the INSTALL files, maybe we
could just double the timeouts?
> (Each time the link comes up, the cisco willrenegotiate and not enable
> forwarding for 30s.)
so 30s is too short. The following is sufficient for me:
RCS file: /cvsroot/src/sys/nfs/nfs_boot.c,v
retrieving revision 1.88
diff -u -r1.88 nfs_boot.c
--- sys/nfs/nfs_boot.c 17 May 2018 02:34:31 -0000 1.88
+++ sys/nfs/nfs_boot.c 8 Jun 2018 10:34:40 -0000
@@ -425,8 +425,8 @@
* The re-send loop counts up linearly to this maximum, so the
* first complaint will happen after (1+2+3+4+5)=15 seconds.
*/
-#define MAX_RESEND_DELAY 5 /* seconds */
-#define TOTAL_TIMEOUT 30 /* seconds */
+#define MAX_RESEND_DELAY 10 /* seconds */
+#define TOTAL_TIMEOUT 60 /* seconds */
int
nfs_boot_sendrecv(struct socket *so, struct sockaddr_in *nam,
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 11:41:29 +0100
P.S., some timings with the doubling change:
[ 7.197495] nfs_boot: trying DHCP/BOOTP
[ 10.838201] entered nfs_boot_sendrecv()
[ 10.838201] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 11.838703] send_again: timo=1 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 13.839707] send_again: timo=2 waited=1 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 16.841213] send_again: timo=3 waited=3 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 20.843220] send_again: timo=4 waited=6 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 25.845730] send_again: timo=5 waited=10 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 31.308470] acpibat0: normal capacity on 'charge state'
[ 31.848741] send_again: timo=6 waited=15 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 38.852254] send_again: timo=7 waited=21 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 46.856269] send_again: timo=8 waited=28 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 46.856269] entered nfs_boot_sendrecv()
[ 46.856269] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 46.856269] nfs_boot: DHCP next-server: 10.1.1.68
[ 46.856269] nfs_boot: my_name=quark.flow.bpi.cam.ac.uk
[ 46.856269] nfs_boot: my_addr=10.1.1.65
[ 46.856269] nfs_boot: my_mask=255.255.255.0
[ 46.856269] nfs_boot: gateway=10.1.1.86
[ 49.857775] entered nfs_boot_sendrecv()
[ 49.857775] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 50.858277] send_again: timo=1 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 52.859280] send_again: timo=2 waited=1 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 52.859280] entered nfs_boot_sendrecv()
[ 52.859280] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 52.859280] entered nfs_boot_sendrecv()
[ 52.859280] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
[ 52.869287] root on 10.1.1.68:/usr/export/amd64
[ 52.869287] root file system type: nfs
[ 52.869287] kern.module.path=/stand/amd64/8.99.19/modules
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 14:08:31 +0100
(Just netbooted ubuntu 18 successfully without needing to reconfigure
the switch.)
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 17:21:18 +0200
We should find out what part on our side makes the link drop between
pxeboot and the kernel taking over (and try to avoid that).
Martin
From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/53329: netboot over nfs broken
Date: Tue, 24 Aug 2021 15:54:45 +0100
Just hit this bug again, this time with a much newer cisco switch, and
with Julian's suggestion I hope interpreted correctly, the port is set
to:
interface GigabitEthernet1/0/3
switchport access vlan 47
switchport trunk native vlan 4
spanning-tree portfast
Doubling the delays to
Index: sys/nfs/nfs_boot.c
===================================================================
RCS file: /cvsroot/src/sys/nfs/nfs_boot.c,v
retrieving revision 1.88
diff -r1.88 nfs_boot.c
428,429c428,429
< #define MAX_RESEND_DELAY 5 /* seconds */
< #define TOTAL_TIMEOUT 30 /* seconds */
---
> #define MAX_RESEND_DELAY 10 /* seconds */
> #define TOTAL_TIMEOUT 60 /* seconds */
got the system booting and installed.
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.