NetBSD Problem Report #53329

From www@NetBSD.org  Tue May 29 16:00:38 2018
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id B21E17A110
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 29 May 2018 16:00:38 +0000 (UTC)
Message-Id: <20180529160037.D12787A221@mollari.NetBSD.org>
Date: Tue, 29 May 2018 16:00:37 +0000 (UTC)
From: prlw1@cam.ac.uk
Reply-To: prlw1@cam.ac.uk
To: gnats-bugs@NetBSD.org
Subject: netboot over nfs broken
X-Send-Pr-Version: www-1.0

>Number:         53329
>Category:       kern
>Synopsis:       netboot over nfs broken
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue May 29 16:05:00 +0000 2018
>Last-Modified:  Fri Jun 08 15:25:00 +0000 2018
>Originator:     Patrick Welche
>Release:        NetBSD-8.99.19/amd64
>Organization:
>Environment:
>Description:
Netbooting no longer works. Tried 2 new boxes (one last week with 8.99.17) and one today, so tried my known-to-have-netbooted-successfully-in-the-past laptop, and in all cases (alc0, wm0, wm0), "cannot mount root, error = 79".

The kernel is loaded and started over nfs, not tftp, so I believe that dhcpd and nfsd on a 8.99.19 server are working correctly.

What is new is:

nfs_boot: trying DHCP/BOOTP
nfs_boot: timeout...
nfs_boot: timeout...
nfs_boot: timeout...

The server sees at most 2 requests. One presumably is the successful dhcp request before tftp get pxeboot_ia32.bin.
>How-To-Repeat:

>Fix:

>Audit-Trail:
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 11:46:19 +0900

 On Wed, May 30, 2018 at 1:05 AM <prlw1@cam.ac.uk> wrote:
 >
 > >Number:         53329
 > >Category:       kern
 > >Synopsis:       netboot over nfs broken
 > >Confidential:   no
 > >Severity:       serious
 > >Priority:       medium
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          sw-bug
 > >Submitter-Id:   net
 > >Arrival-Date:   Tue May 29 16:05:00 +0000 2018
 > >Originator:     Patrick Welche
 > >Release:        NetBSD-8.99.19/amd64
 > >Organization:
 > >Environment:
 > >Description:
 > Netbooting no longer works. Tried 2 new boxes (one last week with 8.99.17) and one today, so tried my known-to-have-netbooted-successfully-in-the-past laptop, and in all cases (alc0, wm0, wm0), "cannot mount root, error = 79".
 >
 > The kernel is loaded and started over nfs, not tftp, so I believe that dhcpd and nfsd on a 8.99.19 server are working correctly.
 >
 > What is new is:
 >
 > nfs_boot: trying DHCP/BOOTP
 > nfs_boot: timeout...
 > nfs_boot: timeout...
 > nfs_boot: timeout...
 >
 > The server sees at most 2 requests. One presumably is the successful dhcp request before tftp get pxeboot_ia32.bin.

 I suspect the recent switch of the default transport to TCP.
 If so adding NFS_BOOT_UDP to your kernel would fix the issue.

   ozaki-r

From: Jason Thorpe <thorpej@me.com>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>,
 kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/53329: netboot over nfs broken
Date: Wed, 30 May 2018 20:32:24 -0700

 > On May 30, 2018, at 7:46 PM, Ryota Ozaki <ozaki-r@netbsd.org> wrote:

 >> What is new is:
 >>=20
 >> nfs_boot: trying DHCP/BOOTP
 >> nfs_boot: timeout...
 >> nfs_boot: timeout...
 >> nfs_boot: timeout...
 >>=20
 >> The server sees at most 2 requests. One presumably is the successful =
 dhcp request before tftp get pxeboot_ia32.bin.
 >=20
 > I suspect the recent switch of the default transport to TCP.
 > If so adding NFS_BOOT_UDP to your kernel would fix the issue.

 Really?  Looks to me like the DHCP doesn=E2=80=99t succeed, which has =
 nothing to do with transport change.  If we were getting a DHCP =
 response, we=E2=80=99d see messages with the results, yes?

 -- thorpej

From: Martin Husemann <martin@duskware.de>
To: Jason Thorpe <thorpej@me.com>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 10:05:18 +0200

 On Wed, May 30, 2018 at 08:32:24PM -0700, Jason Thorpe wrote:
 > Really?  Looks to me like the DHCP doesn't succeed, which has nothing
 > to do with transport change.  If we were getting a DHCP response, we'd
 > see messages with the results, yes?

 Indeed, we should.

 Something wrong with receive interrupts or media type negotiation?

 Martin

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 09:14:45 +0100

 On Thu, May 31, 2018 at 08:10:00AM +0000, Martin Husemann wrote:
 >  Something wrong with receive interrupts or media type negotiation?

 First: does it work for you? The reason I am suprised is that the kernel
 is correctly loaded over nfs by pxeboot => all should be well to find root
 at exactly the same place the kernel is. The 2nd dhcp response must be
 the kernel load. Question is why there isn't a 3rd or nth given the
 "trying" messages emitted by the kernel.

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 10:56:21 +0100

 Tried NFS_BOOT_UDP just in case, and no change. Checked and the dhcpd
 server really only does see 2 reequests (disc/offer/req/ack x 2).

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Thu, 31 May 2018 12:13:39 +0200

 On Thu, May 31, 2018 at 08:15:01AM +0000, Patrick Welche wrote:
 >  First: does it work for you?

 I haven't tried with pxe on x86 lately, but it works for various arm boards
 here.

 Have you updated the pxeboot or only the kernel?

 Martin

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 1 Jun 2018 10:00:06 +0100

 On Thu, May 31, 2018 at 10:15:01AM +0000, Martin Husemann wrote:
 >  Have you updated the pxeboot or only the kernel?

 Both. I thought pxeboot had successfully finished its job once it loaded
 the kernel over nfs. The dhcpd/nfsd server is also updated. The hardware /
 network hasn't changed since the last known working state.

 Building a kernel with DEBUG_NFS_BOOT_DHCP hasn't changed the output:

  7.2... nfs_boot: trying DHCP/BOOTP
 25.9... nfs_boot: timeout...
 30.9... nfs_boot: timeout...
 35.9... nfs_boot: timeout...
 40.9... nfs_boot: timeout...

 Running tcpdump on the dhcpd server shows no requests at all corresponding
 to those, so I don't think that changing the timeout would help.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 1 Jun 2018 12:42:38 +0200

 On Fri, Jun 01, 2018 at 09:05:00AM +0000, Patrick Welche wrote:
 >  On Thu, May 31, 2018 at 10:15:01AM +0000, Martin Husemann wrote:
 >  >  Have you updated the pxeboot or only the kernel?
 >  
 >  Both. I thought pxeboot had successfully finished its job once it loaded

 Can you try an old pxeboot? Maybe it leaves the network device in a state
 the kernel does not properly recover.

 Martin

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 1 Jun 2018 17:19:00 +0100

 It seems to be a new local problem in the sense that:

 - requests were emitted - I didn't seem them as I was filtering on host
   instead of port bootps or port bootpc

 - I successfully booted after increasing timeouts in
   sys/nfs/nfs_boot.c to

   -#define        MAX_RESEND_DELAY 5      /* seconds */
   -#define TOTAL_TIMEOUT   30     /* seconds */
   +#define        MAX_RESEND_DELAY 50     /* seconds */
   +#define TOTAL_TIMEOUT   300    /* seconds */

   and observe:

    7.2... nfs_boot: trying DHCP/BOOTP
   46.9... nfs_boot: DHCP next-server: ...

   as opposed to the post above:

     7.2... nfs_boot: trying DHCP/BOOTP
    25.9... nfs_boot: timeout...
    30.9... nfs_boot: timeout...
    35.9... nfs_boot: timeout...
    40.9... nfs_boot: timeout...

   (old cisco catalyst switch)

   That last output doesn't really seem to match the comment:

     422 /*
     423  * What is the longest we will wait before re-sending a request?
     424  * Note this is also the frequency of "timeout" messages.
     425  * The re-send loop counts up linearly to this maximum, so the
     426  * first complaint will happen after (1+2+3+4+5)=15 seconds.
     427  */

   The frequency of "timeout" messages seems to be 5 rather than 15 seconds.

From: Julian Coleman <jdc@coris.org.uk>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, prlw1@cam.ac.uk
Subject: Re: kern/53329: netboot over nfs broken
Date: Mon, 4 Jun 2018 15:21:43 +0200

 Hi,

 >  - I successfully booted after increasing timeouts in
 >    sys/nfs/nfs_boot.c to

 >      7.2... nfs_boot: trying DHCP/BOOTP
 >     25.9... nfs_boot: timeout...
 >     30.9... nfs_boot: timeout...
 >     35.9... nfs_boot: timeout...
 >     40.9... nfs_boot: timeout...
 >  
 >    (old cisco catalyst switch)

 Can you configure portfast on the switch?  I'm guessing that the switch does
 not enable forwarding until spanning-tree negotiation has completed.  See:

   https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/10553-12.html

 for more details.

 If this worked previously, but fails now, then I also guess that we drop the
 link now and did not in the past.  (Each time the link comes up, the cisco will
 renegotiate and not enable forwarding for 30s.)

 Regards,

 Julian

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 11:38:00 +0100

 I could work around this by changing the configuration of the switch,
 but rather than having to put a note in the INSTALL files, maybe we
 could just double the timeouts?

 >  (Each time the link comes up, the cisco willrenegotiate and not enable
 >  forwarding for 30s.)

 so 30s is too short. The following is sufficient for me:

 RCS file: /cvsroot/src/sys/nfs/nfs_boot.c,v
 retrieving revision 1.88
 diff -u -r1.88 nfs_boot.c
 --- sys/nfs/nfs_boot.c  17 May 2018 02:34:31 -0000      1.88
 +++ sys/nfs/nfs_boot.c  8 Jun 2018 10:34:40 -0000
 @@ -425,8 +425,8 @@
   * The re-send loop counts up linearly to this maximum, so the
   * first complaint will happen after (1+2+3+4+5)=15 seconds.
   */
 -#define        MAX_RESEND_DELAY 5      /* seconds */
 -#define TOTAL_TIMEOUT   30     /* seconds */
 +#define        MAX_RESEND_DELAY 10     /* seconds */
 +#define TOTAL_TIMEOUT    60    /* seconds */

  int
  nfs_boot_sendrecv(struct socket *so, struct sockaddr_in *nam,



From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 11:41:29 +0100

 P.S., some timings with the doubling change:

 [    7.197495] nfs_boot: trying DHCP/BOOTP
 [   10.838201] entered nfs_boot_sendrecv()
 [   10.838201] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   11.838703] send_again: timo=1 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   13.839707] send_again: timo=2 waited=1 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   16.841213] send_again: timo=3 waited=3 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   20.843220] send_again: timo=4 waited=6 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   25.845730] send_again: timo=5 waited=10 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   31.308470] acpibat0: normal capacity on 'charge state'
 [   31.848741] send_again: timo=6 waited=15 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   38.852254] send_again: timo=7 waited=21 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   46.856269] send_again: timo=8 waited=28 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   46.856269] entered nfs_boot_sendrecv()
 [   46.856269] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   46.856269] nfs_boot: DHCP next-server: 10.1.1.68
 [   46.856269] nfs_boot: my_name=quark.flow.bpi.cam.ac.uk
 [   46.856269] nfs_boot: my_addr=10.1.1.65
 [   46.856269] nfs_boot: my_mask=255.255.255.0
 [   46.856269] nfs_boot: gateway=10.1.1.86
 [   49.857775] entered nfs_boot_sendrecv()
 [   49.857775] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   50.858277] send_again: timo=1 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   52.859280] send_again: timo=2 waited=1 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   52.859280] entered nfs_boot_sendrecv()
 [   52.859280] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   52.859280] entered nfs_boot_sendrecv()
 [   52.859280] send_again: timo=0 waited=0 MAX_RESEND_DELAY=10 TOTAL_TIMEOUT=60
 [   52.869287] root on 10.1.1.68:/usr/export/amd64
 [   52.869287] root file system type: nfs
 [   52.869287] kern.module.path=/stand/amd64/8.99.19/modules

From: Patrick Welche <prlw1@cam.ac.uk>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 14:08:31 +0100

 (Just netbooted ubuntu 18 successfully without needing to reconfigure
 the switch.)

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/53329: netboot over nfs broken
Date: Fri, 8 Jun 2018 17:21:18 +0200

 We should find out what part on our side makes the link drop between
 pxeboot and the kernel taking over (and try to avoid that).

 Martin

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.