NetBSD Problem Report #54897

From gson@gson.org  Sun Jan 26 16:34:31 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 5EB8F7A17C
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 26 Jan 2020 16:34:31 +0000 (UTC)
Message-Id: <20200126163425.EC269253F50@guava.gson.org>
Date: Sun, 26 Jan 2020 18:34:25 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: ipsec tests now fail randomly on real hardware
X-Send-Pr-Version: 3.95

>Number:         54897
>Category:       kern
>Synopsis:       ipsec tests now fail randomly on real hardware
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    thorpej
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jan 26 16:35:00 +0000 2020
>Closed-Date:    
>Last-Modified:  Tue Jul 27 03:35:40 +0000 2021
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source adate >= 2020.01.02.15.42.27
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

Various ipsec tests are now failing randomly on real hardware where
they used to relibably pass.  The problem started with these commits:

  2020.01.02.15.42.26 thorpej src/sys/compat/common/kern_info_43.c 1.38
  2020.01.02.15.42.26 thorpej src/sys/compat/common/kern_time_50.c 1.34
  2020.01.02.15.42.26 thorpej src/sys/compat/netbsd32/netbsd32_sysctl.c 1.41
  2020.01.02.15.42.26 thorpej src/sys/external/bsd/drm2/include/linux/ktime.h 1.8
  2020.01.02.15.42.26 thorpej src/sys/fs/nfs/common/nfs_lock.c 1.3
  2020.01.02.15.42.27 thorpej src/sys/kern/init_main.c 1.517
  2020.01.02.15.42.27 thorpej src/sys/kern/init_sysctl.c 1.223
  2020.01.02.15.42.27 thorpej src/sys/kern/kern_rndq.c 1.96
  2020.01.02.15.42.27 thorpej src/sys/kern/kern_tc.c 1.54
  2020.01.02.15.42.27 thorpej src/sys/kern/kern_time.c 1.203
  2020.01.02.15.42.27 thorpej src/sys/miscfs/fdesc/fdesc_vnops.c 1.131
  2020.01.02.15.42.27 thorpej src/sys/miscfs/kernfs/kernfs.h 1.41
  2020.01.02.15.42.27 thorpej src/sys/miscfs/kernfs/kernfs_vnops.c 1.162
  2020.01.02.15.42.27 thorpej src/sys/miscfs/procfs/procfs_linux.c 1.80
  2020.01.02.15.42.27 thorpej src/sys/nfs/nfs_serv.c 1.178
  2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/cons.c 1.9
  2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/emul.c 1.195
  2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/rump.c 1.339
  2020.01.02.15.42.27 thorpej src/sys/sys/kernel.h 1.32
  2020.01.02.15.42.27 thorpej src/sys/sys/timevar.h 1.40

At least amd64 and i386 are affected.  Log output from a
representative recent failure on amd64 is at:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.01.25.00.12.42/test.html#net_ipsec_t_ipsec_tcp_ipsec_tcp_ipv4mappedipv6_esp_rijndaelcbc

>How-To-Repeat:

I used the following script to reproduce the issue on i386:

  set -x
  cd /usr/tests/net/ipsec
  nfail=0
  for i in $(seq 100)
  do
      atf-run t_ipsec_tunnel >>log 2>&1 || nfail=$(expr $nfail + 1)
  done
  echo $nfail

This printed 0 on a system built from source date 2020.01.02.14.33.55
(immediately before the above commits), and 23 on a system built from
source date 2020.01.02.15.42.27 (immediately after the above commits).
In other words, before the commits, the t_ipsec_tunnel test program
failed 0 times out of 100, and after them, it failed 23 times out of
100.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->thorpej
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Sun, 26 Jan 2020 16:36:45 +0000
Responsible-Changed-Why:
Over to committer.


From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54897 CVS commit: src/tests/net/ipsec
Date: Mon, 17 Feb 2020 08:46:10 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Mon Feb 17 08:46:10 UTC 2020

 Modified Files:
 	src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh

 Log Message:
 tests: add missing ifconfig -w

 This change mitigates PR kern/54897.


 To generate a diff of this commit:
 cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
     src/tests/net/ipsec/t_ipsec_l2tp.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@netbsd.org, Jason Thorpe <thorpej@netbsd.org>, Paul Goyette <paul@whooppee.com>
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Mon, 17 Feb 2020 20:46:09 +0200

 Ryota Ozaki wrote:
 >  Module Name:	src
 >  Committed By:	ozaki-r
 >  Date:		Mon Feb 17 08:46:10 UTC 2020
 >  
 >  Modified Files:
 >  	src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
 >  
 >  Log Message:
 >  tests: add missing ifconfig -w
 >  
 >  This change mitigates PR kern/54897.
 >  
 >  
 >  To generate a diff of this commit:
 >  cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
 >      src/tests/net/ipsec/t_ipsec_l2tp.sh

 This made no difference as far as I can see.  In the first test run
 of my testbed after the above commit, four ipsec test cases failed:

   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.02.html#2020.02.17.08.46.10

 which is the same number as in the last run before the commit
 (2020.02.17.06.32.46).
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54897 CVS commit: src/sys/rump/net/lib/libshmif
Date: Thu, 20 Feb 2020 08:06:16 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Thu Feb 20 08:06:15 UTC 2020

 Modified Files:
 	src/sys/rump/net/lib/libshmif: if_shmem.c

 Log Message:
 shmif: use cprng_strong32 to generate random bytes for a MAC address

 cprng_fast32 sometimes returns indentical bytes, which look
 "20:0e:11:33" in a MAC address, on different rump_server instances.
 That leads MAC address duplications resulting in a test failure.

 Fix it by using cprng_strong32 instead of cprng_fast32.  However
 we should rather fix cprng_fast32 (or rump itself) somehow.

 The fix mitigates PR kern/54897 but test failures due to other
 causes still remain.


 To generate a diff of this commit:
 cvs rdiff -u -r1.77 -r1.78 src/sys/rump/net/lib/libshmif/if_shmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Thu, 20 Feb 2020 17:30:24 +0900

 On Tue, Feb 18, 2020 at 3:46 AM Andreas Gustafsson <gson@gson.org> wrote:
 >
 > Ryota Ozaki wrote:
 > >  Module Name: src
 > >  Committed By:        ozaki-r
 > >  Date:                Mon Feb 17 08:46:10 UTC 2020
 > >
 > >  Modified Files:
 > >       src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
 > >
 > >  Log Message:
 > >  tests: add missing ifconfig -w
 > >
 > >  This change mitigates PR kern/54897.
 > >
 > >
 > >  To generate a diff of this commit:
 > >  cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
 > >      src/tests/net/ipsec/t_ipsec_l2tp.sh
 >
 > This made no difference as far as I can see.  In the first test run
 > of my testbed after the above commit, four ipsec test cases failed:
 >
 >   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.02.html#2020.02.17.08.46.10
 >
 > which is the same number as in the last run before the commit
 > (2020.02.17.06.32.46).

 Yes.  I was misguided.  The number of failures varies and is sometimes
 quite small.

 I found a real issue (hopefully correct this time) that is MAC address
 duplication;
 shmif interfaces on rump kernels sometimes had an identical MAC address
 because cprng_fast32 used for MAC address generations sometimes returned
 identical bytes.

 Fixing the issue would reduce the number of test failures but not resolve them
 all yet.  There still remain other cause(s).  (Maybe that is a real cause of
 cprng_fast32 breakage on rump kernels.)

   ozaki-r

From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54897 CVS commit: src/sys/rump/net/lib/libshmif
Date: Tue, 25 Feb 2020 03:24:48 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Tue Feb 25 03:24:48 UTC 2020

 Modified Files:
 	src/sys/rump/net/lib/libshmif: if_shmem.c

 Log Message:
 shmif: use cprng_strong64 instead of cprng_fast64 to generate a unique ID

 shmif uses random bytes generated by cprng(9) as a unique device ID
 between rump kernels to identify packets fed by itself and not receive
 them.  So if generated bytes are identical between shmif interfaces on
 different rump kernels, shmif may drop incoming packets unintentionally.
 This is one cause of recent ATF test failures of IPsec.

 Fix it by using cprng_strong64 instead of cprng_fast64.  This is a
 workaround and we should also investigate why cprng_fast64 starts
 failing on rump kernels, although using cprng_strong64 in initialization
 itself is feasible.

 Fix PR kern/54897


 To generate a diff of this commit:
 cvs rdiff -u -r1.78 -r1.79 src/sys/rump/net/lib/libshmif/if_shmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 25 Jul 2021 02:09:37 +0000
State-Changed-Why:
Is this fixed?


From: Andreas Gustafsson <gson@gson.org>
To: dholland@NetBSD.org, gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Sun, 25 Jul 2021 12:26:03 +0300

 dholland@NetBSD.org wrote:
 > Is this fixed?

 As the latest update said, "it's a workaround", not a fix.  Per the
 cprng_fast64 man page, the only difference between cprng_fast64 and
 cprng_strong64 is supposed to be that the latter provides forward
 secrecy.  That difference should not affect the probability of
 duplicate outputs, so if the test is failing with with cprng_fast64
 but passing with cprng_strong64, that suggests there is something
 seriously wrong with cprng_fast64, and this still needs to be
 investigated.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 27 Jul 2021 03:35:40 +0000
State-Changed-Why:
not fixed.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.