NetBSD Problem Report #54897

From gson@gson.org  Sun Jan 26 16:34:31 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 5EB8F7A17C
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 26 Jan 2020 16:34:31 +0000 (UTC)
Message-Id: <20200126163425.EC269253F50@guava.gson.org>
Date: Sun, 26 Jan 2020 18:34:25 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: ipsec tests now fail randomly on real hardware
X-Send-Pr-Version: 3.95

>Number:         54897
>Category:       kern
>Synopsis:       ipsec tests now fail randomly on real hardware
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    thorpej
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jan 26 16:35:00 +0000 2020
>Closed-Date:    Thu Aug 15 17:58:40 +0000 2024
>Last-Modified:  Thu Aug 15 18:10:00 +0000 2024
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source adate >= 2020.01.02.15.42.27
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

Various ipsec tests are now failing randomly on real hardware where
they used to relibably pass.  The problem started with these commits:

  2020.01.02.15.42.26 thorpej src/sys/compat/common/kern_info_43.c 1.38
  2020.01.02.15.42.26 thorpej src/sys/compat/common/kern_time_50.c 1.34
  2020.01.02.15.42.26 thorpej src/sys/compat/netbsd32/netbsd32_sysctl.c 1.41
  2020.01.02.15.42.26 thorpej src/sys/external/bsd/drm2/include/linux/ktime.h 1.8
  2020.01.02.15.42.26 thorpej src/sys/fs/nfs/common/nfs_lock.c 1.3
  2020.01.02.15.42.27 thorpej src/sys/kern/init_main.c 1.517
  2020.01.02.15.42.27 thorpej src/sys/kern/init_sysctl.c 1.223
  2020.01.02.15.42.27 thorpej src/sys/kern/kern_rndq.c 1.96
  2020.01.02.15.42.27 thorpej src/sys/kern/kern_tc.c 1.54
  2020.01.02.15.42.27 thorpej src/sys/kern/kern_time.c 1.203
  2020.01.02.15.42.27 thorpej src/sys/miscfs/fdesc/fdesc_vnops.c 1.131
  2020.01.02.15.42.27 thorpej src/sys/miscfs/kernfs/kernfs.h 1.41
  2020.01.02.15.42.27 thorpej src/sys/miscfs/kernfs/kernfs_vnops.c 1.162
  2020.01.02.15.42.27 thorpej src/sys/miscfs/procfs/procfs_linux.c 1.80
  2020.01.02.15.42.27 thorpej src/sys/nfs/nfs_serv.c 1.178
  2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/cons.c 1.9
  2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/emul.c 1.195
  2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/rump.c 1.339
  2020.01.02.15.42.27 thorpej src/sys/sys/kernel.h 1.32
  2020.01.02.15.42.27 thorpej src/sys/sys/timevar.h 1.40

At least amd64 and i386 are affected.  Log output from a
representative recent failure on amd64 is at:

  http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.01.25.00.12.42/test.html#net_ipsec_t_ipsec_tcp_ipsec_tcp_ipv4mappedipv6_esp_rijndaelcbc

>How-To-Repeat:

I used the following script to reproduce the issue on i386:

  set -x
  cd /usr/tests/net/ipsec
  nfail=0
  for i in $(seq 100)
  do
      atf-run t_ipsec_tunnel >>log 2>&1 || nfail=$(expr $nfail + 1)
  done
  echo $nfail

This printed 0 on a system built from source date 2020.01.02.14.33.55
(immediately before the above commits), and 23 on a system built from
source date 2020.01.02.15.42.27 (immediately after the above commits).
In other words, before the commits, the t_ipsec_tunnel test program
failed 0 times out of 100, and after them, it failed 23 times out of
100.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->thorpej
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Sun, 26 Jan 2020 16:36:45 +0000
Responsible-Changed-Why:
Over to committer.


From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54897 CVS commit: src/tests/net/ipsec
Date: Mon, 17 Feb 2020 08:46:10 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Mon Feb 17 08:46:10 UTC 2020

 Modified Files:
 	src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh

 Log Message:
 tests: add missing ifconfig -w

 This change mitigates PR kern/54897.


 To generate a diff of this commit:
 cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
     src/tests/net/ipsec/t_ipsec_l2tp.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@netbsd.org, Jason Thorpe <thorpej@netbsd.org>, Paul Goyette <paul@whooppee.com>
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Mon, 17 Feb 2020 20:46:09 +0200

 Ryota Ozaki wrote:
 >  Module Name:	src
 >  Committed By:	ozaki-r
 >  Date:		Mon Feb 17 08:46:10 UTC 2020
 >  
 >  Modified Files:
 >  	src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
 >  
 >  Log Message:
 >  tests: add missing ifconfig -w
 >  
 >  This change mitigates PR kern/54897.
 >  
 >  
 >  To generate a diff of this commit:
 >  cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
 >      src/tests/net/ipsec/t_ipsec_l2tp.sh

 This made no difference as far as I can see.  In the first test run
 of my testbed after the above commit, four ipsec test cases failed:

   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.02.html#2020.02.17.08.46.10

 which is the same number as in the last run before the commit
 (2020.02.17.06.32.46).
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54897 CVS commit: src/sys/rump/net/lib/libshmif
Date: Thu, 20 Feb 2020 08:06:16 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Thu Feb 20 08:06:15 UTC 2020

 Modified Files:
 	src/sys/rump/net/lib/libshmif: if_shmem.c

 Log Message:
 shmif: use cprng_strong32 to generate random bytes for a MAC address

 cprng_fast32 sometimes returns indentical bytes, which look
 "20:0e:11:33" in a MAC address, on different rump_server instances.
 That leads MAC address duplications resulting in a test failure.

 Fix it by using cprng_strong32 instead of cprng_fast32.  However
 we should rather fix cprng_fast32 (or rump itself) somehow.

 The fix mitigates PR kern/54897 but test failures due to other
 causes still remain.


 To generate a diff of this commit:
 cvs rdiff -u -r1.77 -r1.78 src/sys/rump/net/lib/libshmif/if_shmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Thu, 20 Feb 2020 17:30:24 +0900

 On Tue, Feb 18, 2020 at 3:46 AM Andreas Gustafsson <gson@gson.org> wrote:
 >
 > Ryota Ozaki wrote:
 > >  Module Name: src
 > >  Committed By:        ozaki-r
 > >  Date:                Mon Feb 17 08:46:10 UTC 2020
 > >
 > >  Modified Files:
 > >       src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
 > >
 > >  Log Message:
 > >  tests: add missing ifconfig -w
 > >
 > >  This change mitigates PR kern/54897.
 > >
 > >
 > >  To generate a diff of this commit:
 > >  cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
 > >      src/tests/net/ipsec/t_ipsec_l2tp.sh
 >
 > This made no difference as far as I can see.  In the first test run
 > of my testbed after the above commit, four ipsec test cases failed:
 >
 >   http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.02.html#2020.02.17.08.46.10
 >
 > which is the same number as in the last run before the commit
 > (2020.02.17.06.32.46).

 Yes.  I was misguided.  The number of failures varies and is sometimes
 quite small.

 I found a real issue (hopefully correct this time) that is MAC address
 duplication;
 shmif interfaces on rump kernels sometimes had an identical MAC address
 because cprng_fast32 used for MAC address generations sometimes returned
 identical bytes.

 Fixing the issue would reduce the number of test failures but not resolve them
 all yet.  There still remain other cause(s).  (Maybe that is a real cause of
 cprng_fast32 breakage on rump kernels.)

   ozaki-r

From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/54897 CVS commit: src/sys/rump/net/lib/libshmif
Date: Tue, 25 Feb 2020 03:24:48 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Tue Feb 25 03:24:48 UTC 2020

 Modified Files:
 	src/sys/rump/net/lib/libshmif: if_shmem.c

 Log Message:
 shmif: use cprng_strong64 instead of cprng_fast64 to generate a unique ID

 shmif uses random bytes generated by cprng(9) as a unique device ID
 between rump kernels to identify packets fed by itself and not receive
 them.  So if generated bytes are identical between shmif interfaces on
 different rump kernels, shmif may drop incoming packets unintentionally.
 This is one cause of recent ATF test failures of IPsec.

 Fix it by using cprng_strong64 instead of cprng_fast64.  This is a
 workaround and we should also investigate why cprng_fast64 starts
 failing on rump kernels, although using cprng_strong64 in initialization
 itself is feasible.

 Fix PR kern/54897


 To generate a diff of this commit:
 cvs rdiff -u -r1.78 -r1.79 src/sys/rump/net/lib/libshmif/if_shmem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 25 Jul 2021 02:09:37 +0000
State-Changed-Why:
Is this fixed?


From: Andreas Gustafsson <gson@gson.org>
To: dholland@NetBSD.org, gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Sun, 25 Jul 2021 12:26:03 +0300

 dholland@NetBSD.org wrote:
 > Is this fixed?

 As the latest update said, "it's a workaround", not a fix.  Per the
 cprng_fast64 man page, the only difference between cprng_fast64 and
 cprng_strong64 is supposed to be that the latter provides forward
 secrecy.  That difference should not affect the probability of
 duplicate outputs, so if the test is failing with with cprng_fast64
 but passing with cprng_strong64, that suggests there is something
 seriously wrong with cprng_fast64, and this still needs to be
 investigated.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 27 Jul 2021 03:35:40 +0000
State-Changed-Why:
not fixed.


State-Changed-From-To: open->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 15 Aug 2024 17:58:40 +0000
State-Changed-Why:
Here's a theory about what happened, in netbsd-9 and 9.99.x before the
entropy rework:

1. rnd_init
2. rump_hyperentropy_init -> rnd_attach_source
3. cprng_init
4. kern_cprng = cprng_strong_create
   (a) draws from entropy pool
   (b) requests samples from sources (rndsource callback)
5. cprng_fast_init
   -> draws from kern_cprng
6. rnd_init_softint
   -> enters samples from sources into pool
7. ifconfig shmifN create
   -> cprng_fast32
      (a) draws from cprng_fast
      (b) schedules softint to reseed cprng_fast from cprng_strong

The first call to gather a hyperentropy sample is at 4(b).  But, even
though netbsd-6 through netbsd-9 has a fancy notification system
(rndsinks) to actively trigger reseeding, cprng_fast doesn't use it and
the hyperentropy sample doesn't get incorporated into cprng_fast until
7(b).  So the only samples that affect shmif at 7(a) are weak timing
samples and similar.  Hence high collision probability.

Here's how it works differently in netbsd-10:

1. rnd_init
2. rump_hyperentropy_init -> rnd_attach_source
   -> request samples from sources (rndsource callback)
   -> enters samples from sources into pool
3. cprng_init
   -> kern_cprng = cprng_strong_create
   -> draws from entropy pool
4. (no separate step 4, kern_cprng creation happens inside cprng_init)
5. cprng_fast_init (doesn't draw anything)
6. rnd_init_softint
7. ifconfig shmifN create
   -> cprng_fast32
      (a) seeds cprng_fast with draw from cprng_strong
      (b) draws from cprng_fast

Note that in netbsd-10, except for samples entered in hard interrupt
context, rndsource samples are synchronously added the the pool -- and
that includes the samples entered by hyperentropy in the rndsource
callback.  So the first call to cprng_fast at 7(a) is always seeded
with hyperentropy samples.

So I think the underlying cause of this bug has been resolved in
netbsd-10 and current, but I don't think anyone has the appetite to
pull it up to netbsd-9.  Maybe there is a simpler change to netbsd-9
that would be worth applying, but the dependencies between all the
seeding components are very tricky and often lead to hard-to-diagnose
bugs early at boot when tweaked or reordered.

In any case, I think it is better to use cprng_strong here.  The names
are not apt, and if I were choosing them today, I would choose `cprng'
and `weakcprng', the idea being that you only use weakcprng if there is
a performance constraint that overrides security for some reason so you
can tolerate disclosure of past outputs.  For use in device attach or
creation, there is no such performance constraint, so cprng_strong is
the right choice.  So if anyone wants pullups, just changing to
cprng_strong is a better fix.


From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: thorpej@NetBSD.org, netbsd-bugs@NetBSD.org,
	gnats-admin@NetBSD.org, gson@gson.org (Andreas Gustafsson)
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Thu, 15 Aug 2024 18:07:09 +0000

 > Date: Thu, 15 Aug 2024 17:58:41 +0000 (UTC)
 > From: riastradh@NetBSD.org
 > 
 > State-Changed-From-To: open->closed

 (But feel free to reopen or change to needs-pullups if you think there
 is still anything to address here -- I don't mean to shut off further
 discussion by closing this.)

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.