NetBSD Problem Report #54897
From gson@gson.org Sun Jan 26 16:34:31 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 5EB8F7A17C
for <gnats-bugs@gnats.NetBSD.org>; Sun, 26 Jan 2020 16:34:31 +0000 (UTC)
Message-Id: <20200126163425.EC269253F50@guava.gson.org>
Date: Sun, 26 Jan 2020 18:34:25 +0200 (EET)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: ipsec tests now fail randomly on real hardware
X-Send-Pr-Version: 3.95
>Number: 54897
>Category: kern
>Synopsis: ipsec tests now fail randomly on real hardware
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: thorpej
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Jan 26 16:35:00 +0000 2020
>Closed-Date: Thu Aug 15 17:58:40 +0000 2024
>Last-Modified: Thu Aug 15 18:10:00 +0000 2024
>Originator: Andreas Gustafsson
>Release: NetBSD-current, source adate >= 2020.01.02.15.42.27
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
Various ipsec tests are now failing randomly on real hardware where
they used to relibably pass. The problem started with these commits:
2020.01.02.15.42.26 thorpej src/sys/compat/common/kern_info_43.c 1.38
2020.01.02.15.42.26 thorpej src/sys/compat/common/kern_time_50.c 1.34
2020.01.02.15.42.26 thorpej src/sys/compat/netbsd32/netbsd32_sysctl.c 1.41
2020.01.02.15.42.26 thorpej src/sys/external/bsd/drm2/include/linux/ktime.h 1.8
2020.01.02.15.42.26 thorpej src/sys/fs/nfs/common/nfs_lock.c 1.3
2020.01.02.15.42.27 thorpej src/sys/kern/init_main.c 1.517
2020.01.02.15.42.27 thorpej src/sys/kern/init_sysctl.c 1.223
2020.01.02.15.42.27 thorpej src/sys/kern/kern_rndq.c 1.96
2020.01.02.15.42.27 thorpej src/sys/kern/kern_tc.c 1.54
2020.01.02.15.42.27 thorpej src/sys/kern/kern_time.c 1.203
2020.01.02.15.42.27 thorpej src/sys/miscfs/fdesc/fdesc_vnops.c 1.131
2020.01.02.15.42.27 thorpej src/sys/miscfs/kernfs/kernfs.h 1.41
2020.01.02.15.42.27 thorpej src/sys/miscfs/kernfs/kernfs_vnops.c 1.162
2020.01.02.15.42.27 thorpej src/sys/miscfs/procfs/procfs_linux.c 1.80
2020.01.02.15.42.27 thorpej src/sys/nfs/nfs_serv.c 1.178
2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/cons.c 1.9
2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/emul.c 1.195
2020.01.02.15.42.27 thorpej src/sys/rump/librump/rumpkern/rump.c 1.339
2020.01.02.15.42.27 thorpej src/sys/sys/kernel.h 1.32
2020.01.02.15.42.27 thorpej src/sys/sys/timevar.h 1.40
At least amd64 and i386 are affected. Log output from a
representative recent failure on amd64 is at:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.01.25.00.12.42/test.html#net_ipsec_t_ipsec_tcp_ipsec_tcp_ipv4mappedipv6_esp_rijndaelcbc
>How-To-Repeat:
I used the following script to reproduce the issue on i386:
set -x
cd /usr/tests/net/ipsec
nfail=0
for i in $(seq 100)
do
atf-run t_ipsec_tunnel >>log 2>&1 || nfail=$(expr $nfail + 1)
done
echo $nfail
This printed 0 on a system built from source date 2020.01.02.14.33.55
(immediately before the above commits), and 23 on a system built from
source date 2020.01.02.15.42.27 (immediately after the above commits).
In other words, before the commits, the t_ipsec_tunnel test program
failed 0 times out of 100, and after them, it failed 23 times out of
100.
>Fix:
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->thorpej
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Sun, 26 Jan 2020 16:36:45 +0000
Responsible-Changed-Why:
Over to committer.
From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54897 CVS commit: src/tests/net/ipsec
Date: Mon, 17 Feb 2020 08:46:10 +0000
Module Name: src
Committed By: ozaki-r
Date: Mon Feb 17 08:46:10 UTC 2020
Modified Files:
src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
Log Message:
tests: add missing ifconfig -w
This change mitigates PR kern/54897.
To generate a diff of this commit:
cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
src/tests/net/ipsec/t_ipsec_l2tp.sh
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Andreas Gustafsson <gson@gson.org>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@netbsd.org, Jason Thorpe <thorpej@netbsd.org>, Paul Goyette <paul@whooppee.com>
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Mon, 17 Feb 2020 20:46:09 +0200
Ryota Ozaki wrote:
> Module Name: src
> Committed By: ozaki-r
> Date: Mon Feb 17 08:46:10 UTC 2020
>
> Modified Files:
> src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
>
> Log Message:
> tests: add missing ifconfig -w
>
> This change mitigates PR kern/54897.
>
>
> To generate a diff of this commit:
> cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
> src/tests/net/ipsec/t_ipsec_l2tp.sh
This made no difference as far as I can see. In the first test run
of my testbed after the above commit, four ipsec test cases failed:
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.02.html#2020.02.17.08.46.10
which is the same number as in the last run before the commit
(2020.02.17.06.32.46).
--
Andreas Gustafsson, gson@gson.org
From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54897 CVS commit: src/sys/rump/net/lib/libshmif
Date: Thu, 20 Feb 2020 08:06:16 +0000
Module Name: src
Committed By: ozaki-r
Date: Thu Feb 20 08:06:15 UTC 2020
Modified Files:
src/sys/rump/net/lib/libshmif: if_shmem.c
Log Message:
shmif: use cprng_strong32 to generate random bytes for a MAC address
cprng_fast32 sometimes returns indentical bytes, which look
"20:0e:11:33" in a MAC address, on different rump_server instances.
That leads MAC address duplications resulting in a test failure.
Fix it by using cprng_strong32 instead of cprng_fast32. However
we should rather fix cprng_fast32 (or rump itself) somehow.
The fix mitigates PR kern/54897 but test failures due to other
causes still remain.
To generate a diff of this commit:
cvs rdiff -u -r1.77 -r1.78 src/sys/rump/net/lib/libshmif/if_shmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Thu, 20 Feb 2020 17:30:24 +0900
On Tue, Feb 18, 2020 at 3:46 AM Andreas Gustafsson <gson@gson.org> wrote:
>
> Ryota Ozaki wrote:
> > Module Name: src
> > Committed By: ozaki-r
> > Date: Mon Feb 17 08:46:10 UTC 2020
> >
> > Modified Files:
> > src/tests/net/ipsec: t_ipsec_gif.sh t_ipsec_l2tp.sh
> >
> > Log Message:
> > tests: add missing ifconfig -w
> >
> > This change mitigates PR kern/54897.
> >
> >
> > To generate a diff of this commit:
> > cvs rdiff -u -r1.8 -r1.9 src/tests/net/ipsec/t_ipsec_gif.sh \
> > src/tests/net/ipsec/t_ipsec_l2tp.sh
>
> This made no difference as far as I can see. In the first test run
> of my testbed after the above commit, four ipsec test cases failed:
>
> http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.02.html#2020.02.17.08.46.10
>
> which is the same number as in the last run before the commit
> (2020.02.17.06.32.46).
Yes. I was misguided. The number of failures varies and is sometimes
quite small.
I found a real issue (hopefully correct this time) that is MAC address
duplication;
shmif interfaces on rump kernels sometimes had an identical MAC address
because cprng_fast32 used for MAC address generations sometimes returned
identical bytes.
Fixing the issue would reduce the number of test failures but not resolve them
all yet. There still remain other cause(s). (Maybe that is a real cause of
cprng_fast32 breakage on rump kernels.)
ozaki-r
From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/54897 CVS commit: src/sys/rump/net/lib/libshmif
Date: Tue, 25 Feb 2020 03:24:48 +0000
Module Name: src
Committed By: ozaki-r
Date: Tue Feb 25 03:24:48 UTC 2020
Modified Files:
src/sys/rump/net/lib/libshmif: if_shmem.c
Log Message:
shmif: use cprng_strong64 instead of cprng_fast64 to generate a unique ID
shmif uses random bytes generated by cprng(9) as a unique device ID
between rump kernels to identify packets fed by itself and not receive
them. So if generated bytes are identical between shmif interfaces on
different rump kernels, shmif may drop incoming packets unintentionally.
This is one cause of recent ATF test failures of IPsec.
Fix it by using cprng_strong64 instead of cprng_fast64. This is a
workaround and we should also investigate why cprng_fast64 starts
failing on rump kernels, although using cprng_strong64 in initialization
itself is feasible.
Fix PR kern/54897
To generate a diff of this commit:
cvs rdiff -u -r1.78 -r1.79 src/sys/rump/net/lib/libshmif/if_shmem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->feedback
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sun, 25 Jul 2021 02:09:37 +0000
State-Changed-Why:
Is this fixed?
From: Andreas Gustafsson <gson@gson.org>
To: dholland@NetBSD.org, gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Sun, 25 Jul 2021 12:26:03 +0300
dholland@NetBSD.org wrote:
> Is this fixed?
As the latest update said, "it's a workaround", not a fix. Per the
cprng_fast64 man page, the only difference between cprng_fast64 and
cprng_strong64 is supposed to be that the latter provides forward
secrecy. That difference should not affect the probability of
duplicate outputs, so if the test is failing with with cprng_fast64
but passing with cprng_strong64, that suggests there is something
seriously wrong with cprng_fast64, and this still needs to be
investigated.
--
Andreas Gustafsson, gson@gson.org
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Tue, 27 Jul 2021 03:35:40 +0000
State-Changed-Why:
not fixed.
State-Changed-From-To: open->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 15 Aug 2024 17:58:40 +0000
State-Changed-Why:
Here's a theory about what happened, in netbsd-9 and 9.99.x before the
entropy rework:
1. rnd_init
2. rump_hyperentropy_init -> rnd_attach_source
3. cprng_init
4. kern_cprng = cprng_strong_create
(a) draws from entropy pool
(b) requests samples from sources (rndsource callback)
5. cprng_fast_init
-> draws from kern_cprng
6. rnd_init_softint
-> enters samples from sources into pool
7. ifconfig shmifN create
-> cprng_fast32
(a) draws from cprng_fast
(b) schedules softint to reseed cprng_fast from cprng_strong
The first call to gather a hyperentropy sample is at 4(b). But, even
though netbsd-6 through netbsd-9 has a fancy notification system
(rndsinks) to actively trigger reseeding, cprng_fast doesn't use it and
the hyperentropy sample doesn't get incorporated into cprng_fast until
7(b). So the only samples that affect shmif at 7(a) are weak timing
samples and similar. Hence high collision probability.
Here's how it works differently in netbsd-10:
1. rnd_init
2. rump_hyperentropy_init -> rnd_attach_source
-> request samples from sources (rndsource callback)
-> enters samples from sources into pool
3. cprng_init
-> kern_cprng = cprng_strong_create
-> draws from entropy pool
4. (no separate step 4, kern_cprng creation happens inside cprng_init)
5. cprng_fast_init (doesn't draw anything)
6. rnd_init_softint
7. ifconfig shmifN create
-> cprng_fast32
(a) seeds cprng_fast with draw from cprng_strong
(b) draws from cprng_fast
Note that in netbsd-10, except for samples entered in hard interrupt
context, rndsource samples are synchronously added the the pool -- and
that includes the samples entered by hyperentropy in the rndsource
callback. So the first call to cprng_fast at 7(a) is always seeded
with hyperentropy samples.
So I think the underlying cause of this bug has been resolved in
netbsd-10 and current, but I don't think anyone has the appetite to
pull it up to netbsd-9. Maybe there is a simpler change to netbsd-9
that would be worth applying, but the dependencies between all the
seeding components are very tricky and often lead to hard-to-diagnose
bugs early at boot when tweaked or reordered.
In any case, I think it is better to use cprng_strong here. The names
are not apt, and if I were choosing them today, I would choose `cprng'
and `weakcprng', the idea being that you only use weakcprng if there is
a performance constraint that overrides security for some reason so you
can tolerate disclosure of past outputs. For use in device attach or
creation, there is no such performance constraint, so cprng_strong is
the right choice. So if anyone wants pullups, just changing to
cprng_strong is a better fix.
From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: thorpej@NetBSD.org, netbsd-bugs@NetBSD.org,
gnats-admin@NetBSD.org, gson@gson.org (Andreas Gustafsson)
Subject: Re: kern/54897 (ipsec tests now fail randomly on real hardware)
Date: Thu, 15 Aug 2024 18:07:09 +0000
> Date: Thu, 15 Aug 2024 17:58:41 +0000 (UTC)
> From: riastradh@NetBSD.org
>
> State-Changed-From-To: open->closed
(But feel free to reopen or change to needs-pullups if you think there
is still anything to address here -- I don't mean to shut off further
discussion by closing this.)
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.