NetBSD Problem Report #55598

From gson@gson.org  Sat Aug 22 13:31:57 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 56E721A9249
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 22 Aug 2020 13:31:57 +0000 (UTC)
Message-Id: <20200822133152.21091253F75@guava.gson.org>
Date: Sat, 22 Aug 2020 16:31:52 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
X-Send-Pr-Version: 3.95

>Number:         55598
>Category:       port-arm
>Synopsis:       ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    riastradh
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Aug 22 13:35:00 +0000 2020
>Closed-Date:    Sun Aug 23 16:41:43 +0000 2020
>Last-Modified:  Sun Aug 23 16:41:43 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: earmv7hf
Machine: evbarm
>Description:

The ChaCha self-test is randomly failing on the qemu-based TNF
evbarm-earmv7hf testbed, logging this message to the console:

  chacha: self-test failed: ARM NEON ChaCha

Logs:

  http://releng.netbsd.org/b5reports/evbarm-earmv7hf/2020/2020.08.04.04.22.02/test.log
  http://releng.netbsd.org/b5reports/evbarm-earmv7hf/2020/2020.08.19.22.47.09/test.log
  http://releng.netbsd.org/b5reports/evbarm-earmv7hf/2020/2020.08.20.11.09.56/test.log

>How-To-Repeat:

Boot NetBSD-current/evbarm-earmv7hf in qemu repeatedly.

>Fix:

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: port-arm-maintainer->riastradh
Responsible-Changed-By: riastradh@NetBSD.org
Responsible-Changed-When: Sat, 22 Aug 2020 17:25:10 +0000
Responsible-Changed-Why:
mine


From: Taylor R Campbell <campbell@mumble.net>
To: gnats-bugs@netbsd.org
Cc: port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-arm/55598: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
Date: Sat, 22 Aug 2020 17:35:15 +0000

 Are these the only instances where it failed?  Are there other
 instances where it succeeded, when running exactly the same software?
 (There have been some changes to sys/crypto/chacha over the time
 period covered by those three dates.)  Does tests/sys/crypto/chacha
 fail randomly if repeated?

From: Andreas Gustafsson <gson@gson.org>
To: riastradh@netbsd.org,
Cc: gnats-bugs@netbsd.org
Subject: Re: port-arm/55598: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
Date: Sat, 22 Aug 2020 22:50:41 +0300

 Taylor,

 You wrote:
 >  Are these the only instances where it failed?

 Almost - there have been four instances so far (counting a case where
 the system rebooted and then failed again as one, not two).

 > Are there other instances where it succeeded, when running exactly
 > the same software?

 That question is only meaningful if the exact same software has been
 tested more than once, but the testbed doesn't do that.

 It's probably more useful to consider the full set of outcomes from
 the last 25 test runs (these include all the failures).  A zero
 in the rightmost column means success, nonzero means failure:

   lyta /bracket/evbarm-earmv7hf/results $ zgrep -c 'chacha: self-test failed' 2020/*/test.log.gz | tail -25
   2020/2020.08.02.01.36.46/test.log.gz:0
   2020/2020.08.02.10.49.53/test.log.gz:0
   2020/2020.08.03.07.25.59/test.log.gz:0
   2020/2020.08.03.20.43.41/test.log.gz:0
   2020/2020.08.04.04.22.02/test.log.gz:1
   2020/2020.08.04.14.35.24/test.log.gz:0
   2020/2020.08.05.01.58.42/test.log.gz:0
   2020/2020.08.05.10.33.01/test.log.gz:0
   2020/2020.08.05.16.36.33/test.log.gz:0
   2020/2020.08.06.07.38.54/test.log.gz:0
   2020/2020.08.06.16.23.52/test.log.gz:0
   2020/2020.08.07.00.43.52/test.log.gz:0
   2020/2020.08.07.07.19.45/test.log.gz:0
   2020/2020.08.07.23.31.07/test.log.gz:0
   2020/2020.08.08.19.39.28/test.log.gz:1
   2020/2020.08.17.09.03.03/test.log.gz:0
   2020/2020.08.17.19.41.32/test.log.gz:0
   2020/2020.08.18.03.02.50/test.log.gz:0
   2020/2020.08.18.14.32.34/test.log.gz:0
   2020/2020.08.18.23.31.50/test.log.gz:0
   2020/2020.08.19.09.22.05/test.log.gz:0
   2020/2020.08.19.22.47.09/test.log.gz:2
   2020/2020.08.20.11.09.56/test.log.gz:1
   2020/2020.08.21.17.51.31/test.log.gz:0
   2020/2020.08.22.01.43.07/test.log.gz:0

 >  (There have been some changes to sys/crypto/chacha over the time
 >  period covered by those three dates.)  Does tests/sys/crypto/chacha
 >  fail randomly if repeated?

 That would have to be tested separately, but I'm not sure I see the
 point since the reported issue is not the failure of those tests but
 of the kernel's built in self test, and we already know the tests in
 sys/crypt/chacha passed in all 25 of the above runs.  If you think
 testing this will actually yield some useful information, which
 version would you like me to repeatedly test?
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: riastradh@netbsd.org
Cc: gnats-bugs@netbsd.org
Subject: Re: port-arm/55598: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
Date: Sat, 22 Aug 2020 23:05:59 +0300

 I wrote:
 > the tests in sys/crypt/chacha passed

 I meant "the tests in tests/sys/crypto/chacha passed".
 -- 
 Andreas Gustafsson, gson@gson.org

From: Taylor R Campbell <campbell@mumble.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-arm/55598: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
Date: Sat, 22 Aug 2020 23:35:33 +0000

 > Date: Sat, 22 Aug 2020 22:50:41 +0300
 > From: Andreas Gustafsson <gson@gson.org>
 >=20
 > It's probably more useful to consider the full set of outcomes from
 > the last 25 test runs (these include all the failures).  A zero
 > in the rightmost column means success, nonzero means failure:
 >=20
 >   lyta /bracket/evbarm-earmv7hf/results $ zgrep -c 'chacha: self-test fai=
 led' 2020/*/test.log.gz | tail -25
 >   2020/2020.08.02.01.36.46/test.log.gz:0
 >   [...]

 Cool, thanks.

 > >  (There have been some changes to sys/crypto/chacha over the time
 > >  period covered by those three dates.)  Does tests/sys/crypto/chacha
 > >  fail randomly if repeated?
 >=20
 > That would have to be tested separately, but I'm not sure I see the
 > point since the reported issue is not the failure of those tests but
 > of the kernel's built in self test, and we already know the tests in
 > [tests/sys/crypto/chacha] passed in all 25 of the above runs.

 tests/sys/crypto/chacha runs the kernel self-test code in userland.

 There are some small differences:

 - The kernel is built with -mfloat-abi=3Dsoft (and the ChaCha code with
   -mfloat-abi=3Dsoftfp), whereas in earmv7hf the userland is built with
   -mfloat-abi=3Dhard.

   However, this shouldn't make much of a difference for the ChaCha
   code, because there are differences in the first few blocks of
   output, which are produced without any vector parameter-passing,
   which is the only way that -mfloat-abi=3Dhard and -mfloat-abi=3Dsoftfp
   differ.

 - The kernel turns the fpu on and off around the crypto code, so the
   fpu state management is slightly different from userland.

 - The kernel code is built with a kludgey arm_neon.h NEON intrinsics
   header file (sys/crypto/chacha/arch/arm/arm_neon.h) while the
   userland code is built with the compiler's native arm_neon.h.

   However, this also shouldn't make much of a difference because the
   first few blocks are generated by a hand-written assembly routine
   rather than NEON intrinsics in C.

 So repeatedly running the userland t_chacha test may help to narrow
 down whether the problem might lie with these differences, if (say) a
 70 runs all succeed (if the failure rate is 5/26, the probability of
 no failures in 70 trials is below one in a million); or whether the
 problem lies with the ChaCha code itself, if the failure happens in
 userland too.

 Separately, I could add a path -- probably a sysctl knob -- by which
 to re-run the self-tests in the kernel without having to reboot.
 Would that be convenient for you to try via patch, or what would be
 the most convenient way to test this?

 > If you think testing this will actually yield some useful
 > information, which version would you like me to repeatedly test?

 What is currently in HEAD should be fine, since the last change to the
 ChaCha code and the fpu state management code was before the
 2020.08.19.22.47.09 and 2020.08.20.11.09.56 test runs that failed.  If
 you can conveniently run from those, though, that wouldn't hurt.

From: Taylor R Campbell <campbell@mumble.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-arm/55598: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
Date: Sun, 23 Aug 2020 01:02:11 +0000

 > Date: Sat, 22 Aug 2020 23:35:33 +0000
 > From: Taylor R Campbell <campbell@mumble.net>
 > 
 > Separately, I could add a path -- probably a sysctl knob -- by which
 > to re-run the self-tests in the kernel without having to reboot.
 > Would that be convenient for you to try via patch, or what would be
 > the most convenient way to test this?

 I did this and -- with low probability -- I'm able to reproduce it on
 my beaglebone!  I will investigate.

From: Andreas Gustafsson <gson@gson.org>
To: Taylor R Campbell <campbell@mumble.net>
Cc: gnats-bugs@netbsd.org
Subject: Re: port-arm/55598: ChaCha self-test sometimes fails on evbarm-earmv7hf testbed
Date: Sun, 23 Aug 2020 12:11:08 +0300

 Taylor R Campbell wrote:
 > > If you think testing this will actually yield some useful
 > > information, which version would you like me to repeatedly test?
 > 
 > What is currently in HEAD should be fine, since the last change to the
 > ChaCha code and the fpu state management code was before the
 > 2020.08.19.22.47.09 and 2020.08.20.11.09.56 test runs that failed.  If
 > you can conveniently run from those, though, that wouldn't hurt.

 I ran t_chacha in a loop overnight on a system built from
 2020.08.20.11.09.56 sources, under qemu on lyta.  When I stopped
 it, it had run 4,558 times and they all succeeded.
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55598 CVS commit: src/sys/crypto/chacha/arch/arm
Date: Sun, 23 Aug 2020 16:39:06 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sun Aug 23 16:39:06 UTC 2020

 Modified Files:
 	src/sys/crypto/chacha/arch/arm: chacha_neon_32.S

 Log Message:
 Adjust sp, not fp, to allocate a 32-byte temporary.

 Costs another couple MOV instructions, but we can't skimp on this --
 there's no red zone below sp for interrupts on arm, so we can't touch
 anything there.  So just use fp to save sp and then adjust sp itself,
 rather than using fp as a temporary register to point just below sp.

 Should fix PR port-arm/55598 -- previously the ChaCha self-test
 failed 33/10000 trials triggered by sysctl during running system;
 with the patch it has failed 0/10000 trials.

 (Presumably it happened more often at boot time, leading to 5/26
 failures in the test bed, because we just enabled interrupts and some
 devices are starting to deliver interrupts.)


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/sys/crypto/chacha/arch/arm/chacha_neon_32.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sun, 23 Aug 2020 16:41:43 +0000
State-Changed-Why:
Pretty sure this is fixed; feel free to re-open if you see it again.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.