NetBSD Problem Report #43997

From paul@whooppee.com  Sun Oct 24 13:04:24 2010
Return-Path: <paul@whooppee.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id B050763BA4B
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 24 Oct 2010 13:04:24 +0000 (UTC)
Message-Id: <20101024130413.B188C94A3F@quicky.whooppee.com>
Date: Sun, 24 Oct 2010 06:04:13 -0700 (PDT)
From: paul@whooppee.com
Reply-To: paul@whooppee.com
To: gnats-bugs@gnats.NetBSD.org
Subject: Kernel timer discrepancies
X-Send-Pr-Version: 3.95

>Number:         43997
>Notify-List:    gson@gson.org
>Category:       kern
>Synopsis:       Kernel timer discrepancies
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 24 13:05:00 +0000 2010
>Last-Modified:  Sat Sep 29 13:30:00 +0000 2018
>Originator:     Paul Goyette
>Release:        NetBSD 5.99.39
>Organization:
-------------------------------------------------------------------------
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |                          | pgoyette at netbsd.org  |
-------------------------------------------------------------------------
>Environment:


System: NetBSD quicky.whooppee.com 5.99.39 NetBSD 5.99.39 (WHOOPPEE 2010-09-12 22:12:41) #0: Thu Sep 30 21:01:33 PDT 2010 paul@speedy.whooppee.com:/build-out/netbsd-local/obj/amd64/sys/arch/amd64/compile/WHOOPPEE amd64
Architecture: x86_64
Machine: amd64
>Description:

The routines sleep(3), usleep(3), and nanosleep(2) wake-up based on the 
occurrence of clock ticks.  However, the timer interrupt routine determines
the actual absolute time.  Normally this isn't an issue, but on systems 
that can "miss" interrupts it can result in significant discrepancies.

One such system is qemu, frequently used for providing a system test-bed
environment.  On (at least) amd64 architecture, qemu cannot simulate clock
interrupts at 100Hz.  Therefore, a simple "date ; sleep 5; date" command
actually requires 10 seconds to complete!

This problem was first noticed while debugging atf tests for the swwdog(4)
driver.  The test program used sleep() to schedule tickling of the wdog,
while the kernel uses a callout() with an absolute time to expire the
wdog timer.  The test program attempted to wait to the middle of the timer
interval, but due to this time discrepancy the wait time was doubled and
the timer expired before the wdog could be tickled.
>How-To-Repeat:

See above.
>Fix:


>Release-Note:

>Audit-Trail:
From: "Paul Goyette" <pgoyette@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43997 CVS commit: src/tests/dev/sysmon
Date: Sun, 24 Oct 2010 13:16:13 +0000

 Module Name:	src
 Committed By:	pgoyette
 Date:		Sun Oct 24 13:16:12 UTC 2010

 Modified Files:
 	src/tests/dev/sysmon: t_swwdog.c

 Log Message:
 Due to timing discrepancies desribed in PR kern/43997, tickle the
 watchdog just a little bit sooner.  This maintains the spirit of the
 test while avoiding the actual bug.  This can be reverted once (if)
 the PR is fixed.


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/tests/dev/sysmon/t_swwdog.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Julio Merino" <jmmv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43997 CVS commit: src/tests/lib/libc/gen
Date: Sun, 17 Mar 2013 05:47:48 +0000

 Module Name:	src
 Committed By:	jmmv
 Date:		Sun Mar 17 05:47:48 UTC 2013

 Modified Files:
 	src/tests/lib/libc/gen: t_sleep.c

 Log Message:
 Do not special-case qemu when expecting the failure due to PR kern/43997.

 I am sporadically observing this in my real machine as well.  It's harder
 to trigger, but it happens.


 To generate a diff of this commit:
 cvs rdiff -u -r1.5 -r1.6 src/tests/lib/libc/gen/t_sleep.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Andreas Gustafsson" <gson@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43997 CVS commit: src/tests/net/mpls
Date: Mon, 1 Sep 2014 06:38:35 +0000

 Module Name:	src
 Committed By:	gson
 Date:		Mon Sep  1 06:38:35 UTC 2014

 Modified Files:
 	src/tests/net/mpls: t_ldp_regen.sh

 Log Message:
 The t_ldp_regen test frequently fails under qemu, but reliably passes
 on real hardware.  The failures are hardly surprising given that qemu
 timing is off by a is off by a factor of two as reported in
 PR kern/43997.  Disabling the test on qemu for now; it should be
 re-enabled once 43997 has been fixed to see if it still fails then.


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/tests/net/mpls/t_ldp_regen.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Robert Elz" <kre@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43997 CVS commit: src/tests/net/net
Date: Sun, 2 Oct 2016 15:27:33 +0000

 Module Name:	src
 Committed By:	kre
 Date:		Sun Oct  2 15:27:32 UTC 2016

 Modified Files:
 	src/tests/net/net: t_ipv6_lifetime.sh

 Log Message:
 This test works fine on real hardware, but due to PR kern/43997 (qemu
 timing problems) fails when run under qemu.   Attempt to compensate
 for that (by skipping the problematic test case) when running in qemu.

 This should be reverted when the PR gets fixed (either in qemu or in
 the NetBSD kernel).


 To generate a diff of this commit:
 cvs rdiff -u -r1.3 -r1.4 src/tests/net/net/t_ipv6_lifetime.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Ryota Ozaki" <ozaki-r@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43997 CVS commit: src/tests/net
Date: Mon, 24 Jul 2017 02:07:43 +0000

 Module Name:	src
 Committed By:	ozaki-r
 Date:		Mon Jul 24 02:07:43 UTC 2017

 Modified Files:
 	src/tests/net: net_common.sh
 	src/tests/net/ipsec: t_ipsec_misc.sh

 Log Message:
 Skip ipsec_spi_*_*_preferred_new_timeout when running on qemu

 Probably due to PR 43997


 To generate a diff of this commit:
 cvs rdiff -u -r1.19 -r1.20 src/tests/net/net_common.sh
 cvs rdiff -u -r1.15 -r1.16 src/tests/net/ipsec/t_ipsec_misc.sh

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Soren Jacobsen" <snj@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/43997 CVS commit: [netbsd-8] src
Date: Sat, 21 Oct 2017 19:43:57 +0000

 Module Name:	src
 Committed By:	snj
 Date:		Sat Oct 21 19:43:56 UTC 2017

 Modified Files:
 	src/crypto/dist/ipsec-tools/src/setkey [netbsd-8]: parse.y token.l
 	src/distrib/sets/lists/tests [netbsd-8]: mi
 	src/doc [netbsd-8]: TODO.smpnet
 	src/sys/net [netbsd-8]: pfkeyv2.h raw_cb.c raw_cb.h raw_usrreq.c
 	    rtsock.c
 	src/sys/netinet [netbsd-8]: in_proto.c ip_input.c tcp_input.c
 	    tcp_output.c tcp_var.h
 	src/sys/netinet6 [netbsd-8]: icmp6.c in6_proto.c ip6_forward.c
 	    ip6_input.c ip6_output.c ip6protosw.h
 	src/sys/netipsec [netbsd-8]: ipsec.c ipsec.h ipsec6.h ipsec_input.c
 	    ipsec_netbsd.c ipsec_output.c ipsec_private.h key.c key.h
 	    key_debug.c key_debug.h keydb.h keysock.c keysock.h xform.h
 	    xform_ah.c xform_esp.c xform_ipcomp.c xform_ipip.c xform_tcp.c
 	src/sys/rump/librump/rumpkern [netbsd-8]: Makefile.rumpkern
 	src/sys/rump/librump/rumpnet [netbsd-8]: net_stub.c
 	src/sys/sys [netbsd-8]: protosw.h
 	src/tests/net [netbsd-8]: net_common.sh
 	src/tests/net/carp [netbsd-8]: t_basic.sh
 	src/tests/net/if_gif [netbsd-8]: t_gif.sh
 	src/tests/net/if_l2tp [netbsd-8]: t_l2tp.sh
 	src/tests/net/ipsec [netbsd-8]: Makefile algorithms.sh common.sh
 	    t_ipsec_ah_keys.sh t_ipsec_esp_keys.sh t_ipsec_gif.sh
 	    t_ipsec_l2tp.sh t_ipsec_misc.sh t_ipsec_transport.sh
 	    t_ipsec_tunnel.sh t_ipsec_tunnel_odd.sh
 	src/tests/net/mcast [netbsd-8]: t_mcast.sh
 	src/tests/net/net [netbsd-8]: t_ipaddress.sh
 	src/tests/net/npf [netbsd-8]: t_npf.sh
 	src/tests/net/route [netbsd-8]: t_flags.sh t_flags6.sh
 	src/usr.bin/netstat [netbsd-8]: fast_ipsec.c
 Added Files:
 	src/tests/net/ipsec [netbsd-8]: t_ipsec_sockopt.sh t_ipsec_tcp.sh
 	    t_ipsec_tunnel_ipcomp.sh

 Log Message:
 Pull up following revision(s) (requested by ozaki-r in ticket #300):
 	crypto/dist/ipsec-tools/src/setkey/parse.y: 1.19
 	crypto/dist/ipsec-tools/src/setkey/token.l: 1.20
 	distrib/sets/lists/tests/mi: 1.754, 1.757, 1.759
 	doc/TODO.smpnet: 1.12-1.13
 	sys/net/pfkeyv2.h: 1.32
 	sys/net/raw_cb.c: 1.23-1.24, 1.28
 	sys/net/raw_cb.h: 1.28
 	sys/net/raw_usrreq.c: 1.57-1.58
 	sys/net/rtsock.c: 1.228-1.229
 	sys/netinet/in_proto.c: 1.125
 	sys/netinet/ip_input.c: 1.359-1.361
 	sys/netinet/tcp_input.c: 1.359-1.360
 	sys/netinet/tcp_output.c: 1.197
 	sys/netinet/tcp_var.h: 1.178
 	sys/netinet6/icmp6.c: 1.213
 	sys/netinet6/in6_proto.c: 1.119
 	sys/netinet6/ip6_forward.c: 1.88
 	sys/netinet6/ip6_input.c: 1.181-1.182
 	sys/netinet6/ip6_output.c: 1.193
 	sys/netinet6/ip6protosw.h: 1.26
 	sys/netipsec/ipsec.c: 1.100-1.122
 	sys/netipsec/ipsec.h: 1.51-1.61
 	sys/netipsec/ipsec6.h: 1.18-1.20
 	sys/netipsec/ipsec_input.c: 1.44-1.51
 	sys/netipsec/ipsec_netbsd.c: 1.41-1.45
 	sys/netipsec/ipsec_output.c: 1.49-1.64
 	sys/netipsec/ipsec_private.h: 1.5
 	sys/netipsec/key.c: 1.164-1.234
 	sys/netipsec/key.h: 1.20-1.32
 	sys/netipsec/key_debug.c: 1.18-1.21
 	sys/netipsec/key_debug.h: 1.9
 	sys/netipsec/keydb.h: 1.16-1.20
 	sys/netipsec/keysock.c: 1.59-1.62
 	sys/netipsec/keysock.h: 1.10
 	sys/netipsec/xform.h: 1.9-1.12
 	sys/netipsec/xform_ah.c: 1.55-1.74
 	sys/netipsec/xform_esp.c: 1.56-1.72
 	sys/netipsec/xform_ipcomp.c: 1.39-1.53
 	sys/netipsec/xform_ipip.c: 1.50-1.54
 	sys/netipsec/xform_tcp.c: 1.12-1.16
 	sys/rump/librump/rumpkern/Makefile.rumpkern: 1.170
 	sys/rump/librump/rumpnet/net_stub.c: 1.27
 	sys/sys/protosw.h: 1.67-1.68
 	tests/net/carp/t_basic.sh: 1.7
 	tests/net/if_gif/t_gif.sh: 1.11
 	tests/net/if_l2tp/t_l2tp.sh: 1.3
 	tests/net/ipsec/Makefile: 1.7-1.9
 	tests/net/ipsec/algorithms.sh: 1.5
 	tests/net/ipsec/common.sh: 1.4-1.6
 	tests/net/ipsec/t_ipsec_ah_keys.sh: 1.2
 	tests/net/ipsec/t_ipsec_esp_keys.sh: 1.2
 	tests/net/ipsec/t_ipsec_gif.sh: 1.6-1.7
 	tests/net/ipsec/t_ipsec_l2tp.sh: 1.6-1.7
 	tests/net/ipsec/t_ipsec_misc.sh: 1.8-1.18
 	tests/net/ipsec/t_ipsec_sockopt.sh: 1.1-1.2
 	tests/net/ipsec/t_ipsec_tcp.sh: 1.1-1.2
 	tests/net/ipsec/t_ipsec_transport.sh: 1.5-1.6
 	tests/net/ipsec/t_ipsec_tunnel.sh: 1.9
 	tests/net/ipsec/t_ipsec_tunnel_ipcomp.sh: 1.1-1.2
 	tests/net/ipsec/t_ipsec_tunnel_odd.sh: 1.3
 	tests/net/mcast/t_mcast.sh: 1.6
 	tests/net/net/t_ipaddress.sh: 1.11
 	tests/net/net_common.sh: 1.20
 	tests/net/npf/t_npf.sh: 1.3
 	tests/net/route/t_flags.sh: 1.20
 	tests/net/route/t_flags6.sh: 1.16
 	usr.bin/netstat/fast_ipsec.c: 1.22
 Do m_pullup before mtod

 It may fix panicks of some tests on anita/sparc and anita/GuruPlug.
 ---
 KNF
 ---
 Enable DEBUG for babylon5
 ---
 Apply C99-style struct initialization to xformsw
 ---
 Tweak outputs of netstat -s for IPsec

 - Get rid of "Fast"
 - Use ipsec and ipsec6 for titles to clarify protocol
 - Indent outputs of sub protocols

 Original outputs were organized like this:

 (Fast) IPsec:
 IPsec ah:
 IPsec esp:
 IPsec ipip:
 IPsec ipcomp:
 (Fast) IPsec:
 IPsec ah:
 IPsec esp:
 IPsec ipip:
 IPsec ipcomp:

 New outputs are organized like this:

 ipsec:
 	ah:
 	esp:
 	ipip:
 	ipcomp:
 ipsec6:
 	ah:
 	esp:
 	ipip:
 	ipcomp:
 ---
 Add test cases for IPComp
 ---
 Simplify IPSEC_OSTAT macro (NFC)
 ---
 KNF; replace leading whitespaces with hard tabs
 ---
 Introduce and use SADB_SASTATE_USABLE_P
 ---
 KNF
 ---
 Add update command for testing

 Updating an SA (SADB_UPDATE) requires that a process issuing
 SADB_UPDATE is the same as a process issued SADB_ADD (or SADB_GETSPI).
 This means that update command must be used with add command in a
 configuration of setkey. This usage is normally meaningless but
 useful for testing (and debugging) purposes.
 ---
 Add test cases for updating SA/SP

 The tests require newly-added udpate command of setkey.
 ---
 PR/52346: Frank Kardel: Fix checksumming for NAT-T
 See XXX for improvements.
 ---
 Remove codes for PACKET_TAG_IPSEC_IN_CRYPTO_DONE

 It seems that PACKET_TAG_IPSEC_IN_CRYPTO_DONE is for network adapters
 that have IPsec accelerators; a driver sets the mtag to a packet
 when its device has already encrypted the packet.

 Unfortunately no driver implements such offload features for long
 years and seems unlikely to implement them soon. (Note that neither
 FreeBSD nor Linux doesn't have such drivers.) Let's remove related
 (unused) codes and simplify the IPsec code.
 ---
 Fix usages of sadb_msg_errno
 ---
 Avoid updating sav directly

 On SADB_UPDATE a target sav was updated directly, which was unsafe.
 Instead allocate another sav, copy variables of the old sav to
 the new one and replace the old one with the new one.
 ---
 Simplify; we can assume sav->tdb_xform cannot be NULL while it's valid
 ---
 Rename key_alloc* functions (NFC)

 We shouldn't use the term "alloc" for functions that just look up
 data and actually don't allocate memory.
 ---
 Use explicit_memset to surely zero-clear key_auth and key_enc
 ---
 Make sure to clear keys on error paths of key_setsaval
 ---
 Add missing KEY_FREESAV
 ---
 Make sure a sav is inserted to a sah list after its initialization completes
 ---
 Remove unnecessary zero-clearing codes from key_setsaval

 key_setsaval is now used only for a newly-allocated sav. (It was
 used to reset variables of an existing sav.)
 ---
 Correct wrong assumption of sav->refcnt in key_delsah

 A sav in a list is basically not to be sav->refcnt == 0. And also
 KEY_FREESAV assumes sav->refcnt > 0.
 ---
 Let key_getsavbyspi take a reference of a returning sav
 ---
 Use time_mono_to_wall (NFC)
 ---
 Separate sending message routine (NFC)
 ---
 Simplify; remove unnecessary zero-clears

 key_freesaval is used only when a target sav is being destroyed.
 ---
 Omit NULL checks for sav->lft_c

 sav->lft_c can be NULL only when initializing or destroying sav.
 ---
 Omit unnecessary NULL checks for sav->sah
 ---
 Omit unnecessary check of sav->state

 key_allocsa_policy picks a sav of either MATURE or DYING so we
 don't need to check its state again.
 ---
 Simplify; omit unnecessary saidx passing

 - ipsec_nextisr returns a saidx but no caller uses it
 - key_checkrequest is passed a saidx but it can be gotton by
   another argument (isr)
 ---
 Fix splx isn't called on some error paths
 ---
 Fix header size calculation of esp where sav is NULL
 ---
 Fix header size calculation of ah in the case sav is NULL

 This fix was also needed for esp.
 ---
 Pass sav directly to opencrypto callback

 In a callback, use a passed sav as-is by default and look up a sav
 only if the passed sav is dead.
 ---
 Avoid examining freshness of sav on packet processing

 If a sav list is sorted (by lft_c->sadb_lifetime_addtime) in advance,
 we don't need to examine each sav and also don't need to delete one
 on the fly and send up a message. Fortunately every sav lists are sorted
 as we need.

 Added key_validate_savlist validates that each sav list is surely sorted
 (run only if DEBUG because it's not cheap).
 ---
 Add test cases for SAs with different SPIs
 ---
 Prepare to stop using isr->sav

 isr is a shared resource and using isr->sav as a temporal storage
 for each packet processing is racy. And also having a reference from
 isr to sav makes the lifetime of sav non-deterministic; such a reference
 is removed when a packet is processed and isr->sav is overwritten by
 new one. Let's have a sav locally for each packet processing instead of
 using shared isr->sav.

 However this change doesn't stop using isr->sav yet because there are
 some users of isr->sav. isr->sav will be removed after the users find
 a way to not use isr->sav.
 ---
 Fix wrong argument handling
 ---
 fix printf format.
 ---
 Don't validate sav lists of LARVAL or DEAD states

 We don't sort the lists so the validation will always fail.

 Fix PR kern/52405
 ---
 Make sure to sort the list when changing the state by key_sa_chgstate
 ---
 Rename key_allocsa_policy to key_lookup_sa_bysaidx
 ---
 Separate test files
 ---
 Calculate ah_max_authsize on initialization as well as esp_max_ivlen
 ---
 Remove m_tag_find(PACKET_TAG_IPSEC_PENDING_TDB) because nobody sets the tag
 ---
 Restore a comment removed in previous

 The comment is valid for the below code.
 ---
 Make tests more stable

 sleep command seems to wait longer than expected on anita so
 use polling to wait for a state change.
 ---
 Add tests that explicitly delete SAs instead of waiting for expirations
 ---
 Remove invalid M_AUTHIPDGM check on ESP isr->sav

 M_AUTHIPDGM flag is set to a mbuf in ah_input_cb. An sav of ESP can
 have AH authentication as sav->tdb_authalgxform. However, in that
 case esp_input and esp_input_cb are used to do ESP decryption and
 AH authentication and M_AUTHIPDGM never be set to a mbuf. So
 checking M_AUTHIPDGM of a mbuf on isr->sav of ESP is meaningless.
 ---
 Look up sav instead of relying on unstable sp->req->sav

 This code is executed only in an error path so an additional lookup
 doesn't matter.
 ---
 Correct a comment
 ---
 Don't release sav if calling crypto_dispatch again
 ---
 Remove extra KEY_FREESAV from ipsec_process_done

 It should be done by the caller.
 ---
 Don't bother the case of crp->crp_buf == NULL in callbacks
 ---
 Hold a reference to an SP during opencrypto processing

 An SP has a list of isr (ipsecrequest) that represents a sequence
 of IPsec encryption/authentication processing. One isr corresponds
 to one opencrypto processing. The lifetime of an isr follows its SP.

 We pass an isr to a callback function of opencrypto to continue
 to a next encryption/authentication processing. However nobody
 guaranteed that the isr wasn't freed, i.e., its SP wasn't destroyed.

 In order to avoid such unexpected destruction of isr, hold a reference
 to its SP during opencrypto processing.
 ---
 Don't make SAs expired on tests that delete SAs explicitly
 ---
 Fix a debug message
 ---
 Dedup error paths (NFC)
 ---
 Use pool to allocate tdb_crypto

 For ESP and AH, we need to allocate an extra variable space in addition
 to struct tdb_crypto. The fixed size of pool items may be larger than
 an actual requisite size of a buffer, but still the performance
 improvement by replacing malloc with pool wins.
 ---
 Don't use unstable isr->sav for header size calculations

 We may need to optimize to not look up sav here for users that
 don't need to know an exact size of headers (e.g., TCP segmemt size
 caclulation).
 ---
 Don't use sp->req->sav when handling NAT-T ESP fragmentation

 In order to do this we need to look up a sav however an additional
 look-up degrades performance. A sav is later looked up in
 ipsec4_process_packet so delay the fragmentation check until then
 to avoid an extra look-up.
 ---
 Don't use key_lookup_sp that depends on unstable sp->req->sav

 It provided a fast look-up of SP. We will provide an alternative
 method in the future (after basic MP-ification finishes).
 ---
 Stop setting isr->sav on looking up sav in key_checkrequest
 ---
 Remove ipsecrequest#sav
 ---
 Stop setting mtag of PACKET_TAG_IPSEC_IN_DONE because there is no users anymore
 ---
 Skip ipsec_spi_*_*_preferred_new_timeout when running on qemu

 Probably due to PR 43997
 ---
 Add localcount to rump kernels
 ---
 Remove unused macro
 ---
 Fix key_getcomb_setlifetime

 The fix adjusts a soft limit to be 80% of a corresponding hard limit.

 I'm not sure the fix is really correct though, at least the original
 code is wrong. A passed comb is zero-cleared before calling
 key_getcomb_setlifetime, so
   comb->sadb_comb_soft_addtime = comb->sadb_comb_soft_addtime * 80 / 100;
 is meaningless.
 ---
 Provide and apply key_sp_refcnt (NFC)

 It simplifies further changes.
 ---
 Fix indentation

 Pointed out by knakahara@
 ---
 Use pslist(9) for sptree
 ---
 Don't acquire global locks for IPsec if NET_MPSAFE

 Note that the change is just to make testing easy and IPsec isn't MP-safe yet.
 ---
 Let PF_KEY socks hold their own lock instead of softnet_lock

 Operations on SAD and SPD are executed via PF_KEY socks. The operations
 include deletions of SAs and SPs that will use synchronization mechanisms
 such as pserialize_perform to wait for references to SAs and SPs to be
 released. It is known that using such mechanisms with holding softnet_lock
 causes a dead lock. We should avoid the situation.
 ---
 Make IPsec SPD MP-safe

 We use localcount(9), not psref(9), to make the sptree and secpolicy (SP)
 entries MP-safe because SPs need to be referenced over opencrypto
 processing that executes a callback in a different context.

 SPs on sockets aren't managed by the sptree and can be destroyed in softint.
 localcount_drain cannot be used in softint so we delay the destruction of
 such SPs to a thread context. To do so, a list to manage such SPs is added
 (key_socksplist) and key_timehandler_spd deletes dead SPs in the list.

 For more details please read the locking notes in key.c.

 Proposed on tech-kern@ and tech-net@
 ---
 Fix updating ipsec_used

 - key_update_used wasn't called in key_api_spddelete2 and key_api_spdflush
 - key_update_used wasn't called if an SP had been added/deleted but
   a reply to userland failed
 ---
 Fix updating ipsec_used; turn on when SPs on sockets are added
 ---
 Add missing IPsec policy checks to icmp6_rip6_input

 icmp6_rip6_input is quite similar to rip6_input and the same checks exist
 in rip6_input.
 ---
 Add test cases for setsockopt(IP_IPSEC_POLICY)
 ---
 Don't use KEY_NEWSP for dummy SP entries

 By the change KEY_NEWSP is now not called from softint anymore
 and we can use kmem_zalloc with KM_SLEEP for KEY_NEWSP.
 ---
 Comment out unused functions
 ---
 Add test cases that there are SPs but no relevant SAs
 ---
 Don't allow sav->lft_c to be NULL

 lft_c of an sav that was created by SADB_GETSPI could be NULL.
 ---
 Clean up clunky eval strings

 - Remove unnecessary \ at EOL
   - This allows to omit ; too
 - Remove unnecessary quotes for arguments of atf_set
 - Don't expand $DEBUG in eval
   - We expect it's expanded on execution

 Suggested by kre@
 ---
 Remove unnecessary KEY_FREESAV in an error path

 sav should be freed (unreferenced) by the caller.
 ---
 Use pslist(9) for sahtree
 ---
 Use pslist(9) for sah->savtree
 ---
 Rename local variable newsah to sah

 It may not be new.
 ---
 MP-ify SAD slightly

 - Introduce key_sa_mtx and use it for some list operations
 - Use pserialize for some list iterations
 ---
 Introduce KEY_SA_UNREF and replace KEY_FREESAV with it where sav will never be actually freed in the future

 KEY_SA_UNREF is still key_freesav so no functional change for now.

 This change reduces diff of further changes.
 ---
 Remove out-of-date log output

 Pointed out by riastradh@
 ---
 Use KDASSERT instead of KASSERT for mutex_ownable

 Because mutex_ownable is too heavy to run in a fast path
 even for DIAGNOSTIC + LOCKDEBUG.

 Suggested by riastradh@
 ---
 Assemble global lists and related locks into cache lines (NFCI)

 Also rename variable names from *tree to *list because they are
 just lists, not trees.

 Suggested by riastradh@
 ---
 Move locking notes
 ---
 Update the locking notes

 - Add locking order
 - Add locking notes for misc lists such as reglist
 - Mention pserialize, key_sp_ref and key_sp_unref on SP operations

 Requested by riastradh@
 ---
 Describe constraints of key_sp_ref and key_sp_unref

 Requested by riastradh@
 ---
 Hold key_sad.lock on SAVLIST_WRITER_INSERT_TAIL
 ---
 Add __read_mostly to key_psz

 Suggested by riastradh@
 ---
 Tweak wording (pserialize critical section => pserialize read section)

 Suggested by riastradh@
 ---
 Add missing mutex_exit
 ---
 Fix setkey -D -P outputs

 The outputs were tweaked (by me), but I forgot updating libipsec
 in my local ATF environment...
 ---
 MP-ify SAD (key_sad.sahlist and sah entries)

 localcount(9) is used to protect key_sad.sahlist and sah entries
 as well as SPD (and will be used for SAD sav).

 Please read the locking notes of SAD for more details.
 ---
 Introduce key_sa_refcnt and replace sav->refcnt with it (NFC)
 ---
 Destroy sav only in the loop for DEAD sav
 ---
 Fix KASSERT(solocked(sb->sb_so)) failure in sbappendaddr that is called eventually from key_sendup_mbuf

 If key_sendup_mbuf isn't passed a socket, the assertion fails.
 Originally in this case sb->sb_so was softnet_lock and callers
 held softnet_lock so the assertion was magically satisfied.
 Now sb->sb_so is key_so_mtx and also softnet_lock isn't always
 held by callers so the assertion can fail.

 Fix it by holding key_so_mtx if key_sendup_mbuf isn't passed a socket.

 Reported by knakahara@
 Tested by knakahara@ and ozaki-r@
 ---
 Fix locking notes of SAD
 ---
 Fix deadlock between key_sendup_mbuf called from key_acquire and localcount_drain

 If we call key_sendup_mbuf from key_acquire that is called on packet
 processing, a deadlock can happen like this:
 - At key_acquire, a reference to an SP (and an SA) is held
 - key_sendup_mbuf will try to take key_so_mtx
 - Some other thread may try to localcount_drain to the SP with
   holding key_so_mtx in say key_api_spdflush
 - In this case localcount_drain never return because key_sendup_mbuf
   that has stuck on key_so_mtx never release a reference to the SP

 Fix the deadlock by deferring key_sendup_mbuf to the timer
 (key_timehandler).
 ---
 Fix that prev isn't cleared on retry
 ---
 Limit the number of mbufs queued for deferred key_sendup_mbuf

 It's easy to be queued hundreds of mbufs on the list under heavy
 network load.
 ---
 MP-ify SAD (savlist)

 localcount(9) is used to protect savlist of sah. The basic design is
 similar to MP-ifications of SPD and SAD sahlist. Please read the
 locking notes of SAD for more details.
 ---
 Simplify ipsec_reinject_ipstack (NFC)
 ---
 Add per-CPU rtcache to ipsec_reinject_ipstack

 It reduces route lookups and also reduces rtcache lock contentions
 when NET_MPSAFE is enabled.
 ---
 Use pool_cache(9) instead of pool(9) for tdb_crypto objects

 The change improves network throughput especially on multi-core systems.
 ---
 Update

 ipsec(4), opencrypto(9) and vlan(4) are now MP-safe.
 ---
 Write known issues on scalability
 ---
 Share a global dummy SP between PCBs

 It's never be changed so it can be pre-allocated and shared safely between PCBs.
 ---
 Fix race condition on the rawcb list shared by rtsock and keysock

 keysock now protects itself by its own mutex, which means that
 the rawcb list is protected by two different mutexes (keysock's one
 and softnet_lock for rtsock), of course it's useless.

 Fix the situation by having a discrete rawcb list for each.
 ---
 Use a dedicated mutex for rt_rawcb instead of softnet_lock if NET_MPSAFE
 ---
 fix localcount leak in sav. fixed by ozaki-r@n.o.

 I commit on behalf of him.
 ---
 remove unnecessary comment.
 ---
 Fix deadlock between pserialize_perform and localcount_drain

 A typical ussage of localcount_drain looks like this:

   mutex_enter(&mtx);
   item = remove_from_list();
   pserialize_perform(psz);
   localcount_drain(&item->localcount, &cv, &mtx);
   mutex_exit(&mtx);

 This sequence can cause a deadlock which happens for example on the following
 situation:

 - Thread A calls localcount_drain which calls xc_broadcast after releasing
   a specified mutex
 - Thread B enters the sequence and calls pserialize_perform with holding
   the mutex while pserialize_perform also calls xc_broadcast
 - Thread C (xc_thread) that calls an xcall callback of localcount_drain tries
   to hold the mutex

 xc_broadcast of thread B doesn't start until xc_broadcast of thread A
 finishes, which is a feature of xcall(9). This means that pserialize_perform
 never complete until xc_broadcast of thread A finishes. On the other hand,
 thread C that is a callee of xc_broadcast of thread A sticks on the mutex.
 Finally the threads block each other (A blocks B, B blocks C and C blocks A).

 A possible fix is to serialize executions of the above sequence by another
 mutex, but adding another mutex makes the code complex, so fix the deadlock
 by another way; the fix is to release the mutex before pserialize_perform
 and instead use a condvar to prevent pserialize_perform from being called
 simultaneously.

 Note that the deadlock has happened only if NET_MPSAFE is enabled.
 ---
 Add missing ifdef NET_MPSAFE
 ---
 Take softnet_lock on pr_input properly if NET_MPSAFE

 Currently softnet_lock is taken unnecessarily in some cases, e.g.,
 icmp_input and encap4_input from ip_input, or not taken even if needed,
 e.g., udp_input and tcp_input from ipsec4_common_input_cb. Fix them.

 NFC if NET_MPSAFE is disabled (default).
 ---
 - sanitize key debugging so that we don't print extra newlines or unassociated
   debugging messages.
 - remove unused functions and make internal ones static
 - print information in one line per message
 ---
 humanize printing of ip addresses
 ---
 cast reduction, NFC.
 ---
 Fix typo in comment
 ---
 Pull out ipsec_fill_saidx_bymbuf (NFC)
 ---
 Don't abuse key_checkrequest just for looking up sav

 It does more than expected for example key_acquire.
 ---
 Fix SP is broken on transport mode

 isr->saidx was modified accidentally in ipsec_nextisr.

 Reported by christos@
 Helped investigations by christos@ and knakahara@
 ---
 Constify isr at many places (NFC)
 ---
 Include socketvar.h for softnet_lock
 ---
 Fix buffer length for ipsec_logsastr


 To generate a diff of this commit:
 cvs rdiff -u -r1.18 -r1.18.4.1 src/crypto/dist/ipsec-tools/src/setkey/parse.y
 cvs rdiff -u -r1.19 -r1.19.8.1 src/crypto/dist/ipsec-tools/src/setkey/token.l
 cvs rdiff -u -r1.752.2.3 -r1.752.2.4 src/distrib/sets/lists/tests/mi
 cvs rdiff -u -r1.11 -r1.11.6.1 src/doc/TODO.smpnet
 cvs rdiff -u -r1.31 -r1.31.4.1 src/sys/net/pfkeyv2.h
 cvs rdiff -u -r1.22 -r1.22.20.1 src/sys/net/raw_cb.c
 cvs rdiff -u -r1.27 -r1.27.4.1 src/sys/net/raw_cb.h
 cvs rdiff -u -r1.56 -r1.56.4.1 src/sys/net/raw_usrreq.c
 cvs rdiff -u -r1.213.2.2 -r1.213.2.3 src/sys/net/rtsock.c
 cvs rdiff -u -r1.123 -r1.123.4.1 src/sys/netinet/in_proto.c
 cvs rdiff -u -r1.355 -r1.355.2.1 src/sys/netinet/ip_input.c
 cvs rdiff -u -r1.357.4.1 -r1.357.4.2 src/sys/netinet/tcp_input.c
 cvs rdiff -u -r1.196 -r1.196.2.1 src/sys/netinet/tcp_output.c
 cvs rdiff -u -r1.177 -r1.177.10.1 src/sys/netinet/tcp_var.h
 cvs rdiff -u -r1.211.6.1 -r1.211.6.2 src/sys/netinet6/icmp6.c
 cvs rdiff -u -r1.117 -r1.117.4.1 src/sys/netinet6/in6_proto.c
 cvs rdiff -u -r1.87 -r1.87.2.1 src/sys/netinet6/ip6_forward.c
 cvs rdiff -u -r1.178 -r1.178.2.1 src/sys/netinet6/ip6_input.c
 cvs rdiff -u -r1.191.6.1 -r1.191.6.2 src/sys/netinet6/ip6_output.c
 cvs rdiff -u -r1.25 -r1.25.10.1 src/sys/netinet6/ip6protosw.h
 cvs rdiff -u -r1.99 -r1.99.2.1 src/sys/netipsec/ipsec.c
 cvs rdiff -u -r1.50 -r1.50.2.1 src/sys/netipsec/ipsec.h
 cvs rdiff -u -r1.17 -r1.17.4.1 src/sys/netipsec/ipsec6.h \
     src/sys/netipsec/key_debug.c
 cvs rdiff -u -r1.43 -r1.43.2.1 src/sys/netipsec/ipsec_input.c
 cvs rdiff -u -r1.40 -r1.40.6.1 src/sys/netipsec/ipsec_netbsd.c
 cvs rdiff -u -r1.48 -r1.48.2.1 src/sys/netipsec/ipsec_output.c
 cvs rdiff -u -r1.4 -r1.4.4.1 src/sys/netipsec/ipsec_private.h
 cvs rdiff -u -r1.163 -r1.163.2.1 src/sys/netipsec/key.c
 cvs rdiff -u -r1.19 -r1.19.2.1 src/sys/netipsec/key.h
 cvs rdiff -u -r1.8 -r1.8.4.1 src/sys/netipsec/key_debug.h
 cvs rdiff -u -r1.15 -r1.15.2.1 src/sys/netipsec/keydb.h
 cvs rdiff -u -r1.58 -r1.58.2.1 src/sys/netipsec/keysock.c
 cvs rdiff -u -r1.9 -r1.9.2.1 src/sys/netipsec/keysock.h
 cvs rdiff -u -r1.8 -r1.8.10.1 src/sys/netipsec/xform.h
 cvs rdiff -u -r1.54 -r1.54.2.1 src/sys/netipsec/xform_ah.c
 cvs rdiff -u -r1.55 -r1.55.2.1 src/sys/netipsec/xform_esp.c
 cvs rdiff -u -r1.38 -r1.38.2.1 src/sys/netipsec/xform_ipcomp.c
 cvs rdiff -u -r1.49 -r1.49.2.1 src/sys/netipsec/xform_ipip.c
 cvs rdiff -u -r1.11 -r1.11.4.1 src/sys/netipsec/xform_tcp.c
 cvs rdiff -u -r1.169 -r1.169.6.1 \
     src/sys/rump/librump/rumpkern/Makefile.rumpkern
 cvs rdiff -u -r1.26 -r1.26.4.1 src/sys/rump/librump/rumpnet/net_stub.c
 cvs rdiff -u -r1.66 -r1.66.10.1 src/sys/sys/protosw.h
 cvs rdiff -u -r1.18.2.1 -r1.18.2.2 src/tests/net/net_common.sh
 cvs rdiff -u -r1.6 -r1.6.2.1 src/tests/net/carp/t_basic.sh
 cvs rdiff -u -r1.10 -r1.10.2.1 src/tests/net/if_gif/t_gif.sh
 cvs rdiff -u -r1.2 -r1.2.8.1 src/tests/net/if_l2tp/t_l2tp.sh
 cvs rdiff -u -r1.6 -r1.6.2.1 src/tests/net/ipsec/Makefile
 cvs rdiff -u -r1.4 -r1.4.2.1 src/tests/net/ipsec/algorithms.sh \
     src/tests/net/ipsec/t_ipsec_transport.sh
 cvs rdiff -u -r1.3 -r1.3.2.1 src/tests/net/ipsec/common.sh
 cvs rdiff -u -r1.1 -r1.1.8.1 src/tests/net/ipsec/t_ipsec_ah_keys.sh \
     src/tests/net/ipsec/t_ipsec_esp_keys.sh
 cvs rdiff -u -r1.5 -r1.5.2.1 src/tests/net/ipsec/t_ipsec_gif.sh \
     src/tests/net/ipsec/t_ipsec_l2tp.sh
 cvs rdiff -u -r1.6.2.1 -r1.6.2.2 src/tests/net/ipsec/t_ipsec_misc.sh
 cvs rdiff -u -r0 -r1.2.2.2 src/tests/net/ipsec/t_ipsec_sockopt.sh \
     src/tests/net/ipsec/t_ipsec_tcp.sh \
     src/tests/net/ipsec/t_ipsec_tunnel_ipcomp.sh
 cvs rdiff -u -r1.8 -r1.8.2.1 src/tests/net/ipsec/t_ipsec_tunnel.sh
 cvs rdiff -u -r1.2 -r1.2.2.1 src/tests/net/ipsec/t_ipsec_tunnel_odd.sh
 cvs rdiff -u -r1.5 -r1.5.4.1 src/tests/net/mcast/t_mcast.sh
 cvs rdiff -u -r1.10 -r1.10.4.1 src/tests/net/net/t_ipaddress.sh
 cvs rdiff -u -r1.2 -r1.2.28.1 src/tests/net/npf/t_npf.sh
 cvs rdiff -u -r1.15.6.1 -r1.15.6.2 src/tests/net/route/t_flags.sh
 cvs rdiff -u -r1.12.6.1 -r1.12.6.2 src/tests/net/route/t_flags6.sh
 cvs rdiff -u -r1.21 -r1.21.4.1 src/usr.bin/netstat/fast_ipsec.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/43997: Kernel timer discrepancies
Date: Sat, 29 Sep 2018 16:26:36 +0300

 Almost eight years ago, Paul Goyette wrote:
 > The routines sleep(3), usleep(3), and nanosleep(2) wake-up based on the 
 > occurrence of clock ticks.  However, the timer interrupt routine determines
 > the actual absolute time.  Normally this isn't an issue, but on systems 
 > that can "miss" interrupts it can result in significant discrepancies.
 > 
 > One such system is qemu, frequently used for providing a system test-bed
 > environment.  On (at least) amd64 architecture, qemu cannot simulate clock
 > interrupts at 100Hz.  Therefore, a simple "date ; sleep 5; date" command
 > actually requires 10 seconds to complete!

 This is still an issue.  Some observations:

 1. As a work-around, attempts are made to skip certain tests under
 qemu, but the method used to detect qemu does not work for sparc
 guests, so on sparc, these tests are still run and fail (e.g.,
 ipsec_spi_ah_hmacsha512_preferred_new_timeout).

 2. There are three parties to the problem: the host, qemu, and the
 guest.  Paul's original report frames it as a guest problem (the guest
 not dealing with lost clock interrupts), but IMO the real bug is that
 the interrupts are lost in the first place, and the root cause of that
 lies not with the guest nor with qemu, but with the host.

 Specifically, it's a problem of host timer granularity: on a NetBSD
 host with the default hz=100, calling poll() with a timeout of 5 ms
 (as qemu does) results in a sleep taking anywhere between 10 and 20
 ms, and repeated calls will return at regular 20 ms intervals.  It is
 hardly surprising that qemu can't schedule guest clock interrupts at
 10 ms intervals when the host can't wake qemu up at intervals less
 than 20 ms.

 IMO, the correct way to fix this is for NetBSD to go tickless, as
 Linux and FreeBSD already have.  Until that happens, possible
 work-arounds are to increase the hz parameter on the host, or to
 host qemu on an OS other than NetBSD.

 3. It should also be possible to work around the issue by making the
 guest have its own notion of time, independent of the host time and
 instead based on a count of emulated guest instructions.  This would
 also have other advantages: it would make test execution more
 deterministic, and it would allow speeding up the tests by "warping"
 the virtual guest time instantly to the next clock tick whenever the
 guest enters the idle loop, so that guest idle time would consume no
 real time.

 Qemu claims to already support this through the "--icount" option
 and its "sleep" keyword, but sadly, it doesn't actually work.  See
 https://bugs.launchpad.net/qemu/+bug/1774677 for the bug report.
 -- 
 Andreas Gustafsson, gson@gson.org

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.