NetBSD Problem Report #44473

From Wolfgang.Stukenbrock@nagler-company.com  Thu Jan 27 12:55:57 2011
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id 500B163B873
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 27 Jan 2011 12:55:57 +0000 (UTC)
Message-Id: <20110127125548.720B81E80CE@test-s0.nagler-company.com>
Date: Thu, 27 Jan 2011 13:55:48 +0100 (CET)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: (FAST-)IPSEC processing comsumes too much CPU in interrupt processing
X-Send-Pr-Version: 3.95

>Number:         44473
>Category:       kern
>Synopsis:       (FAST-)IPSEC processing comsumes too much CPU in interrupt processing
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Thu Jan 27 13:00:00 +0000 2011
>Last-Modified:  Thu Jan 27 16:00:03 +0000 2011
>Originator:     Dr. Wolfgang Stukenbrock
>Release:        NetBSD 5.1
>Organization:
Dr. Nagler & Company GmbH
>Environment:


System: NetBSD test-s0 4.0 NetBSD 4.0 (NSW-WS) #0: Tue Aug 17 17:28:09 CEST 2010 wgstuken@test-s0:/usr/src/sys/arch/amd64/compile/NSW-WS amd64
Architecture: x86_64
Machine: amd64
>Description:
	The IPSEC and FAST_IPSEC processing is done while processing the packet itself.
	After processing, the packet is placed in a queue in order to avoid stack problems and reinjected
	from there by a kernel-thread (cryptoret).
	This consumes a large amount of cpu-power during interrupt processing and that leads to bad response times
	to other IO-devices (or next packets on the network).
	I've meashured "only" a throughput of something about 6 MB/sec on a Xeon L3406.
	(In NetBSD 4.0 a Xeon E3110 has processed up to 12 MB/sec ...)
	The reason for this seems to be the KERNEL_LOCK and/or softnet_lock mutex that are hold during
	processing of the device interrupts and the network code.

	We run some systems as Firewalls that connects different locations via ESP/IPCOMP-tunnels to provide
	a transparent network for the locations. For this application 6 MB/sec is too slow ...

	While forwarding packets "top" shows something about 90% interrupt activity on CPU 0.

	In order to separate the interrupt processing from the crypto-stuff, I've create a new queue in cryptosoft.c
	and bypass the previous queue at end of processing. I also allow a dynamic number of kernel threads to 
	process requests in parallel tunable via sysctl at runtime.
	With this change, "top" now says something about 10% interrupt on CPU 0 and a large number in % for system
	from the new threads. I've reached 12 MB/sec throughput again - but I think it may be larger, the other side (the E3110 system)
	was not able to process more ...

	The patch below does not turn on this new kind of processing by default. If 0 threads are selected, the processing is
	done in the old style during interrupt processing.
	Netherless always one new thread is present all the time in order to simplyfy the implementation.
	The number of threads is adjusted dynamicaly to the requested value that may be changed via sysctl at any time.
	For now there is no uppler limit check for the number of threads - perhaps it would make sence to add one.
	I've selected the name "kern.swcr_num_thr" for the parameter - feel free to choose another one ...

	I've tested the troughput with different number of threads (my L3406 offers 4 CPU's) and found out that at least
	for my test-setup 2 threads lead to the best result.
	But for other setups other values may be best. Therefore I've left the number of threads tunable by the admin.
	When starting more threads, it looks like that the scheduling overhead gets to expensive. But that should depend on the number
	of concurrent available crypto request and that again depends on the amount of trafic ...

	What is still not perfect in this implementation:
	- no upper limit for the number of threads - an easy aproach may be the number of CPU's available
	- curently the threads are not bound to a CPU, they switch between them. But I simple do not know how to fix them to a 
	  CPU and I'm not shure how to decide to which CPU it should be bound.
	- I use printf() for kernel messages as done in the other parts of opencrypto too.
	- I'm not shure if I would overrun the console with error messages if the thread-creation does not work, because
	  the code will retry to create missing threads after processing each request again and again and again ....
	  But I don't think it would is a good idea to reduche the sysctl-configured value. That might surprise the admin ...
>How-To-Repeat:
	No relevant - new functionality
>Fix:
	The following patch to /usr/src/sys/opencrypto/cryptosoft.[ch] requires PR44470 to be applied.
	Otherwise the change to "const" for the swcr_data pointers in the worker-routines does not work for swcr_compdec().
	(sorry - the const should have been in PR 44470, but I've overlooked them there - sorry again ...)


--- cryptosoft.c	2011/01/27 11:42:23	1.2
+++ cryptosoft.c	2011/01/27 12:08:43
@@ -33,6 +33,8 @@
 #include <sys/sysctl.h>
 #include <sys/errno.h>

+#include <sys/kthread.h>
+
 #include "opt_ocf.h"
 #include <opencrypto/cryptodev.h>
 #include <opencrypto/cryptosoft.h>
@@ -53,6 +55,14 @@
 u_int32_t swcr_sesnum = 0;
 int32_t swcr_id = -1;

+static  void swcrthr(void);           /* kernel thread(s) for processing request in parallel */
+
+static kcondvar_t swcr_kcv;
+static kmutex_t   swcr_kmtx;
+static int        swcr_numthr = 0; /* default number of thread to use */
+static int        swcr_curthr = 0;
+static TAILQ_HEAD(,cryptop) swcr_q = TAILQ_HEAD_INITIALIZER(swcr_q);
+
 #define COPYBACK(x, a, b, c, d) \
 	(x) == CRYPTO_BUF_MBUF ? m_copyback((struct mbuf *)a,b,c,d) \
 	: cuio_copyback((struct uio *)a,b,c,d)
@@ -60,9 +70,10 @@
 	(x) == CRYPTO_BUF_MBUF ? m_copydata((struct mbuf *)a,b,c,d) \
 	: cuio_copydata((struct uio *)a,b,c,d)

-static	int swcr_encdec(struct cryptodesc *, struct swcr_data *, void *, int);
-static	int swcr_compdec(struct cryptodesc *, struct swcr_data *, void *, int, int *);
+static	int swcr_encdec(struct cryptodesc *, const struct swcr_data *, void *, int);
+static	int swcr_compdec(struct cryptodesc *, const struct swcr_data *, void *, int, int *);
 static	int swcr_process(void *, struct cryptop *, int);
+static	void swcr_process2(struct cryptop *);
 static	int swcr_newsession(void *, u_int32_t *, struct cryptoini *);
 static	int swcr_freesession(void *, u_int64_t);

@@ -70,7 +81,7 @@
  * Apply a symmetric encryption/decryption algorithm.
  */
 static int
-swcr_encdec(struct cryptodesc *crd, struct swcr_data *sw, void *bufv,
+swcr_encdec(struct cryptodesc *crd, const struct swcr_data *sw, void *bufv,
     int outtype)
 {
 	char *buf = bufv;
@@ -418,7 +429,7 @@
  */
 int
 swcr_authcompute(struct cryptop *crp, struct cryptodesc *crd,
-    struct swcr_data *sw, void *buf, int outtype)
+    const struct swcr_data *sw, void *buf, int outtype)
 {
 	unsigned char aalg[AALG_MAX_RESULT_LEN];
 	const struct swcr_auth_hash *axf;
@@ -512,7 +523,7 @@
  * Apply a compression/decompression algorithm
  */
 static int
-swcr_compdec(struct cryptodesc *crd, struct swcr_data *sw,
+swcr_compdec(struct cryptodesc *crd, const struct swcr_data *sw,
     void *buf, int outtype, int *res_size)
 {
 	u_int8_t *data, *out;
@@ -900,15 +911,39 @@
 static int
 swcr_process(void *arg, struct cryptop *crp, int hint)
 {
-	struct cryptodesc *crd;
-	struct swcr_data *sw;
-	u_int32_t lid;
-	int type;
+	int wasempty;

 	/* Sanity check */
 	if (crp == NULL)
 		return EINVAL;

+	if (swcr_numthr != 0) { 
+		mutex_spin_enter(&swcr_kmtx);  
+		if (swcr_curthr == 0 || swcr_numthr == 0) {
+			mutex_spin_exit(&swcr_kmtx);
+			goto direct_call;
+		}       
+		wasempty = TAILQ_EMPTY(&swcr_q);
+		TAILQ_INSERT_TAIL(&swcr_q, crp, crp_next);
+		if (wasempty) {
+			cv_signal(&swcr_kcv);
+		}       
+		mutex_spin_exit(&swcr_kmtx);
+	} else {        
+direct_call:    
+		swcr_process2(crp);
+	}               
+	return 0;       
+}
+
+static void
+swcr_process2(struct cryptop *crp)
+{
+	struct cryptodesc *crd;
+	struct swcr_data *sw;
+	u_int32_t lid;
+	int type;
+
 	if (crp->crp_desc == NULL || crp->crp_buf == NULL) {
 		crp->crp_etype = EINVAL;
 		goto done;
@@ -1000,12 +1035,13 @@
 done:
 	DPRINTF(("request %08x done\n", (uint32_t)crp));
 	crypto_done(crp);
-	return 0;
 }

 static void
 swcr_init(void)
 {
+	int error;
+
 	swcr_id = crypto_get_driverid(CRYPTOCAP_F_SOFTWARE);
 	if (swcr_id < 0) {
 		/* This should never happen */
@@ -1038,8 +1074,88 @@
 	REGISTER(CRYPTO_DEFLATE_COMP);
 	REGISTER(CRYPTO_GZIP_COMP);
 #undef REGISTER
+
+	mutex_init(&swcr_kmtx, MUTEX_DEFAULT, IPL_NET);
+	cv_init(&swcr_kcv, "swcr_wait");
+	error = kthread_create(PRI_COUNT - 1, KTHREAD_MPSAFE, NULL,
+				(void (*)(void*))swcrthr, NULL, NULL, "swcrthr");
+	if (error) {
+		printf("softcrypto_init: cannot start softcrypto thread; error %d", error);
+	} else swcr_curthr = 1;
 }

+/*
+ * Kernel thread to do callbacks.
+ */
+static void
+swcrthr(void)        
+{
+	struct cryptop *crp;
+	int f_loop;
+
+	mutex_spin_enter(&swcr_kmtx);
+	for (f_loop = 1;;) {
+		crp = TAILQ_FIRST(&swcr_q);
+		if (crp == NULL) {
+			cv_wait(&swcr_kcv, &swcr_kmtx);
+			f_loop = 1;
+			continue;
+		}
+		TAILQ_REMOVE(&swcr_q, crp, crp_next);
+		if (f_loop != 0 && !TAILQ_EMPTY(&swcr_q)) {
+			cv_signal(&swcr_kcv); /* wake up another thread on first loop it not empty */
+		}
+		f_loop = 0;
+		mutex_spin_exit(&swcr_kmtx);
+
+		crp->crp_flags |= CRYPTO_F_CBIMM; /* force direct return - avoid additional scheduling ...*/
+		swcr_process2(crp);
+
+		mutex_spin_enter(&swcr_kmtx);
+		if (swcr_numthr != swcr_curthr) { /* need to correct number of threads ... */
+			if (swcr_numthr < swcr_curthr && swcr_curthr > 1) {
+				swcr_curthr--;
+				mutex_spin_exit(&swcr_kmtx);
+				lwp_exit(curlwp);
+			} else if (swcr_numthr > swcr_curthr) {
+				int need_new = swcr_numthr - swcr_curthr;
+				int error;
+
+				swcr_curthr = swcr_numthr;
+				mutex_spin_exit(&swcr_kmtx);
+				while (need_new > 0) {
+					error = kthread_create(PRI_COUNT - 1, KTHREAD_MPSAFE, NULL,
+						(void (*)(void*))swcrthr, NULL, NULL, "swcrthr");
+					if (error) {
+						printf("softcrypto_init: cannot start softcrypto thread; error %d", error);
+						break;
+					}
+					need_new--;
+				}
+				mutex_spin_enter(&swcr_kmtx);
+				swcr_curthr -= need_new;
+			}
+		}
+	}
+}
+
+
+SYSCTL_SETUP(sysctl_softcrypto_setup, "sysctl softcrypto subtree setup")
+{
+	sysctl_createv(clog, 0, NULL, NULL,
+			CTLFLAG_PERMANENT,
+			CTLTYPE_NODE, "kern", NULL,
+			NULL, 0, NULL, 0,
+			CTL_KERN, CTL_EOL);
+	sysctl_createv(clog, 0, NULL, NULL,
+			CTLFLAG_PERMANENT|CTLFLAG_READWRITE,
+			CTLTYPE_INT, "swcr_num_thr",
+			SYSCTL_DESCR("number of kernel thread to use for softcrypto"),
+			NULL, 0, &swcr_numthr, 0,
+			CTL_KERN, CTL_CREATE, CTL_EOL);
+}
+
+

 /*
  * Pseudo-device init routine for software crypto.
--- cryptosoft.h	2011/01/27 11:42:23	1.2
+++ cryptosoft.h	2011/01/27 11:53:10
@@ -57,7 +57,7 @@

 #ifdef _KERNEL
 int swcr_authcompute(struct cryptop *crp, struct cryptodesc *crd,
-    struct swcr_data *sw, void *buf, int outtype);
+    const struct swcr_data *sw, void *buf, int outtype);
 #endif /* _KERNEL */

 #endif /* _CRYPTO_CRYPTO_H_ */

>Audit-Trail:
From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/44473: (FAST-)IPSEC processing comsumes too much CPU in
 interrupt processing
Date: Thu, 27 Jan 2011 10:58:55 -0500

 On Thu, 27 Jan 2011 13:00:01 +0000 (UTC)
 Wolfgang.Stukenbrock@nagler-company.com wrote:

 > 	- no upper limit for the number of threads - an easy aproach may be the number of CPU's available
 > 	- curently the threads are not bound to a CPU, they switch between them. But I simple do not know how to fix them to a 
 > 	  CPU and I'm not shure how to decide to which CPU it should be bound.

 I'm not sure if others will agree but perhaps that a default or limit
 of one or two  thread(s) per CPU would be allright.  For tieing those
 threads to specific CPUs, in userland there's AFFINITY(3) for that,
 which internally uses the _sched_setaffinity(2) syscall,
 sys__sched_setaffinity() in src/sys/kern/sys_sched.c which sets the
 flag and ties a cpuset with the lwp.  That syscall obviously doesn't
 allow to do this on system threads, but it could be from a security
 standpoint rather than a limit of the system threads capabilities.

 Thanks,
 -- 
 Matt

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.