NetBSD Problem Report #39769
From bouyer@antioche.eu.org Sun Oct 19 13:32:43 2008
Return-Path: <bouyer@antioche.eu.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 69C1763B88A
for <gnats-bugs@gnats.NetBSD.org>; Sun, 19 Oct 2008 13:32:43 +0000 (UTC)
Message-Id: <200810191332.m9JDWclU008385@rochebonne.antioche.eu.org>
Date: Sun, 19 Oct 2008 15:32:38 +0200 (CEST)
From: Manuel Bouyer <bouyer@antioche.eu.org>
Reply-To: bouyer@antioche.eu.org
To: gnats-bugs@gnats.NetBSD.org
Subject: race condition in TCP timers
X-Send-Pr-Version: 3.95
>Number: 39769
>Category: kern
>Synopsis: race condition in TCP timers
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: bouyer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Oct 19 13:35:00 +0000 2008
>Closed-Date: Sun Nov 16 20:41:35 +0000 2008
>Last-Modified: Sun Nov 16 20:41:35 +0000 2008
>Originator: Manuel Bouyer
>Release: NetBSD 3.1.1_PATCH
>Organization:
>Environment:
System: NetBSD ftp.lip6.fr 3.1.1_PATCH NetBSD 3.1.1_PATCH (FTP) #6: Fri Oct 10 16:47:47 CEST 2008 bouyer@rock:/dsk/l1/misc/bouyer/tmp/amd64/obj/dsk/l1/misc/bouyer/netbsd-3-1/src/sys/arch/amd64/compile/FTP amd64
Architecture: amd64
Machine: amd64
>Description:
I've got several instances of this panic:
panic: tcp_output REXMT
panic() at netbsd:panic+0x1c8
tcp_segsize() at netbsd:tcp_segsize
tcp_timer_persist() at netbsd:tcp_timer_persist+0x73
softclock() at netbsd:softclock+0x2c9
softintr_dispatch() at netbsd:softintr_dispatch+0x99
DDB lost frame for netbsd:Xsoftclock+0x2d, trying 0xffffffff8069bcd0
Xsoftclock() at netbsd:Xsoftclock+0x2d
The tcp timers are armed/disarmed, but we don't check if the callback is
being invoked. One of them at last can cause the panic I'm seeing in
tcp_output():
if (win == 0) {
TCP_TIMER_DISARM(tp, TCPT_REXMT);
tp->t_rxtshift = 0;
tp->snd_nxt = tp->snd_una;
if (TCP_TIMER_ISARMED(tp, TCPT_PERSIST) == 0)
tcp_setpersist(tp);
}
If TCPT_REXMT's callback is invoking at this point, it'll be run just
after tcp_output() exists. tcp_timer_rexmt() will restart the
TCPT_REXMT timer, so now we have both TCPT_REXMT and TCPT_PERSIST
pending. When TCPT_PERSIST expires, we get the panic above.
There are other places where we stop TCPT_REXMT, and potentially
arm another timer in the same splsoftnet() block (e.g. tcp_input
stops TCPT_REXMT and may call tcp_output, which may arm TCPT_PERSIST,
tcp_sack does it too).
There are also in tcp_output() possible paths where we disarm
TCPT_PERSIST, and later TCPT_REXMT while TCPT_PERSIST may still be
invoking. I think there could also cause the tcp_setpersist panic,
as tcp_timer_persist can call tcp_setpersist().
As TCP_TIMER_DISARM() will clear the pending and expirted states,
I think a possible way to close this race is to check callout_expired()
in the callout handler, and abort the handler if it's not expired.
See attached patch.
It looks like the same issue exists in netbsd-4 and current.
>How-To-Repeat:
Code inspection.
Run a system with TCP connections and be unlucky.
>Fix:
I think this patch fixes it. I've been running it on a 3.1.1_PATCH
server without ill effect.
Index: tcp_timer.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_timer.c,v
retrieving revision 1.71
diff -u -r1.71 tcp_timer.c
--- tcp_timer.c 2 Mar 2005 10:20:18 -0000 1.71
+++ tcp_timer.c 10 Oct 2008 16:16:15 -0000
@@ -232,6 +232,10 @@
splx(s);
return;
}
+ if (!callout_expired(&tp->t_delack_ch)) {
+ splx(s);
+ return;
+ }
tp->t_flags |= TF_ACKNOW;
(void) tcp_output(tp);
@@ -293,6 +297,10 @@
splx(s);
return;
}
+ if (!callout_expired(&tp->t_timer[TCPT_REXMT])) {
+ splx(s);
+ return;
+ }
#ifdef TCP_DEBUG
#ifdef INET
@@ -453,6 +461,10 @@
splx(s);
return;
}
+ if (!callout_expired(&tp->t_timer[TCPT_PERSIST])) {
+ splx(s);
+ return;
+ }
#ifdef TCP_DEBUG
#ifdef INET
@@ -520,6 +532,10 @@
splx(s);
return;
}
+ if (!callout_expired(&tp->t_timer[TCPT_KEEP])) {
+ splx(s);
+ return;
+ }
#ifdef TCP_DEBUG
ostate = tp->t_state;
@@ -607,6 +623,10 @@
splx(s);
return;
}
+ if (!callout_expired(&tp->t_timer[TCPT_2MSL])) {
+ splx(s);
+ return;
+ }
/*
* 2 MSL timeout went off, clear the SACK scoreboard, reset
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: kern-bug-people->bouyer
Responsible-Changed-By: bouyer@NetBSD.org
Responsible-Changed-When: Sun, 16 Nov 2008 20:41:35 +0000
Responsible-Changed-Why:
I handled it
State-Changed-From-To: open->closed
State-Changed-By: bouyer@NetBSD.org
State-Changed-When: Sun, 16 Nov 2008 20:41:35 +0000
State-Changed-Why:
patch commited, pullup requested.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.