NetBSD Problem Report #33986

From tls@netbsd.cs.columbia.edu  Wed Jul 12 17:58:25 2006
Return-Path: <tls@netbsd.cs.columbia.edu>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id A93AF63B896
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 12 Jul 2006 17:58:25 +0000 (UTC)
Message-Id: <200607121647.k6CGl6df027630@build.netbsd.org>
Date: Wed, 12 Jul 2006 16:47:06 GMT
From: tls@netbsd.org
Reply-To: tls@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: UFS_DIRHASH causes rampant kernel memory corruption
X-Send-Pr-Version: 3.95

>Number:         33986
>Category:       kern
>Synopsis:       Kernels with UFS_DIRHASH exhibit corruption of other kernel data structures, particularly mbufs and other structures managed with pool(9)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jul 12 18:00:00 +0000 2006
>Last-Modified:  Sun Jul 16 12:10:01 +0000 2006
>Originator:     Thor Lancelot Simon
>Release:        NetBSD 3.0_STABLE
>Organization:
The NetBSD Foundation, Inc.
>Environment:


System: NetBSD b2.netbsd.org 3.0_STABLE NetBSD 3.0_STABLE (FAITH) #2: Tue Jul 11 20:19:46 UTC 2006 root@ADMIN:/usr/obj/sys/arch/i386/compile.i386/FAITH i386
Architecture: i386
Machine: i386
>Description:

We've just spent almost a month chasing kernel memory corruption problems
on one of the TNF build servers.  These manifested as panics all over the
kernel, particularly in the networking code, where a pointer within a
datastructure (usually a pool-allocated structure) had been overwritten
with garbage -- upon examination, usually part of some other datastructure.

A typical symptom is a panic due to junk pointer dereference in UDP
input or (in a kernel with FAST_IPSEC) a bad attempt to zeroize a
nonexistent ESP or AH key due to corruption of a security association
data structure.  Another common panic is in the ipfilter rule matching
code.  Many of these code paths share the property of invocation via
the soft network "interrupt".  However, we have observed panics throughout
the kernel always due to uvm_fault (-> 0xe) on a kernel address.  Adjusting
other kernel options (e.g. removing FAST_IPSEC or substituting pf for ip)
may make the problem occur less _often_, but it still occurs, and the
symptom is still the same: page fault in supervisor mode due to a corrupted
pointer in a kernel datastructure, wherever in the kernel it may occur.

Removing UFS_DIRHASH from our kernel configuration made the problem go
away.  Though it is possible that there is an underlying problem of some
kind in one of the allocators that is simply particularly badly exposed
by UFS_DIRHASH, it seems more likely that there is a problem (which we
haven't found yet) in UFS_DIRHASH itself.  The code has a history of
similar problems on FreeBSD which seem to have ended only when the entire
kernel synchronization scheme in FreeBSD was reworked in FreeBSD 5.

We have observed this problem with NetBSD 3.0-STABLE and with
NetBSD-current as of 7/6/2004.

>How-To-Repeat:

Adding UFS_DIRHASH to a GENERIC.MPACPI kernel (we have not tested with
unprocessor kernels; our test system is a 4-core Opteron running a 32-bit
kernel) is sufficient to exercise the bug.  To see the problem, exercise
both the filesystem and networking code at the same time.  For example,
we do this by building NetBSD in a tight loop with a script that uses
tar | rsh | tar to distribute build jobs to slave hosts and collect
results, while running build jobs at the same time on the master host
itself.  With UFS_DIRHASH in the kernel, this reliably produces a panic
with clear evidence of this problem within 6-10 hours, and often much
faster than that.

>Fix:

I recommend complete removal of the UFS_DIRHASH code if the problem
cannot be quickly identified.  Gordon Waldhoffer has recommended a change
to the on-disk directory format which is mostly backwards-compatible with
traditional UFS but which stores large directories as expanding hashes;
this may be a much better option in any case than constructing such hash
tables in-core.

>Audit-Trail:
From: David Brownlee <abs@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: kern/33986: UFS_DIRHASH causes rampant kernel memory corruption
Date: Sun, 16 Jul 2006 02:33:33 +0100 (BST)

 On Wed, 12 Jul 2006, tls@netbsd.org wrote:

 > A typical symptom is a panic due to junk pointer dereference in UDP
 > input or (in a kernel with FAST_IPSEC) a bad attempt to zeroize a
 > nonexistent ESP or AH key due to corruption of a security association
 > data structure.  Another common panic is in the ipfilter rule matching
 > code.  Many of these code paths share the property of invocation via
 > the soft network "interrupt".  However, we have observed panics throughout
 > the kernel always due to uvm_fault (-> 0xe) on a kernel address.  Adjusting
 > other kernel options (e.g. removing FAST_IPSEC or substituting pf for ip)
 > may make the problem occur less _often_, but it still occurs, and the
 > symptom is still the same: page fault in supervisor mode due to a corrupted
 > pointer in a kernel datastructure, wherever in the kernel it may occur.

  	This _could_ be unrelated, but it feels not.

  	I have very similar symptoms (panic in uvm_fault (-> 0xe)
  	on a kernel address) on two systems without UFS_DIRHASH.
  	Both tend to panic overnight, sometimes in 'find'. One
  	tends to be running large scale rsync's and the other
  	postgres backup & rsyncs. Both pass memtester without issue.

  	Both were running relatively heavily tuned 3_STABLE kernels,
  	the same kernels being in use on around fifteen other boxes,
  	of which about eight were close to identical hardware.

  	On one machine I switched to a recent current GENERIC + PF
  	and it still happened. In no case has UFS_DIRHASH been in
  	any kernel. One common factor could be PF.

  	I would go along with the theory that UFS_DIRHASH is
  	triggering some extant issue elsewhere in the kernel.

  	I seem to recall all occurances have been since 3.0, so
  	just for reference I'm going to switch the 'most commonly
  	failing' box to the GENERIC 3.0 release kernel to see if
  	it still happens.

 -- 
  		David/absolute       -- www.NetBSD.org: No hype required --

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/33986: UFS_DIRHASH causes rampant kernel memory corruption
Date: Sun, 16 Jul 2006 14:07:00 +0200

 On Sun, Jul 16, 2006 at 12:35:02AM +0000, David Brownlee wrote:
 >   	I would go along with the theory that UFS_DIRHASH is
 >   	triggering some extant issue elsewhere in the kernel.

 I'm thinking this too - PR kern/33630 (at least to me) seems to indicate
 that sometimes pool related code corrupts memory, depending on the size
 of pool items. It is not clear, however, that this is related to this
 PR.

 Martin

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.