NetBSD Problem Report #41058

From www@NetBSD.org  Sun Mar 22 13:32:34 2009
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id E510063BAFE
	for <gnats-bugs@gnats.netbsd.org>; Sun, 22 Mar 2009 13:32:34 +0000 (UTC)
Message-Id: <20090322133234.8D0C163B9E6@www.NetBSD.org>
Date: Sun, 22 Mar 2009 13:32:34 +0000 (UTC)
From: don@donhayford.com
Reply-To: don@donhayford.com
To: gnats-bugs@NetBSD.org
Subject: General software failures with kernels built for arm/NSLU2 kernels
X-Send-Pr-Version: www-1.0

>Number:         41058
>Category:       kern
>Synopsis:       General software failures with kernels built for arm/NSLU2 kernels
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Mar 22 13:35:00 +0000 2009
>Closed-Date:    Sat Sep 05 20:24:14 +0000 2015
>Last-Modified:  Sat Sep 05 20:24:14 +0000 2015
>Originator:     Donald T Hayford
>Release:        -current from 4/26/2008 through 3/14/2009
>Organization:
self
>Environment:
NetBSD netbsd. 5.99.01 NetBSD 5.99.01 (NSLU2_ALL) #0: Sun Mar  8 22:42:32 EDT 2009  hayford@free-2k:/usr/home/hayford/netbsd-20081107/obj-armeb/sys/arch/evbarm/compile/NSLU2_ALL
>Description:
There have been recent reports of certain problems with NetBSD on the NSLU2, both -current and older.  For example, see:
http://mail-index.netbsd.org/port-arm/2009/03/08/msg000693.html
and the associated thread for more information.  
The symptoms are a general malaise with various bits of software failing (cc reporting internal errors, md5 returning inconsistent results) at random times.  The test that seems to work best, though it is not a great test, is to md5 a large file.  Older versions of the kernel will return a different (and incorrect) result occasionally (1 or 2 times out of 30).  Newer kernels fail significantly more often.  For example here are the results for a kernel built using a cvs date of 11/07/2008:

MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d (Note: this is the correct answer)
MD5 (pkgsrc-2008Q3.tar.gz) = 52925bf5588c9643ee58f4f24ab024b0
MD5 (pkgsrc-2008Q3.tar.gz) = b5af39add036cd54f78ff38384a60370
MD5 (pkgsrc-2008Q3.tar.gz) = f0d5f16bdc59d09f906557149bc9a326
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 0f6c17c863a497165e78729c6c19f042
MD5 (pkgsrc-2008Q3.tar.gz) = 767c33a02ce12ef37b0cfed747586451
MD5 (pkgsrc-2008Q3.tar.gz) = 39ae3194a7306e5cafdd1767bd26d141
MD5 (pkgsrc-2008Q3.tar.gz) = 550be4e993b76ccd8c218c6b59e69f03
MD5 (pkgsrc-2008Q3.tar.gz) = 60beb07311a37b4076d39aa85eadf748
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 680e4b443e0984efdc3f8f3ac6d67b62
MD5 (pkgsrc-2008Q3.tar.gz) = 273d8d3bf711ab59fb55bede9cb077d2
MD5 (pkgsrc-2008Q3.tar.gz) = 25edba177456c0059964ddff9371095d
MD5 (pkgsrc-2008Q3.tar.gz) = b0cc81b7fc8d66e8e65d174b432762d6
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 0951fb446714bfc3842dd57891e7288a
MD5 (pkgsrc-2008Q3.tar.gz) = 7c7b8bb852e567dec180bee4fbb70888
MD5 (pkgsrc-2008Q3.tar.gz) = a426775dee944c97159098e0119efa26
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = bc60fe91f8409fda6fc8cbb834b45102
MD5 (pkgsrc-2008Q3.tar.gz) = a2725280cd57be4e388903e933708852
MD5 (pkgsrc-2008Q3.tar.gz) = 9902781605916302983831b48947592b
MD5 (pkgsrc-2008Q3.tar.gz) = 3adf6c0a2616b7f9720b9874e013b8b2
MD5 (pkgsrc-2008Q3.tar.gz) = 86c0e9d9ba56bde4587d01d5606e9a90
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = c442a730b262580d7f3ba6ef44622057
MD5 (pkgsrc-2008Q3.tar.gz) = 72d8b997c41de9da6d7fbe0202115ca6
MD5 (pkgsrc-2008Q3.tar.gz) = fc963410678d6b514580762e0feb2126
MD5 (pkgsrc-2008Q3.tar.gz) = 16209fe1355c850392dd46775cd8547e
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 0710cd9aa73651cd00b274fed0399a3d
MD5 (pkgsrc-2008Q3.tar.gz) = 7c11cbe472c9e16fff77b20d9bede6b7
# uname -a
NetBSD netbsd. 5.99.01 NetBSD 5.99.01 (NSLU2_ALL) #0: Sun Mar  8 22:42:32 EDT 2009  hayford@free-2k:/usr/home/hayford/netbsd-20081107/obj-armeb/sys/arch/evbarm/compile/NSLU2_ALL

I have tested kernels back to 5/5/2008 and still see the problem, though the failure rates are pretty low (1 in 30 to 1 in 50).  I tried to build a kernel from 4/26/2008, but was unable to, for some reason.  However, I was able to substitute the following two directories from 4/26/2008 into later builds (5/12/2008, 5/16/2008 and 6/1/2008) and get the kernel and world to build correctly:
    src/sys/arch/arm
    src/sys/arch/evbarm
These bastardized kernels seem to run correctly - i.e., no failures in more than 100 trials.  I tried to be more selective with the files that were substituted (i.e., just the arm/xscale directory or just the evbarm/nslu2 directory), but unfortunately there are enough dependencies scattered about that the build fails.  Likewise, I can't substitute these directories into anything approaching -current and build correctly.
>How-To-Repeat:
Build any kernel from 5/5/2008 to present for the NSLU2.
Run md5 on any large file (I use pkgsrc-2008Q3.tar.gz).  Do it several times and you will get different (and incorrect) results.
>Fix:
Using src/sys/arch/arm and src/sys/arch/evbarm from 4/26/2008 with src from 5/12/2008 - 6/1/2008 will build and apparently run correctly.

>Release-Note:

>Audit-Trail:
From: Steve Woodford <scw@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: kern/41058: General software failures with kernels built for arm/NSLU2 kernels
Date: Tue, 27 Oct 2009 20:22:02 +0000

 --Boundary-00=_qZ15KW6mTKXovGn
 Content-Type: text/plain;
   charset="iso-8859-1"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline

 I took a look at this bug last weekend after finally falling foul of it 
 when updating one of my NSLU2s to netbsd-5.

 It looks like the ARMv6 integration exposed a cache sync corner case in 
 pmap_activate(). The latter is supposed to flush the cache only when 
 switching between different userland processes, but I believe there's a 
 corner case related to process exit which also bypasses the flush when 
 it ought not to.

 I've attached a patch relative to the netbsd-5 branch, but which will 
 also apply to netbsd-current. The patch is intended to be temporary 
 until someone (maybe me) invests more time in tracking down the corner 
 case.

 Let me know if you're in a position to test the patch before I commit 
 it.

 Steve

 --Boundary-00=_qZ15KW6mTKXovGn
 Content-Type: text/x-diff;
   charset="iso 8859-15";
   name="pmap.c.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="pmap.c.diff"

 Index: pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/arm/arm32/pmap.c,v
 retrieving revision 1.187
 diff -u -r1.187 pmap.c
 --- pmap.c	28 Sep 2008 21:27:11 -0000	1.187
 +++ pmap.c	27 Oct 2009 20:09:51 -0000
 @@ -3645,7 +3645,6 @@
  	    pg, VM_PAGE_TO_PHYS(pg), prot));

  	switch(prot) {
 -		return;
  	case VM_PROT_READ|VM_PROT_WRITE:
  #if defined(PMAP_CHECK_VIPT) && defined(PMAP_APX)
  		pmap_clearbit(pg, PVF_EXEC);
 @@ -4076,6 +4075,15 @@
  	 * entire cache.
  	 */
  	rpm = pmap_recent_user;
 +
 +/*
 + * XXXSCW: There's a corner case here which can leave turds in the cache as
 + * reported in kern/41058. They're probably left over during tear-down and
 + * switching away from an exiting process. Until the root cause is identified
 + * and fixed, zap the cache when switching pmaps. This will result in a few
 + * unnecessary cache flushes, but that's better than silently corrupting data.
 + */
 +#if 0
  	if (npm != pmap_kernel() && rpm && npm != rpm &&
  	    rpm->pm_cstate.cs_cache) {
  		rpm->pm_cstate.cs_cache = 0;
 @@ -4083,6 +4091,16 @@
  		cpu_idcache_wbinv_all();
  #endif
  	}
 +#else
 +	if (rpm) {
 +		rpm->pm_cstate.cs_cache = 0;
 +		if (npm == pmap_kernel())
 +			pmap_recent_user = NULL;
 +#ifdef PMAP_CACHE_VIVT
 +		cpu_idcache_wbinv_all();
 +#endif
 +	}
 +#endif

  	/* No interrupts while we frob the TTB/DACR */
  	oldirqstate = disable_interrupts(IF32_bits);

 --Boundary-00=_qZ15KW6mTKXovGn--

From: Donald T Hayford <don@donhayford.com>
To: gnats-bugs@NetBSD.org, Steve Woodford <scw@netbsd.org>
Cc: 
Subject: Re: kern/41058: General software failures with kernels built for
 arm/NSLU2 kernels
Date: Tue, 27 Oct 2009 18:39:33 -0400

 I'll try to look at it tonight.  Thanks.
 Don

 Steve Woodford wrote:
 > The following reply was made to PR kern/41058; it has been noted by GNATS.
 >
 > From: Steve Woodford <scw@NetBSD.org>
 > To: gnats-bugs@netbsd.org
 > Cc: kern-bug-people@netbsd.org,
 >  gnats-admin@netbsd.org,
 >  netbsd-bugs@netbsd.org
 > Subject: Re: kern/41058: General software failures with kernels built for arm/NSLU2 kernels
 > Date: Tue, 27 Oct 2009 20:22:02 +0000
 >
 >  --Boundary-00=_qZ15KW6mTKXovGn
 >  Content-Type: text/plain;
 >    charset="iso-8859-1"
 >  Content-Transfer-Encoding: 7bit
 >  Content-Disposition: inline
 >  
 >  I took a look at this bug last weekend after finally falling foul of it 
 >  when updating one of my NSLU2s to netbsd-5.
 >  
 >  It looks like the ARMv6 integration exposed a cache sync corner case in 
 >  pmap_activate(). The latter is supposed to flush the cache only when 
 >  switching between different userland processes, but I believe there's a 
 >  corner case related to process exit which also bypasses the flush when 
 >  it ought not to.
 >  
 >  I've attached a patch relative to the netbsd-5 branch, but which will 
 >  also apply to netbsd-current. The patch is intended to be temporary 
 >  until someone (maybe me) invests more time in tracking down the corner 
 >  case.
 >  
 >  Let me know if you're in a position to test the patch before I commit 
 >  it.
 >  
 >  Steve
 >  
 >  --Boundary-00=_qZ15KW6mTKXovGn
 >  Content-Type: text/x-diff;
 >    charset="iso 8859-15";
 >    name="pmap.c.diff"
 >  Content-Transfer-Encoding: 7bit
 >  Content-Disposition: attachment;
 >  	filename="pmap.c.diff"
 >  
 >  Index: pmap.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/arch/arm/arm32/pmap.c,v
 >  retrieving revision 1.187
 >  diff -u -r1.187 pmap.c
 >  --- pmap.c	28 Sep 2008 21:27:11 -0000	1.187
 >  +++ pmap.c	27 Oct 2009 20:09:51 -0000
 >  @@ -3645,7 +3645,6 @@
 >   	    pg, VM_PAGE_TO_PHYS(pg), prot));
 >   
 >   	switch(prot) {
 >  -		return;
 >   	case VM_PROT_READ|VM_PROT_WRITE:
 >   #if defined(PMAP_CHECK_VIPT) && defined(PMAP_APX)
 >   		pmap_clearbit(pg, PVF_EXEC);
 >  @@ -4076,6 +4075,15 @@
 >   	 * entire cache.
 >   	 */
 >   	rpm = pmap_recent_user;
 >  +
 >  +/*
 >  + * XXXSCW: There's a corner case here which can leave turds in the cache as
 >  + * reported in kern/41058. They're probably left over during tear-down and
 >  + * switching away from an exiting process. Until the root cause is identified
 >  + * and fixed, zap the cache when switching pmaps. This will result in a few
 >  + * unnecessary cache flushes, but that's better than silently corrupting data.
 >  + */
 >  +#if 0
 >   	if (npm != pmap_kernel() && rpm && npm != rpm &&
 >   	    rpm->pm_cstate.cs_cache) {
 >   		rpm->pm_cstate.cs_cache = 0;
 >  @@ -4083,6 +4091,16 @@
 >   		cpu_idcache_wbinv_all();
 >   #endif
 >   	}
 >  +#else
 >  +	if (rpm) {
 >  +		rpm->pm_cstate.cs_cache = 0;
 >  +		if (npm == pmap_kernel())
 >  +			pmap_recent_user = NULL;
 >  +#ifdef PMAP_CACHE_VIVT
 >  +		cpu_idcache_wbinv_all();
 >  +#endif
 >  +	}
 >  +#endif
 >   
 >   	/* No interrupts while we frob the TTB/DACR */
 >   	oldirqstate = disable_interrupts(IF32_bits);
 >  
 >  --Boundary-00=_qZ15KW6mTKXovGn--
 >  
 >   

From: Donald T Hayford <don@donhayford.com>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, 
 netbsd-bugs@netbsd.org
Subject: Re: kern/41058: General software failures with kernels built for
 arm/NSLU2 kernels
Date: Fri, 30 Oct 2009 18:56:37 -0400

 I built NetBSD-5-0-1 using the patch Steve provided, which built with no 
 issues.
 I booted the NSLU2 using the 5-0-1 kernel, and using the same world I 
 used previously, ran same MD5 test that was used earlier.  For those 
 that don't remember, processes like MD5 on large files would return 
 incorrect results roughly one time in 50.

 After the patch, I got 540 correct answers out of 540.  It appears that 
 the problem has been corrected.

 Thanks, Steve.  Great work.  And I, for one, believe you are the ideal 
 person to invest more time tracking this down.  ;-).

 Regards,
 Don

 Steve Woodford wrote:
 > The following reply was made to PR kern/41058; it has been noted by GNATS.
 >
 > From: Steve Woodford <scw@NetBSD.org>
 > To: gnats-bugs@netbsd.org
 > Cc: kern-bug-people@netbsd.org,
 >  gnats-admin@netbsd.org,
 >  netbsd-bugs@netbsd.org
 > Subject: Re: kern/41058: General software failures with kernels built for arm/NSLU2 kernels
 > Date: Tue, 27 Oct 2009 20:22:02 +0000
 >
 >  --Boundary-00=_qZ15KW6mTKXovGn
 >  Content-Type: text/plain;
 >    charset="iso-8859-1"
 >  Content-Transfer-Encoding: 7bit
 >  Content-Disposition: inline
 >  
 >  I took a look at this bug last weekend after finally falling foul of it 
 >  when updating one of my NSLU2s to netbsd-5.
 >  
 >  It looks like the ARMv6 integration exposed a cache sync corner case in 
 >  pmap_activate(). The latter is supposed to flush the cache only when 
 >  switching between different userland processes, but I believe there's a 
 >  corner case related to process exit which also bypasses the flush when 
 >  it ought not to.
 >  
 >  I've attached a patch relative to the netbsd-5 branch, but which will 
 >  also apply to netbsd-current. The patch is intended to be temporary 
 >  until someone (maybe me) invests more time in tracking down the corner 
 >  case.
 >  
 >  Let me know if you're in a position to test the patch before I commit 
 >  it.
 >  
 >  Steve
 >  
 >  --Boundary-00=_qZ15KW6mTKXovGn
 >  Content-Type: text/x-diff;
 >    charset="iso 8859-15";
 >    name="pmap.c.diff"
 >  Content-Transfer-Encoding: 7bit
 >  Content-Disposition: attachment;
 >  	filename="pmap.c.diff"
 >  
 >  Index: pmap.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/arch/arm/arm32/pmap.c,v
 >  retrieving revision 1.187
 >  diff -u -r1.187 pmap.c
 >  --- pmap.c	28 Sep 2008 21:27:11 -0000	1.187
 >  +++ pmap.c	27 Oct 2009 20:09:51 -0000
 >  @@ -3645,7 +3645,6 @@
 >   	    pg, VM_PAGE_TO_PHYS(pg), prot));
 >   
 >   	switch(prot) {
 >  -		return;
 >   	case VM_PROT_READ|VM_PROT_WRITE:
 >   #if defined(PMAP_CHECK_VIPT) && defined(PMAP_APX)
 >   		pmap_clearbit(pg, PVF_EXEC);
 >  @@ -4076,6 +4075,15 @@
 >   	 * entire cache.
 >   	 */
 >   	rpm = pmap_recent_user;
 >  +
 >  +/*
 >  + * XXXSCW: There's a corner case here which can leave turds in the cache as
 >  + * reported in kern/41058. They're probably left over during tear-down and
 >  + * switching away from an exiting process. Until the root cause is identified
 >  + * and fixed, zap the cache when switching pmaps. This will result in a few
 >  + * unnecessary cache flushes, but that's better than silently corrupting data.
 >  + */
 >  +#if 0
 >   	if (npm != pmap_kernel() && rpm && npm != rpm &&
 >   	    rpm->pm_cstate.cs_cache) {
 >   		rpm->pm_cstate.cs_cache = 0;
 >  @@ -4083,6 +4091,16 @@
 >   		cpu_idcache_wbinv_all();
 >   #endif
 >   	}
 >  +#else
 >  +	if (rpm) {
 >  +		rpm->pm_cstate.cs_cache = 0;
 >  +		if (npm == pmap_kernel())
 >  +			pmap_recent_user = NULL;
 >  +#ifdef PMAP_CACHE_VIVT
 >  +		cpu_idcache_wbinv_all();
 >  +#endif
 >  +	}
 >  +#endif
 >   
 >   	/* No interrupts while we frob the TTB/DACR */
 >   	oldirqstate = disable_interrupts(IF32_bits);
 >  
 >  --Boundary-00=_qZ15KW6mTKXovGn--
 >  
 >   

From: Steve Woodford <scw@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/41058 CVS commit: src/sys/arch/arm/arm32
Date: Sat, 28 Nov 2009 11:44:45 +0000

 Module Name:	src
 Committed By:	scw
 Date:		Sat Nov 28 11:44:45 UTC 2009

 Modified Files:
 	src/sys/arch/arm/arm32: pmap.c

 Log Message:
 Apply some band-aid to pmap_activate() for PR kern/41058:

 There's a corner case here which can leave turds in the cache as
 reported in kern/41058. They're probably left over during tear-down and
 switching away from an exiting process. Until the root cause is identified
 and fixed, zap the cache when switching pmaps. This will result in a few
 unnecessary cache flushes, but that's better than silently corrupting data.

 Also remove an extraneous return statement in pmap_page_protect() which
 crept in during the matt-armv6 merge.


 To generate a diff of this commit:
 cvs rdiff -u -r1.202 -r1.203 src/sys/arch/arm/arm32/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Sat, 05 Sep 2015 20:24:14 +0000
State-Changed-Why:
Patch was committed in November 2009.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.