NetBSD Problem Report #48142

From Wolfgang.Stukenbrock@nagler-company.com  Wed Aug 21 12:57:07 2013
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 8B7D8718A5
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 21 Aug 2013 12:57:07 +0000 (UTC)
Message-Id: <20130821125653.7D875123B93@test-s0.nagler-company.com>
Date: Wed, 21 Aug 2013 14:56:53 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: i8254 timer stop working during boot - system lockup during boot
X-Send-Pr-Version: 3.95

>Number:         48142
>Category:       port-amd64
>Synopsis:       i8254 timer stop working during boot - system lockup during boot
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-amd64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Aug 21 13:00:00 +0000 2013
>Last-Modified:  Thu Aug 22 09:55:00 +0000 2013
>Originator:     Dr. Wolfgang Stukenbrock
>Release:        NetBSD 6.1
>Organization:
Dr. Nagler & Company GmbH
>Environment:


System: NetBSD test-s0 5.1.2 NetBSD 5.1.2 (NSW-WS) #3: Fri Dec 21 15:15:43 CET 2012 wgstuken@test-s0:/usr/src/sys/arch/amd64/compile/NSW-WS amd64
Architecture: x86_64
Machine: amd64
>Description:
	While initializing the hardware on x86-systems in sys/arch/x86/x86/cpu.c i8254_delay() is called directly
	bypassing the DELAY() macro that may call this function of somthing else - e.g. lapic_delay() from
	sys/arch/x86/x86/lapic.c. I'm not shure if this is by design or by error ...
	If lapic is setup, the i8254 timer counter frequency is set to '0' - full cycle as far as I understand.
	OK - the time-source is getting slower as before, but it is still running. All calls to i8254_delay() will wait
	longer as before, but that does not realy hurt here and nobody has recognized this as a problem before.

	On our Supermicro X8DAH (with one CPU only) we have the problem, that with some kernel configurations
	the system will hang during startup while starting the "other" CPU's.

	I've debugged into it and found, that the i8254 timer has stopped counting - for unknown reasons.
	This happens after the the isa subsystems is initialized.
	If i8254_delay() is used after this - e.g. at the end of isaattach() for debugging purpose, it will never return.
	At start of this routine the timer is still working fine.
	No indication of the problem is reported - the user sits there and is wondering ...

	The problem is triggered by the finsio driver on port 0x4e, but I'm not shure if it is the fault of this driver or
	if the timer registers are visible on other ports than 0x40 and 0x43 on this board too.
	Also still not tested on other motherboards - perhaps others are affected too.
>How-To-Repeat:
	Setup "finsio0 at isa? port 0x4e" in a kernel configuration on a Supermicro X8DAH board,
	The kernel will freeze during startup.
>Fix:
	Not 100% shure, because my knowloedge about the constrains during the startup is to small.
	Perhaps replacing the i8254_delay() in arch/x86/x86/cpu.c with DELAY() would be a good idea.
	It solves the problem for me, but I'm not shure if there are other side effects. 

	An other way to introduce a workaround is to catch the case that the timer stops working in i8254_delay() in
	sys/arch/x86/isa/clock.c.
	If we assume that each loop takes longer than one timer tick, we can decrement the remaining counter by one
	each time we read the same tick-value again to avoid an endless loop here.
	This aproach introduces some slowdown while bringing up the "other" CPU's, but works fine too without
	accessing "other resources" as the first fix would do ...

	remark: if complied as XEN, then xen_delay() would be used by the DELAY() macro. Not shure if this is OK or not,
		or if sys/arch/x86/x86/cpu.c goes to a XEN kernel or not.

	remark: i8254_delay() will not be used later directly again. So the second aproach will only slow down the
		boot process. (At least I've found no other references to it in the souces.)

	Here is a patch that uses the second aproach for sys/arch/x86/isa/clock.c.
	Feel free to use it or to replace i8254_delay() with DELAY() in sys/arch/x86/x86/cpu.c
	Perhaps it would make sence to add some addition code to report the problem to the user if it happes
	the first time, but on very very fast systems in the future this message may be misleading ...

--- clock.c     2013/08/21 12:43:37     1.1
+++ clock.c     2013/08/21 12:48:26
@@ -482,6 +482,9 @@
                cur_tick = gettick();
                if (cur_tick > initial_tick)
                        delta = rtclock_tval - (cur_tick - initial_tick);
+// avoid looping forever if timer stops counting for any reason
+               else if (cur_tick == initial_tick)
+                       delta = 1;
                else
                        delta = initial_tick - cur_tick;
                if (delta < 0 || delta >= rtclock_tval / 2) {
@@ -500,6 +503,9 @@
                cur_tick = gettick();
                if (cur_tick > initial_tick)
                        remaining -= rtclock_tval - (cur_tick - initial_tick);
+// avoid looping forever if timer stops counting for any reason
+               else if (cur_tick == initial_tick)
+                       remaining -= 1;
                else
                        remaining -= initial_tick - cur_tick;
 #endif

>Audit-Trail:
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: port-amd64/48142: i8254 timer stop working during boot - system lockup during boot
Date: Thu, 22 Aug 2013 08:30:04 +0100

 On Wed, Aug 21, 2013 at 01:00:01PM +0000, Wolfgang.Stukenbrock@nagler-company.com wrote:
 > >Number:         48142
 > >Category:       port-amd64
 > >Synopsis:       i8254 timer stop working during boot - system lockup during boot
 ...
 > 	The problem is triggered by the finsio driver on port 0x4e,
 >       but I'm not shure if it is the fault of this driver or
 > 	if the timer registers are visible on other ports than 0x40 and
 >       0x43 on this board too.

 Why are you enabling the finsio driver?
 Enabling 'random' ISA drivers will always cause grief, the 'grope'
 code can easily destroy other hardware.
 The finsio grope is very nasty - it does io-writes - so you really shouldn't
 enable it unless you really need it and expect the grope to succeed.
 Restoring the origanl value might help.

 I'd certainly add some debug to read the 8254 timer ports during the
 finsio grop code - so work out exactly when they are clobbered.
 IIRC reads from the timer registers are always ok.
 Might be worth doing byte reads of ports 0x40 through 0x4f just to
 see it the timer registers are aliased.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: Wolfgang Stukenbrock <wolfgang.stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: port-amd64-maintainer@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org
Subject: Re: port-amd64/48142: i8254 timer stop working during boot - system lockup during boot
Date: Thu, 22 Aug 2013 11:52:25 +0200

 Hi,

 I've enabled it to see if it will be found or not.
 This is the only way known to me to detect "all" suported ISA-hardware 
 (sensors in most cases) of a motherboard. Even the HW-manuals for many 
 motherboards don't say anything about the available hardware and the 
 addresses where they are located.

 Of cause, the problem will go away if finsio is disabled.
 And of case it is no good idea to have useless parts configured in a 
 kernel - especialy parts, without any type indication that need to be 
 accessed in order to find out if they are present or not. (ISA, 
 VMEbus-devices, ...)

 But I think it would be very nice to either have a workaround if the 
 timer has died or at least run into a panic.
 It is no good idea that the system silently lock up during boot.
 I can't see any warnings about potential problems in the finsio man page 
 too.

 I've just tested "the problem" on an Intel S3000 board and it will also 
 lockup if finsio is enabled.
 So either the 8257 has 16 or more register, or it is common to have 
 access on multiple addresses for the chip - some adress lines not 
 decoded to keep the HW tiny... (No manual for 8257 available to me at 
 the moment to check this ...)

 If it would be possible for the finsio-driver to test for this situation 
 first, it would be a very good idea - and perhaps a warning message 
 during boot if the mapping problem is detected.

 Also a warning/comment in the kernel config file, that the finsio-driver 
 may blow up something would be nice. This may warn "normal" users that 
 there are potential problems and they should try deactivating the driver 
 again if the system runs into problems.

 If the problem is (or will be fixed) in next update and/or next release 
 of finsio-driver, I think you can close this PR.

 best regards

 W. Stukenbrock

 David Laight wrote:

 > The following reply was made to PR port-amd64/48142; it has been noted by GNATS.
 > 
 > From: David Laight <david@l8s.co.uk>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: port-amd64/48142: i8254 timer stop working during boot - system lockup during boot
 > Date: Thu, 22 Aug 2013 08:30:04 +0100
 > 
 >  On Wed, Aug 21, 2013 at 01:00:01PM +0000, Wolfgang.Stukenbrock@nagler-company.com wrote:
 >  > >Number:         48142
 >  > >Category:       port-amd64
 >  > >Synopsis:       i8254 timer stop working during boot - system lockup during boot
 >  ...
 >  > 	The problem is triggered by the finsio driver on port 0x4e,
 >  >       but I'm not shure if it is the fault of this driver or
 >  > 	if the timer registers are visible on other ports than 0x40 and
 >  >       0x43 on this board too.
 >  
 >  Why are you enabling the finsio driver?
 >  Enabling 'random' ISA drivers will always cause grief, the 'grope'
 >  code can easily destroy other hardware.
 >  The finsio grope is very nasty - it does io-writes - so you really shouldn't
 >  enable it unless you really need it and expect the grope to succeed.
 >  Restoring the origanl value might help.
 >  
 >  I'd certainly add some debug to read the 8254 timer ports during the
 >  finsio grop code - so work out exactly when they are clobbered.
 >  IIRC reads from the timer registers are always ok.
 >  Might be worth doing byte reads of ports 0x40 through 0x4f just to
 >  see it the timer registers are aliased.
 >  
 >  	David
 >  
 >  -- 
 >  David Laight: david@l8s.co.uk
 >  
 > 
 > 


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.