NetBSD Problem Report #39526

From Wolfgang.Stukenbrock@nagler-company.com  Fri Sep 12 13:47:01 2008
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id A7DEC63B92A
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 12 Sep 2008 13:47:01 +0000 (UTC)
Message-Id: <20080912134653.8F1D84EAA0D@s012.nagler-company.com>
Date: Fri, 12 Sep 2008 15:46:53 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: ahd driver crashes system if it runs out of memory
X-Send-Pr-Version: 3.95

>Number:         39526
>Category:       kern
>Synopsis:       ahd driver crashes system if it runs out of memory
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Sep 12 13:50:00 +0000 2008
>Last-Modified:  Sun Feb 15 04:31:13 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0
>Organization:
Dr. Nagler  & Company GmbH

>Environment:


System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #1: Thu Sep 11 12:21:03 CEST 2008 root@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
	I'm running an 29320ALP-R in PCI-X Slot on an Intel Motherboard with 8 GB Ram installed.
	(Intel S3210SHLX with E3110 (with the 6MB-Chache fix - see PR kern/39242))
	If the system starts using the memory bejong 4 GB the ahd-driver fails to allocate memory for some
	structures. The following messages occures:
	ahd0: failed to create DMA map for Sense Data structures, error = 12
	ahd_createdmamem error (2)
	ahd0: failed to create DMA map for SG data structures, error = 12
	ahd_createdmamem error (2)
	ahd0: failed to create DMA map for hardware SCB structures, error = 12
	ahd_createdmamem error (2)

	I've added a debug print to the uvm_pglistalloc_simple routine in order to get the cause of the problem and that
	reports the following: (example - the number of free pages will vary over time ..)

	plistalloc - waiting orig num 1 - num 1 low 0x1000000 high 0x100000000 - free 162 pd_res 1 kres 5

	This means, that there was a request for one page, one page is still not found.
	The pages should be between 0x1000000 and 0x100000000.
	The vm-stat structure reports that there are still 162 pages free, one is reseverd for pagedaemon and
	5 pages are reserved for kernel.

	So there is physical memory available, but it is outside the range of the requested range.
	Therefore the error is returned to the ahd driver - it alwas failes in dma-mem-alloc().
	I've tried to enable 64-Bit access by setting AHD_64BIT_ADDRESSING but that does not help.
	(I've set the flag if we found a PCI_CAP_PCIX capability as a first try.)
	For unknown reasons the AHD_64BIT_ADDRESSING is defined and used in several places, but the code
	will never set it. Is this a bug/feature and setting the flag is just missing somewhere or is
	the flag obsolete?
	Without setting this flag there are much more memory allocation failure, so I thing setting the flag
	is just lost in the code by some other changes in the past.
	Neverless the crash described below happens all the time.

	After repeating the 6 error messages above some times, the following output is on the console:
	ahd0: failed to create DMA map for SG data structures, error = 12
	ahd_createdmamem error (2)
	ahd0: failed to create DMA map for Sense Data structures, error = 12
	ahd_createdmamem error (2)
	uvm_fault(0xffffffff80606ee0, 0xffff80009b797000, 2) -> e
	kernel: page fault trap, code=0
	Stopped in pid 9.1 (scsibus0) at        netbsd:ahd_alloc_scbs+0x1ed:    repe stosq      %es:(%rdi)

	The system just tries to setup an additional request.
	The number of repeated output of the 6 lines above will vary until the crash happens.

	An other point that I don't understand is the fact, that even if 64bit adressing is allowed, the
	requested range is below 4GB. Is this a bug, or is it nessesary that dma-memory is below 4GB even
	with 64bit adressing?
>How-To-Repeat:
	Install an ahd-controler (e.g. 29320ALP-R) in a system with 8 GB memory and bring it to heavy load.
	After some minutes you will see the crash.
>Fix:
	not known to me up to now - sorry.
	All I've tried up to now failed to solve the problem at all - I've only some parital success.
	Setting AHD_64BIT_ADDRESSING in ahd_pci_attach() at line 403 in sys/dev/pci/ahd_pci.c after
	"if (!pci_get_capability(pa->pa_pc, pa->pa_tag, PCI_CAP_PCIX, &bd->pcix_off, NULL))" has succeded, will
	remove lots of failed allocation for transfers, but the system will crash anyway after some failed
	dma-memory alloction attempts.

>Release-Note:

>Audit-Trail:
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: Wolfgang.Stukenbrock@nagler-company.com
Subject: Re: kern/39526: ahd driver crashes system if it runs out of memory
Date: Thu, 25 Sep 2008 15:23:38 +0200

 Hi,

 after some additional testing the follwoing patch will lead to a 
 situation, where the controler can be used in a 64-Bit system with 8GB 
 main memory without crashing the system shortly after the lower 4GB 
 memory is used by proesses.
 It enables the already present 64-Bit memory access modes of the driver.
 This will remove the usage of the upper nibble of the length field for 
 additional address bits and uses a real 64-Bit pointer.
 I don't understand why this change changes anything, because the no 
 change at all is done to the problematic 32-Bit DMA-structure alloc.
 And there the alloc failures will happen ...
 But with this fix there will be much less alloc failures as without it.

 This fix will not solve the problem at all!
 If there is lots of trafic on the SCSI-Bus prior all of the lower 4GB 
 memory is used up, no problem occures anymore, but it IO on the 
 controler starts after the lower 4GB memory is used by processes, the 
 system will crash after a short time.


 rcsdiff -r1.1 -u ahd_pci.c
 ===================================================================
 RCS file: RCS/ahd_pci.c,v
 retrieving revision 1.1
 diff -u -r1.1 ahd_pci.c
 --- ahd_pci.c   2008/09/12 14:37:11     1.1
 +++ ahd_pci.c   2008/09/25 13:11:06
 @@ -400,6 +400,12 @@
                  ahd->chip |= AHD_PCI;
                  ahd->bugs &= ~AHD_PCIX_BUG_MASK;
          }
 +       else if (devconfig & PCI64BIT) {
 +               ahd->flags |= AHD_64BIT_ADDRESSING;
 +               aprint_normal("\n%s: using 64 Bit addressing modes", 
 ahd_name(ahd));
 +       } else {
 +               aprint_normal("\n%s: using normal addressing modes - not 
 on 64 Bit bus", ahd_name(ahd));
 +       }

          /*
           * Map PCI Registers
 @@ -502,7 +508,8 @@
          if ((ahd->flags & (AHD_39BIT_ADDRESSING|AHD_64BIT_ADDRESSING)) 
 != 0) {
                  uint32_t dvconfig;

 -               aprint_normal("%s: Enabling 39Bit Addressing\n", 
 ahd_name(ahd));
 +               aprint_normal("%s: Enabling %sBit Addressing\n", 
 ahd_name(ahd),
 +                 ((ahd->flags & AHD_64BIT_ADDRESSING) ? "64" : "39"));
                  dvconfig = pci_conf_read(pa->pa_pc, pa->pa_tag, DEVCONFIG);
                  dvconfig |= DACEN;
                  pci_conf_write(pa->pa_pc, pa->pa_tag, DEVCONFIG, dvconfig);



 gnats-admin@NetBSD.org wrote:

 > Thank you very much for your problem report.
 > It has the internal identification `kern/39526'.
 > The individual assigned to look at your
 > report is: kern-bug-people. 
 > 
 > 
 >>Category:       kern
 >>Responsible:    kern-bug-people
 >>Synopsis:       ahd driver crashes system if it runs out of memory
 >>Arrival-Date:   Fri Sep 12 13:50:00 +0000 2008
 >>
 > 


From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/39526: ahd driver crashes system if it runs out of memory
Date: Fri, 26 Sep 2008 13:48:36 +0200

 Hi again,

 now I've a workaround that avoids crashing the system by the ahd driver.

 The workaround will allocate all SCB structures during startup if 64-bit 
 bus is detected. By dooing this it avoids any call to GROW the number of 
 SCB's later.
 That is the place where no memory was found and after some failed 
 allocations a kernel crash with "fatal page fault in supervisour mode" 
 happens.

 Wit this fix I've go no longer the kernel panic.
 So this workaround leads to a stable running driver, but it does not fix 
 the main cause of the problem!
 There seems to be something wrong in either this driver or the uvm stuff!

 I assume that the problem is in the uvm stuff, because I've a similar 
 problem with the ahc driver. It will crash too after more than 4G is 
 used by the system and the ahc driver will also try to allocate more SCB's.

 Output of "diff -u" for ic/aic79xx.c ic/aic79xx_osm.c and pci/ahd_pci.c 
 below.

 Please integrate this workaround into the next possible release. Thanks 
 in advance.
 A similar fix should be added to the ahc driver to work around the 
 problem there.


 W. Stukenbrock


 in /usr/src/sys/dev/ic:


 --- aic79xx.c   2008/09/26 11:19:07     1.1
 +++ aic79xx.c   2008/09/26 11:26:57
 @@ -5441,6 +5441,24 @@
                        ahd_name(ahd));
                 goto error_exit;
         }
 +/*
 + * Workaround for the "more than 4GB main memory" crash.
 + * We allocate all SCB's now - no additional request taht may fail
 + * later.
 + * This will work around a bug in either this driver or the uvm part.
 + * Without this, the system will crash after some failed GROW requests with
 + * a "fatal page fault in supoervisor mode".
 + *
 + * This fix requires setting the 64-Bit Flag in ahd_pci.c too.
 + * Perhaps the amount of main memory should be checked here too ..
 + *
 +  W. Stukenbrock (09.2008)
 + */
 +       if ((ahd->flags & AHD_64BIT_ADDRESSING) != 0) {
 +               printf("%s: allocating all SCB entries - 64 bit workaround for 4G memory bug\n",
 +                       ahd_name(ahd));
 +               while (ahd_alloc_scbs(ahd) != 0);
 +       }

         /*
          * Note that we were successfull

 ===================================================================

 --- aic79xx_osm.c       2008/09/26 11:19:07     1.1
 +++ aic79xx_osm.c       2008/09/26 10:26:51
 @@ -95,7 +95,9 @@
           ahd->sc_channel.chan_ntargets = AHD_NUM_TARGETS;
           ahd->sc_channel.chan_nluns = 8 /*AHD_NUM_LUNS*/;
           ahd->sc_channel.chan_id = ahd->our_id;
 -        ahd->sc_channel.chan_flags |= SCSIPI_CHAN_CANGROW;
 +/* do not wast time if everything is allread allocated ... */
 +        if (ahd->scb_data.numscbs < AHD_SCB_MAX_ALLOC)
 +               ahd->sc_channel.chan_flags |= SCSIPI_CHAN_CANGROW;

          ahd->sc_child = config_found((void *)ahd, &ahd->sc_channel, 
 scsiprint);

 ===================================================================

 in /usr/src/sys/dev/pci:

 --- ahd_pci.c   2008/09/12 14:37:11     1.1
 +++ ahd_pci.c   2008/09/26 11:17:42
 @@ -400,6 +400,12 @@
                  ahd->chip |= AHD_PCI;
                  ahd->bugs &= ~AHD_PCIX_BUG_MASK;
          }
 +       else if (devconfig & PCI64BIT) {
 +               ahd->flags |= AHD_64BIT_ADDRESSING;
 +       } else {
 +               aprint_normal("\n%s: WARNING: not on 64 bit Bus - may be 
 a problem with more than 4GB main memory",
 +                       ahd_name(ahd));
 +       }

          /*
           * Map PCI Registers
 @@ -502,7 +508,8 @@
          if ((ahd->flags & (AHD_39BIT_ADDRESSING|AHD_64BIT_ADDRESSING)) 
 != 0) {
                  uint32_t dvconfig;

 -               aprint_normal("%s: Enabling 39Bit Addressing\n", 
 ahd_name(ahd));
 +               aprint_normal("%s: Enabling %sBit Addressing\n", 
 ahd_name(ahd),
 +                 ((ahd->flags & AHD_64BIT_ADDRESSING) ? "64" : "39"));
                  dvconfig = pci_conf_read(pa->pa_pc, pa->pa_tag, DEVCONFIG);
                  dvconfig |= DACEN;
                  pci_conf_write(pa->pa_pc, pa->pa_tag, DEVCONFIG, dvconfig);
 ===================================================================





From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org,
        Wolfgang.Stukenbrock@nagler-company.com
Subject: Re: kern/39526: ahd driver crashes system if it runs out of memory
Date: Fri, 26 Sep 2008 22:32:07 +0200

 On Fri, Sep 26, 2008 at 11:50:04AM +0000, Wolfgang Stukenbrock wrote:
 > The following reply was made to PR kern/39526; it has been noted by GNATS.
 > 
 > From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/39526: ahd driver crashes system if it runs out of memory
 > Date: Fri, 26 Sep 2008 13:48:36 +0200
 > 
 >  Hi again,
 >  
 >  now I've a workaround that avoids crashing the system by the ahd driver.
 >  
 >  The workaround will allocate all SCB structures during startup if 64-bit 
 >  bus is detected. By dooing this it avoids any call to GROW the number of 
 >  SCB's later.
 >  That is the place where no memory was found and after some failed 
 >  allocations a kernel crash with "fatal page fault in supervisour mode" 
 >  happens.
 >  
 >  Wit this fix I've go no longer the kernel panic.
 >  So this workaround leads to a stable running driver, but it does not fix 
 >  the main cause of the problem!
 >  There seems to be something wrong in either this driver or the uvm stuff!
 >  
 >  I assume that the problem is in the uvm stuff, because I've a similar 
 >  problem with the ahc driver. It will crash too after more than 4G is 
 >  used by the system and the ahc driver will also try to allocate more SCB's.

 No, it's a bug in both ahc and ahd, which don't handle properly errors
 from bus_dma().

 >  in /usr/src/sys/dev/ic:
 >  
 >  
 >  --- aic79xx.c   2008/09/26 11:19:07     1.1
 >  +++ aic79xx.c   2008/09/26 11:26:57
 >  @@ -5441,6 +5441,24 @@
 >                         ahd_name(ahd));
 >                  goto error_exit;
 >          }
 >  +/*
 >  + * Workaround for the "more than 4GB main memory" crash.
 >  + * We allocate all SCB's now - no additional request taht may fail
 >  + * later.
 >  + * This will work around a bug in either this driver or the uvm part.
 >  + * Without this, the system will crash after some failed GROW requests with
 >  + * a "fatal page fault in supoervisor mode".
 >  + *
 >  + * This fix requires setting the 64-Bit Flag in ahd_pci.c too.
 >  + * Perhaps the amount of main memory should be checked here too ..
 >  + *
 >  +  W. Stukenbrock (09.2008)
 >  + */
 >  +       if ((ahd->flags & AHD_64BIT_ADDRESSING) != 0) {
 >  +               printf("%s: allocating all SCB entries - 64 bit workaround for 4G memory bug\n",
 >  +                       ahd_name(ahd));
 >  +               while (ahd_alloc_scbs(ahd) != 0);
 >  +       }

 Doing this based on AHD_64BIT_ADDRESSING is wrong. It will orrur as
 well, if not more, if the adapter can't handle 64bits addresses.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/39526: ahd driver crashes system if it runs out of memory
Date: Mon, 29 Sep 2008 13:57:33 +0200

 Hi again.

 OK if this is a "known" bug in the ahd and ahc driver, than the 
 workaround by allocating all possible SCB-structures during startup 
 should be added in all cases to both drivers - or the main cause of the 
 problem should be fixed in them as soon as possible. (Sorry, I've not 
 enougth time to go deeper into that and my knowlegde of the 
 dma-subsystem (interaction with CPU-cache, PCI-related things, ...) is 
 poor at this time ... With a short look at the code, I've seen no error 
 in the calls to free for the partial allocated dma-memory.)
 It is no real good idea to have an instable driver in the kernel that 
 may crash the system if SCSI traffic starts after some large processes 
 eat up all memory.
 And the adaptec cards are used by many people. I've also came back to 
 them myself, because e.g. the mpt-driver does not support any of the LSI 
 SCSI-Cards currently sold by LSI ... And I've found no other 
 SCSI-PCI-X-Cards that are available from whatever vendor in europe taht 
 are supported by NetBSD.

 The fastes way to work around the crashes temporary would be, to 
 allocate all SCB's at startup.
 The workaround should just contain the while loop to allocate all SCB's 
 from my patch below:
 e.g. "while (ahd_alloc_scbs(ahd) != 0);".


 Neverless there is a problem with pagedaemon in the kernel, because 
 pagedaemon ignores any range information of the failed allocation-request.
 I've got the system freezing up with the ahc driver if all lower 4GB 
 memory is allocated, but some memory above 4GB is free.
 The command "top" in an other window shows 100% CPU for pagedaemon on on 
 CPU and 100% for the ahc drive on the second CPU just before nothing 
 works anymore.
 Some debugging output shows that the ahc-driver tries to allocate some 
 pages that should reside below 4GB, but there is nothing free anymore. 
 It kicks pagedaemon and blocks. Pagedaemon have a short look at the 
 system statistics, sees that there is plenty of memory available. It 
 deciedes to do nothing, just wake up the ahc-driver and waits for the 
 next request. This is an endless loop ....


 Best regards

 W. Stukenbrock



>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.