NetBSD Problem Report #41579

From Wolfgang.Stukenbrock@nagler-company.com  Fri Jun 12 10:42:32 2009
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by www.NetBSD.org (Postfix) with ESMTP id EB45D63BF62
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 12 Jun 2009 10:42:32 +0000 (UTC)
Message-Id: <20090612091155.B73244EA9FE@s012.nagler-company.com>
Date: Fri, 12 Jun 2009 11:11:55 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: siisata driver interrupt and command completion handling broken
X-Send-Pr-Version: 3.95

>Number:         41579
>Category:       kern
>Synopsis:       siisata driver interrupt and command completion handling broken
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    jakllsch
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jun 12 10:45:00 +0000 2009
>Closed-Date:    Fri Dec 25 18:47:03 +0000 2009
>Last-Modified:  Fri Dec 25 18:47:03 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 5.0
>Organization:
Dr. Nagler & Company GmbH

>Environment:


System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009 wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
	Our systems with io-cards that uses the siisata driver has locked up after a short time all the time.

	I've analysed the problem and located two errors in the driver.

	1. if a command completes very fast after it has been started while the interrupt handling is still running for that port, the driver
	   clears any pending interrupt on interrupt routine exit for the port. This may clear the next pending interrupt, if it is already present.
	2. if a bio-command runs into timeout, the completion routine does not check for the timeout condition and the failed transfer never gets
	   to the ata-routines. (the code in the atapi-completion routine is broken too.)

	The cause for the first problem is a general error in the interrupt code. You may never get some interrupt conditions, process them and
	clear it (or all interrupts as done by the driver) later.
	The driver must clear the interrupt conditions that it is gooing to process immidiatly after determining the set of conditions and before it
	starts processing of the conditions. Otherwise there is a risk of loosing some conditions.
	The patch below will correct this behavior. The automatic clear of the completion interrupt by reading the slot-status is disabled
	for security reasons by the patch. It makes no longer sence to use it anymore and makes debugging of the driver very hard ...

	The cause for the second problem is just a missing check in the routine for the timeout condition.
	The patch below will correct this in siisata_bio_complete().
	The siisata_atapi_complete() routine has been corrected too. But the patch will not change the processing order (timeout versus interrupt),
	because I'm not shure if this is just another error in the driver or if there is a special reason why the interrupts for atapi devices are
	not cleared if a command has timed out. Someone else should have a look at this, because I think the current version of the atapi completion
	routine is incorrect! At least it is inconsistent to the command and bio completion routines.

	While debugging the driver I've seen another programming error, that will not influence the operation as long as there is only one CPU
	nodifiying the status register at the time. Ths Stealvine chip has a Bit-set (PCS) and Bit-clear (PCC) semantic for the port configuration register.
	So it makes no sence - or may lead to unexpected results if another CPU is just clearing a bit - to read the old status, add some bits and
	then write the result into the bit-Set-register.
	The way that should be used is just to write the bits that should be set now to the PCS register (or PCC when clearing a bit).
	There is an own definition for the read-mode of the port status "PS"in the headerfile. That should be used in all read operations and not
	the PCS definition.
	The patch below will fix this too.

	With this patch, the drivers runs stable on our systems without locking up anymore.

	remark:
	We have only ata devices connected - so the atapi code does not run and we have never seen any real error of the drives up to now.
	The chip-manual accedently says nothing about the correct order reading the cause of the error and clearing the error-interrupt.
	And this patch will change the order from "read error then clear interrupt" to "clear interrupt prior read error status".
	As long as only one slot will ever used by the driver - this means only one command is on the chip - this should be no problem at all.
	Due to the fact that the chip stops processing after the first error and all pending commands need to be reissued again, the order should
	be not relevant. But I can be wrong here ..
	If desired (or required) it would be easy to delay clearing the interrupt condition until the error status has been read.
	(attention: the port-slot-status may not be read prior clearing the completion interrupt! Otherwise there will be the next whole in
	interrupt processing and some interrupts may be lost ... So the command-complete interrupt must be cleared prior processing!)
>How-To-Repeat:
	Setup a system with e.g. an SIL3124 controller and start IO in it. After some time you will loose the contact to a disk due to a missed
	interrupt and the broken timeout handling in siisata_bio_complete().
>Fix:
	The following patch to /usr/src/sys/dev/ic/siisata.c will fix the problems above. (output of "diff -c")

*** siisata.c-orig	Wed Jun 10 15:50:37 2009
--- siisata.c	Fri Jun 12 10:20:19 2009
***************
*** 233,238 ****
--- 233,240 ----
  	/* come out of reset, 64-bit activation */
  	PRWRITE(sc, PRX(chp->ch_channel, PRO_PCC),
  	    PR_PC_32BA | PR_PC_PORT_RESET);
+ 	/* disable clear interrupt on PSS read */
+ 	PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_INCOR);
  	/* initialize port */
  	siisata_reinit_port(chp);
  	/* clear any interrupts */
***************
*** 393,398 ****
--- 395,410 ----
  		for (i = 0; i < sc->sc_atac.atac_nchannels; i++)
  			if (is & GR_GIS_PXIS(i))
  				siisata_intr_port(sc, &sc->sc_channels[i]);
+ #ifdef DIAGNOSTIC
+ 		is &= ~GR_GIS_PXIS_MASK;
+ 		if (is) { /* reset any unexpected interrupts to avoid busy loops ... */
+ 			log(LOG_WARNING, "%s: resetting unexpected interrupt (GIS 0x%x)\n", SIISATANAME(sc), is);
+ 			GRWRITE(sc, GR_GIS, is);
+ 		}
+ #else
+ 		if (is & ~GR_GIS_PXIS_MASK)
+ 			panic("%s: %s: unexpected interrupt (GIS 0x%x)", SIISATANAME(sc), __func__, is);
+ #endif
  	}
  	return r;
  }
***************
*** 407,421 ****
  	SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
  	    SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);

  	if ((xfer != NULL) && (xfer->c_intr != NULL))
  		xfer->c_intr(chp, xfer, slot);
  #ifdef DIAGNOSTIC
- 	else
  		log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
  #endif
! 
! 	/* clear some (ok, all) ints */
! 	PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);

  	return;
  }
--- 419,434 ----
  	SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
  	    SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);

+ /* remark: the called routine must clear all processed interrupt itself - otherwise we may loose some interrupts ... */
  	if ((xfer != NULL) && (xfer->c_intr != NULL))
  		xfer->c_intr(chp, xfer, slot);
+ 	else {
  #ifdef DIAGNOSTIC
  		log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
  #endif
! 		/* clear some (ok, all) ints */
! 		PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
! 	}

  	return;
  }
***************
*** 430,436 ****
  	int slot = SIISATA_NON_NCQ_SLOT;

  	/* wait for ready */
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
  		DELAY(10);

  	prb = schp->sch_prb[slot];
--- 443,449 ----
  	int slot = SIISATA_NON_NCQ_SLOT;

  	/* wait for ready */
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
  		DELAY(10);

  	prb = schp->sch_prb[slot];
***************
*** 490,496 ****
  		    SIISATANAME(sc), chp->ch_channel);
  		/* XXX and then ? */
  	}
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
  		DELAY(10);
  	PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
  	    PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
--- 503,509 ----
  		    SIISATANAME(sc), chp->ch_channel);
  		/* XXX and then ? */
  	}
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
  		DELAY(10);
  	PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
  	    PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
***************
*** 546,552 ****
  		schp->sch_sstatus)) {
  	case SStatus_DET_DEV:
  		/* wait for ready */
! 		while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS))
  		    & PR_PS_PORT_READY))
  			DELAY(10);

--- 559,565 ----
  		schp->sch_sstatus)) {
  	case SStatus_DET_DEV:
  		/* wait for ready */
! 		while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS))
  		    & PR_PS_PORT_READY))
  			DELAY(10);

***************
*** 781,792 ****
  	    ("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);

  	pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));

  	if ((xfer->c_flags & C_TIMEOU) != 0)
  		goto command_done;

  	if (pis & PR_PIS_CMDCMPL) {
! 		/* get slot status, clearing completion interrupt */
  		pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
  		/* is this expected? */
  		if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 794,807 ----
  	    ("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);

  	pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ 	/* clear only those interrupts, we are gooing to server now ... */
+ 	if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);

  	if ((xfer->c_flags & C_TIMEOU) != 0)
  		goto command_done;

  	if (pis & PR_PIS_CMDCMPL) {
! 		/* get slot status - interrupt cleared above - PR_PC_INCOR set ... */
  		pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
  		/* is this expected? */
  		if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1062,1070 ****
  	uint32_t *prbfis = (void *)fis;

  	pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));

  	if (pis & PR_PIS_CMDCMPL) {
! 		/* get slot status, clearing completion interrupt */
  		pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
  		/* is this expected? */
  		if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1077,1090 ----
  	uint32_t *prbfis = (void *)fis;

  	pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ 	/* clear only those interrupts, we are gooing to server now ... */
+ 	if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
+ 
+ 	if ((xfer->c_flags & C_TIMEOU) != 0)
+ 		goto command_done;

  	if (pis & PR_PIS_CMDCMPL) {
! 		/* get slot status - interrupt cleared above - PR_PC_INCOR set ... */
  		pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
  		/* is this expected? */
  		if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1116,1122 ****
  command_done:
  	schp->sch_active_slots &= ~__BIT(slot);
  	chp->ch_flags &= ~ATACH_IRQ_WAIT;
! 	callout_stop(&chp->ch_callout);

  	chp->ch_queue->active_xfer = NULL;

--- 1136,1147 ----
  command_done:
  	schp->sch_active_slots &= ~__BIT(slot);
  	chp->ch_flags &= ~ATACH_IRQ_WAIT;
! 	if (xfer->c_flags & C_TIMEOU) {
! 		ata_bio->error = TIMEOUT;
! 	} else {
! 		callout_stop(&chp->ch_callout);
! 		ata_bio->error = NOERROR;
! 	}

  	chp->ch_queue->active_xfer = NULL;

***************
*** 1270,1278 ****
  {
  	struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;

! 	PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
! 	    PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_PORT_INITIALIZE);
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
  		DELAY(10);
  }

--- 1295,1302 ----
  {
  	struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;

! 	PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_PORT_INITIALIZE);
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
  		DELAY(10);
  }

***************
*** 1281,1289 ****
  {
  	struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;

! 	PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
! 	    PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_DEVICE_RESET);
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
  		DELAY(10);
  }

--- 1305,1312 ----
  {
  	struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;

! 	PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_DEVICE_RESET);
! 	while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
  		DELAY(10);
  }

***************
*** 1456,1464 ****
  			periph->periph_cap |= PERIPH_CAP_CMD16;

  			/* configure port for packet length */
! 			PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS),
! 			    PRREAD(siic, PRX(chp->ch_channel, PRO_PCS)) |
! 			    PR_PC_PACKET_LENGTH);
  		}
  		/* XXX This is gross. */
  		periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
--- 1479,1485 ----
  			periph->periph_cap |= PERIPH_CAP_CMD16;

  			/* configure port for packet length */
! 			PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS), PR_PC_PACKET_LENGTH);
  		}
  		/* XXX This is gross. */
  		periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
***************
*** 1649,1657 ****
  	}

  	pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));

  	if (pis & PR_PIS_CMDCMPL) {
! 		/* get slot status, clearing completion interrupt */
  		pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
  		/* is this expected? */
  		if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1670,1680 ----
  	}

  	pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ 	/* clear only those interrupts, we are gooing to server now ... */
+ 	if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);

  	if (pis & PR_PIS_CMDCMPL) {
! 		/* get slot status - interrupt cleared above - PR_PC_INCOR set ... */
  		pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
  		/* is this expected? */
  		if ((schp->sch_active_slots & __BIT(slot)) == 0) {

>Release-Note:

>Audit-Trail:
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41579: siista driver interrupt and command completion handling broken
Date: Fri, 12 Jun 2009 13:10:32 +0200

 Hi - once again

 I've done a typo in the driver name - if possible please correct this.
 Otherwiese this report cannot be found by searching for the driver

 The driver is "siisata" and not "siista" as written in the report itself 
 - sorry

 W. Stukenbrock

 gnats-admin@NetBSD.org wrote:

 >Thank you very much for your problem report.
 >It has the internal identification `kern/41579'.
 >The individual assigned to look at your
 >report is: kern-bug-people. 
 >
 >>Category:       kern
 >>Responsible:    kern-bug-people
 >>Synopsis:       siista driver interrupt and command completion handling broken
 >>Arrival-Date:   Fri Jun 12 10:45:00 +0000 2009
 >>
 >


From: jnemeth@victoria.tc.ca (John Nemeth)
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org, Wolfgang.Stukenbrock@nagler-company.com
Cc: 
Subject: Re: kern/41579: siista driver interrupt and command completion handling broken
Date: Fri, 12 Jun 2009 12:40:54 -0700

 On Sep 27, 11:22pm, Wolfgang Stukenbrock wrote:
 }
 } The following reply was made to PR kern/41579; it has been noted by GNATS.
 } 
 }  I've done a typo in the driver name - if possible please correct this.
 }  Otherwiese this report cannot be found by searching for the driver
 }  
 }  The driver is "siisata" and not "siista" as written in the report itself 

      Done.

 }-- End of excerpt from Wolfgang Stukenbrock

Responsible-Changed-From-To: kern-bug-people->jakllsch
Responsible-Changed-By: jakllsch@NetBSD.org
Responsible-Changed-When: Wed, 17 Jun 2009 18:16:08 +0000
Responsible-Changed-Why:
I'll take this.


From: "Jonathan A. Kollasch" <jakllsch@kollasch.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/41579: siista driver interrupt and command completion
	handling broken
Date: Wed, 17 Jun 2009 18:57:38 +0000

 I'm unable to reproduce this in either 5.0 or -current on my hardware
 (SiI3124 on 32-bit/33MHz PCI and SiI3132, WD5000AA[KJ]S and other
 older non-Flash drives).  So, testing a forward-port of your patch
 myself will be hard to do.

 That said, this mostly looks like stuff that could use fixing.
 Thanks for pointing it out.

 I'd be interested in knowing more about this system and/or why the
 commands complete so fast.

From: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@NetBSD.org
Cc: jakllsch@kollasch.net
Subject: Re: kern/41579: siista driver interrupt and command completion
	handling broken
Date: Mon, 22 Jun 2009 13:23:51 +0200

 Hi,

 some background information from our setup:
 (and some information related to the debugging process ...)

 We have had this problem on all server-systems using a PCI-X 133 MHz =20
 (4-port) or PCIe (2 port) controler with a chip from this family.
 We never try such an IO-card this on a 32-bit bus.
 An IO-Card with 2 or 4 3-GBit Ports makes no sence in a 32-bit slot, =20
 because you will never get the data stream though it ... (4 * 66MHz =3D =20
 264 MByte/sec - 3Gbit/10 =3D 300MByte/sec - so either 600MByte/sec or =20
 1,2GByte/sec disc rate versus 264MByte/sec (or only 132MByte/sec on =20
 33MHz) theoretical bus bandwith)
 For such a card you need at least PCI-X 133MHz -> 8 * 133 =3D 1,064 GByte/se=
 c.
 This sounds like the reason for me why you cannot reproduce the problem.

 Our systems:
 2x Intel S3000AHxx/S3200SHxx with Xeon 3xxx dual-Core-CPU (1x PCI-X, 1x PCIe=
 )
 1x Supermicro (type see bug-report) with Xeon 3xxx quad-core-CPU (PCI-X)
 An 4 port card resides on my desc, because it makes no sence to install it .=
 ..
 All systems running with serial console at 38400 baud.

 "Originally" these system run an 4.x here with some drivers imported =20
 from "-current" due to the need of the HW ...
 Before I've reported the problem, I've setup the Supermicro system =20
 with a 5.0 and run the test again - same problem. So it was not =20
 related to the adaption of the driver to 4.x.

 All system run fine for a wile and suddenly looses a disc on the =20
 SATA-Card. It was very hard to debug, because I've found no load =20
 scenario that will reproduce the problem after a short time reliable. =20
 And the select style of the disc looks like random ...
 We need more SATA-disc in the tree systems as the motherboards =20
 supports, so I've to find the problem in an other way.
 I've added some print statements to the driver to get the famous last =20
 words before looseing a disc. (not realy successfull ...)
 It looks like that the system was loosing interrupts, so I've started =20
 debugging of the interrupt routines and dumped some state information =20
 from the chip - again with printf statements.
 An suddenly all interrupts are lost! The timeout for the transfer =20
 happens, but no timeout processing takes place ...
 I've located the main problem in the "wrong order" of register access =20
 and operation mode of the chip. - The debug dump of the chip register =20
 has cleared the interrupt cause. (And of cause in the missing timeout =20
 handling code ...)

 I'm not shure why the interrupt processing takes too much time on our =20
 systems sometimes.
 Due to the fact, that it was not realy reproducable with access to =20
 large files or the raw-device with larger transfers (-> 64k-transfers, =20
 the outdated limit from the ATA-stuff ...), the cause of the problem =20
 may be short writes to disc as done by the filesystem when updating =20
 some inode information.
 To transfer 4k to the disc over a 3 Gbit-link will take something =20
 around 13 us.
 The driver uses only one slot and I'm not what will happen in the =20
 upper layer of the wd-driver if a transfer completes. (The reason for =20
 useing only one slot is the ATA abstraction in netbsd that doesn't =20
 allow more than one pending transfer ... This should be corrected for =20
 performance reason as soon as possible - but I've no time to fix it =20
 for "all" drivers - sorry ...
 The first try of a port for this chip family (without any usable =20
 eror-handling :-( ...) to netbsd was mine, and it would have used all =20
 slots - as done in the OpenBSD-version, but the ATA-layer cannot realy =20
 handle it. So in fact only two slots are uses.
 By the way: my port had been 10% faster as the driver for the =20
 ICH-series - this one is something about 8% slower ...)

 Back to the main thing ..
 13 us is a short time - even on a fast CPU. And if some processing of =20
 a completed transfer is done after starting the next one, it is =20
 possible that this make take longer than 13 us. (I've haven't looked =20
 in the wd-driver to verify this, but I hope the next transfer will be =20
 started before dooing a wakeup for the completed. Otherwise the driver =20
 would be very very inefficent !!!)
 Due to the missing code fragment for timeout-processing, the timeout =20
 will happen, but is ignored -> still a completed transfer on the chip, =20
 but driver will never stop it.
 remark: on a 32-Bit 33MHz bus this will take at least 30 usec due to =20
 the limited bandwith of the system bus.

 This would be a very good explanation for the problem.
 If there are only "some" transfer for the discs, all processing has =20
 finished before the next transfer is generated and started.
 But if there are more transfers pending at a time and one of the =20
 waiting is e.g. an inode-block update an interrupt is cleared in the =20
 chip before interrupt processing takes place.
 As faster as a CPU is, it will be able to process everything before =20
 the chip generates the next interrupt and no disc is lost. And on a =20
 system with only 132MByte/sec bandwith, the interrupt routine will be =20
 done before the DMA was able to transfer the data to the chip.

 best regards

 W. Stukenbrock

 Zitat von "Jonathan A. Kollasch" <jakllsch@kollasch.net>:

 > The following reply was made to PR kern/41579; it has been noted by GNATS.
 >
 > From: "Jonathan A. Kollasch" <jakllsch@kollasch.net>
 > To: gnats-bugs@NetBSD.org
 > Cc:
 > Subject: Re: kern/41579: siista driver interrupt and command completion
 > =09handling broken
 > Date: Wed, 17 Jun 2009 18:57:38 +0000
 >
 >  I'm unable to reproduce this in either 5.0 or -current on my hardware
 >  (SiI3124 on 32-bit/33MHz PCI and SiI3132, WD5000AA[KJ]S and other
 >  older non-Flash drives).  So, testing a forward-port of your patch
 >  myself will be hard to do.
 >
 >  That said, this mostly looks like stuff that could use fixing.
 >  Thanks for pointing it out.
 >
 >  I'd be interested in knowing more about this system and/or why the
 >  commands complete so fast.
 >
 >


State-Changed-From-To: open->feedback
State-Changed-By: jakllsch@NetBSD.org
State-Changed-When: Thu, 20 Aug 2009 00:34:50 +0000
State-Changed-Why:
I believe this has been fixed in -current since early July.
Could you verify that this issue is no longer present in -current?


From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: jakllsch@NetBSD.org, netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/41579 (siisata driver interrupt and command completion handling broken)
Date: Thu, 20 Aug 2009 12:16:15 +0200

 Hi, I've a little problem with this request ....

 OK, I've integrated the driver into the 4.0 kernel, because I need it 
 there. (I've still no time to have a look at 5.0.1 ...)
 I've fixed the interrupt handling by the patch I've send to NetBSD.org, 
 because the driver was instable.
 With the patch it seems to be stable now - no problems up to now anymore 
 with only disks connected to the controler.
 And all three system with this controler type installed are productive 
 systems! They run 24 hours 7 days a week without any interruption ...

 I've no "free" system with that hardware to install an other OS-version 
 on it at the moment.
 I have a spare controler of this type, but currently no free MB where I 
 can insert it - the controler needs a PCI-X-slot ...

 If it's good for you that  I just check the sources, I will do it again 
 - see below.
 I've done it prior sending the bug report and the changes at that time 
 there doesn't affect the problem, but there may be something new now.

 And there are "some" major changes ...

 But the PR_PC_INCOR bit is again not set in the driver and so one main 
 whole is still present where interrupts can be lost.
 It is very hard to garantie that there will be no interrupt lost on the 
 port, if you clear the cmd-complete interrupt by a side effect!
 And as a side effect not setting this bit makes debugging of the 
 behaviour here impossible, because you cannot check the PSS register ...

 In siisata_intr() you loop now until no bits are set in GR_GIS anymore 
 and check the IS in an inner loop for all channels.
 Then you check the PIS for the channel and if CMPCMPL is set, you read 
 PSS that clears the CMPCMPL interupt as side effect.
 This looks OK as long as only one slot is used at all - and I hope this 
 performance issue will be addresses in the near future ...
 You have moved the whole access to the chip into the intr_port routine - 
 was in cmd_complete before.
 This looks good, but there is still the "clear all interrupts" part in ...
 I would prefere to eliminate this and clear only the interrups gooing to 
 serve. This is more stable and of cause required if more than on slot 
 will be used in the future.
 A check with panic() may make sence if there comes any unexpetec 
 interrupt at all - at least if DIGANOSTICS is defined ...

 The missing timeout-check in bio_complete is there now.
 The usage of PRO_PCS/PRO_PCC has to be corrected.

 summary of my analyses:
 The main cause why interrupts are lost in the previous version has been 
 fixed by moving all chip-access in front of the completion handling.
 This problem schould be solved in the latest version of the driver.
 The lost timeout should be fixed and the PCS/PCC usage looks good to me.

 Neverless I would strongly recommend to set the PR_PC_INCOR bit to 
 remove this kind of side effect in interrupt handling.
 It is alway a bad style to use HW side effects, because they normaly 
 generate sync-wholes and you need to write to PIS in any case - currenly 
 still with 0xffffffff without checking for unexpected interrupts - 
 another point that should be changed I think. At least a WARNING should 
 be printed if there is an unexpected interrupt, so that anybody can see 
 that there is still a problem with the driver ...
 I still prefere a panic() on unecpected interrupts, but thats my way to 
 make drivers stable - do not allow any unexpected behaviour of the HW ...
 If you just clear the intterupt, it may come up again and than you may 
 run into a busy-loop by processing (or better not processing) the 
 HW-interrupt. And you will not see it with the current implementation ...
 It is always a good idea for interrupt handlers to clear exactly those 
 interrupts only, that they are gooing to serve. Otherwiese some 
 interrupts may get lost - OK as long as only one slot is used this 
 cannot happen here ...


 Sorry that I've no way to run the current version at the moment - 
 neverless testing this problem by running the system may only show that 
 there is still a problem. It will never say, that there is no one left - 
 for this kind of statement, you need to analyse the source code of the 
 interrupt handler.
 If the request for feedback was intended by a "close-request" for the 
 report, the current source code looks good to me and the problem report 
 can be closed.


 By the way. This mail reached me today at 02:34 (AM) and a reminder for 
 that came today at 07:11 (AM).
 Is it normal that the systems sends a reminder after less than 5 hours?

 W. Stukenbrock

 jakllsch@NetBSD.org wrote:

 >Synopsis: siisata driver interrupt and command completion handling broken
 >
 >State-Changed-From-To: open->feedback
 >State-Changed-By: jakllsch@NetBSD.org
 >State-Changed-When: Thu, 20 Aug 2009 00:34:50 +0000
 >State-Changed-Why:
 >I believe this has been fixed in -current since early July.
 >Could you verify that this issue is no longer present in -current?
 >
 >
 >


State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Thu, 27 Aug 2009 04:42:14 +0000
State-Changed-Why:
Feedback received; submitter can't test current but has some recommendations
for further changes.

(regarding the gnats reminders: gnats sends those when it feels like it,
so while having it send one after only a few hours isn't so great we
probably can't prevent it.)


State-Changed-From-To: open->closed
State-Changed-By: jakllsch@NetBSD.org
State-Changed-When: Fri, 25 Dec 2009 18:47:03 +0000
State-Changed-Why:
while I haven't fully addressed all points in this pr, the basic issue is believed fixed and the submiter reports it can be closed.


>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.