NetBSD Problem Report #41579
From Wolfgang.Stukenbrock@nagler-company.com Fri Jun 12 10:42:32 2009
Return-Path: <Wolfgang.Stukenbrock@nagler-company.com>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by www.NetBSD.org (Postfix) with ESMTP id EB45D63BF62
for <gnats-bugs@gnats.NetBSD.org>; Fri, 12 Jun 2009 10:42:32 +0000 (UTC)
Message-Id: <20090612091155.B73244EA9FE@s012.nagler-company.com>
Date: Fri, 12 Jun 2009 11:11:55 +0200 (CEST)
From: Wolfgang.Stukenbrock@nagler-company.com
Reply-To: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@gnats.NetBSD.org
Subject: siisata driver interrupt and command completion handling broken
X-Send-Pr-Version: 3.95
>Number: 41579
>Category: kern
>Synopsis: siisata driver interrupt and command completion handling broken
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: jakllsch
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jun 12 10:45:00 +0000 2009
>Closed-Date: Fri Dec 25 18:47:03 +0000 2009
>Last-Modified: Fri Dec 25 18:47:03 +0000 2009
>Originator: Wolfgang Stukenbrock
>Release: NetBSD 5.0
>Organization:
Dr. Nagler & Company GmbH
>Environment:
System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009 wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
Our systems with io-cards that uses the siisata driver has locked up after a short time all the time.
I've analysed the problem and located two errors in the driver.
1. if a command completes very fast after it has been started while the interrupt handling is still running for that port, the driver
clears any pending interrupt on interrupt routine exit for the port. This may clear the next pending interrupt, if it is already present.
2. if a bio-command runs into timeout, the completion routine does not check for the timeout condition and the failed transfer never gets
to the ata-routines. (the code in the atapi-completion routine is broken too.)
The cause for the first problem is a general error in the interrupt code. You may never get some interrupt conditions, process them and
clear it (or all interrupts as done by the driver) later.
The driver must clear the interrupt conditions that it is gooing to process immidiatly after determining the set of conditions and before it
starts processing of the conditions. Otherwise there is a risk of loosing some conditions.
The patch below will correct this behavior. The automatic clear of the completion interrupt by reading the slot-status is disabled
for security reasons by the patch. It makes no longer sence to use it anymore and makes debugging of the driver very hard ...
The cause for the second problem is just a missing check in the routine for the timeout condition.
The patch below will correct this in siisata_bio_complete().
The siisata_atapi_complete() routine has been corrected too. But the patch will not change the processing order (timeout versus interrupt),
because I'm not shure if this is just another error in the driver or if there is a special reason why the interrupts for atapi devices are
not cleared if a command has timed out. Someone else should have a look at this, because I think the current version of the atapi completion
routine is incorrect! At least it is inconsistent to the command and bio completion routines.
While debugging the driver I've seen another programming error, that will not influence the operation as long as there is only one CPU
nodifiying the status register at the time. Ths Stealvine chip has a Bit-set (PCS) and Bit-clear (PCC) semantic for the port configuration register.
So it makes no sence - or may lead to unexpected results if another CPU is just clearing a bit - to read the old status, add some bits and
then write the result into the bit-Set-register.
The way that should be used is just to write the bits that should be set now to the PCS register (or PCC when clearing a bit).
There is an own definition for the read-mode of the port status "PS"in the headerfile. That should be used in all read operations and not
the PCS definition.
The patch below will fix this too.
With this patch, the drivers runs stable on our systems without locking up anymore.
remark:
We have only ata devices connected - so the atapi code does not run and we have never seen any real error of the drives up to now.
The chip-manual accedently says nothing about the correct order reading the cause of the error and clearing the error-interrupt.
And this patch will change the order from "read error then clear interrupt" to "clear interrupt prior read error status".
As long as only one slot will ever used by the driver - this means only one command is on the chip - this should be no problem at all.
Due to the fact that the chip stops processing after the first error and all pending commands need to be reissued again, the order should
be not relevant. But I can be wrong here ..
If desired (or required) it would be easy to delay clearing the interrupt condition until the error status has been read.
(attention: the port-slot-status may not be read prior clearing the completion interrupt! Otherwise there will be the next whole in
interrupt processing and some interrupts may be lost ... So the command-complete interrupt must be cleared prior processing!)
>How-To-Repeat:
Setup a system with e.g. an SIL3124 controller and start IO in it. After some time you will loose the contact to a disk due to a missed
interrupt and the broken timeout handling in siisata_bio_complete().
>Fix:
The following patch to /usr/src/sys/dev/ic/siisata.c will fix the problems above. (output of "diff -c")
*** siisata.c-orig Wed Jun 10 15:50:37 2009
--- siisata.c Fri Jun 12 10:20:19 2009
***************
*** 233,238 ****
--- 233,240 ----
/* come out of reset, 64-bit activation */
PRWRITE(sc, PRX(chp->ch_channel, PRO_PCC),
PR_PC_32BA | PR_PC_PORT_RESET);
+ /* disable clear interrupt on PSS read */
+ PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_INCOR);
/* initialize port */
siisata_reinit_port(chp);
/* clear any interrupts */
***************
*** 393,398 ****
--- 395,410 ----
for (i = 0; i < sc->sc_atac.atac_nchannels; i++)
if (is & GR_GIS_PXIS(i))
siisata_intr_port(sc, &sc->sc_channels[i]);
+ #ifdef DIAGNOSTIC
+ is &= ~GR_GIS_PXIS_MASK;
+ if (is) { /* reset any unexpected interrupts to avoid busy loops ... */
+ log(LOG_WARNING, "%s: resetting unexpected interrupt (GIS 0x%x)\n", SIISATANAME(sc), is);
+ GRWRITE(sc, GR_GIS, is);
+ }
+ #else
+ if (is & ~GR_GIS_PXIS_MASK)
+ panic("%s: %s: unexpected interrupt (GIS 0x%x)", SIISATANAME(sc), __func__, is);
+ #endif
}
return r;
}
***************
*** 407,421 ****
SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);
if ((xfer != NULL) && (xfer->c_intr != NULL))
xfer->c_intr(chp, xfer, slot);
#ifdef DIAGNOSTIC
- else
log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
#endif
!
! /* clear some (ok, all) ints */
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
return;
}
--- 419,434 ----
SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);
+ /* remark: the called routine must clear all processed interrupt itself - otherwise we may loose some interrupts ... */
if ((xfer != NULL) && (xfer->c_intr != NULL))
xfer->c_intr(chp, xfer, slot);
+ else {
#ifdef DIAGNOSTIC
log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
#endif
! /* clear some (ok, all) ints */
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
! }
return;
}
***************
*** 430,436 ****
int slot = SIISATA_NON_NCQ_SLOT;
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
prb = schp->sch_prb[slot];
--- 443,449 ----
int slot = SIISATA_NON_NCQ_SLOT;
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
prb = schp->sch_prb[slot];
***************
*** 490,496 ****
SIISATANAME(sc), chp->ch_channel);
/* XXX and then ? */
}
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
--- 503,509 ----
SIISATANAME(sc), chp->ch_channel);
/* XXX and then ? */
}
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
***************
*** 546,552 ****
schp->sch_sstatus)) {
case SStatus_DET_DEV:
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS))
& PR_PS_PORT_READY))
DELAY(10);
--- 559,565 ----
schp->sch_sstatus)) {
case SStatus_DET_DEV:
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS))
& PR_PS_PORT_READY))
DELAY(10);
***************
*** 781,792 ****
("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
if ((xfer->c_flags & C_TIMEOU) != 0)
goto command_done;
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status, clearing completion interrupt */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 794,807 ----
("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ /* clear only those interrupts, we are gooing to server now ... */
+ if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
if ((xfer->c_flags & C_TIMEOU) != 0)
goto command_done;
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status - interrupt cleared above - PR_PC_INCOR set ... */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1062,1070 ****
uint32_t *prbfis = (void *)fis;
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status, clearing completion interrupt */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1077,1090 ----
uint32_t *prbfis = (void *)fis;
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ /* clear only those interrupts, we are gooing to server now ... */
+ if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
+
+ if ((xfer->c_flags & C_TIMEOU) != 0)
+ goto command_done;
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status - interrupt cleared above - PR_PC_INCOR set ... */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1116,1122 ****
command_done:
schp->sch_active_slots &= ~__BIT(slot);
chp->ch_flags &= ~ATACH_IRQ_WAIT;
! callout_stop(&chp->ch_callout);
chp->ch_queue->active_xfer = NULL;
--- 1136,1147 ----
command_done:
schp->sch_active_slots &= ~__BIT(slot);
chp->ch_flags &= ~ATACH_IRQ_WAIT;
! if (xfer->c_flags & C_TIMEOU) {
! ata_bio->error = TIMEOUT;
! } else {
! callout_stop(&chp->ch_callout);
! ata_bio->error = NOERROR;
! }
chp->ch_queue->active_xfer = NULL;
***************
*** 1270,1278 ****
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
! PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_PORT_INITIALIZE);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
}
--- 1295,1302 ----
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_PORT_INITIALIZE);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
}
***************
*** 1281,1289 ****
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
! PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_DEVICE_RESET);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
}
--- 1305,1312 ----
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_DEVICE_RESET);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
}
***************
*** 1456,1464 ****
periph->periph_cap |= PERIPH_CAP_CMD16;
/* configure port for packet length */
! PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS),
! PRREAD(siic, PRX(chp->ch_channel, PRO_PCS)) |
! PR_PC_PACKET_LENGTH);
}
/* XXX This is gross. */
periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
--- 1479,1485 ----
periph->periph_cap |= PERIPH_CAP_CMD16;
/* configure port for packet length */
! PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS), PR_PC_PACKET_LENGTH);
}
/* XXX This is gross. */
periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
***************
*** 1649,1657 ****
}
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status, clearing completion interrupt */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1670,1680 ----
}
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ /* clear only those interrupts, we are gooing to server now ... */
+ if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status - interrupt cleared above - PR_PC_INCOR set ... */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
>Release-Note:
>Audit-Trail:
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/41579: siista driver interrupt and command completion handling broken
Date: Fri, 12 Jun 2009 13:10:32 +0200
Hi - once again
I've done a typo in the driver name - if possible please correct this.
Otherwiese this report cannot be found by searching for the driver
The driver is "siisata" and not "siista" as written in the report itself
- sorry
W. Stukenbrock
gnats-admin@NetBSD.org wrote:
>Thank you very much for your problem report.
>It has the internal identification `kern/41579'.
>The individual assigned to look at your
>report is: kern-bug-people.
>
>>Category: kern
>>Responsible: kern-bug-people
>>Synopsis: siista driver interrupt and command completion handling broken
>>Arrival-Date: Fri Jun 12 10:45:00 +0000 2009
>>
>
From: jnemeth@victoria.tc.ca (John Nemeth)
To: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org, Wolfgang.Stukenbrock@nagler-company.com
Cc:
Subject: Re: kern/41579: siista driver interrupt and command completion handling broken
Date: Fri, 12 Jun 2009 12:40:54 -0700
On Sep 27, 11:22pm, Wolfgang Stukenbrock wrote:
}
} The following reply was made to PR kern/41579; it has been noted by GNATS.
}
} I've done a typo in the driver name - if possible please correct this.
} Otherwiese this report cannot be found by searching for the driver
}
} The driver is "siisata" and not "siista" as written in the report itself
Done.
}-- End of excerpt from Wolfgang Stukenbrock
Responsible-Changed-From-To: kern-bug-people->jakllsch
Responsible-Changed-By: jakllsch@NetBSD.org
Responsible-Changed-When: Wed, 17 Jun 2009 18:16:08 +0000
Responsible-Changed-Why:
I'll take this.
From: "Jonathan A. Kollasch" <jakllsch@kollasch.net>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/41579: siista driver interrupt and command completion
handling broken
Date: Wed, 17 Jun 2009 18:57:38 +0000
I'm unable to reproduce this in either 5.0 or -current on my hardware
(SiI3124 on 32-bit/33MHz PCI and SiI3132, WD5000AA[KJ]S and other
older non-Flash drives). So, testing a forward-port of your patch
myself will be hard to do.
That said, this mostly looks like stuff that could use fixing.
Thanks for pointing it out.
I'd be interested in knowing more about this system and/or why the
commands complete so fast.
From: Wolfgang.Stukenbrock@nagler-company.com
To: gnats-bugs@NetBSD.org
Cc: jakllsch@kollasch.net
Subject: Re: kern/41579: siista driver interrupt and command completion
handling broken
Date: Mon, 22 Jun 2009 13:23:51 +0200
Hi,
some background information from our setup:
(and some information related to the debugging process ...)
We have had this problem on all server-systems using a PCI-X 133 MHz =20
(4-port) or PCIe (2 port) controler with a chip from this family.
We never try such an IO-card this on a 32-bit bus.
An IO-Card with 2 or 4 3-GBit Ports makes no sence in a 32-bit slot, =20
because you will never get the data stream though it ... (4 * 66MHz =3D =20
264 MByte/sec - 3Gbit/10 =3D 300MByte/sec - so either 600MByte/sec or =20
1,2GByte/sec disc rate versus 264MByte/sec (or only 132MByte/sec on =20
33MHz) theoretical bus bandwith)
For such a card you need at least PCI-X 133MHz -> 8 * 133 =3D 1,064 GByte/se=
c.
This sounds like the reason for me why you cannot reproduce the problem.
Our systems:
2x Intel S3000AHxx/S3200SHxx with Xeon 3xxx dual-Core-CPU (1x PCI-X, 1x PCIe=
)
1x Supermicro (type see bug-report) with Xeon 3xxx quad-core-CPU (PCI-X)
An 4 port card resides on my desc, because it makes no sence to install it .=
..
All systems running with serial console at 38400 baud.
"Originally" these system run an 4.x here with some drivers imported =20
from "-current" due to the need of the HW ...
Before I've reported the problem, I've setup the Supermicro system =20
with a 5.0 and run the test again - same problem. So it was not =20
related to the adaption of the driver to 4.x.
All system run fine for a wile and suddenly looses a disc on the =20
SATA-Card. It was very hard to debug, because I've found no load =20
scenario that will reproduce the problem after a short time reliable. =20
And the select style of the disc looks like random ...
We need more SATA-disc in the tree systems as the motherboards =20
supports, so I've to find the problem in an other way.
I've added some print statements to the driver to get the famous last =20
words before looseing a disc. (not realy successfull ...)
It looks like that the system was loosing interrupts, so I've started =20
debugging of the interrupt routines and dumped some state information =20
from the chip - again with printf statements.
An suddenly all interrupts are lost! The timeout for the transfer =20
happens, but no timeout processing takes place ...
I've located the main problem in the "wrong order" of register access =20
and operation mode of the chip. - The debug dump of the chip register =20
has cleared the interrupt cause. (And of cause in the missing timeout =20
handling code ...)
I'm not shure why the interrupt processing takes too much time on our =20
systems sometimes.
Due to the fact, that it was not realy reproducable with access to =20
large files or the raw-device with larger transfers (-> 64k-transfers, =20
the outdated limit from the ATA-stuff ...), the cause of the problem =20
may be short writes to disc as done by the filesystem when updating =20
some inode information.
To transfer 4k to the disc over a 3 Gbit-link will take something =20
around 13 us.
The driver uses only one slot and I'm not what will happen in the =20
upper layer of the wd-driver if a transfer completes. (The reason for =20
useing only one slot is the ATA abstraction in netbsd that doesn't =20
allow more than one pending transfer ... This should be corrected for =20
performance reason as soon as possible - but I've no time to fix it =20
for "all" drivers - sorry ...
The first try of a port for this chip family (without any usable =20
eror-handling :-( ...) to netbsd was mine, and it would have used all =20
slots - as done in the OpenBSD-version, but the ATA-layer cannot realy =20
handle it. So in fact only two slots are uses.
By the way: my port had been 10% faster as the driver for the =20
ICH-series - this one is something about 8% slower ...)
Back to the main thing ..
13 us is a short time - even on a fast CPU. And if some processing of =20
a completed transfer is done after starting the next one, it is =20
possible that this make take longer than 13 us. (I've haven't looked =20
in the wd-driver to verify this, but I hope the next transfer will be =20
started before dooing a wakeup for the completed. Otherwise the driver =20
would be very very inefficent !!!)
Due to the missing code fragment for timeout-processing, the timeout =20
will happen, but is ignored -> still a completed transfer on the chip, =20
but driver will never stop it.
remark: on a 32-Bit 33MHz bus this will take at least 30 usec due to =20
the limited bandwith of the system bus.
This would be a very good explanation for the problem.
If there are only "some" transfer for the discs, all processing has =20
finished before the next transfer is generated and started.
But if there are more transfers pending at a time and one of the =20
waiting is e.g. an inode-block update an interrupt is cleared in the =20
chip before interrupt processing takes place.
As faster as a CPU is, it will be able to process everything before =20
the chip generates the next interrupt and no disc is lost. And on a =20
system with only 132MByte/sec bandwith, the interrupt routine will be =20
done before the DMA was able to transfer the data to the chip.
best regards
W. Stukenbrock
Zitat von "Jonathan A. Kollasch" <jakllsch@kollasch.net>:
> The following reply was made to PR kern/41579; it has been noted by GNATS.
>
> From: "Jonathan A. Kollasch" <jakllsch@kollasch.net>
> To: gnats-bugs@NetBSD.org
> Cc:
> Subject: Re: kern/41579: siista driver interrupt and command completion
> =09handling broken
> Date: Wed, 17 Jun 2009 18:57:38 +0000
>
> I'm unable to reproduce this in either 5.0 or -current on my hardware
> (SiI3124 on 32-bit/33MHz PCI and SiI3132, WD5000AA[KJ]S and other
> older non-Flash drives). So, testing a forward-port of your patch
> myself will be hard to do.
>
> That said, this mostly looks like stuff that could use fixing.
> Thanks for pointing it out.
>
> I'd be interested in knowing more about this system and/or why the
> commands complete so fast.
>
>
State-Changed-From-To: open->feedback
State-Changed-By: jakllsch@NetBSD.org
State-Changed-When: Thu, 20 Aug 2009 00:34:50 +0000
State-Changed-Why:
I believe this has been fixed in -current since early July.
Could you verify that this issue is no longer present in -current?
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
To: gnats-bugs@NetBSD.org
Cc: jakllsch@NetBSD.org, netbsd-bugs@NetBSD.org, gnats-admin@NetBSD.org
Subject: Re: kern/41579 (siisata driver interrupt and command completion handling broken)
Date: Thu, 20 Aug 2009 12:16:15 +0200
Hi, I've a little problem with this request ....
OK, I've integrated the driver into the 4.0 kernel, because I need it
there. (I've still no time to have a look at 5.0.1 ...)
I've fixed the interrupt handling by the patch I've send to NetBSD.org,
because the driver was instable.
With the patch it seems to be stable now - no problems up to now anymore
with only disks connected to the controler.
And all three system with this controler type installed are productive
systems! They run 24 hours 7 days a week without any interruption ...
I've no "free" system with that hardware to install an other OS-version
on it at the moment.
I have a spare controler of this type, but currently no free MB where I
can insert it - the controler needs a PCI-X-slot ...
If it's good for you that I just check the sources, I will do it again
- see below.
I've done it prior sending the bug report and the changes at that time
there doesn't affect the problem, but there may be something new now.
And there are "some" major changes ...
But the PR_PC_INCOR bit is again not set in the driver and so one main
whole is still present where interrupts can be lost.
It is very hard to garantie that there will be no interrupt lost on the
port, if you clear the cmd-complete interrupt by a side effect!
And as a side effect not setting this bit makes debugging of the
behaviour here impossible, because you cannot check the PSS register ...
In siisata_intr() you loop now until no bits are set in GR_GIS anymore
and check the IS in an inner loop for all channels.
Then you check the PIS for the channel and if CMPCMPL is set, you read
PSS that clears the CMPCMPL interupt as side effect.
This looks OK as long as only one slot is used at all - and I hope this
performance issue will be addresses in the near future ...
You have moved the whole access to the chip into the intr_port routine -
was in cmd_complete before.
This looks good, but there is still the "clear all interrupts" part in ...
I would prefere to eliminate this and clear only the interrups gooing to
serve. This is more stable and of cause required if more than on slot
will be used in the future.
A check with panic() may make sence if there comes any unexpetec
interrupt at all - at least if DIGANOSTICS is defined ...
The missing timeout-check in bio_complete is there now.
The usage of PRO_PCS/PRO_PCC has to be corrected.
summary of my analyses:
The main cause why interrupts are lost in the previous version has been
fixed by moving all chip-access in front of the completion handling.
This problem schould be solved in the latest version of the driver.
The lost timeout should be fixed and the PCS/PCC usage looks good to me.
Neverless I would strongly recommend to set the PR_PC_INCOR bit to
remove this kind of side effect in interrupt handling.
It is alway a bad style to use HW side effects, because they normaly
generate sync-wholes and you need to write to PIS in any case - currenly
still with 0xffffffff without checking for unexpected interrupts -
another point that should be changed I think. At least a WARNING should
be printed if there is an unexpected interrupt, so that anybody can see
that there is still a problem with the driver ...
I still prefere a panic() on unecpected interrupts, but thats my way to
make drivers stable - do not allow any unexpected behaviour of the HW ...
If you just clear the intterupt, it may come up again and than you may
run into a busy-loop by processing (or better not processing) the
HW-interrupt. And you will not see it with the current implementation ...
It is always a good idea for interrupt handlers to clear exactly those
interrupts only, that they are gooing to serve. Otherwiese some
interrupts may get lost - OK as long as only one slot is used this
cannot happen here ...
Sorry that I've no way to run the current version at the moment -
neverless testing this problem by running the system may only show that
there is still a problem. It will never say, that there is no one left -
for this kind of statement, you need to analyse the source code of the
interrupt handler.
If the request for feedback was intended by a "close-request" for the
report, the current source code looks good to me and the problem report
can be closed.
By the way. This mail reached me today at 02:34 (AM) and a reminder for
that came today at 07:11 (AM).
Is it normal that the systems sends a reminder after less than 5 hours?
W. Stukenbrock
jakllsch@NetBSD.org wrote:
>Synopsis: siisata driver interrupt and command completion handling broken
>
>State-Changed-From-To: open->feedback
>State-Changed-By: jakllsch@NetBSD.org
>State-Changed-When: Thu, 20 Aug 2009 00:34:50 +0000
>State-Changed-Why:
>I believe this has been fixed in -current since early July.
>Could you verify that this issue is no longer present in -current?
>
>
>
State-Changed-From-To: feedback->open
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Thu, 27 Aug 2009 04:42:14 +0000
State-Changed-Why:
Feedback received; submitter can't test current but has some recommendations
for further changes.
(regarding the gnats reminders: gnats sends those when it feels like it,
so while having it send one after only a few hours isn't so great we
probably can't prevent it.)
State-Changed-From-To: open->closed
State-Changed-By: jakllsch@NetBSD.org
State-Changed-When: Fri, 25 Dec 2009 18:47:03 +0000
State-Changed-Why:
while I haven't fully addressed all points in this pr, the basic issue is believed fixed and the submiter reports it can be closed.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.