NetBSD Problem Report #32757

From seebs@vash.cel.plethora.net  Mon Feb  6 09:48:48 2006
Return-Path: <seebs@vash.cel.plethora.net>
Received: from vash.cel.plethora.net (216-243-131-210.static.iphouse.net [216.243.131.210])
	by narn.netbsd.org (Postfix) with ESMTP id 8D12C63B848
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  6 Feb 2006 09:48:47 +0000 (UTC)
Message-Id: <200602060946.k169kLXo012203@vash.cel.plethora.net>
Date: Mon, 6 Feb 2006 03:46:21 -0600 (CST)
From: seebs <seebs@vash.cel.plethora.net>
Reply-To: seebs@plethora.net
To: gnats-bugs@netbsd.org
Subject: TLB IPI rendezvous fails sometimes
X-Send-Pr-Version: 3.95

>Number:         32757
>Category:       kern
>Synopsis:       kernel occasionally panics with "TLB IPI rendezvous failed"
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 06 09:50:00 +0000 2006
>Closed-Date:    Wed Jul 21 20:53:02 +0000 2010
>Last-Modified:  Wed Jul 21 20:53:02 +0000 2010
>Originator:     seebs
>Release:        NetBSD 2.1
>Organization:
>Environment:
NetBSD ns1.cheetah.net 2.1 NetBSD 2.1 (CHEETAH) #0: Thu Dec 29 04:02:46 PST 2005  beta1@ns1.cheetah.net:/usr/src/2.1/usr/src/sys/arch/i386/compile/CHEETAH i386
Architecture: i386
Machine: i386
>Description:
	On at least some motherboards, NetBSD 2.1 occasionally fails with TLB
	IPI rendezvous failed.  The patch (from pmap.c 1.184) is verified
	present.
>How-To-Repeat:
	Run under load.

	Someone else on the NetBSD lists reports the same behavior with a
	Pentium 3 system, suggesting that this isn't just a specific
	motherboard, but it's obviously rare.  Here's full dmesg output
	from the system in single-processor mode.  (It's not stable enough
	in SMP mode to run in production.)

NetBSD 2.1 (CHEETAH) #0: Thu Dec 29 04:02:46 PST 2005
	beta1@ns1.cheetah.net:/usr/src/2.1/usr/src/sys/arch/i386/compile/CHEETAH
total memory = 1022 MB
avail memory = 996 MB
BIOS32 rev. 0 found at 0xfd6d0
mainbus0 (root)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel Xeon (686-class), 3065.96 MHz, id 0xf29
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: I-cache 12K uOp cache 8-way, D-cache 8 KB 64B/line 4-way
cpu0: L2 cache 512 KB 64B/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: calibrating local timer
cpu0: apic clock running at 133 MHz
cpu0: 16 page colors
cpu1 at mainbus0: apid 6 (application processor)
cpu1: not started
cpu2 at mainbus0: apid 1 (application processor)
cpu2: not started
cpu3 at mainbus0: apid 7 (application processor)
cpu3: not started
ioapic0 at mainbus0 apid 2 (I/O APIC)
ioapic0: pa 0xfec00000, version 20, 24 pins
ioapic1 at mainbus0 apid 3 (I/O APIC)
ioapic1: pa 0xfec80000, version 20, 24 pins
ioapic2 at mainbus0 apid 4 (I/O APIC)
ioapic2: pa 0xfec80100, version 20, 24 pins
cpu4 at mainbus0: (uniprocessor)
cpu4: Intel Xeon (686-class), 3065.82 MHz, id 0xf29
cpu4: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu4: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu4: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu4: I-cache 12K uOp cache 8-way, D-cache 8 KB 64B/line 4-way
cpu4: L2 cache 512 KB 64B/line 8-way
cpu4: ITLB 4K/4M: 64 entries
cpu4: DTLB 4K/4M: 64 entries
acpi0 at mainbus0
acpi0: using Intel ACPI CA subsystem version 20040211
acpi0: X/RSDT: OemId <PTLTD ,  RSDT  ,06040000>, AslId < LTP,00000000>
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
acpi: activated PNP0C0F
acpi: activated PNP0C0F
PNP0A03 [PCI Bus] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured
PNP0200 [AT DMA Controller] at acpi0 not configured
PNP0C04 [Math Coprocessor] at acpi0 not configured
PNP0000 [AT Interrupt Controller] at acpi0 not configured
PNP0B00 [AT Real-Time Clock] at acpi0 not configured
PNP0800 [AT-style speaker sound] at acpi0 not configured
PNP0100 [AT Timer] at acpi0 not configured
PNP0303 [IBM Enhanced (101/102-key, PS/2 mouse support)] at acpi0 not configured
PNP0F13 [PS/2 Port for PS/2-style Mice] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
INT0800 at acpi0 not configured
PNP0A05 [Generic ACPI Bus] at acpi0 not configured
PNP0501 [16550A-compatible COM port] at acpi0 not configured
PNP0501 [16550A-compatible COM port] at acpi0 not configured
PNP0700 [PC standard floppy disk controller] at acpi0 not configured
PNP0401 [ECP printer port] at acpi0 not configured
PNP0C0C [ACPI power button device] at acpi0 not configured
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: Intel E7505 MCH Host (rev. 0x03)
agp0 at pchb0: using generic initialization for Intel AGP
agp0: aperture at 0xf4000000, size 0x4000000
Intel E7505 MCH RAS Controller (undefined subclass 0x00, revision 0x03) at pci0 dev 0 function 1 not configured
ppb0 at pci0 dev 1 function 0: Intel E7505 MCH Host-to-AGP Bridge (rev. 0x03)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
ppb1 at pci0 dev 2 function 0: Intel E7505 MCH HI_B PCI-to-PCI (rev. 0x03)
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled
Intel 82870P2 P64H2 IOxAPIC (interrupt system, interface 0x20, revision 0x04) at pci2 dev 28 function 0 not configured
ppb2 at pci2 dev 29 function 0: Intel 82870P2 P64H2 PCI-to-PCI Bridge (rev. 0x04)
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled
wm0 at pci3 dev 3 function 0: Intel i82545EM 1000BASE-T Ethernet, rev. 1
wm0: interrupting at ioapic2 pin 6 (irq 12)
wm0: 64-bit 133MHz PCIX bus
wm0: 256 word (8 address bits) MicroWire EEPROM
wm0: Ethernet address 00:30:48:73:a3:13
ukphy0 at wm0 phy 1: Generic IEEE 802.3u media interface
ukphy0: Marvell 88E1011 Gigabit PHY (OUI 0x000ac2, model 0x0002), rev. 3
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
Intel 82870P2 P64H2 IOxAPIC (interrupt system, interface 0x20, revision 0x04) at pci2 dev 30 function 0 not configured
ppb3 at pci2 dev 31 function 0: Intel 82870P2 P64H2 PCI-to-PCI Bridge (rev. 0x04)
pci4 at ppb3 bus 4
pci4: i/o space, memory space enabled
ahd0 at pci4 dev 3 function 0
ahd0: interrupting at ioapic1 pin 8 (irq 12)
ahd0: aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs
scsibus0 at ahd0: 16 targets, 8 luns per target
ahd1 at pci4 dev 3 function 1
ahd1: interrupting at ioapic1 pin 9 (irq 12)
ahd1: aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs
scsibus1 at ahd1: 16 targets, 8 luns per target
uhci0 at pci0 dev 29 function 0: Intel 82801DB/DBM USB UHCI Controller #1 (rev. 0x02)
uhci0: interrupting at ioapic0 pin 16 (irq 11)
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1 at pci0 dev 29 function 1: Intel 82801DB/DBM USB UHCI Controller #2 (rev. 0x02)
uhci1: interrupting at ioapic0 pin 19 (irq 10)
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2 at pci0 dev 29 function 2: Intel 82801DB/DBM USB UHCI Controller #3 (rev. 0x02)
uhci2: interrupting at ioapic0 pin 18 (irq 5)
usb2 at uhci2: USB revision 1.0
uhub2 at usb2
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ehci0 at pci0 dev 29 function 7: Intel 82801DB/DBM USB EHCI Controller (rev. 0x02)
ehci0: interrupting at ioapic0 pin 23 (irq 12)
ehci0: EHCI version 1.0
ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2
usb3 at ehci0: USB revision 2.0
uhub3 at usb3
uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub3: 6 ports with 6 removable, self powered
ppb4 at pci0 dev 30 function 0: Intel 82801BA Hub-to-PCI Bridge (rev. 0x82)
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled
ex0 at pci5 dev 1 function 0: 3Com 3c905B-TX 10/100 Ethernet (rev. 0x30)
ex0: interrupting at ioapic0 pin 16 (irq 11)
ex0: MAC address 00:10:5a:83:1b:65
exphy0 at ex0 phy 24: 3Com internal media interface
exphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
vga0 at pci5 dev 2 function 0: ATI Technologies Rage XL (AGP) (rev. 0x65)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
ichlpcib0 at pci0 dev 31 function 0
ichlpcib0: Intel 82801DB LPC Interface Bridge (rev. 0x02)
ichlpcib0: TCO (watchdog) timer configured.
pciide0 at pci0 dev 31 function 1
pciide0: Intel 82801DB IDE Controller (UltraATA/100) (rev. 0x02)
pciide0: bus-master DMA support present, but unused (no driver support)
pciide0: primary channel configured to compatibility mode
pciide0: primary channel ignored (not responding; disabled or no drives?)
pciide0: secondary channel configured to compatibility mode
pciide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
atabus0 at pciide0 channel 1
Intel 82801DB/DBM SMBus Controller (SMBus serial bus, revision 0x02) at pci0 dev 31 function 3 not configured
Intel 82801DB/DBM AC97 Audio Controller (audio multimedia, revision 0x02) at pci0 dev 31 function 5 not configured
isa0 at ichlpcib0
lpt0 at isa0 port 0x378-0x37b irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
sysbeep0 at pcppi0
npx0 at isa0 port 0xf0-0xff: using exception 16
ioapic2: enabling
ioapic1: enabling
ioapic0: enabling
IPsec: Initialized Security Association Processing.
scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...
atapibus0 at atabus0: 2 targets
cd0 at atapibus0 drive 0: <TSSTcorpCD/DVDW TS-H552B, , TS05> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST373207LW, 0003> disk fixed
sd0: 70007 MB, 90774 cyl, 2 head, 789 sec, 512 bytes/sect x 143374744 sectors
sd0: sync (6.25ns offset 63), 16-bit (320.000MB/s) transfers, tagged queueing
boot device: sd0
root on sd0a dumps on sd0b
root file system type: ffs
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)

>Fix:
	Workaround:  Run single-processor.

	Actual fix:  Not known.  The other person I talked to in late December
	had a patch which simply retried the rendezvous (possibly re-sending
	something?  I don't have the patch) which apparently worked, but
	implies that there is a race condition somewhere.

>Release-Note:

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: kern/32757: TLB IPI rendezvous fails sometimes
Date: Mon, 6 Feb 2006 23:06:26 +0100

 --SUOF0GtieIMvvwua
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline

 On Mon, Feb 06, 2006 at 09:50:01AM +0000, seebs wrote:
 > Machine: i386
 > >Description:
 > 	On at least some motherboards, NetBSD 2.1 occasionally fails with TLB
 > 	IPI rendezvous failed.  The patch (from pmap.c 1.184) is verified
 > 	present.
 > >How-To-Repeat:
 > 	Run under load.
 > 
 > 	Someone else on the NetBSD lists reports the same behavior with a
 > 	Pentium 3 system, suggesting that this isn't just a specific

 I'm the one who reported the problem. Hardware is PIII-1Ghz on a
 MSI 694D-Pro 2 motherboard:
 mainbus0 (root)
 mainbus0: Intel MP Specification (Version 1.4) (OEM00000 PROD00000000)
 cpu0 at mainbus0: apid 0 (boot processor)
 cpu0: Intel Pentium III (686-class), 1002.37 MHz, id 0x68a
 cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
 cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
 cpu0: features 387fbff<FXSR,SSE>
 cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
 cpu0: L2 cache 256 KB 32B/line 8-way
 cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
 cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
 cpu0: serial number 0000-068A-0001-DDD6-4ED7-4704
 cpu0: calibrating local timer
 cpu0: apic clock running at 133 MHz
 cpu0: 8 page colors
 cpu1 at mainbus0: apid 1 (application processor)
 cpu1: starting
 cpu1: Intel Pentium III (686-class), 1002.28 MHz, id 0x68a
 cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
 cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
 cpu1: features 387fbff<FXSR,SSE>
 cpu1: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
 cpu1: L2 cache 256 KB 32B/line 8-way
 cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
 cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
 cpu1: serial number 0000-068A-0003-ADAB-C15A-1E54
 pchb0 at pci0 dev 0 function 0
 pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)
 pcib0 at pci0 dev 7 function 0
 pcib0: VIA Technologies VT82C686A PCI-ISA Bridge (rev. 0x40)
 viaide0 at pci0 dev 7 function 1
 viaide0: VIA Technologies VT82C686A (Apollo KX133) ATA100 controller

 I still see it with NetBSD 3.0, both for TLB IPIs and FPU IPIs.
 I'm running with the attached patch, all my systems are stable with
 this. I have several systems based on the same hardware running SMP, with
 different workloads, all of them show the problems from once a day to
 once in several weeks, depending on the workload.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --SUOF0GtieIMvvwua
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: i386/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v
 retrieving revision 1.181.2.2
 diff -u -r1.181.2.2 pmap.c
 --- i386/pmap.c	26 Sep 2005 20:24:52 -0000	1.181.2.2
 +++ i386/pmap.c	6 Feb 2006 19:37:12 -0000
 @@ -3652,6 +3652,7 @@
  	int s;
  #ifdef DIAGNOSTIC
  	int count = 0;
 +	int ipi_retry = 0;
  #endif
  #endif

 @@ -3672,6 +3673,9 @@
  	/*
  	 * Send the TLB IPI to other CPUs pending shootdowns.
  	 */
 +#ifdef DIAGNOSTIC
 +ipi_again:
 +#endif
  	for (CPU_INFO_FOREACH(cii, ci)) {
  		if (ci == self)
  			continue;
 @@ -3683,9 +3687,20 @@

  	while (self->ci_tlb_ipi_mask != 0) {
  #ifdef DIAGNOSTIC
 -		if (count++ > 10000000)
 +		if (count++ > 10000000) {
 +			for (CPU_INFO_FOREACH(cii, ci)) {
 +				if (ci == self)
 +					continue;
 +				printf("CPU %ld interrupt level 0x%x pending "
 +				    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
 +				    ci->ci_ilevel, ci->ci_ipending,
 +				    ci->ci_idepth, ci->ci_ipis);
 +			}
 +			if (ipi_retry++ < 5)
 +				goto ipi_again;
  			panic("TLB IPI rendezvous failed (mask %x)",
  			    self->ci_tlb_ipi_mask);
 +		}
  #endif
  		x86_pause();
  	}
 Index: isa/npx.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v
 retrieving revision 1.107
 diff -u -r1.107 npx.c
 --- isa/npx.c	3 Feb 2005 21:08:58 -0000	1.107
 +++ isa/npx.c	6 Feb 2006 19:37:12 -0000
 @@ -732,6 +732,8 @@
  	} else {
  #ifdef DIAGNOSTIC
  		int spincount;
 +		int ipi_retry = 0;
 +ipi_again:
  #endif

  		IPRINTF(("%s: fp ipi to %s %s lwp %p\n",
 @@ -750,6 +752,16 @@
  #ifdef DIAGNOSTIC
  			spincount++;
  			if (spincount > 10000000) {
 +				printf("CPU %ld interrupt level 0x%x pending "
 +				    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
 +				    ci->ci_ilevel, ci->ci_ipending,
 +				    ci->ci_idepth, ci->ci_ipis);
 +				printf("CPU %ld interrupt level 0x%x pending "
 +				    "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid,
 +				    oci->ci_ilevel, oci->ci_ipending,
 +				    oci->ci_idepth, oci->ci_ipis);
 +				if (ipi_retry++ < 5)
 +					goto ipi_again;
  				panic("fp_save ipi didn't");
  			}
  #endif

 --SUOF0GtieIMvvwua--

State-Changed-From-To: open->feedback
State-Changed-By: pooka@NetBSD.org
State-Changed-When: Tue, 29 Jun 2010 19:24:43 +0300
State-Changed-Why:
is still still an issue?  at least the code is different now.


From: seebs@seebs.net (Peter Seebach)
To: gnats-bugs@NetBSD.org (NetBSD Problem Report DB Administrator)
Cc: 
Subject: Re: kern/32757
Date: Wed, 21 Jul 2010 14:58:37 -0500

 >kern/32757 - critical medium priority sw-bug
 >	kernel occasionally panics with "TLB IPI rendezvous failed"
 >	http://gnats.NetBSD.org/32757

 I don't have access to the machine in question anymore, but I'd guess that
 it's very unlikely that this is still an issue.

 -s

State-Changed-From-To: feedback->closed
State-Changed-By: wiz@NetBSD.org
State-Changed-When: Wed, 21 Jul 2010 20:53:02 +0000
State-Changed-Why:
Submitter agrees this can be closed. Thanks for the feedback!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.