NetBSD Problem Report #45928

From www@NetBSD.org  Sun Feb  5 07:17:18 2012
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id C0FD563DF42
	for <gnats-bugs@gnats.NetBSD.org>; Sun,  5 Feb 2012 07:17:17 +0000 (UTC)
Message-Id: <20120205071717.2419763BCF4@www.NetBSD.org>
Date: Sun,  5 Feb 2012 07:17:17 +0000 (UTC)
From: mm_lists@pulsar-zone.net
Reply-To: mm_lists@pulsar-zone.net
To: gnats-bugs@NetBSD.org
Subject: Random freeze and interrupt/system storms (netbsd-5/amd64 and -current/amd64)
X-Send-Pr-Version: www-1.0

>Number:         45928
>Category:       kern
>Synopsis:       Random freeze and interrupt/system storms (netbsd-5/amd64 and -current/amd64)
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Feb 05 07:20:00 +0000 2012
>Last-Modified:  Tue Feb 21 20:55:01 +0000 2012
>Originator:     Matthew Mondor
>Release:        netbsd-5, -current
>Organization:
>Environment:
NetBSD ninja.xisop 5.1_STABLE NetBSD 5.1_STABLE (GENERIC_MM) #1: Fri Feb  3 06:04:22 EST 2012  root@ninja.xisop:/usr/obj/sys/arch/amd64/compile/GENERIC_MM amd64

>Description:
Since I installed NetBSD on an i5 2500 system I have I experience
instability issues, involving resumable freezes and interrupt/system
time storms reported on the cores.

Suddenly, the system becomes unresponsive and top shows two of the 4
cores in 100% interrupt usage and the two others in 100% system usage.
The system becomes reusable again but then it shortly does this again
and it gets worse, where I have to reset.

This sometimes happens after several hours or can also occur a minute
or two after I rebooted.  The system does not have to be notoriously
busy for this to happen, it seems to occur anytime.

Unfortunatly I'm not sure how to diagnose this further.  I think that
it happens less often, or that it's less serious since I applied the
patch from kern/45160, but it still happens.  Also, it seems to happen
more quickly if I use a hard drive in PIO mode (see kern/45917), but
even when I'm not using those this happens (in which case I see no
abnormal storms reported by vmstat/systat, so I suspect it's soft
interrupts).

This also happens on -current it seems.
>How-To-Repeat:

>Fix:

>Audit-Trail:
From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/45928: Random freeze and interrupt/system storms
 (netbsd-5/amd64 and -current/amd64)
Date: Sun, 5 Feb 2012 02:35:37 -0500

 Here is the dmesg (on netbsd-5):

 Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
     2006, 2007, 2008, 2009, 2010
     The NetBSD Foundation, Inc.  All rights reserved.
 Copyright (c) 1982, 1986, 1989, 1991, 1993
     The Regents of the University of California.  All rights reserved.

 NetBSD 5.1_STABLE (GENERIC_MM) #1: Fri Feb  3 06:04:22 EST 2012
 	root@ninja.xisop:/usr/obj/sys/arch/amd64/compile/GENERIC_MM
 total memory = 8103 MB
 avail memory = 7840 MB
 timecounter: Timecounters tick every 10.000 msec
 timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
 SMBIOS rev. 2.6 @ 0xeb170 (107 entries)
 System manufacturer System Product Name (System Version)
 mainbus0 (root)
 cpu0 at mainbus0 apid 0/usr/src/sys/arch/x86/x86/mtrr_i686.c: FIXME: more than 8 MTRRs
 : Intel 686-class, 3311MHz, id 0x206a7
 cpu1 at mainbus0 apid 2: Intel 686-class, 3311MHz, id 0x206a7
 cpu2 at mainbus0 apid 4: Intel 686-class, 3311MHz, id 0x206a7
 cpu3 at mainbus0 apid 6: Intel 686-class, 3311MHz, id 0x206a7
 ioapic0 at mainbus0 apid 0: pa 0xfec00000, version 20, 24 pins
 acpi0 at mainbus0: Intel ACPICA 20080321
 acpi0: X/RSDT: OemId <ALASKA,   A M I,01072009>, AslId <AMI ,00010013>
 ACPI Error (psargs-0464): [RAMB] Namespace lookup failure, AE_NOT_FOUND
 ACPI Exception (nsinit-0425): AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) [20080321]
 acpi0: SCI interrupting at int 9
 acpi0: fixed-feature power button present
 timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
 ACPI-Fast 24-bit timer
 LPTE (PNP0400) at acpi0 not configured
 attimer1 at acpi0 (TMR, PNP0100): io 0x40-0x43 irq 0
 pcppi1 at acpi0 (SPKR, PNP0800): io 0x61
 midi0 at pcppi1: PC speaker (CPU-intensive output)
 spkr0 at pcppi1
 sysbeep0 at pcppi1
 UAR1 (PNP0501) at acpi0 not configured
 pckbc1 at acpi0 (PS2K, PNP0303) (kbd port): io 0x60,0x64 irq 1
 hpet0 at acpi0 (HPET, PNP0103): mem 0xfed00000-0xfed003ff
 timecounter: Timecounter "hpet0" frequency 14318179 Hz quality 2000
 acpibut0 at acpi0 (PWRB, PNP0C0C-170): ACPI Power Button
 attimer1: attached to pcppi1
 pckbd0 at pckbc1 (kbd slot)
 pckbc1: using irq 1 for kbd slot
 wskbd0 at pckbd0: console keyboard
 pci0 at mainbus0 bus 0: configuration mode 1
 pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
 pchb0 at pci0 dev 0 function 0
 pchb0: vendor 0x8086 product 0x0100 (rev. 0x09)
 ppb0 at pci0 dev 1 function 0: vendor 0x8086 product 0x0101 (rev. 0x09)
 ppb0: unsupported PCI Express version
 pci1 at ppb0 bus 1
 pci1: i/o space, memory space enabled, rd/line, wr/inv ok
 vga0 at pci0 dev 2 function 0: vendor 0x8086 product 0x0102 (rev. 0x09)
 wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation), using wskbd0
 wsmux1: connecting to wsdisplay0
 drm at vga0 not configured
 vendor 0x8086 product 0x1c3a (miscellaneous communications, revision 0x04) at pci0 dev 22 function 0 not configured
 ehci0 at pci0 dev 26 function 0: vendor 0x8086 product 0x1c2d (rev. 0x05)
 ehci0: interrupting at ioapic0 pin 23
 ehci0: BIOS has given up ownership
 ehci0: EHCI version 1.0
 usb0 at ehci0: USB revision 2.0
 azalia0 at pci0 dev 27 function 0: Generic High Definition Audio Controller
 azalia0: interrupting at ioapic0 pin 22
 azalia0: host: 0x8086/0x1c20 (rev. 5), HDA rev. 1.0
 ppb1 at pci0 dev 28 function 0: vendor 0x8086 product 0x1c10 (rev. 0xb5)
 ppb1: unsupported PCI Express version
 pci2 at ppb1 bus 2
 pci2: i/o space, memory space enabled, rd/line, wr/inv ok
 ppb2 at pci0 dev 28 function 4: vendor 0x8086 product 0x1c18 (rev. 0xb5)
 ppb2: unsupported PCI Express version
 pci3 at ppb2 bus 3
 pci3: i/o space, memory space enabled, rd/line, wr/inv ok
 pciide0 at pci3 dev 0 function 0
 pciide0: vendor 0x1106 product 0x0415 (rev. 0x00)
 pciide0: bus-master DMA support present, but unused (no driver support)
 pciide0: primary channel wired to native-PCI mode
 pciide0: using ioapic0 pin 16 for native-PCI interrupt
 atabus0 at pciide0 channel 0
 pciide0: secondary channel wired to native-PCI mode
 atabus1 at pciide0 channel 1
 ppb3 at pci0 dev 28 function 5: vendor 0x8086 product 0x1c1a (rev. 0xb5)
 ppb3: unsupported PCI Express version
 pci4 at ppb3 bus 4
 pci4: i/o space, memory space enabled, rd/line, wr/inv ok
 vendor 0x1b21 product 0x1042 (USB serial bus, interface 0x30) at pci4 dev 0 function 0 not configured
 ppb4 at pci0 dev 28 function 6: vendor 0x8086 product 0x1c1c (rev. 0xb5)
 ppb4: unsupported PCI Express version
 pci5 at ppb4 bus 5
 pci5: i/o space, memory space enabled, rd/line, wr/inv ok
 re0 at pci5 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet (rev. 0x06)
 re0: interrupting at ioapic0 pin 18
 re0: Ethernet address 54:04:a6:a4:1e:6b
 re0: using 256 tx descriptors
 rgephy0 at re0 phy 7: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 4
 rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
 ppb5 at pci0 dev 28 function 7: vendor 0x8086 product 0x244e (rev. 0xb5)
 ppb5: unsupported PCI Express version
 pci6 at ppb5 bus 6
 pci6: i/o space, memory space enabled, rd/line, wr/inv ok
 ehci1 at pci0 dev 29 function 0: vendor 0x8086 product 0x1c26 (rev. 0x05)
 ehci1: interrupting at ioapic0 pin 23
 ehci1: BIOS has given up ownership
 ehci1: EHCI version 1.0
 usb1 at ehci1: USB revision 2.0
 pcib0 at pci0 dev 31 function 0
 pcib0: vendor 0x8086 product 0x1c4a (rev. 0x05)
 ahcisata0 at pci0 dev 31 function 2: vendor 0x8086 product 0x1c02
 ahcisata0: interrupting at ioapic0 pin 20
 ahcisata0: AHCI revision 0x10300, 6 ports, 32 command slots, features 0xe730e040
 atabus2 at ahcisata0 channel 0
 atabus3 at ahcisata0 channel 5
 vendor 0x8086 product 0x1c22 (SMBus serial bus, revision 0x05) at pci0 dev 31 function 3 not configured
 isa0 at pcib0
 lpt0 at isa0 port 0x378-0x37b irq 7
 com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
 timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
 timecounter: Timecounter "TSC" frequency 3311354560 Hz quality 3000
 azalia0: codec[0]: 0x10ec/0x0892 (rev. 3.2), HDA rev. 1.0
 azalia0: codec[3]: 0x8086/0x2805 (rev. 0.0), HDA rev. 1.0
 audio0 at azalia0: full duplex, playback, capture, independent
 uhub0 at usb0: vendor 0x8086 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
 uhub0: 2 ports with 2 removable, self powered
 wd0 at atabus0 drive 0: <WDC WD3200AAJB-00J3A0>
 wd0: drive supports 16-sector PIO transfers, LBA48 addressing
 wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448 sectors
 uhub1 at usb1: vendor 0x8086 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
 uhub1: 2 ports with 2 removable, self powered
 IPsec: Initialized Security Association Processing.
 ahcisata0 port 5: device present, speed: 1.5Gb/s
 ahcisata0 port 0: device present, speed: 6.0Gb/s
 wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
 wd1 at atabus0 drive 1: <WDC WD3200AAJB-00J3A0>
 wd1: drive supports 16-sector PIO transfers, LBA48 addressing
 wd1: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448 sectors
 wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
 wd2 at atabus2 drive 0: <ST31000524AS>
 wd2: drive supports 16-sector PIO transfers, LBA48 addressing
 wd2: 931 GB, 1938021 cyl, 16 head, 63 sec, 512 bytes/sect x 1953525168 sectors
 wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
 wd2(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA)
 atapibus0 at atabus3: 1 targets
 cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GH22NS90, K5BB9112905, HN00S30> cdrom removable
 cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
 cd0(ahcisata0:5:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) (using DMA)
 uhub2 at uhub0 port 1: vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2
 uhub2: single transaction translator
 uhub2: 6 ports with 6 removable, self powered
 uhub3 at uhub1 port 1: vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2
 uhub3: single transaction translator
 uhub3: 8 ports with 8 removable, self powered
 uhidev0 at uhub3 port 5 configuration 1 interface 0
 uhidev0: Logitech USB-PS/2 Optical Mouse, rev 2.00/11.10, addr 3, iclass 3/1
 ums0 at uhidev0: 3 buttons and Z dir
 wsmouse0 at ums0 mux 0
 Kernelized RAIDframe activated
 pad0: outputs: 44100Hz, 16-bit, stereo
 audio1 at pad0: half duplex, playback, capture
 boot device: wd2
 root on wd2a dumps on wd2b
 root file system type: ffs
 wsdisplay0: screen 1 added (80x25, vt100 emulation)
 wsdisplay0: screen 2 added (80x50, vt100 emulation)
 wsdisplay0: screen 3 added (80x50, vt100 emulation)
 wsdisplay0: screen 4 added (80x50, vt100 emulation)

 -- 
 Matt

From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/45928: Random freeze and interrupt/system storms
 (netbsd-5/amd64 and -current/amd64)
Date: Sun, 5 Feb 2012 18:08:06 -0500

 On Sun,  5 Feb 2012 07:40:04 +0000 (UTC)
 Matthew Mondor <mm_lists@pulsar-zone.net> wrote:

 Interestingly, yesterday on -current music was playing from a radio
 stream while when the system began to experience the problem mplayer
 couldn't fill its buffer.  I then tried to ping on the LAN without
 success.  Restarting the re0 interface (down/up) immediately stopped
 the problem and I could ping again.

 Afterwards I disabled the re0 interface and decided to leave the system
 running.  The problem so far hasn't occurred.  However, when I looked
 at rtl8169.c I couldn't find any obvious spl or locking issue so far...
 -- 
 Matt

From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/45928: Random freeze and interrupt/system storms
 (netbsd-5/amd64 and -current/amd64)
Date: Mon, 6 Feb 2012 00:51:42 -0500

 On Sun,  5 Feb 2012 23:10:05 +0000 (UTC)
 Matthew Mondor <mm_lists@pulsar-zone.net> wrote:

 >  Afterwards I disabled the re0 interface and decided to leave the system
 >  running.  The problem so far hasn't occurred.  However, when I looked
 >  at rtl8169.c I couldn't find any obvious spl or locking issue so far...

 And today I used netbsd-5 a while with re0 disabled, without issue
 whatsoever.  Then I enabled re0 and three times I could reproduce the
 issue and recover by quickly disabling the interface before the system
 was permanently frozen.  Interestingly, I also left a systat-vmstat
 command running and I could not really notice abnormal pin interrupt
 floods.  However, top will show two of the cores in "100% interrupt"
 and the two others in "100% system", as usual.
 -- 
 Matt

From: "Matthew Mondor" <mm_lists@pulsar-zone.net>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/45928: Random freeze and interrupt/system storms (netbsd-5/amd64 and -current/amd64)
Date: Mon, 6 Feb 2012 21:42:51 -0500 (EST)

 To verify that hardware was not the issue, I ran since yesterday using
 an Ubuntu live CD, and attempted to strain the network.  No problems
 could be experienced whatsoever since.
 -- 
 Matthew Mondor

From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/45928: Random freeze and interrupt/system storms
 (netbsd-5/amd64 and -current/amd64)
Date: Tue, 21 Feb 2012 16:59:09 +0000

 On Mon, Feb 06, 2012 at 05:55:03AM +0000, Matthew Mondor wrote:
  >  >  Afterwards I disabled the re0 interface and decided to leave the system
  >  >  running.  The problem so far hasn't occurred.  However, when I looked
  >  >  at rtl8169.c I couldn't find any obvious spl or locking issue so far...
  >  
  >  And today I used netbsd-5 a while with re0 disabled, without issue
  >  whatsoever.  Then I enabled re0 and three times I could reproduce the
  >  issue and recover by quickly disabling the interface before the system
  >  was permanently frozen.  Interestingly, I also left a systat-vmstat
  >  command running and I could not really notice abnormal pin interrupt
  >  floods.  However, top will show two of the cores in "100% interrupt"
  >  and the two others in "100% system", as usual.

 The problem could be in ACPI; it might be worth testing with ACPI
 disabled.

 -- 
 David A. Holland
 dholland@netbsd.org

From: Matthew Mondor <mm_lists@pulsar-zone.net>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/45928: Random freeze and interrupt/system storms
 (netbsd-5/amd64 and -current/amd64)
Date: Tue, 21 Feb 2012 15:52:50 -0500

 On Tue, 21 Feb 2012 17:00:09 +0000 (UTC)
 David Holland <dholland-bugs@netbsd.org> wrote:

 > The following reply was made to PR kern/45928; it has been noted by GNATS.
 > 
 > From: David Holland <dholland-bugs@netbsd.org>
 > To: gnats-bugs@NetBSD.org
 > Cc: 
 > Subject: Re: kern/45928: Random freeze and interrupt/system storms
 >  (netbsd-5/amd64 and -current/amd64)
 > Date: Tue, 21 Feb 2012 16:59:09 +0000
 > 
 >  On Mon, Feb 06, 2012 at 05:55:03AM +0000, Matthew Mondor wrote:
 >   >  >  Afterwards I disabled the re0 interface and decided to leave the system
 >   >  >  running.  The problem so far hasn't occurred.  However, when I looked
 >   >  >  at rtl8169.c I couldn't find any obvious spl or locking issue so far...
 >   >  
 >   >  And today I used netbsd-5 a while with re0 disabled, without issue
 >   >  whatsoever.  Then I enabled re0 and three times I could reproduce the
 >   >  issue and recover by quickly disabling the interface before the system
 >   >  was permanently frozen.  Interestingly, I also left a systat-vmstat
 >   >  command running and I could not really notice abnormal pin interrupt
 >   >  floods.  However, top will show two of the cores in "100% interrupt"
 >   >  and the two others in "100% system", as usual.
 >  
 >  The problem could be in ACPI; it might be worth testing with ACPI
 >  disabled.

 Although I remember doing this in an earlier test, I could retest
 again.  So far to mitigate this problem I avoid the re(4) onboard
 interface and instead am using an added PCI ethernet card.

 Another issue which affected performance and stability, albeit not as
 drastically as using re(4), had to do with the onboard IDE/PATA
 controller which viaide(4) does not yet support (kern/45917).  PIO mode
 was used, and with transfers of ~1.2MB/s maximum on the disks on that
 adaptor, the system would slow down dramatically with seemingly random
 non-involved processes accumulating a lot of CPU time.  Fortunately, I
 could for now force DMA using the 0x0001 pciide(4) flag, which makes
 PATA drives much more usable.
 -- 
 Matt

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.