NetBSD Problem Report #50448

From oster@fween.ca  Thu Nov 19 16:18:37 2015
Return-Path: <oster@fween.ca>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id A4DE3A668F
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 19 Nov 2015 16:18:37 +0000 (UTC)
Message-Id: <20151119150041.7E0318266B@quad.fween.ca>
Date: Thu, 19 Nov 2015 09:00:41 -0600 (CST)
From: oster@netbsd.org
Reply-To: oster@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: Xorg on 82G33/G31 Express Integrated Graphics Controller hangs
X-Send-Pr-Version: 3.95

>Number:         50448
>Category:       kern
>Synopsis:       Xorg on 82G33/G31 Express Integrated Graphics Controller hangs
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    riastradh
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Nov 19 16:20:00 +0000 2015
>Closed-Date:    Sat Jun 03 20:28:05 +0000 2023
>Last-Modified:  Sat Jun 03 20:28:05 +0000 2023
>Originator:     Greg Oster
>Release:        NetBSD 7.0
>Organization:
>Environment:


System: NetBSD quad 7.0 NetBSD 7.0 (QUAD) #0: Mon Sep 28 11:54:36 CST 2015 oster@quad:/u1/builds/build265/src/obj/amd64/u1/builds/build265/src/sys/arch/amd64/compile/QUAD amd64
Architecture: x86_64
Machine: amd64
>Description:

While attempting to use X11 on a:

 00:02.0 VGA compatible controller: Intel Corporation 82G33/G31
 Express Integrated Graphics Controller (rev 02) 

graphics chipset, the X11 display freezes.  This freeze can last from
seconds to minutes, to hours, to days.  A little kernel sleuthing has
determined that the hang is here:

(gdb) where
#0  0xffffffff805b0179 in mi_switch (l=l@entry=0xfffffe811b981560) at /u1/builds/build265/src/sys/kern/kern_synch.c:719
#1  0xffffffff805adcd2 in sleepq_block (timo=timo@entry=0, catch=catch@entry=false) at /u1/builds/build265/src/sys/kern/kern_sleepq.c:264
#2  0xffffffff805afdb1 in mtsleep (ident=0xffffffff810a3010 <uvmexp+16>, priority=priority@entry=516, wmesg=wmesg@entry=0xffffffff80d0f17a "age", 
    timo=<optimized out>, mtx=0xffffffff810a2fd0 <uvm_fpageqlock>) at /u1/builds/build265/src/sys/kern/kern_synch.c:210
#3  0xffffffff8093a188 in uvm_wait (wmsg=wmsg@entry=0xffffffff80d0f17a "age") at /u1/builds/build265/src/sys/uvm/uvm_pdaemon.c:162
#4  0xffffffff80929c87 in uao_get (uobj=0xfffffe810835dd78, offset=<optimized out>, pps=<optimized out>, npagesp=<optimized out>, centeridx=0, 
    access_type=<optimized out>, advice=2, flags=18) at /u1/builds/build265/src/sys/uvm/uvm_aobj.c:1013
#5  0xffffffff80936ff5 in uvm_obj_wirepages (uobj=uobj@entry=0xfffffe810835dd78, start=start@entry=0, end=end@entry=6815744, list=list@entry=0xfffffe8072839f30)
    at /u1/builds/build265/src/sys/uvm/uvm_object.c:142
#6  0xffffffff803886f0 in bus_dmamem_wire_uvm_object (start=0, tag=<optimized out>, alignment=4096, boundary=0, flags=1, rsegs=0xfffffe8072839f48, nsegs=1664, 
    segs=0xffff800008a41000, pages=0xfffffe8072839f30, size=6815744, uobj=0xfffffe810835dd78)
    at /u1/builds/build265/src/sys/external/bsd/drm2/include/drm/bus_dma_hacks.h:81
#7  i915_gem_object_get_pages_gtt (obj=0xfffffe8072839e00) at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem.c:2580
#8  0xffffffff8038935f in i915_gem_object_get_pages (obj=0xfffffe8072839e00) at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem.c:2761
#9  0xffffffff8038c628 in i915_gem_object_bind_to_vm (flags=1, alignment=<optimized out>, vm=0xffff800007416400, obj=0xfffffe8072839e00)
    at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem.c:4012
#10 i915_gem_object_pin (obj=obj@entry=0xfffffe8072839e00, vm=0xffff800007416400, alignment=<optimized out>, flags=1)
    at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem.c:4666
#11 0xffffffff809c3614 in i915_gem_execbuffer_reserve_vma (vma=vma@entry=0xfffffe8113164c48, need_reloc=need_reloc@entry=0xfffffe80426e6d0f, 
    ring=0xffff800007415560) at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem_execbuffer.c:611
#12 0xffffffff809c3933 in i915_gem_execbuffer_reserve (ring=ring@entry=0xffff800007415560, vmas=vmas@entry=0xfffffe804b4c8a18, 
    need_relocs=need_relocs@entry=0xfffffe80426e6d0f) at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem_execbuffer.c:750
#13 0xffffffff80390e4f in i915_gem_do_execbuffer (dev=dev@entry=0xfffffe81074d7808, file=file@entry=0xfffffe81133a4cc0, args=args@entry=0xfffffe80426e6df8, 
    exec=exec@entry=0xfffffe812ad2b708, data=0xfffffe80426e6df8) at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem_execbuffer.c:1246
#14 0xffffffff8039209c in i915_gem_execbuffer2 (dev=0xfffffe81074d7808, data=0xfffffe80426e6df8, file=0xfffffe81133a4cc0)
    at /u1/builds/build265/src/sys/external/bsd/drm2/dist/drm/i915/i915_gem_execbuffer.c:1501
#15 0xffffffff802cd11c in drm_ioctl (fp=<optimized out>, cmd=<optimized out>, data=0xfffffe80426e6df8)
    at /u1/builds/build265/src/sys/external/bsd/drm2/drm/drm_drv.c:673
#16 0xffffffff80869419 in sys_ioctl (l=<optimized out>, uap=0xfffffe80426e6f00, retval=<optimized out>) at /u1/builds/build265/src/sys/kern/sys_generic.c:681
#17 0xffffffff8087386a in sy_call (rval=0xfffffe80426e6eb8, uap=0xfffffe80426e6f00, l=0xfffffe811b981560, sy=0xffffffff80fef5a0 <sysent+864>)
    at /u1/builds/build265/src/sys/sys/syscallvar.h:61
#18 sy_invoke (code=54, rval=0xfffffe80426e6eb8, uap=0xfffffe80426e6f00, l=0xfffffe811b981560, sy=0xffffffff80fef5a0 <sysent+864>)
    at /u1/builds/build265/src/sys/sys/syscallvar.h:85
#19 syscall (frame=0xfffffe80426e6f00) at /u1/builds/build265/src/sys/arch/x86/x86/syscall.c:156
#20 0xffffffff80100691 in Xsyscall ()
(gdb) 

When the kernel gets into this state, ok course, one is unable to
connect to the Xorg process with gdb, and 'kill -9' has no effect.

Relevant dmesg output is:

agp0 at pchb0: G33-family chipset
agp0: detected 7164k stolen memory
agp0: aperture at 0xd0000000, size 0x10000000
i915drmkms0 at pci0 dev 2 function 0: vendor 0x8086 product 0x29c2 (rev. 0x02)
drm: Memory usable by graphics device = 1024M
drm: Supports vblank timestamp caching Rev 2 (21.10.2013).
drm: Driver supports precise vblank timestamp query.
i915drmkms0: interrupting at ioapic0 pin 16 (i915)
drm: initialized overlay support
intelfb0 at i915drmkms0
i915drmkms0: info: registered panic notifier
intelfb0: framebuffer at 0xffff800048f25000, size 1920x1200, depth 32, stride 7680
wsdisplay0 at intelfb0 kbdmux 1: console (default, vt100 emulation)
wsmux1: connecting to wsdisplay0


The older 6.1.1 kernel said:

agp0 at pchb0: detected 7164k stolen memory
agp0: aperture at 0xd0000000, size 0x10000000
vga0 at pci0 dev 2 function 0: vendor 0x8086 product 0x29c2 (rev. 0x02)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
i915drm0 at vga0: Intel G33
i915drm0: AGP at 0xd0000000 256MB
i915drm0: Initialized i915 1.6.0 20080730


>How-To-Repeat:
	Upgrade a perfectly working NetBSD 6.1.1/amd64 box to NetBSD
	7.0/amd64 and no longer be able to use the machine as a workstation.
>Fix:
	Please.  I'm happy to provide additional information, and/or
test patches.  The alternative is to revert back to NetBSD 6.1.5.

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->riastradh
Responsible-Changed-By: riastradh@NetBSD.org
Responsible-Changed-When: Mon, 14 Dec 2015 16:25:01 +0000
Responsible-Changed-Why:
mine


From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/50448: Xorg on 82G33/G31 Express Integrated Graphics
 Controller hangs
Date: Sun, 21 Feb 2016 16:52:15 -0600

 On Thu, 19 Nov 2015 16:20:00 +0000 (UTC)
 oster@netbsd.org wrote:

 > >How-To-Repeat:
 > 	Upgrade a perfectly working NetBSD 6.1.1/amd64 box to NetBSD
 > 	7.0/amd64 and no longer be able to use the machine as a
 > workstation.
 > >Fix:
 > 	Please.  I'm happy to provide additional information, and/or
 > test patches.  The alternative is to revert back to NetBSD 6.1.5.

 Anything I can do to help with this?  I just tried to login to my
 desktop after a reboot, and it 'hung' again...
 Later...

 Greg Oster

From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Subject: Re: kern/50448: Xorg on 82G33/G31 Express Integrated Graphics Controller hangs
Date: Mon, 22 Feb 2016 19:57:06 +0000

    Date: Sun, 21 Feb 2016 22:55:01 +0000 (UTC)
    From: Greg Oster <oster@netbsd.org>

    Anything I can do to help with this?  I just tried to login to my
    desktop after a reboot, and it 'hung' again...
    Later...

 Hi, Greg!  Sorry to have been so quiet about this -- my time for
 focussed debugging has been pretty limited.

 One immediate workaround you can try, just to make the machine useful
 to you again right now, is to disable the new DRM/KMS code and
 re-enable the old DRM(/UMS) code.  Something like this:

 i915drm* at drm?

 no i915drmkms*
 no intelfb*
 no radeon*
 no radeondrmkmsfb*
 no nouveau*
 no nouveaufb*

 This hasn't undergone much testing, but it is fairly likely that most
 of that code hasn't broken.


 I have two general hypotheses about what's going on here:


 1. The i915drmkms code has allocated too many pages to graphics
 buffers, and we haven't hooked up the mechanism by which the page
 daemon can tell i915drmkms to please relinquish a few pages that
 aren't terribly important, if it doesn't mind.  Hooking up this
 mechanism shouldn't be too hard -- just need to invent a shim for
 Linux `shrinkers' and teach the uvm page daemon to invoke it.


 2. The code path you quoted involves two locks and a wait that only
 releases one of them.  The two locks are the DRM/KMS dev->struct_mutex
 and the GEM/UVM object's vmobjlock:

 Many paths into the DRM/KMS code path require exclusive access to
 whole the driver state, serialized by dev->struct_mutex, and I haven't
 checked but it wouldn't surprise me if this one will hold that.

 When we do i915_gem_object_get_pages, which calls uvm_obj_wirepages ->
 uao_get, if there are no free pages, then we have to wait -- and
 although uao_get drops the vmobjlock in order to uvm_wait, it does not
 drop dev->struct_mutex.

 If it turns out that this code path is serialized by dev->struct_mutex
 (which it may not be), then we'd have to find some way to disentangle
 it, or some hack to persuade uao_get to drop dev->struct_mutex.


 These hints might be enough for someone else to do some analysis or
 experiments.  If you'd like to take a look, and want any more
 hand-holding, I'd be happy to give more hints.  I won't have time to
 prepare specific patches to test for a few days at best, though -- but
 feel free to ping me in a week if I haven't piped up again.

State-Changed-From-To: open->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sat, 03 Jun 2023 20:28:05 +0000
State-Changed-Why:
several drm updates since then, submitter no longer has hardware to test


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.