NetBSD Problem Report #57537

From mrg@eterna.com.au  Sat Jul 22 06:40:49 2023
Return-Path: <mrg@eterna.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id F1DCC1A923E
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 22 Jul 2023 06:40:48 +0000 (UTC)
Message-Id: <20230722063734.3D4E81590CF@splode.eterna.com.au>
Date: Sat, 22 Jul 2023 16:37:34 +1000 (AEST)
From: mrg@eterna.com.au
Reply-To: mrg@eterna.com.au
To: gnats-bugs@NetBSD.org
Subject: radeon drm hangs with multiple glxgears active
X-Send-Pr-Version: 3.95

>Number:         57537
>Category:       kern
>Synopsis:       radeon drm hangs with multiple glxgears active
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    riastradh
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jul 22 06:45:00 +0000 2023
>Closed-Date:    Wed Aug 02 13:09:07 +0000 2023
>Last-Modified:  Wed Aug 02 13:09:07 +0000 2023
>Originator:     matthew green
>Release:        NetBSD 10.99.6 amd64
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:

	-10 or -current amd64.

	radeon0 at pci1 dev 0 function 0: ATI Technologies Mobility Radeon HD 4670 (rev. 0x00)
	[ ... ]
	[drm] initializing kernel modesetting (RV730 0x1002:0x9488 0x1028:0x02FE 0x00).
	[ .. normal looking messages, no errors ]

>Description:

	on a system with radeon 4760 running 8 concurrent glxgears
	pretty quickly has at least 3-5 of them lock up and stop
	spinning the gear.  (the system has 8 cpu threads.)

	crash(8) shows that each of these glxgears has 3 lwps, two
	are in lwp_park() and for the 5 stuck instances right now,
	this is the kernel stack trace (all the same):

	crash> bt/a ffff8fb035453140
	trace: pid 1159 lid 1972 at 0xffff928243e406f0
	sleepq_block() at sleepq_block+0x166
	cv_wait_sig() at cv_wait_sig+0x55
	ww_mutex_lock_wait_sig() at ww_mutex_lock_wait_sig+0xab
	linux_ww_mutex_lock_interruptible() at linux_ww_mutex_lock_interruptible+0x1d3
	ttm_eu_reserve_buffers() at ttm_eu_reserve_buffers+0x1a6
	radeon_bo_list_validate() at radeon_bo_list_validate+0xab
	radeon_cs_ioctl() at radeon_cs_ioctl+0x975
	drm_ioctl() at drm_ioctl+0x260
	drm_ioctl_shim() at drm_ioctl_shim+0x45
	sys_ioctl() at sys_ioctl+0x5d3
	syscall() at syscall+0x1ae
	--- syscall (number 54) ---
	syscall+0x1ae:

>How-To-Repeat:

	boot on r600 system, run concurrent glxgears with
	"vblank_mode=1" in the environment.

>Fix:

>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/57537: radeon drm hangs with multiple glxgears active
Date: Sat, 22 Jul 2023 16:59:59 +1000

 i realised that i've been near this problem before, i've got
 a patch that printf()s instead of panic()s here, and it's
 firing in dmesg:

 ww_mutex_lock_wait_sig:408: nopanic: ww mutex class mismatch: 0xffffffff81=
 09e4c0 !=3D 0xffffffff804b89d6

 see patch below.  i'll work on getting more info (stack trace
 at the very least).


 .mrg.


 Index: sys/external/bsd/drm2/linux/linux_ww_mutex.c
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 RCS file: /cvsroot/src/sys/external/bsd/drm2/linux/linux_ww_mutex.c,v
 retrieving revision 1.14
 diff -p -u -r1.14 linux_ww_mutex.c
 --- sys/external/bsd/drm2/linux/linux_ww_mutex.c	18 Mar 2022 23:33:41 -000=
 0	1.14
 +++ sys/external/bsd/drm2/linux/linux_ww_mutex.c	22 Jul 2023 06:57:19 -000=
 0
 @@ -398,9 +398,16 @@ ww_mutex_lock_wait_sig(struct ww_mutex *
  	KASSERT((mutex->wwm_state =3D=3D WW_CTX) ||
  	    (mutex->wwm_state =3D=3D WW_WANTOWN));
  	KASSERT(mutex->wwm_u.ctx !=3D ctx);
 +#if 0
  	KASSERTMSG((ctx->wwx_class =3D=3D mutex->wwm_u.ctx->wwx_class),
  	    "ww mutex class mismatch: %p !=3D %p",
  	    ctx->wwx_class, mutex->wwm_u.ctx->wwx_class);
 +#else
 +	if (ctx->wwx_class !=3D mutex->wwm_u.ctx->wwx_class)
 +		printf("%s:%d: nopanic: ww mutex class mismatch: %p !=3D %p\n",
 +		    __func__, __LINE__,
 +		    ctx->wwx_class, mutex->wwm_u.ctx->wwx_class);
 +#endif
  	KASSERTMSG((mutex->wwm_u.ctx->wwx_ticket !=3D ctx->wwx_ticket),
  	    "ticket number reused: %"PRId64" (%p) %"PRId64" (%p)",
  	    ctx->wwx_ticket, ctx,
 @@ -751,9 +758,16 @@ retry:	switch (mutex->wwm_state) {
  		 * Owned by a higher-priority party.  Tell the caller
  		 * to unlock everything and start over.
  		 */
 +#if 0
  		KASSERTMSG((ctx->wwx_class =3D=3D mutex->wwm_u.ctx->wwx_class),
  		    "ww mutex class mismatch: %p !=3D %p",
  		    ctx->wwx_class, mutex->wwm_u.ctx->wwx_class);
 +#else
 +		if (!(ctx->wwx_class =3D=3D mutex->wwm_u.ctx->wwx_class))
 +			printf("%s:%d: nopanic: ww mutex class mismatch: %p !=3D %p\n",
 +			    __func__, __LINE__,
 +			    ctx->wwx_class, mutex->wwm_u.ctx->wwx_class);
 +#endif
  		ret =3D -EDEADLK;
  		goto out_unlock;
  	}

Responsible-Changed-From-To: kern-bug-people->riastradh
Responsible-Changed-By: riastradh@NetBSD.org
Responsible-Changed-When: Sat, 29 Jul 2023 22:44:44 +0000
Responsible-Changed-Why:
my bug


State-Changed-From-To: open->needs-pullups
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sat, 29 Jul 2023 22:44:44 +0000
State-Changed-Why:
fix committed


From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57537 CVS commit: src/sys/external/bsd/drm2/linux
Date: Sat, 29 Jul 2023 22:43:56 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Sat Jul 29 22:43:56 UTC 2023

 Modified Files:
 	src/sys/external/bsd/drm2/linux: linux_ww_mutex.c

 Log Message:
 drm/linux_ww_mutex: Fix wait loops.

 If cv_wait_sig returns because a signal is delivered, we may
 nonetheless have been granted the lock.  It is harmless for us to
 ignore this fact in three of the four paths, but in
 ww_mutex_state_wait_sig, we may now have ownership of the lock and
 MUST NOT return failure because the caller MUST release the lock
 before destroying the ww_acquire_ctx.

 While here, restructure the other three loops for clarity, so they
 match the structure of the fourth and so they have a little less
 impenetrable negation.

 PR kern/57537

 XXX pullup-8
 XXX pullup-9
 XXX pullup-10


 To generate a diff of this commit:
 cvs rdiff -u -r1.14 -r1.15 src/sys/external/bsd/drm2/linux/linux_ww_mutex.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: needs-pullups->pending-pullups
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sun, 30 Jul 2023 12:36:47 +0000
State-Changed-Why:
pullup-10 #298
(more work needed for netbsd-9 or netbsd-8)


From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57537 CVS commit: [netbsd-10] src/sys/external/bsd/drm2/linux
Date: Tue, 1 Aug 2023 16:53:19 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Aug  1 16:53:18 UTC 2023

 Modified Files:
 	src/sys/external/bsd/drm2/linux [netbsd-10]: linux_ww_mutex.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #298):

 	sys/external/bsd/drm2/linux/linux_ww_mutex.c: revision 1.15

 drm/linux_ww_mutex: Fix wait loops.

 If cv_wait_sig returns because a signal is delivered, we may
 nonetheless have been granted the lock.  It is harmless for us to
 ignore this fact in three of the four paths, but in
 ww_mutex_state_wait_sig, we may now have ownership of the lock and
 MUST NOT return failure because the caller MUST release the lock
 before destroying the ww_acquire_ctx.

 While here, restructure the other three loops for clarity, so they
 match the structure of the fourth and so they have a little less
 impenetrable negation.

 PR kern/57537


 To generate a diff of this commit:
 cvs rdiff -u -r1.14 -r1.14.4.1 \
     src/sys/external/bsd/drm2/linux/linux_ww_mutex.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57537 CVS commit: [netbsd-9] src/sys/external/bsd/drm2/linux
Date: Tue, 1 Aug 2023 17:26:29 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Aug  1 17:26:28 UTC 2023

 Modified Files:
 	src/sys/external/bsd/drm2/linux [netbsd-9]: linux_ww_mutex.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1696):

 	sys/external/bsd/drm2/linux/linux_ww_mutex.c: revision 1.15

 drm/linux_ww_mutex: Fix wait loops.

 If cv_wait_sig returns because a signal is delivered, we may
 nonetheless have been granted the lock.  It is harmless for us to
 ignore this fact in three of the four paths, but in
 ww_mutex_state_wait_sig, we may now have ownership of the lock and
 MUST NOT return failure because the caller MUST release the lock
 before destroying the ww_acquire_ctx.

 While here, restructure the other three loops for clarity, so they
 match the structure of the fourth and so they have a little less
 impenetrable negation.

 PR kern/57537


 To generate a diff of this commit:
 cvs rdiff -u -r1.7.2.2 -r1.7.2.3 \
     src/sys/external/bsd/drm2/linux/linux_ww_mutex.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/57537 CVS commit: [netbsd-8] src/sys/external/bsd/drm2/linux
Date: Tue, 1 Aug 2023 17:29:15 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Aug  1 17:29:15 UTC 2023

 Modified Files:
 	src/sys/external/bsd/drm2/linux [netbsd-8]: linux_ww_mutex.c

 Log Message:
 Pull up following revision(s) (requested by riastradh in ticket #1876):

 	sys/external/bsd/drm2/linux/linux_ww_mutex.c: revision 1.15

 drm/linux_ww_mutex: Fix wait loops.

 If cv_wait_sig returns because a signal is delivered, we may
 nonetheless have been granted the lock.  It is harmless for us to
 ignore this fact in three of the four paths, but in
 ww_mutex_state_wait_sig, we may now have ownership of the lock and
 MUST NOT return failure because the caller MUST release the lock
 before destroying the ww_acquire_ctx.

 While here, restructure the other three loops for clarity, so they
 match the structure of the fourth and so they have a little less
 impenetrable negation.

 PR kern/57537


 To generate a diff of this commit:
 cvs rdiff -u -r1.2.10.5 -r1.2.10.6 \
     src/sys/external/bsd/drm2/linux/linux_ww_mutex.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: pending-pullups->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Wed, 02 Aug 2023 13:09:07 +0000
State-Changed-Why:
fixed and pulled up to 10, 9, and 8


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.