NetBSD Problem Report #56804

From www@netbsd.org  Mon Apr 25 20:04:04 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 015931A9239
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 25 Apr 2022 20:04:04 +0000 (UTC)
Message-Id: <20220425200402.23C761A923A@mollari.NetBSD.org>
Date: Mon, 25 Apr 2022 20:04:02 +0000 (UTC)
From: prlw1@cam.ac.uk
Reply-To: prlw1@cam.ac.uk
To: gnats-bugs@NetBSD.org
Subject: panic: drm2 overreleasing kref
X-Send-Pr-Version: www-1.0

>Number:         56804
>Category:       kern
>Synopsis:       panic: drm2 overreleasing kref
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 25 20:05:00 +0000 2022
>Closed-Date:    Thu Sep 15 21:04:44 +0000 2022
>Last-Modified:  Thu Sep 15 21:04:44 +0000 2022
>Originator:     Patrick Welche
>Release:        NetBSD-9.99.96/amd64 of 24 April 2022
>Organization:
>Environment:
>Description:
(Oddly AFAIK no one was logged in, just a pbulk run was happening)

System panicked: kernel diagostic assertion "(count <= old)" failed: file "../../../../external/bsd/drm2/include/linux/kref.h", line 89 overreleasing kref: 0 - 1


#0  0xffffffff802229c5 in cpu_reboot (howto=howto@entry=260, 
    bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:720
#1  0xffffffff80acfa94 in kern_reboot (howto=howto@entry=260, 
    bootstr=bootstr@entry=0x0) at ../../../../kern/kern_reboot.c:73
#2  0xffffffff80b1599d in vpanic (
    fmt=0xffffffff8101eab0 "kernel %sassertion \"%s\" failed: file \"%s\", line %d overreleasing kref: %u - %u", ap=ap@entry=0xffff9504afe04b68)
    at ../../../../kern/subr_prf.c:293
#3  0xffffffff80c8e97f in kern_assert (
    fmt=fmt@entry=0xffffffff8101eab0 "kernel %sassertion \"%s\" failed: file \"%s\", line %d overreleasing kref: %u - %u")
    at ../../../../../../lib/libkern/kern_assert.c:51
#4  0xffffffff80c3e715 in kref_sub (count=<optimized out>, 
    release=0xffffffff80c3d950 <ttm_bo_release>, kref=0xffff899b95809c60)
    at ../../../../external/bsd/drm2/include/linux/kref.h:89
#5  kref_put (release=0xffffffff80c3d950 <ttm_bo_release>, 
    kref=0xffff899b95809c60)
    at ../../../../external/bsd/drm2/include/linux/kref.h:140
#6  ttm_bo_put (bo=0xffff899b95809ac8)
    at ../../../../external/bsd/drm2/dist/drm/ttm/ttm_bo.c:702
#7  0xffffffff806dba39 in nouveau_bo_ref (pnvbo=<synthetic pointer>, ref=0x0)
    at ../../../../external/bsd/drm2/dist/drm/nouveau/nouveau_bo.h:73
#8  nouveau_gem_new (cli=<optimized out>, size=<optimized out>, 
    align=<optimized out>, domain=12, tile_mode=<optimized out>, 
    tile_flags=<optimized out>, pnvbo=pnvbo@entry=0xffff9504afe04c78)
    at ../../../../external/bsd/drm2/dist/drm/nouveau/nouveau_gem.c:211
#9  0xffffffff806dbaa3 in nouveau_gem_ioctl_new (dev=<optimized out>, 
    data=0xffff9504afe04de0, file_priv=0xffff899a5a964608)
    at ../../../../external/bsd/drm2/dist/drm/nouveau/nouveau_gem.c:278
#10 0xffffffff8094b248 in drm_ioctl (fp=<optimized out>, cmd=<optimized out>, 
    data=0xffff9504afe04de0)
    at ../../../../external/bsd/drm2/dist/drm/drm_ioctl.c:906
#11 0xffffffff80914bb4 in drm_ioctl_shim (fp=<optimized out>, 
    cmd=<optimized out>, data=<optimized out>)
    at ../../../../external/bsd/drm2/drm/drm_cdevsw.c:391
#12 0xffffffff80b27af1 in sys_ioctl (l=<optimized out>, 
    uap=0xffff9504afe04f00, retval=<optimized out>)
    at ../../../../kern/sys_generic.c:673
#13 0xffffffff803fe62e in sy_call (rval=0xffff9504afe04eb0, 
    uap=0xffff9504afe04f00, l=0xffff899a558c02c0, 
    sy=0xffffffff814641b0 <sysent+1296>) at ../../../../sys/syscallvar.h:65
#14 sy_invoke (code=54, rval=0xffff9504afe04eb0, uap=0xffff9504afe04f00, 
    l=0xffff899a558c02c0, sy=0xffffffff814641b0 <sysent+1296>)
    at ../../../../sys/syscallvar.h:94
#15 syscall (frame=0xffff9504afe04f00)
    at ../../../../arch/x86/x86/syscall.c:138
#16 0xffffffff8020867d in handle_syscall ()

(netbsd.18.core)
>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: prlw1@cam.ac.uk
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/56804: panic: drm2 overreleasing kref
Date: Wed, 25 May 2022 01:02:33 +0000

 If you still have the core dump, can you share dmesg, and print what
 `ret' is in frame #8?  I want to see how nouveau_bo_init failed.


 The immediate cause of this panic is that the error branches in
 nouveau_gem_new are broken:

 	ret = drm_gem_object_init(drm->dev, &nvbo->bo.base, size);
 	if (ret) {
 		nouveau_bo_ref(NULL, &nvbo);
 		return ret;
 	}

 	ret = nouveau_bo_init(nvbo, size, align, flags, NULL, NULL);
 	if (ret) {
 		nouveau_bo_ref(NULL, &nvbo);
 		return ret;
 	}

 The function nouveau_bo_ref(NULL, &nvbo) releases the reference to
 nvbo (and sets it to null), by doing ttm_bo_put(&nvbo->bo).  But
 ttm_bo_put isn't valid until ttm_bo_init has completed, and that
 doesn't run until nouveau_bo_init.

 Instead, this should maybe just use kfree (not sure if
 nv10_bo_put_tile_region is necessary here -- issued by
 nouveau_bo_del_ttm which is normally used by ttm in ttm_bo_put to free
 nvbo when the last reference is dropped).

 But none of this explains why we took this error branch in the first
 place.  Knowing what `ret' is might help to narrow it down which
 branch of nouveau_bo_init -> ttm_bo_init -> ttm_bo_init_reserved
 failed.  If you can reproduce this, it might also be helpful to insert
 printfs in every branch of ttm_bo_init_reserved, and of its callees
 ttm_bo_validate/ttm_bo_move_buffer/ttm_bo_mem_space/ttm_bo_handle_move_mem,
 to see where it came from.

State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Wed, 25 May 2022 01:14:43 +0000
State-Changed-Why:
feedback requested


From: Patrick Welche <prlw1@talktalk.net>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: prlw1@cam.ac.uk, gnats-bugs@NetBSD.org
Subject: Re: kern/56804: panic: drm2 overreleasing kref
Date: Thu, 26 May 2022 12:21:20 +0100

 On Wed, May 25, 2022 at 01:02:33AM +0000, Taylor R Campbell wrote:
 > If you still have the core dump, can you share dmesg, and print what
 > `ret' is in frame #8?  I want to see how nouveau_bo_init failed.

 ret = -12  = -ENOMEM

 Looking at sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c ttm_bo_init_reserved()

    1313 	ret = ttm_mem_global_alloc(mem_glob, acc_size, ctx);
    1314 	if (ret) {
    1315 		pr_err("Out of kernel memory\n");
    1316 		if (destroy)
    1317 			(*destroy)(bo);
    1318 		else
    1319 			kfree(bo);
    1320 		return -ENOMEM;
    1321 	}

 nouveau_bo_init():
 struct nouveau_bo *nvbo;
 acc_size = ttm_bo_dma_acc_size(nvbo->bo.bdev, size, sizeof(*nvbo));

 should be "small"?

 Something which has puzzled me on this box is that it has

 [   1.0000000] total memory = 65469 MB
 [   1.0000000] avail memory = 63426 MB

 yet it struggles to build (-j24) libreoffice with objdir on tmpfs,
 with processes stuck in biowait swapping. I am not sure I understand
 what "16G Inact" means in "top" as naively I am surprised that so
 little is free that swapping is necessary... When not building,
 e.g.  now, I see 740K Inact 37G Free which makes more sense.

 This panic had happened during a pbulk run.


 [   2.7572726] nouveau0: info: NVIDIA GK104 (0e4000a2)
 [   2.8772749] nouveau0: info: bios: version 80.04.09.00.01
 [   2.8872727] nouveau0: interrupting at msi8 vec 0 (nouveau0)
 [   2.8872727] nouveau0: info: fb: 2048 MiB GDDR5
 [   2.9672727] Zone  kernel: Available graphics memory: 22684942 KiB
 [   2.9672727] Zone   dma32: Available graphics memory: 2097152 KiB
 [   2.9813498] nouveau0: info: DRM: VRAM: 2048 MiB
 [   2.9813498] nouveau0: info: DRM: GART: 1048576 MiB
 [   2.9908122] nouveau0: info: DRM: TMDS table version 2.0
 [   2.9908122] nouveau0: info: DRM: DCB version 4.0
 [   3.0006277] nouveau0: info: DRM: DCB outp 00: 01000f02 00020030
 [   3.0006277] nouveau0: info: DRM: DCB outp 01: 02000f00 00000000
 [   3.0124439] nouveau0: info: DRM: DCB outp 02: 08011f82 00020030
 [   3.0183527] nouveau0: info: DRM: DCB outp 03: 02022f62 00020010
 [   3.0183527] nouveau0: info: DRM: DCB outp 04: 04833fb6 0f420010
 [   3.0301683] nouveau0: info: DRM: DCB outp 05: 04033f72 00020010
 [   3.0301683] nouveau0: info: DRM: DCB conn 00: 00001030
 [   3.0412027] nouveau0: info: DRM: DCB conn 01: 00020131
 [   3.0412027] nouveau0: info: DRM: DCB conn 02: 00010261
 [   3.0514536] nouveau0: info: DRM: DCB conn 03: 00002346
 [   3.0572741] nouveau0: info: DRM: MM: using COPY for buffer copies
 [   3.0572741] kern info: [drm] Supports vblank timestamp caching Rev 2 (21.10.2
 013).
 [   3.0738254] kern info: [drm] Driver supports precise vblank timestamp query.
 [   3.1072742] nouveaufb0 at nouveau0
 [   3.1072742] kern info: [drm] Initialized nouveau 1.3.1 20120801 for nouveau0 
 on minor 0
 [   3.1213511] nouveaufb0: framebuffer at 0xe8260000, size 3840x2160, depth 32, 
 stride 15360
 [   3.1672733] no data for est. mode 640x480x67

From: Tobias Nygren <tnn@NetBSD.org>
To: gnats-bugs@netbsd.org, Patrick Welche <prlw1@talktalk.net>
Cc: 
Subject: Re: kern/56804: panic: drm2 overreleasing kref
Date: Thu, 26 May 2022 14:34:19 +0200

 On Thu, 26 May 2022 11:25:01 +0000 (UTC)
 Patrick Welche <prlw1@talktalk.net> wrote:

 >  [   1.0000000] total memory = 65469 MB
 >  [   1.0000000] avail memory = 63426 MB
 >  
 >  yet it struggles to build (-j24) libreoffice with objdir on tmpfs,
 >  with processes stuck in biowait swapping.

 I think this is an unrelated bug in tmpfs related to how the
 libreoffice build writes files. If you look in df -h output it will
 say /tmp is out of space, but du -h says the libreoffice work directory
 only takes 25GB. Deleting the work directory reclaims all the space so
 tmpfs files seem to leak memory for the duration of their existence.

From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56804 CVS commit: src/sys/external/bsd/drm2/dist/drm/nouveau
Date: Tue, 31 May 2022 00:17:10 +0000

 Module Name:	src
 Committed By:	riastradh
 Date:		Tue May 31 00:17:10 UTC 2022

 Modified Files:
 	src/sys/external/bsd/drm2/dist/drm/nouveau: nouveau_gem.c

 Log Message:
 nouveau(4): Fix error branches in nouveau_gem_new.

 PR kern/56804


 To generate a diff of this commit:
 cvs rdiff -u -r1.13 -r1.14 \
     src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_gem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: feedback->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 15 Sep 2022 21:04:44 +0000
State-Changed-Why:
immediate issue fixed, underlying issue unclear, reopen if still an issue


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.