NetBSD Problem Report #56804
From www@netbsd.org Mon Apr 25 20:04:04 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 015931A9239
for <gnats-bugs@gnats.NetBSD.org>; Mon, 25 Apr 2022 20:04:04 +0000 (UTC)
Message-Id: <20220425200402.23C761A923A@mollari.NetBSD.org>
Date: Mon, 25 Apr 2022 20:04:02 +0000 (UTC)
From: prlw1@cam.ac.uk
Reply-To: prlw1@cam.ac.uk
To: gnats-bugs@NetBSD.org
Subject: panic: drm2 overreleasing kref
X-Send-Pr-Version: www-1.0
>Number: 56804
>Category: kern
>Synopsis: panic: drm2 overreleasing kref
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 25 20:05:00 +0000 2022
>Closed-Date: Thu Sep 15 21:04:44 +0000 2022
>Last-Modified: Thu Sep 15 21:04:44 +0000 2022
>Originator: Patrick Welche
>Release: NetBSD-9.99.96/amd64 of 24 April 2022
>Organization:
>Environment:
>Description:
(Oddly AFAIK no one was logged in, just a pbulk run was happening)
System panicked: kernel diagostic assertion "(count <= old)" failed: file "../../../../external/bsd/drm2/include/linux/kref.h", line 89 overreleasing kref: 0 - 1
#0 0xffffffff802229c5 in cpu_reboot (howto=howto@entry=260,
bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:720
#1 0xffffffff80acfa94 in kern_reboot (howto=howto@entry=260,
bootstr=bootstr@entry=0x0) at ../../../../kern/kern_reboot.c:73
#2 0xffffffff80b1599d in vpanic (
fmt=0xffffffff8101eab0 "kernel %sassertion \"%s\" failed: file \"%s\", line %d overreleasing kref: %u - %u", ap=ap@entry=0xffff9504afe04b68)
at ../../../../kern/subr_prf.c:293
#3 0xffffffff80c8e97f in kern_assert (
fmt=fmt@entry=0xffffffff8101eab0 "kernel %sassertion \"%s\" failed: file \"%s\", line %d overreleasing kref: %u - %u")
at ../../../../../../lib/libkern/kern_assert.c:51
#4 0xffffffff80c3e715 in kref_sub (count=<optimized out>,
release=0xffffffff80c3d950 <ttm_bo_release>, kref=0xffff899b95809c60)
at ../../../../external/bsd/drm2/include/linux/kref.h:89
#5 kref_put (release=0xffffffff80c3d950 <ttm_bo_release>,
kref=0xffff899b95809c60)
at ../../../../external/bsd/drm2/include/linux/kref.h:140
#6 ttm_bo_put (bo=0xffff899b95809ac8)
at ../../../../external/bsd/drm2/dist/drm/ttm/ttm_bo.c:702
#7 0xffffffff806dba39 in nouveau_bo_ref (pnvbo=<synthetic pointer>, ref=0x0)
at ../../../../external/bsd/drm2/dist/drm/nouveau/nouveau_bo.h:73
#8 nouveau_gem_new (cli=<optimized out>, size=<optimized out>,
align=<optimized out>, domain=12, tile_mode=<optimized out>,
tile_flags=<optimized out>, pnvbo=pnvbo@entry=0xffff9504afe04c78)
at ../../../../external/bsd/drm2/dist/drm/nouveau/nouveau_gem.c:211
#9 0xffffffff806dbaa3 in nouveau_gem_ioctl_new (dev=<optimized out>,
data=0xffff9504afe04de0, file_priv=0xffff899a5a964608)
at ../../../../external/bsd/drm2/dist/drm/nouveau/nouveau_gem.c:278
#10 0xffffffff8094b248 in drm_ioctl (fp=<optimized out>, cmd=<optimized out>,
data=0xffff9504afe04de0)
at ../../../../external/bsd/drm2/dist/drm/drm_ioctl.c:906
#11 0xffffffff80914bb4 in drm_ioctl_shim (fp=<optimized out>,
cmd=<optimized out>, data=<optimized out>)
at ../../../../external/bsd/drm2/drm/drm_cdevsw.c:391
#12 0xffffffff80b27af1 in sys_ioctl (l=<optimized out>,
uap=0xffff9504afe04f00, retval=<optimized out>)
at ../../../../kern/sys_generic.c:673
#13 0xffffffff803fe62e in sy_call (rval=0xffff9504afe04eb0,
uap=0xffff9504afe04f00, l=0xffff899a558c02c0,
sy=0xffffffff814641b0 <sysent+1296>) at ../../../../sys/syscallvar.h:65
#14 sy_invoke (code=54, rval=0xffff9504afe04eb0, uap=0xffff9504afe04f00,
l=0xffff899a558c02c0, sy=0xffffffff814641b0 <sysent+1296>)
at ../../../../sys/syscallvar.h:94
#15 syscall (frame=0xffff9504afe04f00)
at ../../../../arch/x86/x86/syscall.c:138
#16 0xffffffff8020867d in handle_syscall ()
(netbsd.18.core)
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: Taylor R Campbell <riastradh@NetBSD.org>
To: prlw1@cam.ac.uk
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/56804: panic: drm2 overreleasing kref
Date: Wed, 25 May 2022 01:02:33 +0000
If you still have the core dump, can you share dmesg, and print what
`ret' is in frame #8? I want to see how nouveau_bo_init failed.
The immediate cause of this panic is that the error branches in
nouveau_gem_new are broken:
ret = drm_gem_object_init(drm->dev, &nvbo->bo.base, size);
if (ret) {
nouveau_bo_ref(NULL, &nvbo);
return ret;
}
ret = nouveau_bo_init(nvbo, size, align, flags, NULL, NULL);
if (ret) {
nouveau_bo_ref(NULL, &nvbo);
return ret;
}
The function nouveau_bo_ref(NULL, &nvbo) releases the reference to
nvbo (and sets it to null), by doing ttm_bo_put(&nvbo->bo). But
ttm_bo_put isn't valid until ttm_bo_init has completed, and that
doesn't run until nouveau_bo_init.
Instead, this should maybe just use kfree (not sure if
nv10_bo_put_tile_region is necessary here -- issued by
nouveau_bo_del_ttm which is normally used by ttm in ttm_bo_put to free
nvbo when the last reference is dropped).
But none of this explains why we took this error branch in the first
place. Knowing what `ret' is might help to narrow it down which
branch of nouveau_bo_init -> ttm_bo_init -> ttm_bo_init_reserved
failed. If you can reproduce this, it might also be helpful to insert
printfs in every branch of ttm_bo_init_reserved, and of its callees
ttm_bo_validate/ttm_bo_move_buffer/ttm_bo_mem_space/ttm_bo_handle_move_mem,
to see where it came from.
State-Changed-From-To: open->feedback
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Wed, 25 May 2022 01:14:43 +0000
State-Changed-Why:
feedback requested
From: Patrick Welche <prlw1@talktalk.net>
To: Taylor R Campbell <riastradh@NetBSD.org>
Cc: prlw1@cam.ac.uk, gnats-bugs@NetBSD.org
Subject: Re: kern/56804: panic: drm2 overreleasing kref
Date: Thu, 26 May 2022 12:21:20 +0100
On Wed, May 25, 2022 at 01:02:33AM +0000, Taylor R Campbell wrote:
> If you still have the core dump, can you share dmesg, and print what
> `ret' is in frame #8? I want to see how nouveau_bo_init failed.
ret = -12 = -ENOMEM
Looking at sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c ttm_bo_init_reserved()
1313 ret = ttm_mem_global_alloc(mem_glob, acc_size, ctx);
1314 if (ret) {
1315 pr_err("Out of kernel memory\n");
1316 if (destroy)
1317 (*destroy)(bo);
1318 else
1319 kfree(bo);
1320 return -ENOMEM;
1321 }
nouveau_bo_init():
struct nouveau_bo *nvbo;
acc_size = ttm_bo_dma_acc_size(nvbo->bo.bdev, size, sizeof(*nvbo));
should be "small"?
Something which has puzzled me on this box is that it has
[ 1.0000000] total memory = 65469 MB
[ 1.0000000] avail memory = 63426 MB
yet it struggles to build (-j24) libreoffice with objdir on tmpfs,
with processes stuck in biowait swapping. I am not sure I understand
what "16G Inact" means in "top" as naively I am surprised that so
little is free that swapping is necessary... When not building,
e.g. now, I see 740K Inact 37G Free which makes more sense.
This panic had happened during a pbulk run.
[ 2.7572726] nouveau0: info: NVIDIA GK104 (0e4000a2)
[ 2.8772749] nouveau0: info: bios: version 80.04.09.00.01
[ 2.8872727] nouveau0: interrupting at msi8 vec 0 (nouveau0)
[ 2.8872727] nouveau0: info: fb: 2048 MiB GDDR5
[ 2.9672727] Zone kernel: Available graphics memory: 22684942 KiB
[ 2.9672727] Zone dma32: Available graphics memory: 2097152 KiB
[ 2.9813498] nouveau0: info: DRM: VRAM: 2048 MiB
[ 2.9813498] nouveau0: info: DRM: GART: 1048576 MiB
[ 2.9908122] nouveau0: info: DRM: TMDS table version 2.0
[ 2.9908122] nouveau0: info: DRM: DCB version 4.0
[ 3.0006277] nouveau0: info: DRM: DCB outp 00: 01000f02 00020030
[ 3.0006277] nouveau0: info: DRM: DCB outp 01: 02000f00 00000000
[ 3.0124439] nouveau0: info: DRM: DCB outp 02: 08011f82 00020030
[ 3.0183527] nouveau0: info: DRM: DCB outp 03: 02022f62 00020010
[ 3.0183527] nouveau0: info: DRM: DCB outp 04: 04833fb6 0f420010
[ 3.0301683] nouveau0: info: DRM: DCB outp 05: 04033f72 00020010
[ 3.0301683] nouveau0: info: DRM: DCB conn 00: 00001030
[ 3.0412027] nouveau0: info: DRM: DCB conn 01: 00020131
[ 3.0412027] nouveau0: info: DRM: DCB conn 02: 00010261
[ 3.0514536] nouveau0: info: DRM: DCB conn 03: 00002346
[ 3.0572741] nouveau0: info: DRM: MM: using COPY for buffer copies
[ 3.0572741] kern info: [drm] Supports vblank timestamp caching Rev 2 (21.10.2
013).
[ 3.0738254] kern info: [drm] Driver supports precise vblank timestamp query.
[ 3.1072742] nouveaufb0 at nouveau0
[ 3.1072742] kern info: [drm] Initialized nouveau 1.3.1 20120801 for nouveau0
on minor 0
[ 3.1213511] nouveaufb0: framebuffer at 0xe8260000, size 3840x2160, depth 32,
stride 15360
[ 3.1672733] no data for est. mode 640x480x67
From: Tobias Nygren <tnn@NetBSD.org>
To: gnats-bugs@netbsd.org, Patrick Welche <prlw1@talktalk.net>
Cc:
Subject: Re: kern/56804: panic: drm2 overreleasing kref
Date: Thu, 26 May 2022 14:34:19 +0200
On Thu, 26 May 2022 11:25:01 +0000 (UTC)
Patrick Welche <prlw1@talktalk.net> wrote:
> [ 1.0000000] total memory = 65469 MB
> [ 1.0000000] avail memory = 63426 MB
>
> yet it struggles to build (-j24) libreoffice with objdir on tmpfs,
> with processes stuck in biowait swapping.
I think this is an unrelated bug in tmpfs related to how the
libreoffice build writes files. If you look in df -h output it will
say /tmp is out of space, but du -h says the libreoffice work directory
only takes 25GB. Deleting the work directory reclaims all the space so
tmpfs files seem to leak memory for the duration of their existence.
From: "Taylor R Campbell" <riastradh@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/56804 CVS commit: src/sys/external/bsd/drm2/dist/drm/nouveau
Date: Tue, 31 May 2022 00:17:10 +0000
Module Name: src
Committed By: riastradh
Date: Tue May 31 00:17:10 UTC 2022
Modified Files:
src/sys/external/bsd/drm2/dist/drm/nouveau: nouveau_gem.c
Log Message:
nouveau(4): Fix error branches in nouveau_gem_new.
PR kern/56804
To generate a diff of this commit:
cvs rdiff -u -r1.13 -r1.14 \
src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_gem.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: feedback->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 15 Sep 2022 21:04:44 +0000
State-Changed-Why:
immediate issue fixed, underlying issue unclear, reopen if still an issue
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.