NetBSD Problem Report #56826

From tih@hamartun.priv.no  Tue May 10 10:18:08 2022
Return-Path: <tih@hamartun.priv.no>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 6D88D1A9239
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 10 May 2022 10:18:08 +0000 (UTC)
Message-Id: <20220510101754.74C204DF20@thuvia.hamartun.priv.no>
Date: Tue, 10 May 2022 12:17:54 +0200 (CEST)
From: tih@hamartun.priv.no
Reply-To: tih@hamartun.priv.no
To: gnats-bugs@NetBSD.org
Subject: Kernel memory leak with Nvidia GPU
X-Send-Pr-Version: 3.95

>Number:         56826
>Category:       kern
>Synopsis:       Kernel memory leak with Nvidia GPU
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    mrg
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue May 10 10:20:00 +0000 2022
>Closed-Date:    Thu Sep 15 21:01:10 +0000 2022
>Last-Modified:  Thu Sep 15 21:01:10 +0000 2022
>Originator:     Tom Ivar Helbekkmo
>Release:        NetBSD 9.99.94
>Organization:

>Environment:


System: NetBSD thuvia.hamartun.priv.no 9.99.94 NetBSD 9.99.94 (THUVIA) #32: Thu Mar 17 03:29:34 CET 2022 root@barsoom.hamartun.priv.no:/usr/local/obj/amd64/usr/src/sys/arch/amd64/compile/THUVIA amd64
Architecture: x86_64
Machine: amd64
>Description:

I'm using NetBSD/amd64-current on a workstation with 4 GiB of RAM and
an Nvidia graphics board:

nouveau0 at pci1 dev 0 function 0: NVIDIA GeForce GTX 560 Ti (rev. 0xa1)
nouveau0: info: NVIDIA GF114 (0ce000a1)

In use with X, running Firefox 91 from pkgsrc, a few terminal windows,
and rdesktop 1.9.0 from pkgsrc, the kernel leaks memory fast enough
that I can just barely get away with rebooting just once per day.
It's fine for the first few hours, but then it starts slowing down,
swapping more and more, until it becomes pretty much useless.  I've
tried keeping it going, to see if the kernel does anything to help
itself out of the memory crisis, but when I finally gave up and told
Firefox to stop, preparatory to rebooting, the system was still
swapping intensely, while Firefox was trying to quit, the next day.

The memory leak consists of kmem-04096 objects, shown by 'vmstat -m'
to be steadily getting allocated, but never freed.

It seems that any update to the screen image needs these allocations.
Text scrolling in a terminal window will eat them, slowly.  The highest
rate I've seen is when I start a film playing on Youtube, which leads
to a couple of hundred such allocations per second.

Trying to debug this back in January, I ran some dtrace tests (on the
advice of mrg@).  First, below, something that shows how kernel memory
gets allocated over time.  Next, a very short run, showing what
happens when I just move the pointer over a link in Firefox, causing
it to be highlighted.

I'm also attaching my current /var/run/dmesg.boot, at the bottom.

-tih

# dtrace -n 'fbt::kmem_intr_alloc:entry /arg0 >= 768 && arg0 < 4096/ { @["alloc", stack()] = count() } fbt::kmem_intr_free:entry /arg0 >= 768 && arg0 < 4096/ { @["free", stack()] = count() }'

[...]

  alloc                                             
              netbsd`kmem_intr_zalloc+0x12
              netbsd`ttm_tt_alloc_page_directory+0x2f
              netbsd`ttm_dma_tt_init+0x29
              netbsd`nouveau_sgdma_create_ttm+0x67
              netbsd`ttm_tt_create+0x9d
              netbsd`ttm_bo_handle_move_mem+0x7b3
              netbsd`ttm_bo_validate+0x14e
              netbsd`ttm_bo_init_reserved+0x35a
              netbsd`ttm_bo_init+0x65
              netbsd`nouveau_bo_init+0xb0
              netbsd`nouveau_gem_new+0x8e
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
              netbsd`syscall+0x196
              netbsd`handle_syscall+0x2d
             2001
  alloc                                             
              netbsd`kmem_zalloc+0x4e
              netbsd`nvkm_mem_new_type+0x30b
              netbsd`nvkm_umem_new+0x142
              netbsd`nvkm_ioctl_new+0x1cd
              netbsd`nvkm_ioctl+0x100
              netbsd`nvif_object_init+0xff
              netbsd`nvif_mem_init_type+0x97
              netbsd`nouveau_mem_host+0xe9
              netbsd`nv50_sgdma_bind+0x18
              netbsd`ttm_tt_bind+0x4c
              netbsd`ttm_bo_handle_move_mem+0x73c
              netbsd`ttm_bo_validate+0x14e
              netbsd`ttm_bo_init_reserved+0x35a
              netbsd`ttm_bo_init+0x65
              netbsd`nouveau_bo_init+0xb0
              netbsd`nouveau_gem_new+0x8e
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
             2001
  alloc                                             
              netbsd`kmem_intr_zalloc+0x12
              netbsd`nouveau_bo_alloc+0xa2
              netbsd`nouveau_gem_new+0x54
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
              netbsd`syscall+0x196
              netbsd`handle_syscall+0x2d
             2055

...and later...

[...]

  alloc                                             
              netbsd`kmem_zalloc+0x4e
              netbsd`nvkm_mem_new_type+0x30b
              netbsd`nvkm_umem_new+0x142
              netbsd`nvkm_ioctl_new+0x1cd
              netbsd`nvkm_ioctl+0x100
              netbsd`nvif_object_init+0xff
              netbsd`nvif_mem_init_type+0x97
              netbsd`nouveau_mem_host+0xe9
              netbsd`nv50_sgdma_bind+0x18
              netbsd`ttm_tt_bind+0x4c
              netbsd`ttm_bo_handle_move_mem+0x73c
              netbsd`ttm_bo_validate+0x14e
              netbsd`ttm_bo_init_reserved+0x35a
              netbsd`ttm_bo_init+0x65
              netbsd`nouveau_bo_init+0xb0
              netbsd`nouveau_gem_new+0x8e
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
            36645
  alloc                                             
              netbsd`kmem_intr_zalloc+0x12
              netbsd`ttm_tt_alloc_page_directory+0x2f
              netbsd`ttm_dma_tt_init+0x29
              netbsd`nouveau_sgdma_create_ttm+0x67
              netbsd`ttm_tt_create+0x9d
              netbsd`ttm_bo_handle_move_mem+0x7b3
              netbsd`ttm_bo_validate+0x14e
              netbsd`ttm_bo_init_reserved+0x35a
              netbsd`ttm_bo_init+0x65
              netbsd`nouveau_bo_init+0xb0
              netbsd`nouveau_gem_new+0x8e
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
              netbsd`syscall+0x196
              netbsd`handle_syscall+0x2d
            36656
  alloc                                             
              netbsd`kmem_intr_zalloc+0x12
              netbsd`nouveau_bo_alloc+0xa2
              netbsd`nouveau_gem_new+0x54
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
              netbsd`syscall+0x196
              netbsd`handle_syscall+0x2d
            37093

...leading to:

: tih@thuvia:~; vmstat -m | head -2; vmstat -m | grep kmem-
Memory resource pool statistics
Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
[...]
kmem-00768   832     3380    0      327   764     0   764   764     0   inf    0
kmem-01024  1088     6881    0     1511  1882    85  1797  1797     0   inf    7
kmem-02048  2048     3725    0      817  1584   122  1462  1462     0   inf    8
kmem-04096  4096   135241    0       79 135162    0 135162 135162   0   inf    0


Next, here's a log while just highlighting a link on a web page (and
possibly getting a short mouseover text; I don't recall now. In any
case, I remember seeing how about half the allocations came when the
pointer was moved over the link; the other half when I moved it away):

Script started on Mon Jan 24 12:16:05 2022
: thuvia# ;dtrace -n 'fbt::kmem_intr_alloc:entry /arg0 >= 2048 && arg0 < 4096/ { @["alloc", stack()] = count() } fbt::kmem_intr_free:entry /arg0 >= 2048 && arg0 < 4096/ { @["free", stack()] = count() }'
dtrace: description 'fbt::kmem_intr_alloc:entry ' matched 2 probes
dtrace: buffer size lowered to 1m
^C

  alloc                                             
              netbsd`kmem_alloc+0x48
              netbsd`amap_alloc1+0xe4
              netbsd`amap_alloc+0x3c
              netbsd`amap_copy+0x2a3
              netbsd`uvm_fault_internal+0x1a59
              netbsd`trap+0x480
              netbsd`alltraps+0xc3
                1
  alloc                                             
              netbsd`kmem_zalloc+0x4e
              netbsd`sys___fstatvfs190+0x23
              netbsd`syscall+0x196
              netbsd`handle_syscall+0x2d
               15
  alloc                                             
              netbsd`kmem_intr_zalloc+0x12
              netbsd`ttm_tt_alloc_page_directory+0x2f
              netbsd`ttm_dma_tt_init+0x29
              netbsd`nouveau_sgdma_create_ttm+0x67
              netbsd`ttm_tt_create+0x9d
              netbsd`ttm_bo_handle_move_mem+0x7b3
              netbsd`ttm_bo_validate+0x14e
              netbsd`ttm_bo_init_reserved+0x35a
              netbsd`ttm_bo_init+0x65
              netbsd`nouveau_bo_init+0xb0
              netbsd`nouveau_gem_new+0x8e
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
              netbsd`syscall+0x196
              netbsd`handle_syscall+0x2d
               53
  alloc                                             
              netbsd`kmem_zalloc+0x4e
              netbsd`nvkm_mem_new_type+0x30b
              netbsd`nvkm_umem_new+0x142
              netbsd`nvkm_ioctl_new+0x1cd
              netbsd`nvkm_ioctl+0x100
              netbsd`nvif_object_init+0xff
              netbsd`nvif_mem_init_type+0x97
              netbsd`nouveau_mem_host+0xe9
              netbsd`nv50_sgdma_bind+0x18
              netbsd`ttm_tt_bind+0x4c
              netbsd`ttm_bo_handle_move_mem+0x73c
              netbsd`ttm_bo_validate+0x14e
              netbsd`ttm_bo_init_reserved+0x35a
              netbsd`ttm_bo_init+0x65
              netbsd`nouveau_bo_init+0xb0
              netbsd`nouveau_gem_new+0x8e
              netbsd`nouveau_gem_ioctl_new+0x4a
              netbsd`drm_ioctl+0x219
              netbsd`drm_ioctl_shim+0x25
              netbsd`sys_ioctl+0x56d
               53
: thuvia# ;^D

Script done on Mon Jan 24 12:18:22 2022


Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
    2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
    2018, 2019, 2020, 2021, 2022
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 9.99.94 (THUVIA) #32: Thu Mar 17 03:29:34 CET 2022
	root@barsoom.hamartun.priv.no:/usr/local/obj/amd64/usr/src/sys/arch/amd64/compile/THUVIA
total memory = 3959 MB
avail memory = 3813 MB
timecounter: Timecounters tick every 10.000 msec
Kernelized RAIDframe activated
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
mainbus0 (root)
ACPI: RSDP 0x00000000000F9E00 000024 (v02 HPQOEM)
ACPI: XSDT 0x00000000D3780100 000064 (v01 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
ACPI: FACP 0x00000000D3780290 0000F4 (v04 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
Firmware Warning (ACPI): 32/64X length mismatch in FADT/Gpe0Block: 128/64 (20211217/tbfadt-640)
ACPI: DSDT 0x00000000D37805E0 005AB5 (v02 HPQOEM SLIC-CPC 00000007 INTL 20051117)
ACPI: FACS 0x00000000D378E000 000040
ACPI: APIC 0x00000000D3780390 00008C (v02 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
ACPI: MCFG 0x00000000D3780420 00003C (v01 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
ACPI: SLIC 0x00000000D3780460 000176 (v01 HPQOEM SLIC-CPC 00000001 MSFT 00000001)
ACPI: OEMB 0x00000000D378E040 000072 (v01 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
ACPI: HPET 0x00000000D378A680 000038 (v01 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
ACPI: GSCI 0x00000000D378E0C0 002024 (v01 HPQOEM SLIC-CPC 20091221 MSFT 00000097)
ACPI: SSDT 0x00000000D3791830 000363 (v01 HPQOEM SLIC-CPC 00000012 INTL 20051117)
ACPI: 2 ACPI AML tables successfully acquired and loaded
ioapic0 at mainbus0 apid 6: pa 0xfec00000, version 0x20, 24 pins
cpu0 at mainbus0 apid 0
cpu0: Use lfence to serialize rdtsc
cpu0: Intel(R) Core(TM) i3 CPU         530  @ 2.93GHz, id 0x20652
cpu0: node 0, package 0, core 0, smt 0
cpu1 at mainbus0 apid 4
cpu1: Intel(R) Core(TM) i3 CPU         530  @ 2.93GHz, id 0x20652
cpu1: node 0, package 0, core 2, smt 0
cpu2 at mainbus0 apid 1
cpu2: Intel(R) Core(TM) i3 CPU         530  @ 2.93GHz, id 0x20652
cpu2: node 0, package 0, core 0, smt 1
cpu3 at mainbus0 apid 5
cpu3: Intel(R) Core(TM) i3 CPU         530  @ 2.93GHz, id 0x20652
cpu3: node 0, package 0, core 2, smt 1
acpi0 at mainbus0: Intel ACPICA 20211217
acpi0: X/RSDT: OemId <HPQOEM,SLIC-CPC,20091221>, AslId <MSFT,00000097>
acpi0: MCFG: segment 0, bus 0-255, address 0x00000000e0000000
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFD28007750000 001238 (v01 HPQOEM SLIC-CPC 00000011 INTL 20051117)
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFAF07EFFBD008 0004F4 (v01 HPQOEM SLIC-CPC 00003001 INTL 20051117)
acpi0: SCI interrupting at int 9
acpi0: fixed power button present
timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
hpet0 at acpi0: high precision event timer (mem 0xfed00000-0xfed00400)
timecounter: Timecounter "hpet0" frequency 14318180 Hz quality 2000
IOH (PNP0C01) at acpi0 not configured
attimer1 at acpi0 (TMR, PNP0100): io 0x40-0x43 irq 0
pcppi1 at acpi0 (SPKR, PNP0800): io 0x61
spkr0 at pcppi1: PC Speaker
wsbell at spkr0 not configured
sysbeep0 at pcppi1
RMEM (PNP0C01) at acpi0 not configured
acpibut0 at acpi0 (PWRB, PNP0C0C-170): ACPI Power Button
ACPI: Enabled 2 GPEs in block 00 to 3F
attimer1: attached to pcppi1
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0: Intel Iron Lake Host Bridge (rev. 0x12)
agp0 at pchb0autoconfiguration error: : can't find internal VGA config space
ppb0 at pci0 dev 1 function 0: Intel Core PCIe Root Port (rev. 0x12)
ppb0: PCI Express capability version 2 <Root Port of PCI-E Root Complex> x8 @ 5.0GT/s
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
nouveau0 at pci1 dev 0 function 0: NVIDIA GeForce GTX 560 Ti (rev. 0xa1)
hdaudio0 at pci1 dev 0 function 1: HD Audio Controller
hdaudio0: interrupting at msi0 vec 0
hdaudio0: HDA ver. 1.0, OSS 2, ISS 4, BSS 0, SDO 2, 64-bit
hdafg0 at hdaudio0: vendor 10de product 0016
hdafg0: DP00 8ch: Digital Out [Jack]
hdafg0: 8ch/0ch 48000Hz PCM16*
hdafg1 at hdaudio0: vendor 10de product 0016
hdafg1: DP00 8ch: Digital Out [Jack]
hdafg1: 8ch/0ch 48000Hz PCM16*
hdafg2 at hdaudio0: vendor 10de product 0016
hdafg2: DP00 8ch: Digital Out [Jack]
hdafg2: 8ch/0ch 48000Hz PCM16*
hdafg3 at hdaudio0: vendor 10de product 0016
hdafg3: DP00 8ch: Digital Out [Jack]
hdafg3: 8ch/0ch 48000Hz PCM16*
Intel 3400 MEI (miscellaneous communications, revision 0x06) at pci0 dev 22 function 0 not configured
ehci0 at pci0 dev 26 function 0: Intel 3400 USB ECHI (rev. 0x06)
ehci0: 64-bit DMA
ehci0: interrupting at ioapic0 pin 16
ehci0: EHCI version 1.0
ehci0: Using DMA subregion for control data structures
usb0 at ehci0: USB revision 2.0
hdaudio1 at pci0 dev 27 function 0: HD Audio Controller
hdaudio1: interrupting at msi1 vec 0
hdaudio1: HDA ver. 1.0, OSS 4, ISS 4, BSS 0, SDO 1, 64-bit
hdafg4 at hdaudio1: vendor 10ec product 0888
hdafg4: DAC00 8ch: Speaker [Jack]
hdafg4: DAC01 2ch: HP Out [Jack]
hdafg4: DIG02 2ch: SPDIF Out [Jack]
hdafg4: ADC03 2ch: Mic In [Jack]
hdafg4: ADC04 2ch: Line In [Jack]
hdafg4: 8ch/2ch 44100Hz 48000Hz 96000Hz 192000Hz PCM16 PCM20 PCM24 AC3
audio0 at hdafg4: playback, capture, full duplex, independent
audio0: slinear_le:16 2ch 48000Hz, blk 1920 bytes (10ms) for playback
audio0: slinear_le:16 2ch 48000Hz, blk 1920 bytes (10ms) for recording
spkr1 at audio0: PC Speaker (synthesized)
wsbell at spkr1 not configured
ppb1 at pci0 dev 28 function 0: Intel 3400 PCIe (rev. 0x06)
ppb1: PCI Express capability version 2 <Root Port of PCI-E Root Complex> x1 @ 2.5GT/s
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
re0 at pci2 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet (rev. 0x03)
re0: interrupting at msix2 vec 0
re0: RTL8168D/8111D (0x2800)
re0: Ethernet address 40:61:86:92:20:33
re0: using 256 tx descriptors
rgephy0 at re0 phy 7: RTL8211B 1000BASE-T media interface
rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
ppb2 at pci0 dev 28 function 2: Intel 3400 PCIe (rev. 0x06)
ppb2: PCI Express capability version 2 <Root Port of PCI-E Root Complex> x1 @ 2.5GT/s
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled, rd/line, wr/inv ok
fwohci0 at pci3 dev 0 function 0: VIA Technologies product 3403 (rev. 0x00)
fwohci0: interrupting at ioapic0 pin 18
fwohci0: OHCI version 1.10 (ROM=1)
fwohci0: No. of Isochronous channels is 4.
fwohci0: EUI64 00:10:dc:00:01:a4:92:c5
fwohci0: Phy 1394a available S400, 2 ports.
fwohci0: Link S400, max_rec 2048 bytes.
ieee1394if0 at fwohci0: IEEE1394 bus
fwip0 at ieee1394if0: IP over IEEE1394
fwohci0: Initiate bus reset
ehci1 at pci0 dev 29 function 0: Intel 3400 USB EHCI (rev. 0x06)
ehci1: 64-bit DMA
ehci1: interrupting at ioapic0 pin 23
ehci1: EHCI version 1.0
ehci1: Using DMA subregion for control data structures
usb1 at ehci1: USB revision 2.0
ppb3 at pci0 dev 30 function 0: Intel 82801BA Hub-PCI Bridge (rev. 0xa6)
pci4 at ppb3 bus 4
pci4: i/o space, memory space enabled
ichlpcib0 at pci0 dev 31 function 0: Intel H57 LPC Interface Bridge (rev. 0x06)
timecounter: Timecounter "ichlpcib0" frequency 3579545 Hz quality 1000
ichlpcib0: 24-bit timer
tco0 at ichlpcib0: TCO (watchdog) timer configured.
tco0: Min/Max interval 1/367 seconds
ahcisata0 at pci0 dev 31 function 2: Intel 82801H/C6[12]x/X99/Z170/[ZQH]270 RAID SATA Controller (rev. 0x06)
ahcisata0: 64-bit DMA
ahcisata0: AHCI revision 1.30, 6 ports, 32 slots, CAP 0xef20ff65<SXS,EMS,PSC,SSC,PMD,ISS=0x2=Gen2,SCLO,SAL,SALP,SSS,SSNTF,SNCQ,S64A>
ahcisata0: interrupting at msi3 vec 0
atabus0 at ahcisata0 channel 0
atabus1 at ahcisata0 channel 1
atabus2 at ahcisata0 channel 2
atabus3 at ahcisata0 channel 3
atabus4 at ahcisata0 channel 4
atabus5 at ahcisata0 channel 5
ichsmb0 at pci0 dev 31 function 3: Intel 3400 SMBus (rev. 0x06)
ichsmb0: interrupting at ioapic0 pin 18
iic0 at ichsmb0: I2C bus
isa0 at ichlpcib0
acpicpu0 at cpu0: ACPI CPU
acpicpu0: C1: FFH, lat   1 us, pow  1000 mW
acpicpu0: C2: FFH, lat  17 us, pow   500 mW
acpicpu0: C3: FFH, lat  17 us, pow   350 mW
acpicpu0: P0: FFH, lat  10 us, pow 73000 mW, 2933 MHz
acpicpu0: P1: FFH, lat  10 us, pow 65000 mW, 2800 MHz
acpicpu0: P2: FFH, lat  10 us, pow 58000 mW, 2667 MHz
acpicpu0: P3: FFH, lat  10 us, pow 53000 mW, 2533 MHz
acpicpu0: P4: FFH, lat  10 us, pow 48000 mW, 2400 MHz
acpicpu0: P5: FFH, lat  10 us, pow 44000 mW, 2267 MHz
acpicpu0: P6: FFH, lat  10 us, pow 39000 mW, 2133 MHz
acpicpu0: P7: FFH, lat  10 us, pow 36000 mW, 2000 MHz
acpicpu0: P8: FFH, lat  10 us, pow 33000 mW, 1867 MHz
acpicpu0: P9: FFH, lat  10 us, pow 30000 mW, 1733 MHz
acpicpu0: P10: FFH, lat  10 us, pow 28000 mW, 1600 MHz
acpicpu0: P11: FFH, lat  10 us, pow 26000 mW, 1467 MHz
acpicpu0: P12: FFH, lat  10 us, pow 24000 mW, 1333 MHz
acpicpu0: P13: FFH, lat  10 us, pow 23000 mW, 1200 MHz
coretemp0 at cpu0: thermal sensor, 1 C resolution, Tjmax=105
acpicpu1 at cpu1: ACPI CPU
coretemp1 at cpu1: thermal sensor, 1 C resolution, Tjmax=105
acpicpu2 at cpu2: ACPI CPU
acpicpu3 at cpu3: ACPI CPU
fwohci0: BUS reset
fwohci0: node_id=0xc800ffc0, gen=1, CYCLEMASTER mode
ieee1394if0: 1 nodes, maxhop <= 0 cable IRM irm(0) (me)
ieee1394if0: bus manager 0
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
timecounter: Timecounter "TSC" frequency 2926181000 Hz quality 3000
IPsec: Initialized Security Association Processing.
aes: Intel SSSE3 vpaes
chacha: x86 SSE2 ChaCha
uhub0 at usb0: NetBSD (0x0000) EHCI root hub (0x0000), class 9/0, rev 2.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhub1 at usb1: NetBSD (0x0000) EHCI root hub (0x0000), class 9/0, rev 2.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
ahcisata0 port 3: device present, speed: 3.0Gb/s
ahcisata0 port 2: device present, speed: 1.5Gb/s
uhub2 at uhub0 port 1: vendor 8087 (0x8087) product 0020 (0x0020), class 9/0, rev 2.00/0.00, addr 2
uhub2: single transaction translator
uhub3 at uhub1 port 1: vendor 8087 (0x8087) product 0020 (0x0020), class 9/0, rev 2.00/0.00, addr 2
uhub3: single transaction translator
uhub2: 6 ports with 6 removable, self powered
uhub3: 8 ports with 8 removable, self powered
atapibus0 at atabus2: 1 targets
cd0 at atapibus0 drive 0: <TSSTcorp CDDVDW SH-224GB, S1DU6YDH2003PQ, SB00> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
cd0(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd0 at atabus3 drive 0
wd0: <ST3750528AS>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 698 GB, 1453521 cyl, 16 head, 63 sec, 512 bytes/sect x 1465149168 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100), WRITE DMA FUA, NCQ (32 tags)
wd0(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) (using DMA), NCQ (31 tags)
uplcom0 at uhub3 port 3
uplcom0: Prolific Technology Inc. (0x067b) USB-Serial Controller D (0x2303), rev 1.10/4.00, addr 3
uhub4 at uhub2 port 3: vendor 05e3 (0x05e3) USB2.0 Hub (0x0608), class 9/0, rev 2.00/77.64, addr 3
uhub4: single transaction translator
ucom0 at uplcom0
uhub4: 4 ports with 4 removable, self powered
umass0 at uhub3 port 4 configuration 1 interface 0
umass0: Western Digital (0x1058) Elements SE 25FE (0x25fe), rev 2.10/10.21, addr 4
umass0: using SCSI over Bulk-Only
scsibus0 at umass0: 2 targets, 2 luns per target
sd0 at scsibus0 target 0 lun 0: <WD, Elements SE 25FE, 1021> disk fixed
sd0: fabricating a geometry
sd0: 931 GB, 953837 cyl, 64 head, 32 sec, 512 bytes/sect x 1953458176 sectors
sd0: fabricating a geometry
sd0: GPT GUID: 2dd60a4f-adf1-4a90-8a54-89a0913c9102
dk0 at sd0: "Elements SE", 1953454080 blocks at 2048, type: ffs
ses0 at scsibus0 target 0 lun 1: <WD, SES Device, 1021> enclosure services fixed
ses0: SCSI-3 SES Device
umidi0 at uhub4 port 3 configuration 1 interface 0
umidi0: Roland (0x0582) UM-ONE (0x012a), rev 1.10/1.00, addr 4
umidi0: (Fixed Endpoint)
umidi0: out=1, in=1
midi0 at umidi0: <0 >0 on umidi0
uaudio0 at uhub3 port 5 configuration 1 interface 0
uaudio0: Burr-Brown from TI (0x08bb) USB Audio CODEC (0x2902), rev 1.10/1.00, addr 5
uaudio0: audio rev 1.00
audio1 at uaudio0: playback, capture, full duplex, independent
audio1: slinear_le:16 2ch 48000Hz, blk 11520 bytes (60ms) for playback
audio1: slinear_le:16 2ch 11025Hz, blk 2880 bytes (65.3ms) for recording
spkr2 at audio1: PC Speaker (synthesized)
wsbell at spkr2 not configured
uhidev0 at uhub3 port 5 configuration 1 interface 3
uhidev0: Burr-Brown from TI (0x08bb) USB Audio CODEC (0x2902), rev 1.10/1.00, addr 5, iclass 3/0
uhid0 at uhidev0: input=1, output=0, feature=0
uhub5 at uhub4 port 4: vendor 05e3 (0x05e3) USB2.0 Hub (0x0608), class 9/0, rev 2.00/77.64, addr 5
uhub5: single transaction translator
uhub5: 4 ports with 4 removable, self powered
uhub6 at uhub3 port 6: vendor 05e3 (0x05e3) USB2.0 Hub (0x0608), class 9/0, rev 2.00/77.64, addr 6
uhub6: single transaction translator
uhub6: 4 ports with 4 removable, self powered
uhidev1 at uhub5 port 1 configuration 1 interface 0
uhidev1: vendor 047d (0x047d) Kensington Expert Mouse (0x1020), rev 2.00/1.06, addr 6, iclass 3/1
ums0 at uhidev1: 5 buttons and Z dir
wsmouse0 at ums0 mux 0
uhub7 at uhub6 port 4: vendor 05e3 (0x05e3) USB2.0 Hub (0x0608), class 9/0, rev 2.00/77.64, addr 7
uhub7: single transaction translator
uhub7: 4 ports with 4 removable, self powered
umidi1 at uhub5 port 2 configuration 1 interface 0
umidi1: Roland (0x0582) UM-ONE (0x012a), rev 1.10/1.10, addr 7
umidi1: (Fixed Endpoint)
umidi1: out=1, in=1
midi1 at umidi1: <0 >0 on umidi1
umidi2 at uhub5 port 4 configuration 1 interface 0
umidi2: Roland (0x0582) UM-ONE (0x012a), rev 1.10/1.00, addr 8
umidi2: (Fixed Endpoint)
umidi2: out=1, in=1
midi2 at umidi2: <0 >0 on umidi2
umass1 at uhub3 port 7 configuration 1 interface 0
umass1: Generic (0x058f) Mass Storage Device (0x6362), rev 2.00/1.00, addr 8
umass1: using SCSI over Bulk-Only
scsibus1 at umass1: 2 targets, 4 luns per target
sd1 at scsibus1 target 0 lun 0: <Generic-, SD/MMC, 1.00> disk removable
sd1: drive offline
sd2 at scsibus1 target 0 lun 1: <Generic-, Compact Flash, 1.01> disk removable
sd2: drive offline
sd3 at scsibus1 target 0 lun 2: <Generic-, SM/xD-Picture, 1.02> disk removable
sd3: drive offline
sd4 at scsibus1 target 0 lun 3: <Generic-, MS/MS-Pro, 1.03> disk removable
sd4: drive offline
uhidev2 at uhub2 port 6 configuration 1 interface 0
uhidev2: ZSA Technology Labs (0x3297) ErgoDox EZ (0x4974), rev 1.10/0.01, addr 9, iclass 3/1
ukbd0 at uhidev2
wskbd0 at ukbd0: console keyboard
uhidev3 at uhub2 port 6 configuration 1 interface 1
uhidev3: ZSA Technology Labs (0x3297) ErgoDox EZ (0x4974), rev 1.10/0.01, addr 9, iclass 3/0
uhidev3: 5 report ids
ums1 at uhidev3 reportid 2: 8 buttons, W and Z dirs
wsmouse1 at ums1 mux 0
uhid1 at uhidev3 reportid 3: input=2, output=0, feature=0
uhid2 at uhidev3 reportid 4: input=2, output=0, feature=0
ukbd1 at uhidev3 reportid 5
wskbd1 at ukbd1 mux 1
WARNING: 1 error while detecting hardware; check system log.
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
kern.module.path=/stand/amd64/9.99.94/modules
nouveau0: info: NVIDIA GF114 (0ce000a1)
nouveau0: info: bios: version 70.24.21.00.02
nouveau0: interrupting at msi4 vec 0 (nouveau0)
nouveau0: info: fb: 1024 MiB GDDR5
Zone  kernel: Available graphics memory: 968514 KiB
nouveau0: info: DRM: VRAM: 1024 MiB
nouveau0: info: DRM: GART: 1048576 MiB
nouveau0: info: DRM: TMDS table version 2.0
nouveau0: info: DRM: DCB version 4.0
nouveau0: info: DRM: DCB outp 00: 02000300 00000000
nouveau0: info: DRM: DCB outp 01: 01000302 00020030
nouveau0: info: DRM: DCB outp 02: 04011380 00000000
nouveau0: info: DRM: DCB outp 03: 08011382 00020030
nouveau0: info: DRM: DCB outp 04: 02022362 00020010
nouveau0: info: DRM: DCB conn 00: 00001030
nouveau0: info: DRM: DCB conn 01: 00010130
nouveau0: info: DRM: DCB conn 02: 00002261
nouveau0: info: DRM: MM: using COPY0 for buffer copies
kern info: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
kern info: [drm] Driver supports precise vblank timestamp query.
nouveaufb0 at nouveau0
kern info: [drm] Initialized nouveau 1.3.1 20120801 for nouveau0 on minor 0
nouveaufb0: framebuffer at 0xd80a0000, size 1920x1080, depth 32, stride 7680
nouveau0: autoconfiguration error: error: DRM: core notifier timeout
no data for est. mode 640x480x67
wsdisplay0 at nouveaufb0 kbdmux 1: console (default, vt100 emulation), using wskbd0
wsmux1: connecting to wsdisplay0
wskbd1: connecting to wsdisplay0
entropy: ready
no data for est. mode 640x480x67
wsdisplay0: screen 1 added (default, vt100 emulation)
no data for est. mode 640x480x67
wsdisplay0: screen 2 added (default, vt100 emulation)
no data for est. mode 640x480x67
wsdisplay0: screen 3 added (default, vt100 emulation)
no data for est. mode 640x480x67
wsdisplay0: screen 4 added (default, vt100 emulation)

>How-To-Repeat:

>Fix:

Unknown
>Release-Note:

>Audit-Trail:
From: matthew green <mrg@eterna.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: re: kern/56826: Kernel memory leak with Nvidia GPU
Date: Tue, 31 May 2022 13:31:41 +1000

 with Taylor's new kmem probes and a little investigation i've
 narrowed down at least two leaks i believe.

 nvkm_mem_new_host() has two paths that allocate "struct nvkm_mem
 *mem":  one that copies a dmamap from the caller via ioctl-like
 arguments, or, one that creates one freshly.  both of them are
 setup to use the same dtor via nvkm_mem_dma.dtor, set to
 nvkm_mem_dtor() which only does something if "mem->mem" is non
 NULL.  however, only the second path sets "mem->mem", so this
 dtor call does not free the allocation, and we leak it.

 a simple hack to free is triggering crashes so it's not quite as
 simple as it may appear here.

 the second leak appears to be the second path, which does a whole
 bus_dma setup phase:  bus_dmamem_alloc(), bus_dmamap_create(),
 bus_dmamap_load_raw().  however, there's no bus_dmamap_destroy()
 that isn't in this setup error path, so, allocations created here
 are also leaking i believe (i haven't confirmed this yet.)

 for the second one, i'm wondering why we'd be doing create and
 destroy so frequently -- shouldn't set up a busdma in device
 init, and then using it, and destroy in device tear down?


 .mrg.

From: matthew green <mrg@eterna.com.au>
To: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gnats-bugs@netbsd.org
Cc: 
Subject: re: kern/56826: Kernel memory leak with Nvidia GPU
Date: Tue, 31 May 2022 14:18:45 +1000

 OK, i think i figured it out.

 when nvkm_mem_new_host() is called via the in-kernel ioctl method
 it passes the dmamap in via "args->v0.dma", and we borrow this
 dmamap for this memory.  (i don't claim to understand what this is
 really doing.)  in this case, we don't call bus_dmamap_create(),
 so someone else owns this dmamap, and it can be destroyed before
 the dtor for this memory is called.  this means that by the time
 it's called for this memory, "mem->dmamap" is invalid and can't
 be safely used.  fortunately, in this case, the "mem->nseg" member
 is already the right value for the calls _create() case, and so
 copying dm_nsegs in the non-_create() case gives the size needed
 for the free of mem->dma.

 additionally, the bus_dmamap_create() in the non-ioctl path here
 is never destroyed.  this is the second leak.

    https://www.netbsd.org/~mrg/nouveau.leak.diff

 works for me.  i worry about the dmama borrowing and it being
 accessed after freed by the real owner still, but this code is
 so very opaque and layered i have no idea.


 .mrg.

From: "matthew green" <mrg@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/56826 CVS commit: src/sys/external/bsd/drm2/dist/drm/nouveau/nvkm/subdev/mmu
Date: Tue, 31 May 2022 20:53:35 +0000

 Module Name:	src
 Committed By:	mrg
 Date:		Tue May 31 20:53:35 UTC 2022

 Modified Files:
 	src/sys/external/bsd/drm2/dist/drm/nouveau/nvkm/subdev/mmu:
 	    nouveau_nvkm_subdev_mmu_mem.c

 Log Message:
 reorganise most of the NetBSD portion of nvkm_mem_dtor().

 when nvkm_mem_new_host() is called via the in-kernel ioctl method,
 we copy the supplied dmamap, use it's dm_nsegs value for allocation
 of "mem->dma", and assume it remains valid until we're done.

 when this path is taken "mem->mem" remains NULL so all the code in
 nvkm_mem_dtor() is ignored, and the "mem->dma" is leaked.  this is
 one leak seen in PR#56826.  as "dmamap->dm_nsegs" can become invalid
 before the dtor call, store the value in "mem->nseg" for use in the
 dtor, and convert the dtor to free "mem->dma" if "mem->dma" is set.
 additionally, "mem->pages" should end up being the same value as
 "nseg" here, ASSERT() this.

 while here properly mark NetBSD specific code in nvkm_mem_new_host().

 additionally, destroy the dmamap created in the non-ioctl path of
 nvkm_mem_new_host().  this is another leak seen in PR#56826.

 with both of these fixes my "kmem-04096" pool does not grow rapidly
 while using "mpv -vo gpu".  infact, once i loaded the relevant file
 into memory, this pool remains stable after at least one minute of
 video playback.

 ok riastradh@


 To generate a diff of this commit:
 cvs rdiff -u -r1.7 -r1.8 \
     src/sys/external/bsd/drm2/dist/drm/nouveau/nvkm/subdev/mmu/nouveau_nvkm_subdev_mmu_mem.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

Responsible-Changed-From-To: kern-bug-people->mrg
Responsible-Changed-By: mrg@NetBSD.org
Responsible-Changed-When: Tue, 31 May 2022 20:58:31 +0000
Responsible-Changed-Why:
i fixed it.


State-Changed-From-To: open->feedback
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Tue, 31 May 2022 20:58:31 +0000
State-Changed-Why:
Tom, can you update and test?  my kmem 4096 pool has only used 1400
pages after 14 hours with several hours of firefox activity and mpv.


State-Changed-From-To: feedback->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Thu, 15 Sep 2022 21:01:10 +0000
State-Changed-Why:
feedback timeout, probably fixed, please reopen if still an issue!


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2022 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.