NetBSD Problem Report #56847

From www@netbsd.org  Thu May 19 10:06:20 2022
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 84C2B1A921F
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 19 May 2022 10:06:20 +0000 (UTC)
Message-Id: <20220519100618.CD9B61A923A@mollari.NetBSD.org>
Date: Thu, 19 May 2022 10:06:18 +0000 (UTC)
From: sehnsucht@sdf.org
Reply-To: sehnsucht@sdf.org
To: gnats-bugs@NetBSD.org
Subject: nouveau autoconfiguration error: fifo: fault 01 [WRITE] -> DROPPED_MMU_FAULT; freezed console output on boot on a GTX 1060
X-Send-Pr-Version: www-1.0

>Number:         56847
>Category:       kern
>Synopsis:       nouveau autoconfiguration error: fifo: fault 01 [WRITE] -> DROPPED_MMU_FAULT; freezed console output on boot on a GTX 1060
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu May 19 10:10:00 +0000 2022
>Closed-Date:    Wed May 24 17:52:28 +0000 2023
>Last-Modified:  Wed May 24 17:52:28 +0000 2023
>Originator:     Paolo Vincenzo Olivo
>Release:        NetBSD HEAD (May the 14th, 2022) / amd64
>Organization:
SDF Publix Access UNIX System
>Environment:
NetBSD  9.99.96 NetBSD 9.99.96 (GENERIC) #0: Sat May 14 21:04:34 UTC 2022  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Once loaded in memory, the nouveau driver apparently fails to reset the display on a Nvidia Geforce GTX 1060, using the latest HEAD snapshot.

The console output is stuck at:

```
nouveau0: info: NVIDIA GP106 (13000a1)
nouveau0: info: bios: version 86.06.0e.00.99
nouveau0: interrupting at msi6vec 0 (nouveau0)
nouveau0: info: fb: 6144 MiB GDDR5
```

After which the console becomes completely unresponsive. The machine however successfully boots, so I am able to SSH into it and inspect the dmesg, which prints a long list of autoconfiguration errors like those which follow:

```
fifo: fault 01 [WRITE] at 000000000102d000 engine 04 [BAR1] client 08 [HUB/HOST_CPU_NB] reason 00 [PDE] on channel -1 [017feaa000 unknown]
[    21.541860] nouveau0: autoconfiguration error: error: fifo: fault 01 [WRITE] at 000000000102f000 engine 04 [BAR1] client 08 [HUB/HOST_CPU_NB] reason 00 [PDE] on channel -1 [017feaa000 unknown]
[    21.541860] nouveau0: autoconfiguration error: error: fifo: fault 01 [WRITE] at 0000000001030000 engine 04 [BAR1] client 08 [HUB/HOST_CPU_NB] reason 00 [PDE] on channel -1 [017feaa000 unknown]
[    21.541860] nouveau0: autoconfiguration error: error: fifo: fault 01 [WRITE] at 0000000001032000 engine 04 [BAR1] client 08 [HUB/HOST_CPU_NB] reason 00 [PDE] on channel -1 [017feaa000 unknown]

[...]

[    22.300998] nouveau0: autoconfiguration error: error: fifo: DROPPED_MMU_FAULT 00000000
```

Still, wsdisplay is loaded and the ttys are supposedly created:

[    22.310713] wsdisplay0 at nouveaufb0 kbdmux 1: console (default, vt100 emulation), using wskbd0
[    22.322939] wsmux1: connecting to wsdisplay0
[    22.322939] wskbd1: connecting to wsdisplay0
[    45.060664] cd0(ahcisata0:1:0):  DEFERRED ERROR, key = 0x2
[    81.420583] wsdisplay0: screen 1 added (default, vt100 emulation)
[    81.420583] wsdisplay0: screen 2 added (default, vt100 emulation)
[    81.420583] wsdisplay0: screen 3 added (default, vt100 emulation)
[    81.420583] wsdisplay0: screen 4 added (default, vt100 emulation)

```
I took a picture of the console buffer, which is available at https://ttm.sh/bva.jpg

The full dmesg is available at https://bhh.sh/6cm.

Disabling `nouveau' and `nouveaufb' on boot eliminates the problem (the kernel buffer prints the whole boot sequence, I can see rc starting the services and be eventually greated with a working wscons).

Here's the standard dmesg with nouveau* modules disabled:
https://bhh.sh/6cl

Using a `pcictl' based script, here's a recap of all PCI devices names and IDs, which my workstation comes equipped with:

```
 Core 7G (S, Quad) Host Bridge, DRAM (0x591f)
 Core 6G PCIe x16 (0x1901)
 200 Series xHCI (0xa2af)
 200 Series MEI (0xa2ba)
 200 Series SATA (AHCI) (0xa282)
 200 Series PCIe (0xa2e9)
 200 Series PCIe (0xa292)
 200 Series PCIe (0xa298)
 H270 LPC (0xa2c4)
 GeForce GTX 1060 6GB (0x1c03)
 BCM5751 10/100/1000 Ethernet (0x1677)
 Wireless AC 9260 (0x2526)
 ASM1083/1085 PCIe-PCI Bridge (0x1080)
```

The problem looks independent from the LCD monitor's resolution and the video display interface used. I've tried both with:

- an iiyama ProLite 2560x1440 2K monitor, connected through DP port  
- a BenQ 1920x1080 monitor, connected through HDMI port

And the result is the same.

The same bug has been also reported recently by another user, owning a GTX 770:
https://marc.info/?l=netbsd-bugs&m=165167531709750&w=2 

It is acceptably safe to assume that, with the graphics stack currently implemented in HEAD, this bug affects many GPUs models across multiple generations.

And it seems a known regression in nouveau, at least affecting some users since Linux 4.19, and possibly fixed upstream:
https://bugzilla.kernel.org/show_bug.cgi?id=201847
https://www.spinics.net/lists/kernel/msg3773355.html
>How-To-Repeat:
Boot a HEAD 9.99.96 NetBSD/amd64 snapshot on a desktop equipped with a Nvidia Geforce GTX 1060, making sure that the gpufw.tar.xz set is properly installed^[1].


^[1] Otherwise the driver will fail to attach (`acr: failed to load firmware') resulting in a kernel panic. Loading firmware on GPUs which require it seems mandatory starting with Linux 5.6. See: https://www.mail-archive.com/nouveau@lists.freedesktop.org/msg35424.html
>Fix:

>Release-Note:

>Audit-Trail:
From: Soren Jacobsen <snj@blef.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56847: nouveau autoconfiguration error: fifo: fault 01
 [WRITE] -> DROPPED_MMU_FAULT; freezed console output on boot on a GTX 1060
Date: Thu, 10 Nov 2022 08:11:30 -0800

 I also hit this problem.  For me, it only occurs if the system is booted
 via EFI.  A good ol' fashioned BIOS boot gets me a system with usable
 nouveau.  Should be easy enough to test with a USB stick.

 Soren

From: Onno van der Linden <o.vd.linden@quicknet.nl>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56847: nouveau autoconfiguration error: fifo: fault 01
  [WRITE] -> DROPPED_MMU_FAULT; freezed console output on boot on a GTX 1060
Date: Fri, 11 Nov 2022 20:16:25 +0100

 Soren Jacobsen wrote:
 > I also hit this problem.  For me, it only occurs if the system is booted
 > via EFI.  A good ol' fashioned BIOS boot gets me a system with usable
 > nouveau.  Should be easy enough to test with a USB stick.

 I can confirm that booting with a 9.3 amd64 BIOS install image
 fixes the same freezed console / connectable machine problem
 I had with my nvidia GT710. Booting the same image on the same machine
 with an R7 240 instead of the GT710 got rid of the ring test
 failed message I saw with an EFI boot as mentioned in kern/57079.

 That EFI boot is doing something wrong during initialization.

 Noticed this difference in dmesg for BIOS / EFI boots with the R7 240:
 Zone  kernel: Available graphics memory: 9007199253819168 KiB with EFI
 Zone  kernel: Available graphics memory: 5509780 kiB with BIOS

 See also kern/53126 

 Onno

From: brian@bemorehuman.org
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/56847: nouveau autoconfiguration error: fifo: fault 01  
 [WRITE] -> DROPPED_MMU_FAULT; freezed console output on boot on a GTX 1060
Date: Wed, 23 Nov 2022 10:23:17 +1300

 Hi there,

 Just another data point -- I'm seeing the same behavior as described 
 above.

 I'm running a GTX 1070 and could not get nouveau going with 9.3. Tried 
 head kernel 9.99.106 and got further in that boot recognized the card 
 and tried setting up nouveau but the screen output hung. Under this 
 scenario, the final message I see on the console is:

 nouveau: fb: 8192 MiB GDDR5

 The boot process continues with no display updates, then I can ssh in 
 from another machine.

 Finally, I reinstalled 9.3 in bios mode, copied over the gpufw.tar.xz 
 and kern-GENERIC.tar.xz (both from HEAD) and installed them. Reboot, and 
 I get nouveau attaching to the 1070 perfectly, confirmed by

 pcictl pci0 list -N

 which gives:

 ...
 008:00:0: NVIDIA GeForce GTX 1070 (VGA display, revision 0xa1) 
 [nouveau0]
 ...

 Hope this helps.

 Brian


 -- 
 Everything is connected, all the time.

State-Changed-From-To: open->feedback
State-Changed-By: snj@NetBSD.org
State-Changed-When: Tue, 14 Mar 2023 23:52:53 +0000
State-Changed-Why:
Please try a recent kernel.  The following commit fixes it for me:
http://mail-index.netbsd.org/source-changes/2023/03/01/msg143606.html


From: Paolo Vincenzo Olivo <sehnsucht@SDF.ORG>
To: gnats-bugs@gnats.netbsd.org
Cc: riastradh@netbsd.org
Subject: Re: kern/56847
Date: Sun, 21 May 2023 16:29:16 +0200

 --bp/iNruPH9dso1Pn
 Content-Type: text/plain; charset=utf-8
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable

 Hi can confirm that the problem is fixed on my side following the commit
 referenced above by @snj.
 A big thanks to @riastradh for his commitment and his hard work.=20

 --
 PVO

 --bp/iNruPH9dso1Pn
 Content-Type: application/pgp-signature; name="signature.asc"

 -----BEGIN PGP SIGNATURE-----

 iHUEABEKAB0WIQRxHb+RTf9yD0TTVi4vViwbB9te9gUCZGoqtAAKCRAvViwbB9te
 9ti/AQCk/oVCrWUady3388b8jC21eMn2WHGcd690Tiz0GRMv3AD/e2GxcFcVc1CP
 O2JjekPbmpUj6p44Cwf20/4zeK+NXm8=
 =oaii
 -----END PGP SIGNATURE-----

 --bp/iNruPH9dso1Pn--

From: Taylor R Campbell <riastradh@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/56847: nouveau autoconfiguration error: fifo: fault 01 [WRITE] -> DROPPED_MMU_FAULT; freezed console output on boot on a GTX 1060
Date: Sun, 21 May 2023 16:03:08 +0000

 This is a multi-part message in MIME format.
 --=_fDlpu8OzeqfFpAJUhARZcts9raUpcXJJ

 possibly sent to wrong address at first

 --=_fDlpu8OzeqfFpAJUhARZcts9raUpcXJJ
 Content-Type: message/rfc822
 Content-Disposition: inline

 Date: Sun, 21 May 2023 16:29:16 +0200
 From: Paolo Vincenzo Olivo <sehnsucht@SDF.ORG>
 To: gnats-bugs@gnats.netbsd.org
 Cc: riastradh@netbsd.org
 Subject: Re: kern/56847
 Message-ID: <20230521142915.GA9999@ma.sdf.org>
 MIME-Version: 1.0
 Content-Type: multipart/signed; micalg=pgp-sha512;
 	protocol="application/pgp-signature"; boundary="bp/iNruPH9dso1Pn"
 Content-Disposition: inline


 --bp/iNruPH9dso1Pn
 Content-Type: text/plain; charset=utf-8
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable

 Hi can confirm that the problem is fixed on my side following the commit
 referenced above by @snj.
 A big thanks to @riastradh for his commitment and his hard work.=20

 --
 PVO

 --bp/iNruPH9dso1Pn
 Content-Type: application/pgp-signature; name="signature.asc"

 -----BEGIN PGP SIGNATURE-----

 iHUEABEKAB0WIQRxHb+RTf9yD0TTVi4vViwbB9te9gUCZGoqtAAKCRAvViwbB9te
 9ti/AQCk/oVCrWUady3388b8jC21eMn2WHGcd690Tiz0GRMv3AD/e2GxcFcVc1CP
 O2JjekPbmpUj6p44Cwf20/4zeK+NXm8=
 =oaii
 -----END PGP SIGNATURE-----

 --bp/iNruPH9dso1Pn--

 --=_fDlpu8OzeqfFpAJUhARZcts9raUpcXJJ--

State-Changed-From-To: feedback->closed
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sun, 21 May 2023 16:13:21 +0000
State-Changed-Why:
submitter reports fixed


State-Changed-From-To: closed->needs-pullups
State-Changed-By: riastradh@NetBSD.org
State-Changed-When: Sun, 21 May 2023 16:14:06 +0000
State-Changed-Why:
pullup-10 https://mail-index.netbsd.org/source-changes/2023/03/01/msg143606.html


State-Changed-From-To: needs-pullups->closed
State-Changed-By: snj@NetBSD.org
State-Changed-When: Wed, 24 May 2023 17:52:28 +0000
State-Changed-Why:
already pulled up in ticket 122


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.