NetBSD Problem Report #55326

From www@netbsd.org  Sun May 31 12:15:58 2020
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 383E61A9218
	for <gnats-bugs@gnats.NetBSD.org>; Sun, 31 May 2020 12:15:58 +0000 (UTC)
Message-Id: <20200531121557.4FC161A921A@mollari.NetBSD.org>
Date: Sun, 31 May 2020 12:15:57 +0000 (UTC)
From: rokuyama.rk@gmail.com
Reply-To: rokuyama.rk@gmail.com
To: gnats-bugs@NetBSD.org
Subject: gem(4): memory corruption by RX DMA
X-Send-Pr-Version: www-1.0

>Number:         55326
>Notify-List:    david@gutteridge.ca
>Category:       port-macppc
>Synopsis:       gem(4): memory corruption by RX DMA
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-macppc-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun May 31 12:20:00 +0000 2020
>Last-Modified:  Sun May 31 19:26:52 +0000 2020
>Originator:     Rin Okuyama
>Release:        9.99.64
>Organization:
Department of Physics, Meiji University
>Environment:
NetBSD macmini 9.99.64 NetBSD 9.99.64 (GENERIC) #65: Sun May 31 01:11:41 JST 2020  rin@latipes:/usr/src/sys/arch/macppc/compile/GENERIC macppc
>Description:
If DIAGNOSTIC is enabled for machine with gem(4), Mac mini for me, panic
occurs as:

panic: pr_phinpage_check: [mclpl] item 0x3fb0b040 not part of pool
cpu0: Begin traceback...
...: at vpanic+...
...: at panic+...
...: at pool_cache_put_paddr+...
...: at m_ext_free+...
...: at m_freem.part.7+...
...: at ether_input+...
...: at if_percpuq_softint+...
...: at softint_dispatch+...
...: at softint_fast_dispatch+...
saved LR(0x1c) is invalid.cpu0: End traceback...

I found that ph_page field became NULL when this panic occurred, whereas
it was correctly initialized at the time of MCLGET(9).

This dirty hack fixes the problem as far as I can see:

----
Index: sys/kern/uipc_mbuf.c
===================================================================
RCS file: /home/netbsd/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.241
diff -p -u -r1.241 uipc_mbuf.c
--- sys/kern/uipc_mbuf.c	5 May 2020 20:36:48 -0000	1.241
+++ sys/kern/uipc_mbuf.c	25 May 2020 14:08:51 -0000
@@ -188,8 +188,13 @@ mbinit(void)
 	    NULL, IPL_VM, mb_ctor, NULL, NULL);
 	KASSERT(mb_cache != NULL);

+#ifdef GEM_WORKAROUND /* XXXXXXXX */
+	mcl_cache = pool_cache_init(mclbytes, PAGE_SIZE, 0, 0, "mclpl",
+	    NULL, IPL_VM, NULL, NULL, NULL);
+#else
 	mcl_cache = pool_cache_init(mclbytes, COHERENCY_UNIT, 0, 0, "mclpl",
 	    NULL, IPL_VM, NULL, NULL, NULL);
+#endif
 	KASSERT(mcl_cache != NULL);

 	pool_cache_set_drain_hook(mb_cache, mb_drain, NULL);
----

Therefore, I guess that RX DMA of gem(4) pollutes memory located at the
page offset of DMA buffer. However, this is not documented in the manual[1].

(They only recommends buffers to be aligned in cache line (not mandatory),
but this is achieved even if DIAGNOSTIC is enabled; m_ext.ext_buf is aligned
in COHERENT_UNIT = 64, that is larger than 32, cache line of Mac mini.)

[1] Sun Microsystems, Gigabit Ethernet ASIC Specification
>How-To-Repeat:
Described above.
>Fix:
N/A. Hardware limitation? In that case, use its own pool for DMA buffer?

>Release-Note:

>Audit-Trail:
From: "David H. Gutteridge" <david@gutteridge.ca>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-macppc/55326: gem(4): memory corruption by RX DMA
Date: Sun, 31 May 2020 14:47:53 -0400

 FWIW, I reported a similar backtrace in PR port-macppc/54331:

  [   20.6900578] panic: pr_phinpage_check: [mclpl] item 0x1fb2f040 not
 part of pool
  [   20.6900578] cpu0: Begin traceback...
  [   20.6900578] 0x10007da0: at vpanic+0x144
  [   20.7300428] 0x10007dd0: at panic+0x50
  [   20.7300428] 0x10007e20: at pool_cache_put_paddr+0x25c
  [   20.7600388] 0x10007e50: at m_ext_free+0x130
  [   20.7600388] 0x10007e60: at m_free+0x9c
  [   20.7600388] 0x10007e70: at m_freem.part.8+0xc
  [   20.7600388] 0x10007e80: at ether_input+0x67c
  [   20.7600388] 0x10007eb0: at if_percpuq_softint+0xb4
  [   20.7600388] 0x10007ed0: at softint_dispatch+0x1d0
  [   20.7600388] 0x10007f20: at softint_fast_dispatch+0xdc
  [   20.8700561] 0x10007fe8: at 0xfffffffc
  [   20.8700561] cpu0: End traceback...
  Stopped in pid 0.3 (system)
 at  netbsd:vpanic+0x148:    or       r3,  r29,  r29
  gem0: receive error: RX overflow sc->rxptr 0, complete 4

 (Since the rest of PR 54331 is addressed, and this PR contains
 analysis and a patch, I'll close 54331.)

 Thanks,

 Dave


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.