NetBSD Problem Report #48372

From mrg@eterna.com.au  Fri Nov  8 22:35:12 2013
Return-Path: <mrg@eterna.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 068DAA618A
	for <gnats-bugs@gnats.NetBSD.org>; Fri,  8 Nov 2013 22:35:12 +0000 (UTC)
Message-Id: <20131108223509.099F4B380@splode.eterna.com.au>
Date: Sat,  9 Nov 2013 09:35:09 +1100 (EST)
From: mrg@eterna.com.au
Reply-To: mrg@eterna.com.au
To: gnats-bugs@gnats.NetBSD.org
Subject: sunblade 2500 hangs under memory pressure -- loop between uvm/pool/vmem
X-Send-Pr-Version: 3.95

>Number:         48372
>Category:       kern
>Synopsis:       sunblade 2500 hangs under memory pressure -- loop between uvm/pool/vmem
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Nov 08 22:40:00 +0000 2013
>Closed-Date:    Sun Dec 08 07:47:52 +0000 2019
>Last-Modified:  Sun Dec 08 07:47:52 +0000 2019
>Originator:     matthew green
>Release:        NetBSD 6.1_STABLE
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:


System: NetBSD splode.eterna.com.au 6.1_STABLE NetBSD 6.1_STABLE (_splode_) #1: Thu Jul 11 19:09:23 EST 2013 mrg@splode.eterna.com.au:/var/obj/sparc64/usr/src/sys/arch/sparc64/compile/_splode_ sparc64
Architecture: sparc64
Machine: sparc64
>Description:

twice after being up about 6 days my sunblade 2500 with one cpu
and 4GB of ram has hung.  breaking into ddb on the console shows
that a loop was occuring between uvm, pool and vmem.

"show uvm" says that there is only 1 free page, and the bt is:

db{0}> bt
intr_list_handler(59c86d0, a, e0017ed0, 330, 1203f20, 0) at 
netbsd:intr_list_handler+0x10
sparc_interrupt(0, a, 400cc000, 575dec0, 0, 0) at netbsd:sparc_interrupt+0x22c
mutex_spin_enter(18aac80, ff070000000001, ffffffffffffffff, 4000001, 0, 0) at 
netbsd:mutex_spin_enter+0xa0
bt_refill(18abd18, 1002, ff070000000001, 874061e8, 0, 0) at 
netbsd:bt_refill+0x100
vmem_xalloc(18abd18, 2000, 2000, 0, 0, 0) at netbsd:vmem_xalloc+0x6c
vmem_alloc(18abd18, 2000, 1002, 87405d68, 0, 0) at netbsd:vmem_alloc+0x94
pool_page_alloc_meta(18a93f0, 2, ff070000000001, 87406aa8, 0, 0) at 
netbsd:pool_page_alloc_meta+0x2c
pool_grow(18a93f0, 2, 2000, 0, 0, 0) at netbsd:pool_grow+0x1c
pool_get(18a94a8, 2, ff070000000001, 330, 0, 0) at netbsd:pool_get+0x3c
pool_cache_put_slow(18ad840, a, 400cc000, 575dec0, 0, 0) at 
netbsd:pool_cache_put_slow+0x160
pool_cache_put_paddr(18ad600, 400cc000, ffffffffffffffff, 4000001, 0, 0) at 
netbsd:pool_cache_put_paddr+0xa4

[ this repeats 30 more times
uvm_km_kmem_alloc(c, 2000, 0, 87411628, 0, 0) at netbsd:uvm_km_kmem_alloc+0x104
vmem_xalloc(18abd18, 18abfb0, 2000, 0, 0, 0) at netbsd:vmem_xalloc+0x8ac
vmem_alloc(18abd18, 2000, 1002, 874117c8, ff7fffff, ffdfffff) at 
netbsd:vmem_alloc+0x94
pool_page_alloc_meta(18a93f0, 2, ff070000000001, 13, 0, 0) at 
netbsd:pool_page_alloc_meta+0x2c
pool_grow(18a93f0, 2, 59c2000, 13, 7d, 0) at netbsd:pool_grow+0x1c
pool_get(18a94a8, 2, 59c2000, 5, 10c2660, ffffffffffffffff) at 
netbsd:pool_get+0x3c
pool_cache_put_slow(57b7780, 0, 28f50940, 4, 0, 0) at 
netbsd:pool_cache_put_slow+0x160
pool_cache_put_paddr(57b7540, 28f50940, ffffffffffffffff, 201b, 0, 0) at 
netbsd:pool_cache_put_paddr+0xa4
]

ffs_reclaim(0, 59c2000, 59c2000, 0, 0, 0) at netbsd:ffs_reclaim+0xec
VOP_RECLAIM(28f54e70, 1, 0, 59c2000, 0, 0) at netbsd:VOP_RECLAIM+0x28
vclean(28f54e70, 8, 0, 0, ff7fffff, ffdfffff) at netbsd:vclean+0x134
cleanvnode(1884500, 0, 64, 6, 28f54e94, 1884500) at netbsd:cleanvnode+0xc4
vdrain_thread(59c2000, 59c2000, 0, 1c05d38, 7d, 0) at netbsd:vdrain_thread+0x90
lwp_trampoline(f005d730, 113800, 113c00, 111880, 111ce0, 1117f8) at 
netbsd:lwp_trampoline+0x8


>How-To-Repeat:

	not really sure.  the second hang was during a backup
	run, but the first hang was an hour or two after that
	had finished.

>Fix:

>Release-Note:

>Audit-Trail:
From: Petri Laakso <petri.laakso@asd.fi>
To: gnats-bugs@NetBSD.org
Cc: mrg@eterna.com.au, kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: kern/48372: sunblade 2500 hangs under memory pressure -- loop
 between uvm/pool/vmem
Date: Sat, 9 Nov 2013 09:54:57 +0200

 On Fri,  8 Nov 2013 22:40:00 +0000 (UTC)
 mrg@eterna.com.au wrote:

 > >How-To-Repeat:
 > 
 > 	not really sure.  the second hang was during a backup
 > 	run, but the first hang was an hour or two after that
 > 	had finished.

 Hi

 I've been able to repeat memory errors with pkgsrc/sysutils/memtest.
 (in my case due invalid DRAM initialization in SW)
 Maybe you can try to run this if problem appears again or at least
 make sure it is not HW.

 Best regards
 Petri Laakso

From: Lars Heidieker <lars@heidieker.de>
To: gnats-bugs@NetBSD.org, matthew green <mrg@eterna.com.au>
Cc: 
Subject: Re: kern/48372: sunblade 2500 hangs under memory pressure -- loop
 between uvm/pool/vmem
Date: Fri, 22 Nov 2013 23:40:32 +0100

 On 11/08/2013 11:40 PM, mrg@eterna.com.au wrote:

 > System: NetBSD splode.eterna.com.au 6.1_STABLE NetBSD 6.1_STABLE (_splode_) #1: Thu Jul 11 19:09:23 EST 2013 mrg@splode.eterna.com.au:/var/obj/sparc64/usr/src/sys/arch/sparc64/compile/_splode_ sparc64
 > Architecture: sparc64
 > Machine: sparc64
 >> Description:
 > 
 > twice after being up about 6 days my sunblade 2500 with one cpu
 > and 4GB of ram has hung.  breaking into ddb on the console shows
 > that a loop was occuring between uvm, pool and vmem.
 > 
 > "show uvm" says that there is only 1 free page, and the bt is:
 > 
 > db{0}> bt
 > intr_list_handler(59c86d0, a, e0017ed0, 330, 1203f20, 0) at 
 > netbsd:intr_list_handler+0x10
 > sparc_interrupt(0, a, 400cc000, 575dec0, 0, 0) at netbsd:sparc_interrupt+0x22c
 > mutex_spin_enter(18aac80, ff070000000001, ffffffffffffffff, 4000001, 0, 0) at 
 > netbsd:mutex_spin_enter+0xa0
 > bt_refill(18abd18, 1002, ff070000000001, 874061e8, 0, 0) at 
 > netbsd:bt_refill+0x100
 > vmem_xalloc(18abd18, 2000, 2000, 0, 0, 0) at netbsd:vmem_xalloc+0x6c
 > vmem_alloc(18abd18, 2000, 1002, 87405d68, 0, 0) at netbsd:vmem_alloc+0x94
 > pool_page_alloc_meta(18a93f0, 2, ff070000000001, 87406aa8, 0, 0) at 
 > netbsd:pool_page_alloc_meta+0x2c
 > pool_grow(18a93f0, 2, 2000, 0, 0, 0) at netbsd:pool_grow+0x1c
 > pool_get(18a94a8, 2, ff070000000001, 330, 0, 0) at netbsd:pool_get+0x3c
 > pool_cache_put_slow(18ad840, a, 400cc000, 575dec0, 0, 0) at 
 > netbsd:pool_cache_put_slow+0x160
 > pool_cache_put_paddr(18ad600, 400cc000, ffffffffffffffff, 4000001, 0, 0) at 
 > netbsd:pool_cache_put_paddr+0xa4
 > 
 > [ this repeats 30 more times
 > uvm_km_kmem_alloc(c, 2000, 0, 87411628, 0, 0) at netbsd:uvm_km_kmem_alloc+0x104
 > vmem_xalloc(18abd18, 18abfb0, 2000, 0, 0, 0) at netbsd:vmem_xalloc+0x8ac
 > vmem_alloc(18abd18, 2000, 1002, 874117c8, ff7fffff, ffdfffff) at 
 > netbsd:vmem_alloc+0x94
 > pool_page_alloc_meta(18a93f0, 2, ff070000000001, 13, 0, 0) at 
 > netbsd:pool_page_alloc_meta+0x2c
 > pool_grow(18a93f0, 2, 59c2000, 13, 7d, 0) at netbsd:pool_grow+0x1c
 > pool_get(18a94a8, 2, 59c2000, 5, 10c2660, ffffffffffffffff) at 
 > netbsd:pool_get+0x3c
 > pool_cache_put_slow(57b7780, 0, 28f50940, 4, 0, 0) at 
 > netbsd:pool_cache_put_slow+0x160
 > pool_cache_put_paddr(57b7540, 28f50940, ffffffffffffffff, 201b, 0, 0) at 
 > netbsd:pool_cache_put_paddr+0xa4
 > ]
 > 
 > ffs_reclaim(0, 59c2000, 59c2000, 0, 0, 0) at netbsd:ffs_reclaim+0xec
 > VOP_RECLAIM(28f54e70, 1, 0, 59c2000, 0, 0) at netbsd:VOP_RECLAIM+0x28
 > vclean(28f54e70, 8, 0, 0, ff7fffff, ffdfffff) at netbsd:vclean+0x134
 > cleanvnode(1884500, 0, 64, 6, 28f54e94, 1884500) at netbsd:cleanvnode+0xc4
 > vdrain_thread(59c2000, 59c2000, 0, 1c05d38, 7d, 0) at netbsd:vdrain_thread+0x90
 > lwp_trampoline(f005d730, 113800, 113c00, 111880, 111ce0, 1117f8) at 
 > netbsd:lwp_trampoline+0x8
 > 

 Hi,

 one thing puzzles me about the stack trace, how can uvm_km_kmem_alloc
 call pool_cache_put_* without going via vmem_free?
 Is it missing in the stack trace?
 Anyway I think I found it, there is a bug fixed by the change of
 revision: 1.125 src/sys/uvm/uvm_km.c
 It should be pulled up to the netbsd-6 branch.

 Lars


 -- 
 ------------------------------------

 Mystische Erklärungen:
 Die mystischen Erklärungen gelten für tief;
 die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind.

    -- Friedrich Nietzsche
    [ Die Fröhliche Wissenschaft Buch 3, 126 ]

From: Mindaugas Rasiukevicius <rmind@netbsd.org>
To: Lars Heidieker <lars@heidieker.de>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, mrg@eterna.com.au
Subject: Re: kern/48372: sunblade 2500 hangs under memory pressure -- loop
 between uvm/pool/vmem
Date: Sat, 23 Nov 2013 12:33:42 +0000

 Lars Heidieker <lars@heidieker.de> wrote:
 >  Hi,
 >  
 >  one thing puzzles me about the stack trace, how can uvm_km_kmem_alloc
 >  call pool_cache_put_* without going via vmem_free?
 >  Is it missing in the stack trace?
 >  Anyway I think I found it, there is a bug fixed by the change of
 >  revision: 1.125 src/sys/uvm/uvm_km.c
 >  It should be pulled up to the netbsd-6 branch.

 That is a good spot, indeed!

 >  
 >  Lars

 -- 
 Mindaugas

State-Changed-From-To: open->closed
State-Changed-By: mrg@NetBSD.org
State-Changed-When: Sun, 08 Dec 2019 07:47:52 +0000
State-Changed-Why:
was fixed, thanks Lars.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.