NetBSD Problem Report #47030

From gson@gson.org  Wed Oct  3 16:15:05 2012
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id DA69063B8E6
	for <gnats-bugs@gnats.NetBSD.org>; Wed,  3 Oct 2012 16:15:04 +0000 (UTC)
Message-Id: <20121003161459.299E375E8B@guava.gson.org>
Date: Wed,  3 Oct 2012 19:14:59 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: Installation stalls during set extraction
X-Send-Pr-Version: 3.95

>Number:         47030
>Category:       kern
>Synopsis:       low-memory deadlock with wapbl enabled (was: Installation stalls during set extraction)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jdolecek
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 03 16:20:00 +0000 2012
>Closed-Date:    
>Last-Modified:  Mon Apr 10 21:35:01 +0000 2017
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: sparc
Machine: sparc
>Description:

When doing automated NetBSD installs in a qemu VM using anita
(pkgsrc: misc/py-anita), the installation sometimes fails because the 
set extraction never completes.  When this happens, the sysinst progress
bar displays the message "- stalled -" and stays that way indefinitely:

     Status: Running
    Command: progress -zf /mnt2//sparc/binary/sets/base.tgz tar --chroot -xhepf

-
--------------------------------------------------------------------------------
  9% |***                                |  8552 KiB    5.06 KiB/s  - stalled -

For example, when qemu was upgraded from 0.12 to 0.15 on the TNF
automated testbed babylon5.netbsd.org, this problem started hitting
every SPARC install run.  Increasing the amount of memory in the
emulated SPARC system from 32M to 64M made the problem go away.

On an earlier occasion involving an i386 VM, I managed to attach to
the VM with gdb for remote kernel debugging and found that the "tar"
process extracting the sets was stuck waiting for a buffer:

  (gdb) where
  #0  0xc064e1ff in cv_timedwait (cv=0xc537a560, mtx=0x0, timo=-985761876) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/lwp.h:377
  #1  0xc0681c86 in sleepq_block (timo=0, catch=false) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/lwp.h:377
  #2  0xc064e276 in cv_timedwait (cv=0xc13b5120, mtx=0xc13f73fc, timo=0) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/lwp.h:377
  #3  0xc0a83dc3 in getnewbuf (slpflag=0, slptimeo=0, from_bufq=0) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/wapbl.h:209
  #4  0xc0a8366e in getblk (vp=0xc54e9380, blkno=0, size=2048, slpflag=0, slptimeo=0) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/wapbl.h:209
  #5  0xc0387f35 in ffs_getblk (vp=0xc54e9380, lblkno=0, blkno=3389092, size=2048, clearbuf=true, bpp=0xc53e7ad0) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/ufs/ffs/ffs_subr.c:122
  #6  0xc037d57f in ffs_balloc_ufs2 (vp=0xc54e9380, off=0, size=512, cred=0xc460bf00, flags=1, bpp=0xc53e7ad0) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/ufs/ffs/ffs_balloc.c:92
  #7  0xc037b7e8 in ffs_balloc (vp=0xc54e9380, off=0, size=512, cred=0xc460bf00, flags=1, bpp=0xc53e7ad0) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/ufs/ffs/ffs_balloc.c:88
  #8  0xc09f8297 in ufs_mkdir (v=0xc53e7b58) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/vnode.h:391
  #9  0xc0abbad9 in VOP_MKDIR (dvp=0xc54c78fc, vpp=0xc53e7c34, cnp=0xc53e7c48, vap=0xc53e7b9c) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/kern/vnode_if.c:86
  #10 0xc0a98cff in do_sys_mkdir (path=0x8235d30 <Address 0x8235d30 out of bounds>, mode=493, seg=UIO_USERSPACE) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/kern/vfs_syscalls.c:204
  #11 0xc0a98bb0 in sys_mkdir (l=dwarf2_read_address: Corrupted DWARF expression.
  ) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/kern/vfs_syscalls.c:204
  #12 0xc09614f7 in sy_call (sy=0xc0e978c0, l=0xc537a560, uap=0xc53e7cf0, rval=0xc53e7d18) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/syscallvar.h:61
  #13 0xc0961872 in syscall (frame=0xc53e7d48) at /bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/userret.h:84
  #14 0xc01005d6 in syscall1 ()

One one occasion when I had interactive terminal access to the sysinst
session, I was able to recover from the stall by suspending sysinst
with control-Z, maybe running some random shell commands, and then
continuing the the stopped sysinst job.

The following factors all seem to make it more likely for the problem to occur:

 - Extracting a large install set, such as "base"
 - Installing a release built with debug symbols (which makes the sets even larger)
 - A limited amount of memory in the qemu guest (e.g., 32M rather than 64M)
 - Installing on sparc rather than i386
 - Using qemu 0.15 rather than 0.12

My best guess as to why this happens is that the kernel somehow gets
into a deadlocked state where all the disk buffers are simultaneously
locked, leaving none available for allocation; this might be triggered
by some combination of a small memory size, lots of new files being
created in a short amount of time, changes in disk caching policy
between qemu 0.12 and 0.15, and a lack of other system activity, a
combination that is unlikely to occur during normal system operation
but likely during installation into a small-memory VM.

>How-To-Repeat:

The following reliably reproduces the problem for me; your mileage may vary.

  - Install qemu-0.15.1nb5 from the emulators/qemu0 package (not qemu-1
    from emulators/qemu, because it has unrelated bugs that may prevent
    you from reaching the set extraction stage)

  - Install anita-1.29 from the misc/py-anita package

  - Run the command

      anita --memory-size 32M install http://ftp.netbsd.org/pub/NetBSD/iso/6.0_RC2/NetBSD-6.0_RC2-sparc.iso

>Fix:

>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/47030: Installation stalls during set extraction
Date: Thu, 6 Jun 2013 09:24:12 +0300

 The problem is still there in NetBSD 6.1, and I keep running into it
 with some regularity.

 For example, I can reproduce it by running the command

   anita install --memory-size 32M --disk-size 2G http://ftp.netbsd.org/pub/NetBSD/NetBSD-6.1/i386/

 The size of the disk being installed on appears be a factor in
 triggering the bug, because if I change the --disk-size argument
 to 1G, the installation succeeds.
 -- 
 Andreas Gustafsson, gson@gson.org

Responsible-Changed-From-To: kern-bug-people->jdolecek
Responsible-Changed-By: jdolecek@NetBSD.org
Responsible-Changed-When: Mon, 24 Oct 2016 20:27:31 +0000
Responsible-Changed-Why:
This smells very like a problem I'm hitting on a test machine, I'll look into
this.


State-Changed-From-To: open->feedback
State-Changed-By: jdolecek@NetBSD.org
State-Changed-When: Thu, 10 Nov 2016 21:37:46 +0000
State-Changed-Why:
I can repeat the problem using 6.1 isos using qemu 2.7.0 from pkgsrc, but not
any more with 7.0.1. Also on my test machine I can't repeat it any more
after NVMe driver fixes. So I think this problem is actually fixed by changes
in 7.x.
Please confirm if you still see this problem with recent kernels.


From: Andreas Gustafsson <gson@gson.org>
To: jdolecek@NetBSD.org
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/47030 (Installation stalls during set extraction)
Date: Mon, 14 Nov 2016 20:13:33 +0200

 jdolecek@NetBSD.org wrote:
 > I can repeat the problem using 6.1 isos using qemu 2.7.0 from pkgsrc, but not
 > any more with 7.0.1. Also on my test machine I can't repeat it any more
 > after NVMe driver fixes. So I think this problem is actually fixed by changes
 > in 7.x.
 > Please confirm if you still see this problem with recent kernels.

 I don't have an easy way of testing just a recent kernel with a system
 otherwise identical to the the one the bug was originally reported for.
 Instead, I'm testing full release builds from date-based checkouts of
 mainline sources, as that is what I have an automated setup for.

 I did some initial tests with NetBSD/sparc under qemu-current but they
 failed to produce a conclusive result because I kept getting "esp0:
 !TC on DATA XFER" errors, aka PR 46530.

 I then switched to NetBSD/i386, running under qemu-current with 32M
 virtual RAM and a 2G virtual disk.  After some bisections, it looks
 like the problems of installations stalling disappeared with the
 commit

  2012.11.20.20.17.43 jakllsch src/distrib/utils/sysinst/Attic/bsddisklabel.c 1.57

 This commit disabled WAPBL by default.  In other words, the problem
 appears to be triggered by installing on a file system with WAPBL
 enabled.

 To confirm this, I tested installing -current/i386 (source date
 2016.11.12.20.56.49) with following patch to re-enable WAPBL:

 Index: src/usr.sbin/sysinst/bsddisklabel.c
 ===================================================================
 RCS file: /bracket/repo/src/usr.sbin/sysinst/bsddisklabel.c,v
 retrieving revision 1.2
 diff -u -r1.2 bsddisklabel.c
 --- src/usr.sbin/sysinst/bsddisklabel.c	3 Aug 2014 16:09:38 -0000	1.2
 +++ src/usr.sbin/sysinst/bsddisklabel.c	14 Nov 2016 15:32:54 -0000
 @@ -133,8 +133,9 @@
  			asprintf(&buf, "%s", mountpt);
  		strlcpy(p->pi_mount, buf, sizeof p->pi_mount);
  		p->pi_flags |= PIF_MOUNT;
 -		/* Default to UFS2. */
 +		/* Default to logging, UFS2. */
  		if (p->pi_fstype == FS_BSDFFS) {
 +			p->pi_flags |= PIF_LOG;
  #ifdef DEFAULT_UFS2
  #ifndef HAVE_UFS2_BOOT
  			if (strcmp(mountpt, "/") != 0)

 and the installation once again stalled.

 So, my conclusion is that the bug is still there in the -current
 kernel, but is hidden by WAPBL no longer being enabled by default in
 sysinst.
 -- 
 Andreas Gustafsson, gson@gson.org

State-Changed-From-To: feedback->open
State-Changed-By: gson@NetBSD.org
State-Changed-When: Mon, 14 Nov 2016 18:17:10 +0000
State-Changed-Why:
Feedback provided.


From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@NetBSD.org
Cc: jdolecek@NetBSD.org
Subject: Re: kern/47030 (Installation stalls during set extraction)
Date: Tue, 13 Dec 2016 18:52:05 +0200

 This problem has recently started hitting the i386 and amd64 testbeds.

 Here are two recent log files where sysinst prints "- stalled -"
 during the installation of one of the distribution sets and then
 never recovers:

   http://releng.netbsd.org/b5reports/i386/build/2016.12.12.10.10.05/install.log

   http://releng.netbsd.org/b5reports/amd64/build/2016.12.13.02.05.48/install.log

 -- 
 Andreas Gustafsson, gson@gson.org

From: "Jaromir Dolecek" <jdolecek@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47030 CVS commit: src/sys/kern
Date: Mon, 10 Apr 2017 21:34:37 +0000

 Module Name:	src
 Committed By:	jdolecek
 Date:		Mon Apr 10 21:34:37 UTC 2017

 Modified Files:
 	src/sys/kern: vfs_wapbl.c

 Log Message:
 improve performance of journal writes by parallelizing the I/O - use 4 bufs
 by default, add sysctl vfs.wapbl.journal_iobufs to control it

 this also removes need to allocate iobuf during commit, so it
 might help to avoid deadlock during memory shortages like PR kern/47030


 To generate a diff of this commit:
 cvs rdiff -u -r1.94 -r1.95 src/sys/kern/vfs_wapbl.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.