NetBSD Problem Report #54591

From neitzel@hackett.marshlabs.gaertner.de  Tue Oct  1 20:35:12 2019
Return-Path: <neitzel@hackett.marshlabs.gaertner.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 2AF177A174
	for <gnats-bugs@gnats.NetBSD.org>; Tue,  1 Oct 2019 20:35:12 +0000 (UTC)
Message-Id: <20191001203502.D2F4D348AB@marshlabs-mx.gaertner.de>
Date: Tue,  1 Oct 2019 22:35:02 +0200 (CEST)
From: neitzel@hackett.marshlabs.gaertner.de
Reply-To: neitzel@hackett.marshlabs.gaertner.de
To: gnats-bugs@NetBSD.org
Subject: lvm drops volumes on initial start
X-Send-Pr-Version: 3.95

>Number:         54591
>Category:       bin
>Synopsis:       lvm drops volumes on initial start
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 01 20:40:00 +0000 2019
>Last-Modified:  Wed Jan 01 21:30:01 +0000 2020
>Originator:     Martin Neitzel
>Release:        NetBSD 9.99.12 2019-09-21
>Organization:
Gaertner Datensysteme, Marshlabs
>Environment:
System: NetBSD eddie.marshlabs.gaertner.de 9.99.12 NetBSD 9.99.12 (GENERIC) #0: Fri Sep 27 01:08:12 CEST 2019 neitzel@eddie.marshlabs.gaertner.de:/scratch/obj/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:

Upon boot, /etc/rc.d/lvm fails to set up all logical volumes which have
been created.  Random entries in /dev/mapper/ are missing.

As a consequence, the missing filesystems cannot be mounted, and depending
on the missing filesystem, the boot may already abort in single user mode.
Recovery can be... tricky.


>How-To-Repeat:

During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
hold the space for an LVM physical volume:

	neitzel 6 > disklabel wd0
	[...]
	total sectors: 234441648
	[...]
	5 partitions:
	#        size    offset     fstype [fsize bsize cpg/sgs]
	 a:   4194304      2048     4.2BSD      0     0     0  # /
	 b:   2097152   4196416       swap                     # swap
	 c:  41943040      2048     unused      0     0        # NetBSD part.
	 d: 234441648         0     unused      0     0        # whole disk
	 e:  35651520   6293568      vinum                     # LVM PV

I created a simple volume group out of a single physical volume
and for logical volumes:

	# lvm pvcreate /dev/rwd0e
	# lvm vgcreate vg0 /dev/rwd0e

	# lvm lvcreate -L 4g -n src     vg0
	# lvm lvcreate -L 5g -n scratch vg0
	# lvm lvcreate -L 1g -n pkg     vg0
	# lvm lvcreate -L 2g -n local   vg0

I newfs'ed the filesystems on the volumes, prepared mount points and
made /etc/fstab entries as usual.  The "noauto" option no-fsck-"0" are
already for the workaround:

	/dev/mapper/vg0-local   /usr/local      ffs     rw,noauto       0 0
	/dev/mapper/vg0-pkg     /usr/pkglv      ffs     rw,noauto       0 0
	/dev/mapper/vg0-scratch /scratch        ffs     rw,noauto       0 0
	/dev/mapper/vg0-src     /usr/src        ffs     rw,noauto       0 0

While you would typically boot with

	lvm=YES

in /etc/rc.conf, things get easier to repeat/debug/work araound with lvm=NO
and running things manually.  Once in multi-user mode:

pre-flight check:

	/root 5 # modstat | grep -w dm
	/root 6 # dmsetup table
	No devices found

First lvm start, bringing only three out of four volumes on:

	/root 7 # /etc/rc.d/lvm onestart
	Configuring lvm devices.
	 Activated Volume Groups: vg0

	/root 8 # modstat | grep -w dm
	dm                  driver   filesys  a        0   18432 dk_subr

	/root 9 # dmsetup table
	vg0-local: 0 4194304 linear /dev/wd0e 384
	vg0-pkg: 0 2097152 linear /dev/wd0e 4194688
	vg0-src: 0 8388608 linear /dev/wd0e 6291840

	/root 10 # ls -l /dev/mapper
	total 0
	crw-rw----  1 root  operator  194, 0 Aug  7 00:12 control
	crw-r-----  1 root  operator  194, 1 Oct  1 21:34 rvg0-local
	crw-r-----  1 root  operator  194, 2 Oct  1 21:34 rvg0-pkg
	crw-r-----  1 root  operator  194, 3 Oct  1 21:34 rvg0-src
	brw-r-----  1 root  operator  169, 1 Oct  1 21:34 vg0-local
	brw-r-----  1 root  operator  169, 2 Oct  1 21:34 vg0-pkg
	brw-r-----  1 root  operator  169, 3 Oct  1 21:34 vg0-src

This time, the "scratch" volume was missing.  The "3 out of 4"
seems to be fixed, but is random which LV missing.

Revover by restarting the LVM service.  In separate steps:

	/root 11 # /etc/rc.d/lvm onestop
	Unconfiguring lvm devices.
	  Shutting Down logical volume: vg0/local
	  Command failed with status code 5.
	  Shutting Down logical volume: vg0/pkg

Obviously the "stop" runs into inconsistent information, and a bit
of debris is left;  the "dm" kernel module stays loaded:

	/root 12 # modstat | grep -w dm
	dm                  driver   filesys  a        0   18432 dk_subr

	/root 13 # ls -l /dev/mapper
	total 0
	crw-rw----  1 root  operator  194, 0 Aug  7 00:12 control
	crw-r-----  1 root  operator  194, 1 Oct  1 21:34 rvg0-local
	brw-r-----  1 root  operator  169, 1 Oct  1 21:34 vg0-local

	/root 14 # dmsetup table
	vg0-local: 0 4194304 linear /dev/wd0e 384

A second start brings all four volumes online:

	/root 15 # /etc/rc.d/lvm onestart
	Configuring lvm devices.
	 Activated Volume Groups: vg0

	/root 16 # ls -l /dev/mapper
	total 0
	crw-rw----  1 root  operator  194, 0 Aug  7 00:12 control
	crw-r-----  1 root  operator  194, 1 Oct  1 21:34 rvg0-local
	crw-r-----  1 root  operator  194, 4 Oct  1 21:52 rvg0-pkg
	crw-r-----  1 root  operator  194, 6 Oct  1 21:52 rvg0-scratch
	crw-r-----  1 root  operator  194, 5 Oct  1 21:52 rvg0-src
	brw-r-----  1 root  operator  169, 1 Oct  1 21:34 vg0-local
	brw-r-----  1 root  operator  169, 4 Oct  1 21:52 vg0-pkg
	brw-r-----  1 root  operator  169, 6 Oct  1 21:52 vg0-scratch
	brw-r-----  1 root  operator  169, 5 Oct  1 21:52 vg0-src

	/root 17 # dmsetup table
	vg0-local: 0 4194304 linear /dev/wd0e 384
	vg0-pkg: 0 2097152 linear /dev/wd0e 4194688
	vg0-src: 0 8388608 linear /dev/wd0e 6291840
	vg0-scratch: 0 10485760 linear /dev/wd0e 14680448

"mount -a"  and work (almost) as usual.



>Fix:

None known yet.  This may well be a category "kern" instead of "bin" bug.

Hey, I'm happy that I got this far to actually be able to load the
sources and still access them on next boot ;-)

It took me three or four installation attempts get a 9.99.x
-current running at all, with the workarounds as decribed here. In
earlier attempts, I tried to install parts of the base system into
LVs and wents nuts because randonly different parts would be missing
upon reboot.  My first insallation attempts were with GPT partitioning
and a GPT partition as LVM physical volume, then I reverted to MBR
partitioning, then I made sure nothing critical for a multi-user
login (such as /usr/pkg/bin/tcsh) resided on an LV.

Hence "Severity: serious" & "Priority: high".

>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed, 2 Oct 2019 08:43:01 -0000 (UTC)

 neitzel@hackett.marshlabs.gaertner.de writes:

 >During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
 >hold the space for an LVM physical volume:

 Rare, but possible. 

 >       /root 9 # dmsetup table
 >       vg0-local: 0 4194304 linear /dev/wd0e 384
 >       vg0-pkg: 0 2097152 linear /dev/wd0e 4194688
 >       vg0-src: 0 8388608 linear /dev/wd0e 6291840

 Weird, there is only one PV, one LVM label and it should define
 all volumes.


 >Revover by restarting the LVM service.  In separate steps:
 >       /root 11 # /etc/rc.d/lvm onestop
 >       Unconfiguring lvm devices.
 >         Shutting Down logical volume: vg0/local
 >         Command failed with status code 5.
 >         Shutting Down logical volume: vg0/pkg

 >Obviously the "stop" runs into inconsistent information, and a bit
 >of debris is left;  the "dm" kernel module stays loaded:

 The LVM tools usually report verbosely but /etc/rc.d/lvm redirects
 their output to /dev/null. You can manually shut down the volume group
 with

 lvm vgchange -a n vg0

 add one or more -v to get tons of debug information.


 For starting LVM manually:

 - load the dm module with dmsetup
 - lvm pvscan            -> searches devices for LVM labels
 - lvm vgscan --mknodes  -> searches PV devices for volume groups and volumes
 - lvm vgchange -a y     -> generates the /dev/mapper links

 Again, with one or more -v you get debug information.


 >This may well be a category "kern" instead of "bin" bug.  

 The kernel only handles the mapper tables. Everything else is
 in userland, in particular the list of volumes is just from
 reading and parsing the labels, the dm driver is not involved 
 until that information is converted into tables and loaded
 into the kernel.

 I have fixed some LVM bugs in my local tree, but I don't remember
 anything related. 

 -- 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: neitzel@hackett.marshlabs.gaertner.de (Martin Neitzel)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed,  2 Oct 2019 16:32:24 +0200 (CEST)

 Thanks for picking this up so quickly, Michael!

 > >During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
 > >hold the space for an LVM physical volume:
 >  
 > Rare, but possible.

 Yes, i have such setups successfully running since the vinum(4) demise
 with NetBSD versions up to and including NetBSD-8.1/8-stable as of now.

 > For starting LVM manually:
 > 
 > - load the dm module with dmsetup
 > - lvm pvscan            -> searches devices for LVM labels
 > - lvm vgscan --mknodes  -> searches PV devices for volume groups and volumes
 > - lvm vgchange -a y     -> generates the /dev/mapper links
 > 
 > Again, with one or more -v you get debug information.

 In the meantime, I also created an /etc/lvm/lvm.conf with

 	log {
 		level = 9
 		verbose = 3
 	}

 and get plenty of source-related messages logged.  I can see where
 the messages for the missing volume start to deviate from the
 messages for the successful volumes while doing the "vgchange -a
 y" step above.

 I should be able to $CC -g the the lvm tools and give a much more
 detailed analysis within the next two weeks.

From: John Nemeth <jnemeth@cue.bc.ca>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        neitzel@hackett.marshlabs.gaertner.de
Cc: 
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed, 2 Oct 2019 15:13:36 -0700

 On Oct 2,  8:45am, Michael van Elst wrote:
 }
 } The following reply was made to PR bin/54591; it has been noted by GNATS.
 } 
 }  neitzel@hackett.marshlabs.gaertner.de writes:
 }  
 }  >During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
 }  >hold the space for an LVM physical volume:
 }  
 }  Rare, but possible. 

      I do this all the time to get disk space for domUs.  I have
 never had any problems, except when trying to rename LVs (I just
 don't do that any more).  The biggest issue is that GENERIC doesn't
 have, or didn't have, dm(4).  There is nothing inherent in LVM that
 indicates you should have multiple PVs, it's just another way of
 slicing up a disk.

 }-- End of excerpt from Michael van Elst

From: neitzel@hackett.marshlabs.gaertner.de (Martin Neitzel)
To: gnats-bugs@netbsd.org
Cc: mlelstv@serpens.de, netbsd-bugs@netbsd.org
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed,  1 Jan 2020 22:28:03 +0100 (CET)

 After upgrading from 9.99.12 to 9.99.31 (2019-12-30 cvs update),
 the issue appears to fixed.  All volumes appear reliably upon
 my reboots into 9.99.31.  (I did about four reboots now.)

 I see that mlelstv@ applied some TLC on 2019-12-14 to
 /usr/src/external/gpl2/lvm2/dist/libdm/ioctl/libdm_netbsd.c
 (commitid: JEWsp8rnvx7AAEOB) so that seems to have fixed it.
 Thanks a lot, Michael!

 As far as I'm concerned, the PR could be closed.

 					Martin Neitzel

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.45 2018/12/21 14:23:33 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.