NetBSD Problem Report #54591
From neitzel@hackett.marshlabs.gaertner.de Tue Oct 1 20:35:12 2019
Return-Path: <neitzel@hackett.marshlabs.gaertner.de>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 2AF177A174
for <gnats-bugs@gnats.NetBSD.org>; Tue, 1 Oct 2019 20:35:12 +0000 (UTC)
Message-Id: <20191001203502.D2F4D348AB@marshlabs-mx.gaertner.de>
Date: Tue, 1 Oct 2019 22:35:02 +0200 (CEST)
From: neitzel@hackett.marshlabs.gaertner.de
Reply-To: neitzel@hackett.marshlabs.gaertner.de
To: gnats-bugs@NetBSD.org
Subject: lvm drops volumes on initial start
X-Send-Pr-Version: 3.95
>Number: 54591
>Category: bin
>Synopsis: lvm drops volumes on initial start
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Oct 01 20:40:00 +0000 2019
>Last-Modified: Wed Jan 01 21:30:01 +0000 2020
>Originator: Martin Neitzel
>Release: NetBSD 9.99.12 2019-09-21
>Organization:
Gaertner Datensysteme, Marshlabs
>Environment:
System: NetBSD eddie.marshlabs.gaertner.de 9.99.12 NetBSD 9.99.12 (GENERIC) #0: Fri Sep 27 01:08:12 CEST 2019 neitzel@eddie.marshlabs.gaertner.de:/scratch/obj/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
Upon boot, /etc/rc.d/lvm fails to set up all logical volumes which have
been created. Random entries in /dev/mapper/ are missing.
As a consequence, the missing filesystems cannot be mounted, and depending
on the missing filesystem, the boot may already abort in single user mode.
Recovery can be... tricky.
>How-To-Repeat:
During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
hold the space for an LVM physical volume:
neitzel 6 > disklabel wd0
[...]
total sectors: 234441648
[...]
5 partitions:
# size offset fstype [fsize bsize cpg/sgs]
a: 4194304 2048 4.2BSD 0 0 0 # /
b: 2097152 4196416 swap # swap
c: 41943040 2048 unused 0 0 # NetBSD part.
d: 234441648 0 unused 0 0 # whole disk
e: 35651520 6293568 vinum # LVM PV
I created a simple volume group out of a single physical volume
and for logical volumes:
# lvm pvcreate /dev/rwd0e
# lvm vgcreate vg0 /dev/rwd0e
# lvm lvcreate -L 4g -n src vg0
# lvm lvcreate -L 5g -n scratch vg0
# lvm lvcreate -L 1g -n pkg vg0
# lvm lvcreate -L 2g -n local vg0
I newfs'ed the filesystems on the volumes, prepared mount points and
made /etc/fstab entries as usual. The "noauto" option no-fsck-"0" are
already for the workaround:
/dev/mapper/vg0-local /usr/local ffs rw,noauto 0 0
/dev/mapper/vg0-pkg /usr/pkglv ffs rw,noauto 0 0
/dev/mapper/vg0-scratch /scratch ffs rw,noauto 0 0
/dev/mapper/vg0-src /usr/src ffs rw,noauto 0 0
While you would typically boot with
lvm=YES
in /etc/rc.conf, things get easier to repeat/debug/work araound with lvm=NO
and running things manually. Once in multi-user mode:
pre-flight check:
/root 5 # modstat | grep -w dm
/root 6 # dmsetup table
No devices found
First lvm start, bringing only three out of four volumes on:
/root 7 # /etc/rc.d/lvm onestart
Configuring lvm devices.
Activated Volume Groups: vg0
/root 8 # modstat | grep -w dm
dm driver filesys a 0 18432 dk_subr
/root 9 # dmsetup table
vg0-local: 0 4194304 linear /dev/wd0e 384
vg0-pkg: 0 2097152 linear /dev/wd0e 4194688
vg0-src: 0 8388608 linear /dev/wd0e 6291840
/root 10 # ls -l /dev/mapper
total 0
crw-rw---- 1 root operator 194, 0 Aug 7 00:12 control
crw-r----- 1 root operator 194, 1 Oct 1 21:34 rvg0-local
crw-r----- 1 root operator 194, 2 Oct 1 21:34 rvg0-pkg
crw-r----- 1 root operator 194, 3 Oct 1 21:34 rvg0-src
brw-r----- 1 root operator 169, 1 Oct 1 21:34 vg0-local
brw-r----- 1 root operator 169, 2 Oct 1 21:34 vg0-pkg
brw-r----- 1 root operator 169, 3 Oct 1 21:34 vg0-src
This time, the "scratch" volume was missing. The "3 out of 4"
seems to be fixed, but is random which LV missing.
Revover by restarting the LVM service. In separate steps:
/root 11 # /etc/rc.d/lvm onestop
Unconfiguring lvm devices.
Shutting Down logical volume: vg0/local
Command failed with status code 5.
Shutting Down logical volume: vg0/pkg
Obviously the "stop" runs into inconsistent information, and a bit
of debris is left; the "dm" kernel module stays loaded:
/root 12 # modstat | grep -w dm
dm driver filesys a 0 18432 dk_subr
/root 13 # ls -l /dev/mapper
total 0
crw-rw---- 1 root operator 194, 0 Aug 7 00:12 control
crw-r----- 1 root operator 194, 1 Oct 1 21:34 rvg0-local
brw-r----- 1 root operator 169, 1 Oct 1 21:34 vg0-local
/root 14 # dmsetup table
vg0-local: 0 4194304 linear /dev/wd0e 384
A second start brings all four volumes online:
/root 15 # /etc/rc.d/lvm onestart
Configuring lvm devices.
Activated Volume Groups: vg0
/root 16 # ls -l /dev/mapper
total 0
crw-rw---- 1 root operator 194, 0 Aug 7 00:12 control
crw-r----- 1 root operator 194, 1 Oct 1 21:34 rvg0-local
crw-r----- 1 root operator 194, 4 Oct 1 21:52 rvg0-pkg
crw-r----- 1 root operator 194, 6 Oct 1 21:52 rvg0-scratch
crw-r----- 1 root operator 194, 5 Oct 1 21:52 rvg0-src
brw-r----- 1 root operator 169, 1 Oct 1 21:34 vg0-local
brw-r----- 1 root operator 169, 4 Oct 1 21:52 vg0-pkg
brw-r----- 1 root operator 169, 6 Oct 1 21:52 vg0-scratch
brw-r----- 1 root operator 169, 5 Oct 1 21:52 vg0-src
/root 17 # dmsetup table
vg0-local: 0 4194304 linear /dev/wd0e 384
vg0-pkg: 0 2097152 linear /dev/wd0e 4194688
vg0-src: 0 8388608 linear /dev/wd0e 6291840
vg0-scratch: 0 10485760 linear /dev/wd0e 14680448
"mount -a" and work (almost) as usual.
>Fix:
None known yet. This may well be a category "kern" instead of "bin" bug.
Hey, I'm happy that I got this far to actually be able to load the
sources and still access them on next boot ;-)
It took me three or four installation attempts get a 9.99.x
-current running at all, with the workarounds as decribed here. In
earlier attempts, I tried to install parts of the base system into
LVs and wents nuts because randonly different parts would be missing
upon reboot. My first insallation attempts were with GPT partitioning
and a GPT partition as LVM physical volume, then I reverted to MBR
partitioning, then I made sure nothing critical for a multi-user
login (such as /usr/pkg/bin/tcsh) resided on an LV.
Hence "Severity: serious" & "Priority: high".
>Audit-Trail:
From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed, 2 Oct 2019 08:43:01 -0000 (UTC)
neitzel@hackett.marshlabs.gaertner.de writes:
>During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
>hold the space for an LVM physical volume:
Rare, but possible.
> /root 9 # dmsetup table
> vg0-local: 0 4194304 linear /dev/wd0e 384
> vg0-pkg: 0 2097152 linear /dev/wd0e 4194688
> vg0-src: 0 8388608 linear /dev/wd0e 6291840
Weird, there is only one PV, one LVM label and it should define
all volumes.
>Revover by restarting the LVM service. In separate steps:
> /root 11 # /etc/rc.d/lvm onestop
> Unconfiguring lvm devices.
> Shutting Down logical volume: vg0/local
> Command failed with status code 5.
> Shutting Down logical volume: vg0/pkg
>Obviously the "stop" runs into inconsistent information, and a bit
>of debris is left; the "dm" kernel module stays loaded:
The LVM tools usually report verbosely but /etc/rc.d/lvm redirects
their output to /dev/null. You can manually shut down the volume group
with
lvm vgchange -a n vg0
add one or more -v to get tons of debug information.
For starting LVM manually:
- load the dm module with dmsetup
- lvm pvscan -> searches devices for LVM labels
- lvm vgscan --mknodes -> searches PV devices for volume groups and volumes
- lvm vgchange -a y -> generates the /dev/mapper links
Again, with one or more -v you get debug information.
>This may well be a category "kern" instead of "bin" bug.
The kernel only handles the mapper tables. Everything else is
in userland, in particular the list of volumes is just from
reading and parsing the labels, the dm driver is not involved
until that information is converted into tables and loaded
into the kernel.
I have fixed some LVM bugs in my local tree, but I don't remember
anything related.
--
--
Michael van Elst
Internet: mlelstv@serpens.de
"A potential Snark may lurk in every tree."
From: neitzel@hackett.marshlabs.gaertner.de (Martin Neitzel)
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed, 2 Oct 2019 16:32:24 +0200 (CEST)
Thanks for picking this up so quickly, Michael!
> >During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
> >hold the space for an LVM physical volume:
>
> Rare, but possible.
Yes, i have such setups successfully running since the vinum(4) demise
with NetBSD versions up to and including NetBSD-8.1/8-stable as of now.
> For starting LVM manually:
>
> - load the dm module with dmsetup
> - lvm pvscan -> searches devices for LVM labels
> - lvm vgscan --mknodes -> searches PV devices for volume groups and volumes
> - lvm vgchange -a y -> generates the /dev/mapper links
>
> Again, with one or more -v you get debug information.
In the meantime, I also created an /etc/lvm/lvm.conf with
log {
level = 9
verbose = 3
}
and get plenty of source-related messages logged. I can see where
the messages for the missing volume start to deviate from the
messages for the successful volumes while doing the "vgchange -a
y" step above.
I should be able to $CC -g the the lvm tools and give a much more
detailed analysis within the next two weeks.
From: John Nemeth <jnemeth@cue.bc.ca>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
neitzel@hackett.marshlabs.gaertner.de
Cc:
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed, 2 Oct 2019 15:13:36 -0700
On Oct 2, 8:45am, Michael van Elst wrote:
}
} The following reply was made to PR bin/54591; it has been noted by GNATS.
}
} neitzel@hackett.marshlabs.gaertner.de writes:
}
} >During NetBSD installation, I defined a disklabel(8) partition /dev/rwd0e to
} >hold the space for an LVM physical volume:
}
} Rare, but possible.
I do this all the time to get disk space for domUs. I have
never had any problems, except when trying to rename LVs (I just
don't do that any more). The biggest issue is that GENERIC doesn't
have, or didn't have, dm(4). There is nothing inherent in LVM that
indicates you should have multiple PVs, it's just another way of
slicing up a disk.
}-- End of excerpt from Michael van Elst
From: neitzel@hackett.marshlabs.gaertner.de (Martin Neitzel)
To: gnats-bugs@netbsd.org
Cc: mlelstv@serpens.de, netbsd-bugs@netbsd.org
Subject: Re: bin/54591: lvm drops volumes on initial start
Date: Wed, 1 Jan 2020 22:28:03 +0100 (CET)
After upgrading from 9.99.12 to 9.99.31 (2019-12-30 cvs update),
the issue appears to fixed. All volumes appear reliably upon
my reboots into 9.99.31. (I did about four reboots now.)
I see that mlelstv@ applied some TLC on 2019-12-14 to
/usr/src/external/gpl2/lvm2/dist/libdm/ioctl/libdm_netbsd.c
(commitid: JEWsp8rnvx7AAEOB) so that seems to have fixed it.
Thanks a lot, Michael!
As far as I'm concerned, the PR could be closed.
Martin Neitzel
(Contact us)
$NetBSD: query-full-pr,v 1.45 2018/12/21 14:23:33 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.