NetBSD Problem Report #51148
From gson@gson.org Tue May 17 12:59:15 2016
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id 8BDFC7A3D6
for <gnats-bugs@gnats.NetBSD.org>; Tue, 17 May 2016 12:59:15 +0000 (UTC)
Message-Id: <20160517125908.A0A43744682@guava.gson.org>
Date: Tue, 17 May 2016 15:59:08 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: i386 install floppies no longer boot
X-Send-Pr-Version: 3.95
>Number: 51148
>Category: kern
>Synopsis: i386 install floppies no longer boot
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: maxv
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue May 17 13:00:00 +0000 2016
>Closed-Date: Thu May 26 13:19:15 +0000 2016
>Last-Modified: Mon Jul 25 12:15:01 +0000 2016
>Originator: Andreas Gustafsson
>Release: NetBSD-current, source date >= 2016.05.14.21.19.05
>Organization:
>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:
The automated i386 tests on babylon5 are failing at the install stage,
hanging immediately after the kernel load.
The problem started with one of the following commits:
2016.05.14.21.19.05 chs src/external/cddl/osnet/dev/dtrace/amd64/dtrace_isa.c 1.5
2016.05.14.21.19.05 chs src/external/cddl/osnet/dev/dtrace/i386/dtrace_asm.S 1.4
2016.05.14.21.19.05 chs src/external/cddl/osnet/dev/dtrace/i386/dtrace_isa.c 1.4
2016.05.14.21.19.05 chs src/external/cddl/osnet/dist/uts/common/dtrace/dtrace.c 1.31
2016.05.14.21.32.50 mlelstv src/doc/roadmaps/storage 1.14
2016.05.15.07.01.36 maxv src/sys/arch/amd64/amd64/locore.S 1.91
2016.05.15.07.01.36 maxv src/sys/arch/i386/i386/locore.S 1.123
2016.05.15.07.17.53 maxv src/sys/arch/amd64/amd64/locore.S 1.92
2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124
2016.05.15.10.35.54 maxv src/sys/arch/amd64/amd64/machdep.c 1.217
2016.05.15.10.35.54 maxv src/sys/arch/i386/i386/machdep.c 1.755
2016.05.15.10.35.54 maxv src/sys/arch/x86/x86/pmap.c 1.195
2016.05.15.13.59.36 skrll src/sys/dev/ic/sl811hs.c 1.71
2016.05.15.14.00.08 skrll src/sys/dev/ic/sl811hs.c 1.72
2016.05.15.15.26.04 chs src/sys/arch/i386/include/asm.h 1.42
I can't auto-bisect it more closely because of build breakage.
For log data, see
http://releng.netbsd.org/b5reports/i386/commits-2016.05.html#2016.05.14.21.19.05
It looks like the problem only affects installing from floppies and
not installing from a CD, based on some separate tests I did on my
own testbed.
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Tue, 17 May 2016 16:10:17 +0200
> I can't auto-bisect it more closely because of build breakage.
It is this commit:
2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124
--
J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 18 May 2016 14:12:57 +0900
On Tue, May 17, 2016 at 10:00 PM, Andreas Gustafsson <gson@gson.org> wrote:
>>Number: 51148
>>Category: kern
>>Synopsis: i386 install floppies no longer boot
>>Confidential: no
>>Severity: serious
>>Priority: high
>>Responsible: kern-bug-people
>>State: open
>>Class: sw-bug
>>Submitter-Id: net
>>Arrival-Date: Tue May 17 13:00:00 +0000 2016
>>Originator: Andreas Gustafsson
>>Release: NetBSD-current, source date >= 2016.05.14.21.19.05
>>Organization:
>
>>Environment:
> System: NetBSD
> Architecture: i386
> Machine: i386
>>Description:
>
> The automated i386 tests on babylon5 are failing at the install stage,
> hanging immediately after the kernel load.
This problem should be fixed though, I think this may be an opportunity
to switch test frequency between i386 and amd64 because nowadays amd64
has more popularity than i386. How about the idea?
Thanks,
ozaki-r
From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 18 May 2016 18:34:21 +0900
On Wed, May 18, 2016 at 2:12 PM, Ryota Ozaki <ozaki-r@netbsd.org> wrote:
> On Tue, May 17, 2016 at 10:00 PM, Andreas Gustafsson <gson@gson.org> wrote:
>>>Number: 51148
>>>Category: kern
>>>Synopsis: i386 install floppies no longer boot
>>>Confidential: no
>>>Severity: serious
>>>Priority: high
>>>Responsible: kern-bug-people
>>>State: open
>>>Class: sw-bug
>>>Submitter-Id: net
>>>Arrival-Date: Tue May 17 13:00:00 +0000 2016
>>>Originator: Andreas Gustafsson
>>>Release: NetBSD-current, source date >= 2016.05.14.21.19.05
>>>Organization:
>>
>>>Environment:
>> System: NetBSD
>> Architecture: i386
>> Machine: i386
>>>Description:
>>
>> The automated i386 tests on babylon5 are failing at the install stage,
>> hanging immediately after the kernel load.
>
> This problem should be fixed though,
Oops. I meant "We should fix the problem anyway though, ..."
ozaki-r
> I think this may be an opportunity
> to switch test frequency between i386 and amd64 because nowadays amd64
> has more popularity than i386. How about the idea?
>
> Thanks,
> ozaki-r
From: Andreas Gustafsson <gson@gson.org>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 18 May 2016 16:36:31 +0300
Ryota Ozaki wrote:
> I think this may be an opportunity to switch test frequency between
> i386 and amd64 because nowadays amd64 has more popularity than
> i386. How about the idea?
One argument for using a less popular port as the primary test target
is that it will help catch portability bugs that developers miss when
testing their changes on the popular port before committing. In any
case, this discussion probably belongs on current-users rather than
in this PR.
--
Andreas Gustafsson, gson@gson.org
Responsible-Changed-From-To: kern-bug-people->maxv
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Thu, 26 May 2016 06:24:53 +0000
Responsible-Changed-Why:
Problem started with maxv's commit of src/sys/arch/i386/i386/locore.S 1.124
From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/51148 CVS commit: src/sys/arch
Date: Thu, 26 May 2016 07:24:55 +0000
Module Name: src
Committed By: maxv
Date: Thu May 26 07:24:55 UTC 2016
Modified Files:
src/sys/arch/amd64/amd64: locore.S
src/sys/arch/i386/i386: locore.S
Log Message:
There is an issue in the way the fillkpt macro sets up pages on both
amd64 and i386.
The fillkpt loop is equivalent to the following:
do {
/* fill in the slot */
/* increment %ebx to the next slot */
/* increment %eax to the next pa */
} while (%ecx > 0)
The issue here is that if %ecx = 0 (i.e., the chunk we are trying to
map is zero-sized), there is still one entry created in the page table.
The kernel expects the va<->pa translation to be linear in low memory.
If there is a zero-sized chunk, the dead entry creates a +4096 offset in
the virtual space, with two consecutive entries that point to the same
physical address. In other words, the mappings are not linear anymore,
which causes the kernel to die.
Before my recent changes, there were only two big chunks that were
mapped, and neither of these could be zero-sized. Now, with multiple,
fine-grained chunks, it is possible that the [SYMS]+[PRELOADED_MODULES]
chunk could be zero-sized.
[PRELOADED_MODULES] is almost never here, and [SYMS] is always here on
default kernels. Except for floppies, where the bootloader does not load
[SYMS].
Should fix PR 51148.
To generate a diff of this commit:
cvs rdiff -u -r1.93 -r1.94 src/sys/arch/amd64/amd64/locore.S
cvs rdiff -u -r1.124 -r1.125 src/sys/arch/i386/i386/locore.S
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Thu, 26 May 2016 13:19:15 +0000
State-Changed-Why:
Verified fixed - the kernel now starts. There is still a problem with execing init,
but that appears to be unrelated.
From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: Maxime Villard <max@m00nbsd.net>
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Tue, 31 May 2016 04:32:32 +0000
This whole subthread didn't get sent to gnats.
------
From: Maxime Villard <max@m00nbsd.net>
To: Andreas Gustafsson <gson@gson.org>, netbsd-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 25 May 2016 13:06:09 +0200
First of all, I'm not subscribed to netbsd-bugs@, so please forward your mails
to me.
I have carefully investigated the mappings on amd64 and i386 with a kernel page
explorer I wrote, and there no issue. The levels are all linear, with no holes
in the middle, they are correctly linked, and they cover the whole kernel image,
preloaded modules and bootstrap tables.
In fact, there appears to be one bug in the L1 slot that should normally point
to the first page of the data segment: it seems to be destroyed. But this issue
was already here before my changes, so I didn't introduce it.
The changes from me you mentioned are all trivial, and it seems highly unlikely
to me that they cause the install failure. Normally, if there were a bug, it
should have been in the previous commmits. Also, my changes are in no way
install-related, and as far as I know, the mappings are the same on
CD/USB/floppy/whatever.
My guess, right now, is that my alignment changes in kern.ldscript somehow
trigger the aforementioned L1 slot bug on floppy installs.
I don't have a floppy device, and right now my NetBSD resources are limited. The
only thing I can do is asking.
Is the problem still present? (I don't see new entries in the log)
We are talking about GENERIC, and not GENERIC-PAE, right?
Does reverting only [1] fix the problem?
What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?
Thanks.
[1] 2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124
From: Andreas Gustafsson <gson@gson.org>
To: Maxime Villard <max@m00nbsd.net>
Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 25 May 2016 15:17:10 +0300
Maxime,
You wrote:
> First of all, I'm not subscribed to netbsd-bugs@, so please forward your mails
> to me.
Will do. I would have mailed you about the initial report if you had
been the only developer to commit during the period of build breakage
when the problem appeared, but there were commits by four developers,
and no easy way for me to determine which of them was at fault.
> I have carefully investigated the mappings on amd64 and i386 with a kernel page
> explorer I wrote, and there no issue. The levels are all linear, with no holes
> in the middle, they are correctly linked, and they cover the whole kernel image,
> preloaded modules and bootstrap tables.
>
> In fact, there appears to be one bug in the L1 slot that should normally point
> to the first page of the data segment: it seems to be destroyed. But this issue
> was already here before my changes, so I didn't introduce it.
>
> The changes from me you mentioned are all trivial, and it seems highly unlikely
> to me that they cause the install failure. Normally, if there were a bug, it
> should have been in the previous commmits. Also, my changes are in no way
> install-related, and as far as I know, the mappings are the same on
> CD/USB/floppy/whatever.
>
> My guess, right now, is that my alignment changes in kern.ldscript somehow
> trigger the aforementioned L1 slot bug on floppy installs.
>
> I don't have a floppy device, and right now my NetBSD resources are limited.
If you can run misc/py-anita from pkgsrc against an i386 release
build, that should reproduce the problem without the need for a
physical floppy device or even a NetBSD host.
> The
> only thing I can do is asking.
>
> Is the problem still present? (I don't see new entries in the log)
Yes, the problem is still present. I'm not sure what you mean about
not seeing new entries; the newest test runs are from today, and still
failing with the same error:
http://releng.netbsd.org/b5reports/i386/commits-2016.05.html#2016.05.25.10.15.01
> We are talking about GENERIC, and not GENERIC-PAE, right?
Yes.
> Does reverting only [1] fix the problem?
I will try that and report back.
> What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?
I will try that, too.
> Thanks.
>
> [1] 2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124
--
Andreas Gustafsson, gson@gson.org
From: Maxime Villard <max@m00nbsd.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 25 May 2016 16:30:58 +0200
Le 25/05/2016 ? 14:17, Andreas Gustafsson a ?crit :
> [...]
>>
>> I don't have a floppy device, and right now my NetBSD resources are limited.
>
> If you can run misc/py-anita from pkgsrc against an i386 release
> build, that should reproduce the problem without the need for a
> physical floppy device or even a NetBSD host.
>
I would be happy to do the tests myself. But the only i386 machine I have right
now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
minutes. I can do almost nothing on it.
From: Christos Zoulas <christos@zoulas.com>
To: Maxime Villard <max@m00nbsd.net>, Andreas Gustafsson <gson@gson.org>
Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 25 May 2016 14:17:20 -0400
On May 25, 4:30pm, max@m00nbsd.net (Maxime Villard) wrote:
-- Subject: Re: kern/51148: i386 install floppies no longer boot
| Le 25/05/2016 à 14:17, Andreas Gustafsson a écrit :
| > [...]
| >>
| >> I don't have a floppy device, and right now my NetBSD resources are limited.
| >
| > If you can run misc/py-anita from pkgsrc against an i386 release
| > build, that should reproduce the problem without the need for a
| > physical floppy device or even a NetBSD host.
| >
|
| I would be happy to do the tests myself. But the only i386 machine I have right
| now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
| minutes. I can do almost nothing on it.
I am fixing that.
christos
From: Andreas Gustafsson <gson@gson.org>
To: Maxime Villard <max@m00nbsd.net>
Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 25 May 2016 20:06:11 +0300
Maxime,
I have now run the tests you asked for.
> Does reverting only [1] fix the problem?
Yes. The system still doesn't install because the kernel is unable
to exec /sbin/init, but this is a different bug; when I don't revert
[1], the kernel does not even start (there are no kernel messages
on the console).
> What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?
I tested with this patch against 2016.05.22.09.10.37 sources:
diff -u -r1.124 locore.S
--- locore.S 15 May 2016 07:17:53 -0000 1.124
+++ locore.S 25 May 2016 14:33:35 -0000
@@ -731,7 +731,7 @@
movl RELOC(tablesize),%ecx /* length of BOOTSTRAP TABLES */
shrl $PGSHIFT,%ecx
orl $(PG_V|PG_KW),%eax
- fillkpt_nox
+ fillkpt
/* We are on (4). Map ISA I/O mem (later atdevbase) RWX. */
movl $(IOM_BEGIN|PG_V|PG_KW/*|PG_N*/),%eax
and it did _not_ fix the problem.
Later, you wrote:
> I would be happy to do the tests myself. But the only i386 machine I have right
> now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
> minutes. I can do almost nothing on it.
What do you host VirtualBox on? You can test the i386 port using anita+qemu
even on a non-i386 host.
--
Andreas Gustafsson, gson@gson.org
From: Maxime Villard <max@m00nbsd.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Thu, 26 May 2016 09:33:52 +0200
I've committed a patch. Please let me know whether it fixes the issue.
From: Maxime Villard <max@m00nbsd.net>
To: Andreas Gustafsson <gson@gson.org>
Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Thu, 26 May 2016 09:00:44 +0200
Le 25/05/2016 ? 19:06, Andreas Gustafsson a ?crit :
> Maxime,
>
> I have now run the tests you asked for.
>
>> Does reverting only [1] fix the problem?
>
> Yes. The system still doesn't install because the kernel is unable
> to exec /sbin/init, but this is a different bug; when I don't revert
> [1], the kernel does not even start (there are no kernel messages
> on the console).
>
>> What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?
>
> I tested with this patch against 2016.05.22.09.10.37 sources:
>
> diff -u -r1.124 locore.S
> --- locore.S 15 May 2016 07:17:53 -0000 1.124
> +++ locore.S 25 May 2016 14:33:35 -0000
> @@ -731,7 +731,7 @@
> movl RELOC(tablesize),%ecx /* length of BOOTSTRAP TABLES */
> shrl $PGSHIFT,%ecx
> orl $(PG_V|PG_KW),%eax
> - fillkpt_nox
> + fillkpt
>
> /* We are on (4). Map ISA I/O mem (later atdevbase) RWX. */
> movl $(IOM_BEGIN|PG_V|PG_KW/*|PG_N*/),%eax
>
> and it did _not_ fix the problem.
Thanks for the tests. I see where the problem is, and I'll commit a patch
soon.
>
> Later, you wrote:
>
>> I would be happy to do the tests myself. But the only i386 machine I have right
>> now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
>> minutes. I can do almost nothing on it.
>
> What do you host VirtualBox on? You can test the i386 port using anita+qemu
> even on a non-i386 host.
>
I'll answer in the other PR.
From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/51148 CVS commit: src/sys/arch/x86
Date: Mon, 25 Jul 2016 12:11:40 +0000
Module Name: src
Committed By: maxv
Date: Mon Jul 25 12:11:40 UTC 2016
Modified Files:
src/sys/arch/x86/include: pmap.h
src/sys/arch/x86/x86: lapic.c pmap.c
Log Message:
The L1 entry of the first page of the data segment is overwritten for the
LAPIC page, and set as RWX+PG_N. The LAPIC pa is fixed, and its va resides
in the data segment. Because of this error-prone design, the kernel image
map is not linear, and I first thought it was a bug (as I vaguely said in
PR/51148). Using large pages for the data segment is therefore wrong, since
the first page does not actually belong to the data segment (even if its va
is in the range). This bug is not triggered currently, since local_apic is
not large-page-aligned.
We will certainly have to allocate a va dynamically instead of using the
first page of data; but for now, disable large pages on the data segment,
and map the LAPIC as RW.
This is the last x86-specific RWX page.
To generate a diff of this commit:
cvs rdiff -u -r1.58 -r1.59 src/sys/arch/x86/include/pmap.h
cvs rdiff -u -r1.51 -r1.52 src/sys/arch/x86/x86/lapic.c
cvs rdiff -u -r1.216 -r1.217 src/sys/arch/x86/x86/pmap.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.