NetBSD Problem Report #51148

From gson@gson.org  Tue May 17 12:59:15 2016
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 8BDFC7A3D6
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 17 May 2016 12:59:15 +0000 (UTC)
Message-Id: <20160517125908.A0A43744682@guava.gson.org>
Date: Tue, 17 May 2016 15:59:08 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: i386 install floppies no longer boot
X-Send-Pr-Version: 3.95

>Number:         51148
>Category:       kern
>Synopsis:       i386 install floppies no longer boot
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    maxv
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue May 17 13:00:00 +0000 2016
>Closed-Date:    Thu May 26 13:19:15 +0000 2016
>Last-Modified:  Mon Jul 25 12:15:01 +0000 2016
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date >= 2016.05.14.21.19.05
>Organization:

>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

The automated i386 tests on babylon5 are failing at the install stage,
hanging immediately after the kernel load.

The problem started with one of the following commits:

  2016.05.14.21.19.05 chs src/external/cddl/osnet/dev/dtrace/amd64/dtrace_isa.c 1.5
  2016.05.14.21.19.05 chs src/external/cddl/osnet/dev/dtrace/i386/dtrace_asm.S 1.4
  2016.05.14.21.19.05 chs src/external/cddl/osnet/dev/dtrace/i386/dtrace_isa.c 1.4
  2016.05.14.21.19.05 chs src/external/cddl/osnet/dist/uts/common/dtrace/dtrace.c 1.31
  2016.05.14.21.32.50 mlelstv src/doc/roadmaps/storage 1.14
  2016.05.15.07.01.36 maxv src/sys/arch/amd64/amd64/locore.S 1.91
  2016.05.15.07.01.36 maxv src/sys/arch/i386/i386/locore.S 1.123
  2016.05.15.07.17.53 maxv src/sys/arch/amd64/amd64/locore.S 1.92
  2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124
  2016.05.15.10.35.54 maxv src/sys/arch/amd64/amd64/machdep.c 1.217
  2016.05.15.10.35.54 maxv src/sys/arch/i386/i386/machdep.c 1.755
  2016.05.15.10.35.54 maxv src/sys/arch/x86/x86/pmap.c 1.195
  2016.05.15.13.59.36 skrll src/sys/dev/ic/sl811hs.c 1.71
  2016.05.15.14.00.08 skrll src/sys/dev/ic/sl811hs.c 1.72
  2016.05.15.15.26.04 chs src/sys/arch/i386/include/asm.h 1.42

I can't auto-bisect it more closely because of build breakage.
For log data, see

  http://releng.netbsd.org/b5reports/i386/commits-2016.05.html#2016.05.14.21.19.05

It looks like the problem only affects installing from floppies and
not installing from a CD, based on some separate tests I did on my
own testbed.

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:
From: "J. Hannken-Illjes" <hannken@eis.cs.tu-bs.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Tue, 17 May 2016 16:10:17 +0200

 > I can't auto-bisect it more closely because of build breakage.

 It is this commit:

 	2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124

 --
 J. Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)

From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 18 May 2016 14:12:57 +0900

 On Tue, May 17, 2016 at 10:00 PM, Andreas Gustafsson <gson@gson.org> wrote:
 >>Number:         51148
 >>Category:       kern
 >>Synopsis:       i386 install floppies no longer boot
 >>Confidential:   no
 >>Severity:       serious
 >>Priority:       high
 >>Responsible:    kern-bug-people
 >>State:          open
 >>Class:          sw-bug
 >>Submitter-Id:   net
 >>Arrival-Date:   Tue May 17 13:00:00 +0000 2016
 >>Originator:     Andreas Gustafsson
 >>Release:        NetBSD-current, source date >= 2016.05.14.21.19.05
 >>Organization:
 >
 >>Environment:
 > System: NetBSD
 > Architecture: i386
 > Machine: i386
 >>Description:
 >
 > The automated i386 tests on babylon5 are failing at the install stage,
 > hanging immediately after the kernel load.

 This problem should be fixed though, I think this may be an opportunity
 to switch test frequency between i386 and amd64 because nowadays amd64
 has more popularity than i386. How about the idea?

 Thanks,
   ozaki-r

From: Ryota Ozaki <ozaki-r@netbsd.org>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 18 May 2016 18:34:21 +0900

 On Wed, May 18, 2016 at 2:12 PM, Ryota Ozaki <ozaki-r@netbsd.org> wrote:
 > On Tue, May 17, 2016 at 10:00 PM, Andreas Gustafsson <gson@gson.org> wrote:
 >>>Number:         51148
 >>>Category:       kern
 >>>Synopsis:       i386 install floppies no longer boot
 >>>Confidential:   no
 >>>Severity:       serious
 >>>Priority:       high
 >>>Responsible:    kern-bug-people
 >>>State:          open
 >>>Class:          sw-bug
 >>>Submitter-Id:   net
 >>>Arrival-Date:   Tue May 17 13:00:00 +0000 2016
 >>>Originator:     Andreas Gustafsson
 >>>Release:        NetBSD-current, source date >= 2016.05.14.21.19.05
 >>>Organization:
 >>
 >>>Environment:
 >> System: NetBSD
 >> Architecture: i386
 >> Machine: i386
 >>>Description:
 >>
 >> The automated i386 tests on babylon5 are failing at the install stage,
 >> hanging immediately after the kernel load.
 >
 > This problem should be fixed though,

 Oops. I meant "We should fix the problem anyway though, ..."

   ozaki-r

 > I think this may be an opportunity
 > to switch test frequency between i386 and amd64 because nowadays amd64
 > has more popularity than i386. How about the idea?
 >
 > Thanks,
 >   ozaki-r

From: Andreas Gustafsson <gson@gson.org>
To: Ryota Ozaki <ozaki-r@netbsd.org>
Cc: gnats-bugs@NetBSD.org
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Wed, 18 May 2016 16:36:31 +0300

 Ryota Ozaki wrote:
 > I think this may be an opportunity to switch test frequency between
 > i386 and amd64 because nowadays amd64 has more popularity than
 > i386. How about the idea?

 One argument for using a less popular port as the primary test target
 is that it will help catch portability bugs that developers miss when
 testing their changes on the popular port before committing.  In any
 case, this discussion probably belongs on current-users rather than
 in this PR.
 -- 
 Andreas Gustafsson, gson@gson.org

Responsible-Changed-From-To: kern-bug-people->maxv
Responsible-Changed-By: gson@NetBSD.org
Responsible-Changed-When: Thu, 26 May 2016 06:24:53 +0000
Responsible-Changed-Why:
Problem started with maxv's commit of src/sys/arch/i386/i386/locore.S 1.124


From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51148 CVS commit: src/sys/arch
Date: Thu, 26 May 2016 07:24:55 +0000

 Module Name:	src
 Committed By:	maxv
 Date:		Thu May 26 07:24:55 UTC 2016

 Modified Files:
 	src/sys/arch/amd64/amd64: locore.S
 	src/sys/arch/i386/i386: locore.S

 Log Message:
 There is an issue in the way the fillkpt macro sets up pages on both
 amd64 and i386.

 The fillkpt loop is equivalent to the following:

 	do {
 		/* fill in the slot */
 		/* increment %ebx to the next slot */
 		/* increment %eax to the next pa */
 	} while (%ecx > 0)

 The issue here is that if %ecx = 0 (i.e., the chunk we are trying to
 map is zero-sized), there is still one entry created in the page table.
 The kernel expects the va<->pa translation to be linear in low memory.
 If there is a zero-sized chunk, the dead entry creates a +4096 offset in
 the virtual space, with two consecutive entries that point to the same
 physical address. In other words, the mappings are not linear anymore,
 which causes the kernel to die.

 Before my recent changes, there were only two big chunks that were
 mapped, and neither of these could be zero-sized. Now, with multiple,
 fine-grained chunks, it is possible that the [SYMS]+[PRELOADED_MODULES]
 chunk could be zero-sized.

 [PRELOADED_MODULES] is almost never here, and [SYMS] is always here on
 default kernels. Except for floppies, where the bootloader does not load
 [SYMS].

 Should fix PR 51148.


 To generate a diff of this commit:
 cvs rdiff -u -r1.93 -r1.94 src/sys/arch/amd64/amd64/locore.S
 cvs rdiff -u -r1.124 -r1.125 src/sys/arch/i386/i386/locore.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Thu, 26 May 2016 13:19:15 +0000
State-Changed-Why:
Verified fixed - the kernel now starts.  There is still a problem with execing init,
but that appears to be unrelated.


From: David Holland <dholland-bugs@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: Maxime Villard <max@m00nbsd.net>
Subject: Re: kern/51148: i386 install floppies no longer boot
Date: Tue, 31 May 2016 04:32:32 +0000

 This whole subthread didn't get sent to gnats.

    ------

 From: Maxime Villard <max@m00nbsd.net>
 To: Andreas Gustafsson <gson@gson.org>, netbsd-bugs@netbsd.org
 Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Wed, 25 May 2016 13:06:09 +0200

 First of all, I'm not subscribed to netbsd-bugs@, so please forward your mails
 to me.

 I have carefully investigated the mappings on amd64 and i386 with a kernel page
 explorer I wrote, and there no issue. The levels are all linear, with no holes
 in the middle, they are correctly linked, and they cover the whole kernel image,
 preloaded modules and bootstrap tables.

 In fact, there appears to be one bug in the L1 slot that should normally point
 to the first page of the data segment: it seems to be destroyed. But this issue
 was already here before my changes, so I didn't introduce it.

 The changes from me you mentioned are all trivial, and it seems highly unlikely
 to me that they cause the install failure. Normally, if there were a bug, it
 should have been in the previous commmits. Also, my changes are in no way
 install-related, and as far as I know, the mappings are the same on
 CD/USB/floppy/whatever.

 My guess, right now, is that my alignment changes in kern.ldscript somehow
 trigger the aforementioned L1 slot bug on floppy installs.

 I don't have a floppy device, and right now my NetBSD resources are limited. The
 only thing I can do is asking.

 	Is the problem still present? (I don't see new entries in the log)
 	We are talking about GENERIC, and not GENERIC-PAE, right?
 	Does reverting only [1] fix the problem?
 	What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?

 Thanks.


  [1] 2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124


 From: Andreas Gustafsson <gson@gson.org>
 To: Maxime Villard <max@m00nbsd.net>
 Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Wed, 25 May 2016 15:17:10 +0300

 Maxime,

 You wrote:
 > First of all, I'm not subscribed to netbsd-bugs@, so please forward your mails
 > to me.

 Will do.  I would have mailed you about the initial report if you had
 been the only developer to commit during the period of build breakage
 when the problem appeared, but there were commits by four developers,
 and no easy way for me to determine which of them was at fault.

 > I have carefully investigated the mappings on amd64 and i386 with a kernel page
 > explorer I wrote, and there no issue. The levels are all linear, with no holes
 > in the middle, they are correctly linked, and they cover the whole kernel image,
 > preloaded modules and bootstrap tables.
 > 
 > In fact, there appears to be one bug in the L1 slot that should normally point
 > to the first page of the data segment: it seems to be destroyed. But this issue
 > was already here before my changes, so I didn't introduce it.
 > 
 > The changes from me you mentioned are all trivial, and it seems highly unlikely
 > to me that they cause the install failure. Normally, if there were a bug, it
 > should have been in the previous commmits. Also, my changes are in no way
 > install-related, and as far as I know, the mappings are the same on
 > CD/USB/floppy/whatever.
 > 
 > My guess, right now, is that my alignment changes in kern.ldscript somehow
 > trigger the aforementioned L1 slot bug on floppy installs.
 > 
 > I don't have a floppy device, and right now my NetBSD resources are limited.

 If you can run misc/py-anita from pkgsrc against an i386 release
 build, that should reproduce the problem without the need for a
 physical floppy device or even a NetBSD host.

 > The
 > only thing I can do is asking.
 > 
 > 	Is the problem still present? (I don't see new entries in the log)

 Yes, the problem is still present.  I'm not sure what you mean about
 not seeing new entries; the newest test runs are from today, and still
 failing with the same error:

   http://releng.netbsd.org/b5reports/i386/commits-2016.05.html#2016.05.25.10.15.01

 > 	We are talking about GENERIC, and not GENERIC-PAE, right?

 Yes.

 > 	Does reverting only [1] fix the problem?

 I will try that and report back.

 > 	What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?

 I will try that, too.

 > Thanks.
 > 
 >   [1] 2016.05.15.07.17.53 maxv src/sys/arch/i386/i386/locore.S 1.124
 -- 
 Andreas Gustafsson, gson@gson.org

 From: Maxime Villard <max@m00nbsd.net>
 To: Andreas Gustafsson <gson@gson.org>
 Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Wed, 25 May 2016 16:30:58 +0200

 Le 25/05/2016 ? 14:17, Andreas Gustafsson a ?crit :
 > [...]
 >> 
 >> I don't have a floppy device, and right now my NetBSD resources are limited.
 > 
 > If you can run misc/py-anita from pkgsrc against an i386 release
 > build, that should reproduce the problem without the need for a
 > physical floppy device or even a NetBSD host.
 > 

 I would be happy to do the tests myself. But the only i386 machine I have right
 now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
 minutes. I can do almost nothing on it.


 From: Christos Zoulas <christos@zoulas.com>
 To: Maxime Villard <max@m00nbsd.net>, Andreas Gustafsson <gson@gson.org>
 Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Wed, 25 May 2016 14:17:20 -0400

 On May 25,  4:30pm, max@m00nbsd.net (Maxime Villard) wrote:
 -- Subject: Re: kern/51148: i386 install floppies no longer boot

 | Le 25/05/2016 à 14:17, Andreas Gustafsson a écrit :
 | > [...]
 | >>
 | >> I don't have a floppy device, and right now my NetBSD resources are limited.
 | >
 | > If you can run misc/py-anita from pkgsrc against an i386 release
 | > build, that should reproduce the problem without the need for a
 | > physical floppy device or even a NetBSD host.
 | >
 | 
 | I would be happy to do the tests myself. But the only i386 machine I have right
 | now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
 | minutes. I can do almost nothing on it.

 I am fixing that.

 christos


 From: Andreas Gustafsson <gson@gson.org>
 To: Maxime Villard <max@m00nbsd.net>
 Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Wed, 25 May 2016 20:06:11 +0300

 Maxime,

 I have now run the tests you asked for.

 > 	Does reverting only [1] fix the problem?

 Yes.  The system still doesn't install because the kernel is unable
 to exec /sbin/init, but this is a different bug; when I don't revert
 [1], the kernel does not even start (there are no kernel messages
 on the console).

 > 	What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?

 I tested with this patch against 2016.05.22.09.10.37 sources:

 diff -u -r1.124 locore.S
 --- locore.S    15 May 2016 07:17:53 -0000      1.124
 +++ locore.S    25 May 2016 14:33:35 -0000
 @@ -731,7 +731,7 @@
         movl    RELOC(tablesize),%ecx   /* length of BOOTSTRAP TABLES */
         shrl    $PGSHIFT,%ecx
         orl     $(PG_V|PG_KW),%eax
 -       fillkpt_nox
 +       fillkpt

         /* We are on (4). Map ISA I/O mem (later atdevbase) RWX. */
         movl    $(IOM_BEGIN|PG_V|PG_KW/*|PG_N*/),%eax

 and it did _not_ fix the problem.

 Later, you wrote:

 > I would be happy to do the tests myself. But the only i386 machine I have right
 > now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
 > minutes. I can do almost nothing on it.

 What do you host VirtualBox on?  You can test the i386 port using anita+qemu
 even on a non-i386 host.
 -- 
 Andreas Gustafsson, gson@gson.org

 From: Maxime Villard <max@m00nbsd.net>
 To: Andreas Gustafsson <gson@gson.org>
 Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Thu, 26 May 2016 09:33:52 +0200

 I've committed a patch. Please let me know whether it fixes the issue.


 From: Maxime Villard <max@m00nbsd.net>
 To: Andreas Gustafsson <gson@gson.org>
 Cc: netbsd-bugs@netbsd.org, kern-bug-people@netbsd.org, gnats-admin@netbsd.org
 Subject: Re: kern/51148: i386 install floppies no longer boot
 Date: Thu, 26 May 2016 09:00:44 +0200

 Le 25/05/2016 ? 19:06, Andreas Gustafsson a ?crit :
 > Maxime,
 > 
 > I have now run the tests you asked for.
 > 
 >> 	Does reverting only [1] fix the problem?
 > 
 > Yes.  The system still doesn't install because the kernel is unable
 > to exec /sbin/init, but this is a different bug; when I don't revert
 > [1], the kernel does not even start (there are no kernel messages
 > on the console).
 > 
 >> 	What if you put 'fillkpt' instead of 'fillkpt_nox' in [1]?
 > 
 > I tested with this patch against 2016.05.22.09.10.37 sources:
 > 
 > diff -u -r1.124 locore.S
 > --- locore.S    15 May 2016 07:17:53 -0000      1.124
 > +++ locore.S    25 May 2016 14:33:35 -0000
 > @@ -731,7 +731,7 @@
 >         movl    RELOC(tablesize),%ecx   /* length of BOOTSTRAP TABLES */
 >         shrl    $PGSHIFT,%ecx
 >         orl     $(PG_V|PG_KW),%eax
 > -       fillkpt_nox
 > +       fillkpt
 > 
 >         /* We are on (4). Map ISA I/O mem (later atdevbase) RWX. */
 >         movl    $(IOM_BEGIN|PG_V|PG_KW/*|PG_N*/),%eax
 > 
 > and it did _not_ fix the problem.

 Thanks for the tests. I see where the problem is, and I'll commit a patch
 soon.

 > 
 > Later, you wrote:
 > 
 >> I would be happy to do the tests myself. But the only i386 machine I have right
 >> now is a VirtualBox VM, and there is PR 51134 that reboots the machine every ~5
 >> minutes. I can do almost nothing on it.
 > 
 > What do you host VirtualBox on?  You can test the i386 port using anita+qemu
 > even on a non-i386 host.
 > 

 I'll answer in the other PR.


From: "Maxime Villard" <maxv@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/51148 CVS commit: src/sys/arch/x86
Date: Mon, 25 Jul 2016 12:11:40 +0000

 Module Name:	src
 Committed By:	maxv
 Date:		Mon Jul 25 12:11:40 UTC 2016

 Modified Files:
 	src/sys/arch/x86/include: pmap.h
 	src/sys/arch/x86/x86: lapic.c pmap.c

 Log Message:
 The L1 entry of the first page of the data segment is overwritten for the
 LAPIC page, and set as RWX+PG_N. The LAPIC pa is fixed, and its va resides
 in the data segment. Because of this error-prone design, the kernel image
 map is not linear, and I first thought it was a bug (as I vaguely said in
 PR/51148). Using large pages for the data segment is therefore wrong, since
 the first page does not actually belong to the data segment (even if its va
 is in the range). This bug is not triggered currently, since local_apic is
 not large-page-aligned.

 We will certainly have to allocate a va dynamically instead of using the
 first page of data; but for now, disable large pages on the data segment,
 and map the LAPIC as RW.

 This is the last x86-specific RWX page.


 To generate a diff of this commit:
 cvs rdiff -u -r1.58 -r1.59 src/sys/arch/x86/include/pmap.h
 cvs rdiff -u -r1.51 -r1.52 src/sys/arch/x86/x86/lapic.c
 cvs rdiff -u -r1.216 -r1.217 src/sys/arch/x86/x86/pmap.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.