NetBSD Problem Report #55286
From gson@gson.org Fri May 22 19:48:32 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 7FFA91A9227
for <gnats-bugs@gnats.NetBSD.org>; Fri, 22 May 2020 19:48:32 +0000 (UTC)
Message-Id: <20200522194826.A0B7A253F45@guava.gson.org>
Date: Fri, 22 May 2020 22:48:26 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Kernel messages cause loss of serial console input
X-Send-Pr-Version: 3.95
>Number: 55286
>Category: kern
>Synopsis: Kernel messages cause loss of serial console input
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 22 19:50:00 +0000 2020
>Closed-Date: Tue May 26 14:23:51 +0000 2020
>Last-Modified: Tue May 26 14:23:51 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD-current and others
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
I have a couple of automated test setups where one NetBSD system
controls another over a serial console. Both are running on real
hardware and the serial baud rate is set to 9600.
For a long time, these systems have been suffering from random
failures that occur when the kernel of the controlled system happens
to log a console message at an inopportune moment.
Sometimes things fail because a kernel message is printed in the
middle of some string that the controlling system is looking for.
For example, if it is waiting for a "login:" prompt but gets
"log[ 51.0923743] ipmi0: ID 32.1 IPMI 2.0 Available +SDRs\r\nin:",
there's not much you can do short of eliminating the message.
That case is not what this PR is about.
Rather, this PR is about failures that happen when a kernel message is
printed even if it does _not_ cause any of the expected output to be
corrupted, and specifically, that console _input_ appears to be lost
if it happens to arrive at a time when the kernel is printing a
message.
For example, in a recent run, the controlling system sent a long
one-line shell command beginning with
set -ex; sleep 60; mkdir -p /b && mount -t tmpfs tmpfs /b && rm -rf /b/*
This was echoed by the controlled system as
set -ex; sleep 60; mkdir -p /[ 50.9384968] ipmi0: ID 32.1 IPMI 2.0 Available +SDRs\r\nb && m/b \x08 && rm -rf /b/*
Not only did a kernel message appear in the middle of the echoed
command, but more importantly, the "ount -t tmpfs tmpfs " part of
the command was lost. When the controlled system then tried to
execute the command, it printed
sh: m/b: not found
so clearly there was a loss of input, not just a loss of echo.
I can also reproduce the issue manually on a freshly booted system by
logging in as root over the serial console and running the shell command
# for i in $(seq 200); do luactl >/dev/null 2>&1; done
to trigger a series of console messages like
[ 69.6188016] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
[ 69.7088437] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
[ 69.7988866] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
(etc)
and then typing the letters of the alphabet on the keyboard while the
messages are being printed, followed by a return (at 9600 bps, the
messages print slowly enough that there is time to do this before they
finish). When I do this, the initial part of the input is lost, but
the rest is accepted, leading to output that ends like this:
[ 88.0675327] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
# uvxyz
-sh: uvwxyz: not found
Strangely, this only happens once. If I repeat the test, every letter
I type is accepted, and I get the expected output ending in
[ 981.5203980] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
# abcdefghijklmnopqrstuvwxyz
-sh: abcdefghijklmnopqrstuvwxyz: not found
After rebooting, I can again reproduce the problem, once. I have
tried this test using two different pairs of amd64 machines and they
both behave the same way.
In both the automated test failure and the manual tests, exactly 20
characters were lost, and I see that the magic number 20 also occurs
in com.c as the value of MAX_READAHEAD. This may or may not be a
coincidence.
--
Andreas Gustafsson, gson@gson.org
>How-To-Repeat:
>Fix:
Unknown
>Release-Note:
>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sat, 23 May 2020 21:09:22 +0300
I had a closer look at the code, and the issue is definitely caused
by the read-ahead code in com.c. Quoting a comment in com.c:
* The read-ahead code is so that you can detect pending in-band
* cn_magic in polled mode while doing output rather than having to
* wait until the kernel decides it needs input.
So, if I understand this correctly, this code is intended to support a
nonstandard way of breaking into DDB (as opposed to the standard break
condition) and will only work if the kernel happens to be printing
something in polled mode at the time (!).
To me this seems like a bad idea from the start, and I would prefer to
simply remove this code rather than try to fix it. Or does someone
think it actually has some value?
I have confirmed that after disabling the read-ahead using the patch
below, the manual test procedure in the PR no longer reproduces the
problem.
Index: src/sys/dev/ic/com.c
===================================================================
RCS file: /bracket/repo/src/sys/dev/ic/com.c,v
retrieving revision 1.358
diff -u -r1.358 com.c
--- src/sys/dev/ic/com.c 1 May 2020 07:27:51 -0000 1.358
+++ src/sys/dev/ic/com.c 23 May 2020 12:33:23 -0000
@@ -2402,17 +2402,18 @@
com_common_putc(dev_t dev, struct com_regs *regsp, int c)
{
int s = splserial();
- int cin, stat, timo;
+ int timo;
+#if 0
if (com_readaheadcount < MAX_READAHEAD
&& ISSET(stat = CSR_READ_1(regsp, COM_REG_LSR), LSR_RXRDY)) {
- int cn_trapped = 0;
+ int cin, stat, cn_trapped = 0;
cin = CSR_READ_1(regsp, COM_REG_RXDATA);
stat = CSR_READ_1(regsp, COM_REG_IIR);
cn_check_magic(dev, cin, com_cnm_state);
com_readahead[com_readaheadcount++] = cin;
}
-
+#endif
/* wait for any pending transmission to finish */
timo = 150000;
while (!ISSET(CSR_READ_1(regsp, COM_REG_LSR), LSR_TXRDY) && --timo)
--
Andreas Gustafsson, gson@gson.org
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 14:50:35 +0200
On Sat, May 23, 2020 at 06:10:01PM +0000, Andreas Gustafsson wrote:
> So, if I understand this correctly, this code is intended to support a
> nonstandard way of breaking into DDB (as opposed to the standard break
> condition) and will only work if the kernel happens to be printing
> something in polled mode at the time (!).
no, it works fine even if nothing is being printed. I know, I'm using '+++++'
as a key to enter ddb.
>
> To me this seems like a bad idea from the start, and I would prefer to
> simply remove this code rather than try to fix it. Or does someone
> think it actually has some value?
It definitively has some value. There are setups where BREAK can't be
sent (e.g. qemu, Dell idrac, ...). I need it.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 06:46:22 -0700 (PDT)
On Sun, 24 May 2020, Manuel Bouyer wrote:
<snip>
> It definitively has some value. There are setups where BREAK can't be
> sent (e.g. qemu, Dell idrac, ...). I need it.
Just FYI, you should be able to send BREAK from within qemu. From the
qemu man page:
...
During emulation, if you are using a character backend
multiplexer (which is the default if you are using -nographic)
then several commands are available via an escape sequence.
These key sequences all start with an escape character, which
is Ctrl-a by default, but can be changed with -echr. The list
below assumes you're using the default.
...
Ctrl-a b
Send break (magic sysrq in Linux)
...
+--------------------+--------------------------+-----------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul@whooppee.com |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette@netbsd.org |
+--------------------+--------------------------+-----------------------+
From: Martin Husemann <martin@duskware.de>
To: Paul Goyette <paul@whooppee.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 15:51:49 +0200
I have not followed the changes properly, but in general every
sequence configure via cnmagic(9) needs to work.
What we likely don't need (and that is how I understood Andreas) is the
readahead handling during output.
Andreas: can you verify with the patched kernel that cnmagic still works?
Martin
From: Andreas Gustafsson <gson@gson.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 16:51:52 +0300
Manuel Bouyer wrote:
> no, it works fine even if nothing is being printed. I know, I'm using '+++++'
> as a key to enter ddb.
If it's working when nothing is being printed, you are not using
the code path in case, but a different one that will continue to work.
During normal system operation when serial port interrupts are
enabled, the cnmagic check happens in comintr(), and that's the one
you are relying on.
The code I'd like to delete is in com_common_putc() and is intended
for situations such as early startup where interrupts are not yet
enabled, or late shutdown when they have already been disabled. If
you type '+++++' during one of those phases, it is only detected in
com_common_putc(), and that only works if the kernel happens to print
a character to the console after each '+'. It will not work if the
kernel is deadlocked or in an infinite loop where nothing is printed
to the console.
> It definitively has some value. There are setups where BREAK can't be
> sent (e.g. qemu, Dell idrac, ...).
I'm also using both qemu and idrac. The qemu bug is supposedly fixed
(https://bugs.launchpad.net/qemu/+bug/1654137), and the Dell server
I'm using has a nice NMI button. But that really only matters for the
interrupts-disabled case; normally, '+++++' will continue to work.
> I need it.
Sure, but do you need it specifically in the case when serial
interrupts are disabled and the kernel is not polling for console
input but is printing something to the console? That is the only
case that deleting this code will break.
--
Andreas Gustafsson, gson@gson.org
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 15:55:54 +0200
On Sun, May 24, 2020 at 04:51:52PM +0300, Andreas Gustafsson wrote:
> > I need it.
>
> Sure, but do you need it specifically in the case when serial
> interrupts are disabled and the kernel is not polling for console
> input but is printing something to the console? That is the only
> case that deleting this code will break.
No, I don't think I ever used it in this case. Thanks for the clarification
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 22:56:38 +0300
Martin Husemann wrote:
> I have not followed the changes properly, but in general every
> sequence configure via cnmagic(9) needs to work.
>
> What we likely don't need (and that is how I understood Andreas) is the
> readahead handling during output.
Right.
> Andreas: can you verify with the patched kernel that cnmagic still works?
I don't see any reason why it shouldn't still work during normal
operation where serial interrupts are enabled, but sure, I can test it.
--
Andreas Gustafsson, gson@gson.org
From: Valery Ushakov <uwe@stderr.spb.ru>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 03:18:46 +0300
On Sun, May 24, 2020 at 13:55:03 +0000, Andreas Gustafsson wrote:
> The code I'd like to delete is in com_common_putc() and is intended
> for situations such as early startup where interrupts are not yet
> enabled, or late shutdown when they have already been disabled.
[...]
> > I need it.
>
> Sure, but do you need it specifically in the case when serial
> interrupts are disabled and the kernel is not polling for console
> input but is printing something to the console? That is the only
> case that deleting this code will break.
How does your original test end up in a state like that to trigger
this bug? Kernel message being printed with polled code and steals
the input? In that case isn't the right fix to only do readahead when
interrupts are disabled?
-uwe
From: Andreas Gustafsson <gson@gson.org>
To: Valery Ushakov <uwe@stderr.spb.ru>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 09:34:44 +0300
Valery Ushakov wrote:
> How does your original test end up in a state like that to trigger
> this bug? Kernel message being printed with polled code and steals
> the input?
Yes.
> In that case isn't the right fix to only do readahead when
> interrupts are disabled?
That is the right fix if you think the read-ahead code is worth
keeping, but I don't. I'm volunteering to remove it; are you
volunteering to fix it?
--
Andreas Gustafsson, gson@gson.org
From: Valery Ushakov <uwe@stderr.spb.ru>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 15:03:30 +0300
On Mon, May 25, 2020 at 09:34:44 +0300, Andreas Gustafsson wrote:
> Date: Mon, 25 May 2020 09:34:44 +0300
> From: Andreas Gustafsson <gson@gson.org>
> Subject: Re: kern/55286: Kernel messages cause loss of serial console input
> To: Valery Ushakov <uwe@stderr.spb.ru>
> CC: gnats-bugs@netbsd.org
>
w
> Valery Ushakov wrote:
> > How does your original test end up in a state like that to trigger
> > this bug? Kernel message being printed with polled code and steals
> > the input?
>
> Yes.
That would be early autoconfig, before the interrupts are enabled, if
I understand correctly. That's an important time window, isn't it?
> > In that case isn't the right fix to only do readahead when
> > interrupts are disabled?
>
> That is the right fix if you think the read-ahead code is worth
> keeping, but I don't. I'm volunteering to remove it; are you
> volunteering to fix it?
That sounds pretty rude. I'm asking a question and you are telling me
to shut up unless I'm up to imeediately fixing it.
-uwe
From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 14:56:07 +0200
Here is a patch that uses readahead in comcnputc only while we are still
cold - compile tested only, but should fix the problem as well.
Martin
Index: com.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/com.c,v
retrieving revision 1.358
diff -u -p -r1.358 com.c
--- com.c 1 May 2020 07:27:51 -0000 1.358
+++ com.c 25 May 2020 12:48:20 -0000
@@ -159,7 +159,7 @@ int com_to_tiocm(struct com_softc *);
void com_iflush(struct com_softc *);
int com_common_getc(dev_t, struct com_regs *);
-static void com_common_putc(dev_t, struct com_regs *, int);
+static void com_common_putc(dev_t, struct com_regs *, int, int);
int cominit(struct com_regs *, int, int, int, tcflag_t);
@@ -2399,12 +2399,12 @@ com_common_getc(dev_t dev, struct com_re
}
static void
-com_common_putc(dev_t dev, struct com_regs *regsp, int c)
+com_common_putc(dev_t dev, struct com_regs *regsp, int c, int with_readahead)
{
int s = splserial();
int cin, stat, timo;
- if (com_readaheadcount < MAX_READAHEAD
+ if (with_readahead && com_readaheadcount < MAX_READAHEAD
&& ISSET(stat = CSR_READ_1(regsp, COM_REG_LSR), LSR_RXRDY)) {
int cn_trapped = 0;
cin = CSR_READ_1(regsp, COM_REG_RXDATA);
@@ -2572,7 +2572,7 @@ void
comcnputc(dev_t dev, int c)
{
- com_common_putc(dev, &comcons_info.regs, c);
+ com_common_putc(dev, &comcons_info.regs, c, cold);
}
void
@@ -2641,7 +2641,7 @@ void
com_kgdb_putc(void *arg, int c)
{
- com_common_putc(NODEV, &comkgdbregs, c);
+ com_common_putc(NODEV, &comkgdbregs, c, 0);
}
#endif /* KGDB */
From: Andreas Gustafsson <gson@gson.org>
To: Valery Ushakov <uwe@stderr.spb.ru>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 16:47:02 +0300
Valery Ushakov wrote:
> > > How does your original test end up in a state like that to trigger
> > > this bug? Kernel message being printed with polled code and steals
> > > the input?
> >
> > Yes.
>
> That would be early autoconfig, before the interrupts are enabled, if
> I understand correctly. That's an important time window, isn't it?
Early autoconfig is one of the cases the read-ahead mechanism appears
to be intended for, and yes, that's an important time window, but I
have a hard time imagining a situation where the read-ahead would
actually be useful. For example, if the system hangs during early
autoconfig, read-ahead won't help you break into ddb because a hung
system will not be printing anything to the console.
And to be clear, the kernel messages that are causing data loss for me
are not the ones printed during early autoconfig, but ones printed
after the system has fully booted.
When the read-ahead mechanism was first added in com.c 1.174, it was
not enabled by default (it was conditional on "options DDB_BREAK_CHAR"),
but apparently it got enabled by default with the conversion to use
cnmagic in com.c 1.186. There is no corresponding mechanism in the
other serial drivers.
> > > In that case isn't the right fix to only do readahead when
> > > interrupts are disabled?
> >
> > That is the right fix if you think the read-ahead code is worth
> > keeping, but I don't. I'm volunteering to remove it; are you
> > volunteering to fix it?
>
> That sounds pretty rude. I'm asking a question and you are telling me
> to shut up unless I'm up to imeediately fixing it.
Sorry, I didn't mean to be rude, it was an honest question and I'm
certainly not asking you to shut up. I'm just trying to figure out if
anyone cares about this code, and if you do care about it, I don't
think it's unreasonable to ask if you care about it enough to want to
fix the bugs in it.
--
Andreas Gustafsson, gson@gson.org
From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Tue, 26 May 2020 16:08:59 +0300
Martin Husemann wrote:
> Here is a patch that uses readahead in comcnputc only while we are still
> cold - compile tested only, but should fix the problem as well.
I have now run tested it and can confirm that it fixes the problem.
Will you commit it?
--
Andreas Gustafsson, gson@gson.org
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55286 CVS commit: src/sys/dev/ic
Date: Tue, 26 May 2020 13:24:53 +0000
Module Name: src
Committed By: martin
Date: Tue May 26 13:24:53 UTC 2020
Modified Files:
src/sys/dev/ic: com.c
Log Message:
Make the readahead-while-sending-output code conditional in
com_common_putc(), only erquest readahead before interrupts are enabled
and when called from comcnputc().
Fixes PR kern/55286.
To generate a diff of this commit:
cvs rdiff -u -r1.358 -r1.359 src/sys/dev/ic/com.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Martin Husemann <martin@duskware.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Tue, 26 May 2020 15:25:16 +0200
On Tue, May 26, 2020 at 04:08:59PM +0300, Andreas Gustafsson wrote:
> Martin Husemann wrote:
> > Here is a patch that uses readahead in comcnputc only while we are still
> > cold - compile tested only, but should fix the problem as well.
>
> I have now run tested it and can confirm that it fixes the problem.
> Will you commit it?
Thanks for testing - commited!
Martin
State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Tue, 26 May 2020 14:23:51 +0000
State-Changed-Why:
Fixed by martin, thanks.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.