NetBSD Problem Report #55286

From gson@gson.org  Fri May 22 19:48:32 2020
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 7FFA91A9227
	for <gnats-bugs@gnats.NetBSD.org>; Fri, 22 May 2020 19:48:32 +0000 (UTC)
Message-Id: <20200522194826.A0B7A253F45@guava.gson.org>
Date: Fri, 22 May 2020 22:48:26 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@NetBSD.org
Subject: Kernel messages cause loss of serial console input
X-Send-Pr-Version: 3.95

>Number:         55286
>Category:       kern
>Synopsis:       Kernel messages cause loss of serial console input
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 22 19:50:00 +0000 2020
>Closed-Date:    Tue May 26 14:23:51 +0000 2020
>Last-Modified:  Tue May 26 14:23:51 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current and others
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

I have a couple of automated test setups where one NetBSD system
controls another over a serial console.  Both are running on real
hardware and the serial baud rate is set to 9600.

For a long time, these systems have been suffering from random
failures that occur when the kernel of the controlled system happens
to log a console message at an inopportune moment.

Sometimes things fail because a kernel message is printed in the
middle of some string that the controlling system is looking for.
For example, if it is waiting for a "login:" prompt but gets
"log[  51.0923743] ipmi0: ID 32.1 IPMI 2.0 Available +SDRs\r\nin:",
there's not much you can do short of eliminating the message.
That case is not what this PR is about.

Rather, this PR is about failures that happen when a kernel message is
printed even if it does _not_ cause any of the expected output to be
corrupted, and specifically, that console _input_ appears to be lost
if it happens to arrive at a time when the kernel is printing a
message.

For example, in a recent run, the controlling system sent a long
one-line shell command beginning with

  set -ex; sleep 60; mkdir -p /b && mount -t tmpfs tmpfs /b && rm -rf /b/* 

This was echoed by the controlled system as

  set -ex; sleep 60; mkdir -p /[  50.9384968] ipmi0: ID 32.1 IPMI 2.0 Available +SDRs\r\nb && m/b \x08 && rm -rf /b/*

Not only did a kernel message appear in the middle of the echoed
command, but more importantly, the "ount -t tmpfs tmpfs " part of
the command was lost.  When the controlled system then tried to
execute the command, it printed

 sh: m/b: not found

so clearly there was a loss of input, not just a loss of echo.

I can also reproduce the issue manually on a freshly booted system by
logging in as root over the serial console and running the shell command

  # for i in $(seq 200); do luactl >/dev/null 2>&1; done

to trigger a series of console messages like

  [  69.6188016] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
  [  69.7088437] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
  [  69.7988866] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
  (etc)

and then typing the letters of the alphabet on the keyboard while the
messages are being printed, followed by a return (at 9600 bps, the
messages print slowly enough that there is time to do this before they
finish).  When I do this, the initial part of the input is lost, but
the rest is accepted, leading to output that ends like this:

  [  88.0675327] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
  # uvxyz
  -sh: uvwxyz: not found

Strangely, this only happens once.  If I repeat the test, every letter
I type is accepted, and I get the expected output ending in

  [ 981.5203980] WARNING: module error: incompatible module class 1 for `lua' (wanted 3)
  # abcdefghijklmnopqrstuvwxyz
  -sh: abcdefghijklmnopqrstuvwxyz: not found

After rebooting, I can again reproduce the problem, once.  I have
tried this test using two different pairs of amd64 machines and they
both behave the same way.

In both the automated test failure and the manual tests, exactly 20
characters were lost, and I see that the magic number 20 also occurs
in com.c as the value of MAX_READAHEAD.  This may or may not be a
coincidence.
-- 
Andreas Gustafsson, gson@gson.org

>How-To-Repeat:

>Fix:

Unknown
>Release-Note:

>Audit-Trail:
From: Andreas Gustafsson <gson@gson.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sat, 23 May 2020 21:09:22 +0300

 I had a closer look at the code, and the issue is definitely caused
 by the read-ahead code in com.c.  Quoting a comment in com.c:

  * The read-ahead code is so that you can detect pending in-band
  * cn_magic in polled mode while doing output rather than having to
  * wait until the kernel decides it needs input.

 So, if I understand this correctly, this code is intended to support a
 nonstandard way of breaking into DDB (as opposed to the standard break
 condition) and will only work if the kernel happens to be printing
 something in polled mode at the time (!).

 To me this seems like a bad idea from the start, and I would prefer to
 simply remove this code rather than try to fix it.  Or does someone
 think it actually has some value?

 I have confirmed that after disabling the read-ahead using the patch
 below, the manual test procedure in the PR no longer reproduces the
 problem.

 Index: src/sys/dev/ic/com.c
 ===================================================================
 RCS file: /bracket/repo/src/sys/dev/ic/com.c,v
 retrieving revision 1.358
 diff -u -r1.358 com.c
 --- src/sys/dev/ic/com.c	1 May 2020 07:27:51 -0000	1.358
 +++ src/sys/dev/ic/com.c	23 May 2020 12:33:23 -0000
 @@ -2402,17 +2402,18 @@
  com_common_putc(dev_t dev, struct com_regs *regsp, int c)
  {
  	int s = splserial();
 -	int cin, stat, timo;
 +	int timo;

 +#if 0
  	if (com_readaheadcount < MAX_READAHEAD
  	     && ISSET(stat = CSR_READ_1(regsp, COM_REG_LSR), LSR_RXRDY)) {
 -		int cn_trapped = 0;
 +		int cin, stat, cn_trapped = 0;
  		cin = CSR_READ_1(regsp, COM_REG_RXDATA);
  		stat = CSR_READ_1(regsp, COM_REG_IIR);
  		cn_check_magic(dev, cin, com_cnm_state);
  		com_readahead[com_readaheadcount++] = cin;
  	}
 -
 +#endif
  	/* wait for any pending transmission to finish */
  	timo = 150000;
  	while (!ISSET(CSR_READ_1(regsp, COM_REG_LSR), LSR_TXRDY) && --timo)

 -- 
 Andreas Gustafsson, gson@gson.org

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
        Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 14:50:35 +0200

 On Sat, May 23, 2020 at 06:10:01PM +0000, Andreas Gustafsson wrote:
 >  So, if I understand this correctly, this code is intended to support a
 >  nonstandard way of breaking into DDB (as opposed to the standard break
 >  condition) and will only work if the kernel happens to be printing
 >  something in polled mode at the time (!).

 no, it works fine even if nothing is being printed. I know, I'm using '+++++'
 as a key to enter ddb.

 >  
 >  To me this seems like a bad idea from the start, and I would prefer to
 >  simply remove this code rather than try to fix it.  Or does someone
 >  think it actually has some value?

 It definitively has some value. There are setups where BREAK can't be
 sent (e.g. qemu, Dell idrac, ...). I need it.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Paul Goyette <paul@whooppee.com>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
    Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 06:46:22 -0700 (PDT)

 On Sun, 24 May 2020, Manuel Bouyer wrote:

 <snip>

 > It definitively has some value. There are setups where BREAK can't be
 > sent (e.g. qemu, Dell idrac, ...). I need it.

 Just FYI, you should be able to send BREAK from within qemu.  From the
 qemu man page:

  	...
  	During emulation, if you are using a character backend
  	multiplexer (which is the default if you are using -nographic)
  	then several commands are available via an escape sequence.
  	These key sequences all start with an escape character, which
  	is Ctrl-a by default, but can be changed with -echr. The list
  	below assumes you're using the default.
  	...
  	Ctrl-a b
  	    Send break (magic sysrq in Linux)
  	...


 +--------------------+--------------------------+-----------------------+
 | Paul Goyette       | PGP Key fingerprint:     | E-mail addresses:     |
 | (Retired)          | FA29 0E3B 35AF E8AE 6651 | paul@whooppee.com     |
 | Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette@netbsd.org   |
 +--------------------+--------------------------+-----------------------+

From: Martin Husemann <martin@duskware.de>
To: Paul Goyette <paul@whooppee.com>
Cc: gnats-bugs@netbsd.org, kern-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
	Andreas Gustafsson <gson@gson.org>
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 15:51:49 +0200

 I have not followed the changes properly, but in general every
 sequence configure via cnmagic(9) needs to work.

 What we likely don't need (and that is how I understood Andreas) is the
 readahead handling during output.

 Andreas: can you verify with the patched kernel that cnmagic still works?

 Martin

From: Andreas Gustafsson <gson@gson.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 16:51:52 +0300

 Manuel Bouyer wrote:
 > no, it works fine even if nothing is being printed. I know, I'm using '+++++'
 > as a key to enter ddb.

 If it's working when nothing is being printed, you are not using
 the code path in case, but a different one that will continue to work.
 During normal system operation when serial port interrupts are
 enabled, the cnmagic check happens in comintr(), and that's the one
 you are relying on.

 The code I'd like to delete is in com_common_putc() and is intended
 for situations such as early startup where interrupts are not yet
 enabled, or late shutdown when they have already been disabled.  If
 you type '+++++' during one of those phases, it is only detected in
 com_common_putc(), and that only works if the kernel happens to print
 a character to the console after each '+'.  It will not work if the
 kernel is deadlocked or in an infinite loop where nothing is printed
 to the console.

 > It definitively has some value. There are setups where BREAK can't be
 > sent (e.g. qemu, Dell idrac, ...). 

 I'm also using both qemu and idrac.  The qemu bug is supposedly fixed
 (https://bugs.launchpad.net/qemu/+bug/1654137), and the Dell server
 I'm using has a nice NMI button.  But that really only matters for the
 interrupts-disabled case; normally, '+++++' will continue to work.

 > I need it.

 Sure, but do you need it specifically in the case when serial
 interrupts are disabled and the kernel is not polling for console
 input but is printing something to the console?  That is the only
 case that deleting this code will break.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 15:55:54 +0200

 On Sun, May 24, 2020 at 04:51:52PM +0300, Andreas Gustafsson wrote:
 > > I need it.
 > 
 > Sure, but do you need it specifically in the case when serial
 > interrupts are disabled and the kernel is not polling for console
 > input but is printing something to the console?  That is the only
 > case that deleting this code will break.

 No, I don't think I ever used it in this case. Thanks for the clarification

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Sun, 24 May 2020 22:56:38 +0300

 Martin Husemann wrote:
 >  I have not followed the changes properly, but in general every
 >  sequence configure via cnmagic(9) needs to work.
 >  
 >  What we likely don't need (and that is how I understood Andreas) is the
 >  readahead handling during output.

 Right.

 >  Andreas: can you verify with the patched kernel that cnmagic still works?

 I don't see any reason why it shouldn't still work during normal
 operation where serial interrupts are enabled, but sure, I can test it.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Valery Ushakov <uwe@stderr.spb.ru>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 03:18:46 +0300

 On Sun, May 24, 2020 at 13:55:03 +0000, Andreas Gustafsson wrote:

 >  The code I'd like to delete is in com_common_putc() and is intended
 >  for situations such as early startup where interrupts are not yet
 >  enabled, or late shutdown when they have already been disabled.
 [...]
 >  > I need it.
 >  
 >  Sure, but do you need it specifically in the case when serial
 >  interrupts are disabled and the kernel is not polling for console
 >  input but is printing something to the console?  That is the only
 >  case that deleting this code will break.

 How does your original test end up in a state like that to trigger
 this bug?  Kernel message being printed with polled code and steals
 the input?  In that case isn't the right fix to only do readahead when
 interrupts are disabled?

 -uwe

From: Andreas Gustafsson <gson@gson.org>
To: Valery Ushakov <uwe@stderr.spb.ru>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 09:34:44 +0300

 Valery Ushakov wrote:
 >  How does your original test end up in a state like that to trigger
 >  this bug?  Kernel message being printed with polled code and steals
 >  the input?

 Yes.

 > In that case isn't the right fix to only do readahead when
 > interrupts are disabled?

 That is the right fix if you think the read-ahead code is worth
 keeping, but I don't.  I'm volunteering to remove it; are you
 volunteering to fix it?
 -- 
 Andreas Gustafsson, gson@gson.org

From: Valery Ushakov <uwe@stderr.spb.ru>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 15:03:30 +0300

 On Mon, May 25, 2020 at 09:34:44 +0300, Andreas Gustafsson wrote:
 > Date: Mon, 25 May 2020 09:34:44 +0300
 > From: Andreas Gustafsson <gson@gson.org>
 > Subject: Re: kern/55286: Kernel messages cause loss of serial console input
 > To: Valery Ushakov <uwe@stderr.spb.ru>
 > CC: gnats-bugs@netbsd.org
 > 
 w
 > Valery Ushakov wrote:
 > >  How does your original test end up in a state like that to trigger
 > >  this bug?  Kernel message being printed with polled code and steals
 > >  the input?
 > 
 > Yes.

 That would be early autoconfig, before the interrupts are enabled, if
 I understand correctly.  That's an important time window, isn't it?


 > > In that case isn't the right fix to only do readahead when
 > > interrupts are disabled?
 > 
 > That is the right fix if you think the read-ahead code is worth
 > keeping, but I don't.  I'm volunteering to remove it; are you
 > volunteering to fix it?

 That sounds pretty rude.  I'm asking a question and you are telling me
 to shut up unless I'm up to imeediately fixing it.

 -uwe

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 14:56:07 +0200

 Here is a patch that uses readahead in comcnputc only while we are still
 cold - compile tested only, but should fix the problem as well.


 Martin

 Index: com.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/com.c,v
 retrieving revision 1.358
 diff -u -p -r1.358 com.c
 --- com.c	1 May 2020 07:27:51 -0000	1.358
 +++ com.c	25 May 2020 12:48:20 -0000
 @@ -159,7 +159,7 @@ int	com_to_tiocm(struct com_softc *);
  void	com_iflush(struct com_softc *);

  int	com_common_getc(dev_t, struct com_regs *);
 -static void	com_common_putc(dev_t, struct com_regs *, int);
 +static void	com_common_putc(dev_t, struct com_regs *, int, int);

  int	cominit(struct com_regs *, int, int, int, tcflag_t);

 @@ -2399,12 +2399,12 @@ com_common_getc(dev_t dev, struct com_re
  }

  static void
 -com_common_putc(dev_t dev, struct com_regs *regsp, int c)
 +com_common_putc(dev_t dev, struct com_regs *regsp, int c, int with_readahead)
  {
  	int s = splserial();
  	int cin, stat, timo;

 -	if (com_readaheadcount < MAX_READAHEAD
 +	if (with_readahead && com_readaheadcount < MAX_READAHEAD
  	     && ISSET(stat = CSR_READ_1(regsp, COM_REG_LSR), LSR_RXRDY)) {
  		int cn_trapped = 0;
  		cin = CSR_READ_1(regsp, COM_REG_RXDATA);
 @@ -2572,7 +2572,7 @@ void
  comcnputc(dev_t dev, int c)
  {

 -	com_common_putc(dev, &comcons_info.regs, c);
 +	com_common_putc(dev, &comcons_info.regs, c, cold);
  }

  void
 @@ -2641,7 +2641,7 @@ void
  com_kgdb_putc(void *arg, int c)
  {

 -	com_common_putc(NODEV, &comkgdbregs, c);
 +	com_common_putc(NODEV, &comkgdbregs, c, 0);
  }
  #endif /* KGDB */


From: Andreas Gustafsson <gson@gson.org>
To: Valery Ushakov <uwe@stderr.spb.ru>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Mon, 25 May 2020 16:47:02 +0300

 Valery Ushakov wrote:
 >  > >  How does your original test end up in a state like that to trigger
 >  > >  this bug?  Kernel message being printed with polled code and steals
 >  > >  the input?
 >  > 
 >  > Yes.
 >  
 >  That would be early autoconfig, before the interrupts are enabled, if
 >  I understand correctly.  That's an important time window, isn't it?

 Early autoconfig is one of the cases the read-ahead mechanism appears
 to be intended for, and yes, that's an important time window, but I
 have a hard time imagining a situation where the read-ahead would
 actually be useful.  For example, if the system hangs during early
 autoconfig, read-ahead won't help you break into ddb because a hung
 system will not be printing anything to the console.

 And to be clear, the kernel messages that are causing data loss for me
 are not the ones printed during early autoconfig, but ones printed
 after the system has fully booted.

 When the read-ahead mechanism was first added in com.c 1.174, it was
 not enabled by default (it was conditional on "options DDB_BREAK_CHAR"),
 but apparently it got enabled by default with the conversion to use
 cnmagic in com.c 1.186.  There is no corresponding mechanism in the
 other serial drivers.

 >  > > In that case isn't the right fix to only do readahead when
 >  > > interrupts are disabled?
 >  > 
 >  > That is the right fix if you think the read-ahead code is worth
 >  > keeping, but I don't.  I'm volunteering to remove it; are you
 >  > volunteering to fix it?
 >  
 >  That sounds pretty rude.  I'm asking a question and you are telling me
 >  to shut up unless I'm up to imeediately fixing it.

 Sorry, I didn't mean to be rude, it was an honest question and I'm
 certainly not asking you to shut up.  I'm just trying to figure out if
 anyone cares about this code, and if you do care about it, I don't
 think it's unreasonable to ask if you care about it enough to want to
 fix the bugs in it.
 -- 
 Andreas Gustafsson, gson@gson.org

From: Andreas Gustafsson <gson@gson.org>
To: Martin Husemann <martin@duskware.de>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Tue, 26 May 2020 16:08:59 +0300

 Martin Husemann wrote:
 >  Here is a patch that uses readahead in comcnputc only while we are still
 >  cold - compile tested only, but should fix the problem as well.

 I have now run tested it and can confirm that it fixes the problem.
 Will you commit it?
 -- 
 Andreas Gustafsson, gson@gson.org

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55286 CVS commit: src/sys/dev/ic
Date: Tue, 26 May 2020 13:24:53 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue May 26 13:24:53 UTC 2020

 Modified Files:
 	src/sys/dev/ic: com.c

 Log Message:
 Make the readahead-while-sending-output code conditional in
 com_common_putc(), only erquest readahead before interrupts are enabled
 and when called from comcnputc().
 Fixes PR kern/55286.


 To generate a diff of this commit:
 cvs rdiff -u -r1.358 -r1.359 src/sys/dev/ic/com.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Martin Husemann <martin@duskware.de>
To: Andreas Gustafsson <gson@gson.org>
Cc: gnats-bugs@netbsd.org
Subject: Re: kern/55286: Kernel messages cause loss of serial console input
Date: Tue, 26 May 2020 15:25:16 +0200

 On Tue, May 26, 2020 at 04:08:59PM +0300, Andreas Gustafsson wrote:
 > Martin Husemann wrote:
 > >  Here is a patch that uses readahead in comcnputc only while we are still
 > >  cold - compile tested only, but should fix the problem as well.
 > 
 > I have now run tested it and can confirm that it fixes the problem.
 > Will you commit it?

 Thanks for testing - commited!

 Martin

State-Changed-From-To: open->closed
State-Changed-By: gson@NetBSD.org
State-Changed-When: Tue, 26 May 2020 14:23:51 +0000
State-Changed-Why:
Fixed by martin, thanks.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.