NetBSD Problem Report #58452

From www@netbsd.org  Mon Jul 22 02:41:42 2024
Return-Path: <www@netbsd.org>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits)
	 client-signature RSA-PSS (2048 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 5E5261A9239
	for <gnats-bugs@gnats.NetBSD.org>; Mon, 22 Jul 2024 02:41:42 +0000 (UTC)
Message-Id: <20240722024141.1E2D41A923A@mollari.NetBSD.org>
Date: Mon, 22 Jul 2024 02:41:41 +0000 (UTC)
From: nathanialsloss@yahoo.com.au
Reply-To: nathanialsloss@yahoo.com.au
To: gnats-bugs@NetBSD.org
Subject: NCR5380 SCSI fixes for aborting transfers. BlueSCSI(v2)
X-Send-Pr-Version: www-1.0

>Number:         58452
>Category:       kern
>Synopsis:       NCR5380 SCSI fixes for aborting transfers. BlueSCSI(v2)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    nat
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jul 22 02:45:00 +0000 2024
>Last-Modified:  Sun Sep 15 01:35:01 +0000 2024
>Originator:     Nat Sloss
>Release:        NetBSD-10.0 apples to -9 and -current
>Organization:
NetBSD
>Environment:
NetBSD princess 10.99.10 NetBSD 10.99.10 (WSFBPWR) #12: Tue Oct 17 12:38:18 AEDT 2023  build@microrusty:/home/build/nbsd10/sys/arch/mac68k/compile/WSFBPWR mac68k
>Description:
With the use of dse(4) present in BlueSCSI(v2) and Rascsi devices,
aborting transfers are more frequent which can lead to a panic.


The problem stems from the fact that the transfers can take a reentrant fast patch, which makes it difficult to abort from.


>How-To-Repeat:
Everyday use of SCSISBC kernel and dse(4). can cause a panic.

>Fix:
I'll upload a patch in a follow up email.

There are four components to the patch:

1. No more fast path for transfers.  The old api is still there for drivers that use it safely.

2. Abort messages are printed with only DEBUG enabled as there are harmless.

3.  Medium errors follow the normal retry path.

4.  Existing transfers are aborted on scsi bus reset.

With these changes the system (Powerbook 160) has performed very well.

Filing this under kern as its scope is greater than mac68k.

Best regards,

Nat

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: kern-bug-people->nat
Responsible-Changed-By: nat@NetBSD.org
Responsible-Changed-When: Mon, 22 Jul 2024 08:10:07 +0000
Responsible-Changed-Why:
Mine.


From: Nat Sloss <nathanialsloss@yahoo.com.au>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org,
 netbsd-bugs@netbsd.org,
 gnats-admin@netbsd.org
Subject: Re: kern/58452 (NCR5380 SCSI fixes for aborting transfers. BlueSCSI(v2))
Date: Mon, 22 Jul 2024 18:13:57 +1000

 --Boundary-00=_GThnmwbRefpRBL5
 Content-Type: Text/Plain;
   charset="iso-8859-15"
 Content-Transfer-Encoding: 7bit

 Attached is the patch mentioned in the report.

 --Boundary-00=_GThnmwbRefpRBL5
 Content-Type: text/x-patch;
   charset="ISO-8859-1";
   name="macscsi.changes.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="macscsi.changes.diff"

 diff -r 30e85f50c073 sys/dev/ic/ncr5380sbc.c
 --- a/sys/dev/ic/ncr5380sbc.c	Mon Apr 22 21:02:18 2024 +0000
 +++ b/sys/dev/ic/ncr5380sbc.c	Sat Apr 27 02:19:31 2024 +1000
 @@ -393,6 +393,7 @@
  static void
  ncr5380_reset_scsibus(struct ncr5380_softc *sc)
  {
 +	struct sci_req *sr;

  	NCR_TRACE("reset_scsibus, cur=0x%x\n",
  			  (long) sc->sc_current);
 @@ -409,6 +410,9 @@
  	delay(100000);

  	/* XXX - Need to cancel disconnected requests. */
 +	sr = sc->sc_current;
 +	if (sr)
 +		ncr5380_abort(sc);
  }


 @@ -617,9 +621,11 @@
  			/* Terminate any current command. */
  			sr = sc->sc_current;
  			if (sr) {
 +#ifdef	NCR5380_DEBUG
  				printf("%s: polled request aborting %d/%d\n",
  				    device_xname(sc->sc_dev),
  				    sr->sr_target, sr->sr_lun);
 +#endif
  				ncr5380_abort(sc);
  			}
  			if (sc->sc_state != NCR_IDLE) {
 @@ -802,7 +808,7 @@
  	sc->sc_ncmds--;

  	/* Tell common SCSI code it is done. */
 -	scsipi_done(xs);
 +	scsipi_done_once(xs);

  	sc->sc_state = NCR_IDLE;
  	/* Now ncr5380_sched() may be called again. */
 diff -r 30e85f50c073 sys/dev/scsipi/scsipi_base.c
 --- a/sys/dev/scsipi/scsipi_base.c	Mon Apr 22 21:02:18 2024 +0000
 +++ b/sys/dev/scsipi/scsipi_base.c	Sat Apr 27 02:19:31 2024 +1000
 @@ -96,6 +96,8 @@
  SDT_PROBE_DEFINE1(scsi, base, xfer, free,  "struct scsipi_xfer *"/*xs*/);

  static int	scsipi_complete(struct scsipi_xfer *);
 +static struct scsipi_channel*
 +		scsipi_done_internal(struct scsipi_xfer *, bool);
  static void	scsipi_request_sense(struct scsipi_xfer *);
  static int	scsipi_enqueue(struct scsipi_xfer *);
  static void	scsipi_run_queue(struct scsipi_channel *chan);
 @@ -1056,6 +1058,13 @@
  		case SKEY_VOLUME_OVERFLOW:
  			error = ENOSPC;
  			break;
 +		case SKEY_MEDIUM_ERROR:
 +			if (xs->xs_retries != 0) {
 +				xs->xs_retries--;
 +				error = ERESTART;
 +			} else
 +				error = EIO;
 +			break;
  		default:
  			error = EIO;
  			break;
 @@ -1584,6 +1593,28 @@
  void
  scsipi_done(struct scsipi_xfer *xs)
  {
 +	struct scsipi_channel *chan;
 +	/*
 +	 * If there are more xfers on the channel's queue, attempt to
 +	 * run them.
 +	 */
 +	if ((chan = scsipi_done_internal(xs, true)) != NULL)
 +		scsipi_run_queue(chan);
 +}
 +
 +/*
 + * Just like scsipi_done(), but no recursion.  Useful if aborting the current
 + * transfer.
 + */
 +void
 +scsipi_done_once(struct scsipi_xfer *xs)
 +{
 +	(void)scsipi_done_internal(xs, false);
 +}
 +
 +static struct scsipi_channel*
 +scsipi_done_internal(struct scsipi_xfer *xs, bool more)
 +{
  	struct scsipi_periph *periph = xs->xs_periph;
  	struct scsipi_channel *chan = periph->periph_channel;
  	int freezecnt;
 @@ -1672,7 +1703,7 @@
  		 */
  		if (xs->xs_control & XS_CTL_POLL) {
  			mutex_exit(chan_mtx(chan));
 -			return;
 +			return NULL;
  		}
  		cv_broadcast(xs_cv(xs));
  		mutex_exit(chan_mtx(chan));
 @@ -1684,7 +1715,7 @@
  	 * without error; no use in taking a context switch
  	 * if we can handle it in interrupt context.
  	 */
 -	if (xs->error == XS_NOERROR) {
 +	if (xs->error == XS_NOERROR && more == true) {
  		mutex_exit(chan_mtx(chan));
  		(void) scsipi_complete(xs);
  		goto out;
 @@ -1699,11 +1730,7 @@
  	mutex_exit(chan_mtx(chan));

   out:
 -	/*
 -	 * If there are more xfers on the channel's queue, attempt to
 -	 * run them.
 -	 */
 -	scsipi_run_queue(chan);
 +	return chan;
  }

  /*
 diff -r 30e85f50c073 sys/dev/scsipi/scsipiconf.h
 --- a/sys/dev/scsipi/scsipiconf.h	Mon Apr 22 21:02:18 2024 +0000
 +++ b/sys/dev/scsipi/scsipiconf.h	Sat Apr 27 02:19:31 2024 +1000
 @@ -700,6 +700,7 @@
  	    struct scsi_mode_parameter_header_10 *, int, int, int, int);
  int	scsipi_start(struct scsipi_periph *, int, int);
  void	scsipi_done(struct scsipi_xfer *);
 +void	scsipi_done_once(struct scsipi_xfer *);
  void	scsipi_user_done(struct scsipi_xfer *);
  int	scsipi_interpret_sense(struct scsipi_xfer *);
  void	scsipi_wait_drain(struct scsipi_periph *);

 --Boundary-00=_GThnmwbRefpRBL5--

From: Nathanial Sloss <nathanialsloss@yahoo.com.au>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58452 (NCR5380 SCSI fixes for aborting transfers. BlueSCSI(v2))
Date: Sat, 10 Aug 2024 18:24:49 +1000

 --Boundary-00=_SPytmxOh7zWCIRS
 Content-Type: Text/Plain;
   charset="iso-8859-15"
 Content-Transfer-Encoding: 7bit

 Sorry I was confused about the code being re-entrant and the crash point :(

 Attached is the dmesg and backtrace.

 Best regards,

 Nat

 --Boundary-00=_SPytmxOh7zWCIRS
 Content-Type: text/plain;
   charset="ISO-8859-1";
   name="macscsiabort.dmesg.txt"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="macscsiabort.dmesg.txt"

 [   1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
 [   1.0000000]     2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
 [   1.0000000]     2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023,
 [   1.0000000]     2024
 [   1.0000000]     The NetBSD Foundation, Inc.  All rights reserved.
 [   1.0000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
 [   1.0000000]     The Regents of the University of California.  All rights reserved.

 [   1.0000000] NetBSD 10.99.11 (WSFBPWR) #258: Sat Aug 10 10:32:34 AEST 2024
 [   1.0000000] 	build@microrusty:/home/build/nbsd10/sys/arch/mac68k/compile/WSFBPWR
 [   1.0000000] Apple Macintosh PowerBook 160  (68030)
 [   1.0000000] cpu: delay factor 265
 [   1.0000000] fpu: emulator
 [   1.0000000] total memory = 14336 KB
 [   1.0000000] avail memory = 11040 KB
 [   1.0000000] mrg: '2nd Powerbook class ROMs' ROM glue, tracing off, debug off, silent traps
 [   1.0000000] mainbus0 (root)
 [   1.0000000] obio0 at mainbus0
 [   1.0000000] adb0 at obio0
 [   1.0000000] ascaudio0 at obio0: Enhanced Apple Sound Chip
 [   1.0000000] audio0 at ascaudio0: playback, capture, half duplex, independent
 [   1.0000000] audio0: slinear_be:16 1ch 11127Hz, blk 8192 bytes (736.2ms) for playback
 [   1.0000000] audio0: slinear_be:16 1ch 11025Hz, blk 8192 bytes (743ms) for recording
 [   1.0000000] spkr0 at audio0: PC Speaker (synthesized)
 [   1.0000000] wsbell0 at spkr0 mux 1
 [   1.0000000] intvid0 at obio0 @ 60000000: On-board video
 [   1.0000000] intvid0: 640 x 400, 4-bpp color
 [   1.0000000] genfb0 at intvid0: colormap callback not provided
 [   1.0000000] wsdisplay0 at genfb0 kbdmux 1
 [   1.0000000] sbc0 at obio0 addr 0: options=06<RESELECT,INTR>
 [   1.0000000] scsibus0 at sbc0: 8 targets, 8 luns per target
 [   1.0000000] zsc0 at obio0 chip type 0 
 [   1.0000000] zsc0 channel 0: d_speed   9600 DCD clk 0 CTS clk 0
 [   1.0000000] zstty0 at zsc0 channel 0 (console i/o)
 [   1.0000000] zsc0 channel 1: d_speed   9600 DCD clk 0 CTS clk 0
 [   1.0000000] zstty1 at zsc0 channel 1
 [   1.0000000] nubus0 at mainbus0
 [   1.0256579] scsibus0: waiting 2 seconds for devices to settle...
 [   1.1041735] adb0 (direct, PowerBook): 2 targets
 [   2.2808011] aed0 at adb0 addr 0: ADB Event device
 [   2.3429283] akbd0 at adb0 addr 2: PowerBook keyboard
 [   2.4105792] wskbd0 at akbd0 mux 1
 [   2.4613209] ams0 at adb0 addr 3: 1-button, 200 dpi mouse
 [   2.5769255] wsmouse0 at ams0 mux 0
 [   2.6736805] WARNING: system needs entropy for security; see entropy(7)
 [   4.3922339] sd0 at scsibus0 target 0 lun 0: <QUANTUM, BlueSCSI Pico, 1.0> disk fixed
 [   4.4922186] sd0: 20000 MB, 6400 cyl, 200 head, 32 sec, 512 bytes/sect x 40960000 sectors
 [   4.5994720] sd0: async, 8-bit transfers
 [   4.7391242] sd1 at scsibus0 target 2 lun 0: <RaSCSI, SCSI HD 41 MB, 2208> disk fixed
 [   4.8391268] sd1: 40960 KB, 409 cyl, 8 head, 25 sec, 512 bytes/sect x 81920 sectors
 [   4.9393323] sd1: async, 8-bit transfers
 [   5.0693696] dse0 at scsibus0 target 4 lun 0: <Dayna, SCSI/Link, 2.0f> processor fixed
 [   5.6694385] dse0: ethernet address 00:80:19:04:03:5e
 [   5.7542590] dse0: async, 8-bit transfers
 [   5.8027782] cd0 at scsibus0 target 5 lun 0: <RaSCSI, SCSI CD-ROM, 2208> cdrom removable
 [   5.9201572] cd0: async, 8-bit transfers
 [   5.9695304] sd2 at scsibus0 target 6 lun 0: <QUANTUM, FIREBALL, 2208> disk fixed
 [   6.0705912] sd2: 40960 KB, 409 cyl, 8 head, 25 sec, 512 bytes/sect x 81920 sectors
 [   6.1695342] sd2: async, 8-bit transfers
 [   6.2204700] WARNING: 1 error while detecting hardware; check system log.

 --Boundary-00=_SPytmxOh7zWCIRS
 Content-Type: text/plain;
   charset="ISO-8859-1";
   name="macscsiabort.bt.txt"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="macscsiabort.bt.txt"

 [ 1334.3407620] sbc0: polled request aborting 0/0
 [ 1450.7333887] sbc0: polled request aborting 0/0

 [ 1540.8555152] panic: ncr5380_scsipi_request: polled request, abort failed
 [ 1540.8555152] cpu0: Begin traceback...
 [ 1540.8555152] ?(?)
 [ 1540.8555152] db_panic(0,62f800,3d27d74,13000c,1b7c04) at 0
 [ 1540.8555152] vpanic(1b7c04,3d27d80,3d27da8,3991e,1b7c04) + 18c
 [ 1540.8555152] panic(1b7c04,1b7cb8,12fe10,1,23b16) + c
 [ 1540.8555152] ncr5380_scsipi_request(62f83c,0,6b5e3c,62f834) + c6
 [ 1540.8555152] scsipi_run_queue(?)
 [ 1540.8555152] scsipi_done(62f83c,6b5ef0,0,38800,62f800) + 226
 [ 1540.8555152] ncr5380_done(?)
 [ 1540.8555152] compat_14_sys_msgctl(62f800) + b8
 [ 1540.8555152] ncr5380_machine(?)
 [ 1540.8555152] compat_50_sys___msgctl13(62f800,62f800,12fe10,3d27e78,dcda) + 8dc
 [ 1540.8555152] ncr5380_intr(62f800) + 36
 [ 1540.8555152] sbc_irq_intr(62f800) + 20
 [ 1540.8555152] via2_intr(3d27eb0,2,3d27f00,33ce,2214) + 44
 [ 1540.8555152] intr_dispatch(2214,2004,62f834,6b5e3c,20000000) + 5c
 [ 1540.8555152] intrhand(?)
 [ 1540.8555152] scsipi_execute_xs(6b5e3c,62f834,6b5ed0,3d27f62,6) + 4
 [ 1540.8555152] scsipi_command(655644,3d27f62,6,668000,4000,4,186a0,0,1003) + 78
 [ 1540.8555152] dse_recv_worker(629ad4,629840) + 50
 [ 1540.8555152] workqueue_worker(708100) + c4
 [ 1540.8555152] lwp_trampoline() + e
 [ 1540.8555152] cpu0: End traceback...

 --Boundary-00=_SPytmxOh7zWCIRS--

From: mlelstv@serpens.de (Michael van Elst)
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/58452 (NCR5380 SCSI fixes for aborting transfers. BlueSCSI(v2))
Date: Sat, 10 Aug 2024 20:03:13 -0000 (UTC)

 nathanialsloss@yahoo.com.au (Nathanial Sloss) writes:

 > [   1.0000000] Apple Macintosh PowerBook 160  (68030)

 > [ 1540.8555152] panic: ncr5380_scsipi_request: polled request, abort failed
 > [ 1540.8555152] cpu0: Begin traceback...
 > [ 1540.8555152] ?(?)
 > [ 1540.8555152] db_panic(0,62f800,3d27d74,13000c,1b7c04) at 0
 > [ 1540.8555152] vpanic(1b7c04,3d27d80,3d27da8,3991e,1b7c04) + 18c
 > [ 1540.8555152] panic(1b7c04,1b7cb8,12fe10,1,23b16) + c
 > [ 1540.8555152] ncr5380_scsipi_request(62f83c,0,6b5e3c,62f834) + c6
 > [ 1540.8555152] scsipi_run_queue(?)


 That's a combination of dse(4) doing only polled requests (why?) and the
 ncr5380 driver (sys/dev/ic/ncr5380sbc.c) handling polled requests very badly.

 It's also confusing that atari and mac68k have their own (similar) ncr5380
 drivers, but which aren't used. At a first glance, these drivers ignore
 a request for polled I/O, but probably run each command to completion anyway.

From: Nathanial Sloss <nathanialsloss@yahoo.com.au>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org,
 netbsd-bugs@netbsd.org
Subject: Re: kern/58452 (NCR5380 SCSI fixes for aborting transfers. BlueSCSI(v2))
Date: Sun, 15 Sep 2024 09:32:46 +1000

 On Sun, 11 Aug 2024 06:05:01 Michael van Elst wrote:
 > The following reply was made to PR kern/58452; it has been noted by GNATS.
 > 
 > From: mlelstv@serpens.de (Michael van Elst)
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: kern/58452 (NCR5380 SCSI fixes for aborting transfers.
 > BlueSCSI(v2)) Date: Sat, 10 Aug 2024 20:03:13 -0000 (UTC)
 > 
 >  nathanialsloss@yahoo.com.au (Nathanial Sloss) writes:
 >  > [   1.0000000] Apple Macintosh PowerBook 160  (68030)
 >  > 
 >  > [ 1540.8555152] panic: ncr5380_scsipi_request: polled request, abort
 >  > failed [ 1540.8555152] cpu0: Begin traceback...
 >  > [ 1540.8555152] ?(?)
 >  > [ 1540.8555152] db_panic(0,62f800,3d27d74,13000c,1b7c04) at 0
 >  > [ 1540.8555152] vpanic(1b7c04,3d27d80,3d27da8,3991e,1b7c04) + 18c
 >  > [ 1540.8555152] panic(1b7c04,1b7cb8,12fe10,1,23b16) + c
 >  > [ 1540.8555152] ncr5380_scsipi_request(62f83c,0,6b5e3c,62f834) + c6
 >  > [ 1540.8555152] scsipi_run_queue(?)
 > 
 >  That's a combination of dse(4) doing only polled requests (why?) and the
 >  ncr5380 driver (sys/dev/ic/ncr5380sbc.c) handling polled requests very
 > badly.
 > 

 I've since changed dse(4) it only needs reads to be polled.   Which is a quirk 
 of the emulated hw and ncr5380sbc.

 The patch just made it possible to handle aborts/and medium erros better on 
 5380sbc, which is only noticable when using sbc with a device such as dse(4) 
 sending frequent polled requests.

 >  It's also confusing that atari and mac68k have their own (similar) ncr5380
 >  drivers, but which aren't used. At a first glance, these drivers ignore
 >  a request for polled I/O, but probably run each command to completion
 > anyway.


 I highly doubt that dse(4) at least the emulated ones with BlueSCSI v2/ RASCSI 
 / PiSCSI will work, they most likely will have garbled input from dse(4) due 
 to issues with it's emulation.

 Nat

 PS: You may as well say that the issue is with the emulation of the DaynaLINK 
 SCSI device, but I'm only working with the current state of these devices 
 which could be handled better by the kernel and are more prevalent as failing 
 spinning disks are replaced.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2024 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.