NetBSD Problem Report #31003

From igor@string1.ciencias.uniovi.es  Tue Aug 16 23:23:39 2005
Return-Path: <igor@string1.ciencias.uniovi.es>
Received: from FRESNO.NET.UNIOVI.ES (fresno.net.uniovi.es [156.35.11.2])
	by narn.netbsd.org (Postfix) with ESMTP id 1E74463B116
	for <gnats-bugs@gnats.NetBSD.org>; Tue, 16 Aug 2005 23:23:39 +0000 (UTC)
Message-Id: <20050816232337.4067E3D83@string1.ciencias.uniovi.es>
Date: Wed, 17 Aug 2005 01:23:37 +0200
From: igor@string1.ciencias.uniovi.es
Sender: igor@string1.ciencias.uniovi.es
Reply-To: igor@string1.ciencias.uniovi.es
To: gnats-bugs@netbsd.org
Subject: umass(4) panic provoked by Plextor portable hard disk drive
X-Send-Pr-Version: 3.95

>Number:         31003
>Category:       kern
>Synopsis:       umass(4) panic provoked by Plextor portable hard disk drive
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Aug 16 23:24:00 +0000 2005
>Closed-Date:    Sun Apr 28 07:22:51 +0000 2019
>Last-Modified:  Sun Apr 28 07:22:51 +0000 2019
>Originator:     Igor Sobrado
>Release:        NetBSD 2.0.2
>Organization:
	University of Oviedo
>Environment:
System: NetBSD altair.v6.local 2.0.2 NetBSD 2.0.2 (GENERIC_LAPTOP) #0: Wed Mar 23 08:59:09 UTC 2005 jmc@faith.netbsd.org:/home/builds/ab/netbsd-2-0-2-RELEASE/i386/200503220140Z-obj/home/builds/ab/netbsd-2-0-2-RELEASE/src/sys/arch/i386/compile/GENERIC_LAPTOP i386
Architecture: i386
Machine: i386
>Description:
	The Plextor portable hard disk drive PX-PH08U is a member of a new
	family of USB mass storage devices.  The PX-PH08U is a 80 GB,
	2.5 inch, hard disk drive in an external USB enclosure.  It does
	not require an external power supply unit; as a consequence, the
	disk is turned off as soon as the transition to suspend mode is
	honored.  I suspect that this fact can be related with the problem
	outlined in this PR.

	Set up used:

	The PX-PH08U portable hard disk drive is a USB 2.0 device connected
	to a Dell Latitude CPi R400GT laptop (BIOS rev. A14) on its USB 1.1
	port.  This laptop is running NetBSD 2.0.2 and has an internal
	20 GB Hitachi HDD (IC25N020ATDA04).  The Plextor portable hard disk
	drive is identified as an USB mass storage device:

	    Aug 11 12:52:26 altair /netbsd: umass0 at uhub0 port 1 configuration 1 interface 0
	    Aug 11 12:52:26 altair /netbsd: umass0: Plextor S.A./N.V. PLEXTOR PX-PH, rev 2.00/3.02, addr 2
	    Aug 11 12:52:26 altair /netbsd: umass0: using SCSI over Bulk-Only
	    Aug 11 12:52:26 altair /netbsd: scsibus0 at umass0: 2 targets, 1 lun per target

	The PX-PH08U portable hard disk drive contains an UFS-2 filesystem
	on it:

	    altair# disklabel sd0
	    # /dev/rsd0d:
	    type: SCSI
	    disk: PX-PH08U/T3
	    label: 
	    flags:
	    bytes/sector: 512
	    sectors/track: 63
	    tracks/cylinder: 16
	    sectors/cylinder: 1008
	    cylinders: 155127
	    total sectors: 156368016
	    rpm: 5400
	    interleave: 1
	    trackskew: 0
	    cylinderskew: 0
	    headswitch: 0           # microseconds
	    track-to-track seek: 0  # microseconds
	    drivedata: 0 

	    4 partitions:
	    #        size    offset     fstype [fsize bsize cpg/sgs]
	     a: 156368016         0     4.2BSD   1024  8192 46936  # (Cyl.      0 - 155126)
	     c: 156368016         0     unused      0     0        # (Cyl.      0 - 155126)
	     d: 156368016         0     unused      0     0        # (Cyl.      0 - 155126)

	I usually mount this filesystem in /mnt:

	    altair# mount /dev/sd0a /mnt
	    altair# df -k
	    Filesystem  1K-blocks     Used     Avail Capacity  Mounted on
	    /dev/wd0a       45903    19049     24558    43%    /
	    /dev/wd0f       31207    14472     15174    48%    /var
	    /dev/wd0e      370295   163482    188298    46%    /usr
	    /dev/wd0g    11476539   209415  10693297     1%    /home
	    /dev/wd0h      247007   113069    121587    48%    /usr/X11R6
	    /dev/wd0i       31207     4038     25608    13%    /usr/contrib
	    /dev/wd0j      986743        1    937404     0%    /usr/obj
	    /dev/wd0k     1973735   774224   1100824    41%    /usr/pkg
	    /dev/wd0l      349711   159174    173051    47%    /usr/pkgsrc
	    /dev/wd0m     1480391   693818    712553    49%    /usr/src
	    /dev/wd0n      986743   445625    491780    47%    /usr/xsrc
	    mfs:433         63959       27     60734     0%    /tmp
	    kernfs              1        1         0   100%    /kern
	    fdesc               1        1         0   100%    /dev
	    /dev/sd0a    73559093        1  69881137     0%    /mnt

	Description of the problem:

	When the computer goes into suspend mode (Fn+Suspend) the next messages
	are registered in /var/log/messages:

	    Aug 16 23:27:26 altair /netbsd: umass0: BBB reset failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-in clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-out clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB reset failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-in clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-out clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB reset failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-in clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-out clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB reset failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-in clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-out clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB reset failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-in clear stall failed, STALLED
	    Aug 16 23:27:26 altair /netbsd: umass0: BBB bulk-out clear stall failed, STALLED

	followed by the next error:

	    Aug 16 23:22:34 altair /netbsd: umass0: at uhub0 port 1 (addr 2) disconnected
	    Aug 16 23:22:34 altair /netbsd: sd0(umass0:0:0:0): generic HBA error
	    Aug 16 23:22:34 altair /netbsd: uvm_fault(0xc0601680, 0, 0, 1) -> 0xe

	Once rebooted, both the internal HDD filesystems and the portable
	hard disk drive filesystem must be checked for consistency.

	I have classified this PR as a critical high priority problem because
	both it can damage filesystems (in portable hard disk drives and
	other system disks as the filesystems cannot be cleanly unmounted)
	and it enters to the in-kernel debugger stopping the computer.
>How-To-Repeat:
	An easy activity to reproduce the problem is mounting a filesystem
	in the portable hard disk drive in a mounting point (e.g., /mnt)
	and request a suspend mode.
>Fix:
	As a temporary workaround, it is possible unmounting the portable
	hard disk drive when a client requests a suspend mode.  This action
	can be configured for the related apmd(8) transition in the files
	in /etc/apm.  It must be clear that this workaround cannot be
	considered a fix at all for production systems.

>Release-Note:

>Audit-Trail:
From: Igor Sobrado <igor@string1.ciencias.uniovi.es>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31003
Date: Wed, 17 Aug 2005 09:11:22 +0200

 Two random notes on this PR:

   1. Dates in the /var/log/messages file output are related with
      two different events.  That is the reason the umass "BBB"
      messages are dated Aug 16 23:27:26, but the umass/SCSI/UVM
      errors are dated Aug 16 23:22:34.

   2. Other USB mass storage devices (e.g., USB flash drives)
      do not show this behavior.

 Best regards,

 Igor.

From: Igor Sobrado <igor@string1.ciencias.uniovi.es>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31003
Date: Wed, 17 Aug 2005 09:53:11 +0200

 I will try to attach a trace from the "db> " prompt in the next hours
 if this information is useful to isolate the problem.

 Cheers,
 Igor.

From: Igor Sobrado <igor@string1.ciencias.uniovi.es>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31003
Date: Wed, 17 Aug 2005 10:02:11 +0200

 Sorry for sending a lot of emails about this problem, but I have found
 a serious error in the problem report and I must fix it.  Where I write:

 "When the computer goes into suspend mode (Fn+Suspend) the next messages
 are registered in /var/log/messages:"

 I really wanted to write"

 "When the computer RETURNS FROM suspend mode (Fn+Suspend)..."

 The problem happens when the computer returns from suspend mode,
 not when going into that mode.  Certainly it is not the same!
 I will attach a trace as soon as possible.

 Best wishes,
 Igor.

From: Igor Sobrado <igor@string1.ciencias.uniovi.es>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31003
Date: Wed, 17 Aug 2005 20:21:31 +0200

 Ok, I have connected the PX-PH08U drive to the USB port on my laptop,
 mounted its FFS2 filesystem on /mnt, and requested a transition to
 suspend mode again.  As expected, after resuming from suspend state
 the in-kernel debugger was invoked by the system and the next error
 was registered in /var/log/messages:

   Aug 17 19:44:32 altair /netbsd: umass0: at uhub0 port 1 (addr 2) disconnected
   Aug 17 19:44:32 altair /netbsd: sd0(umass0:0:0:0): generic HBA error
   Aug 17 19:44:32 altair /netbsd: uvm_fault(0xc0601680, 0, 0, 1) -> 0xe
   Aug 17 19:44:32 altair last message repeated 3 times

 Now, I will try to play a bit with the in-kernel debugger:

   db> trace
   uvm_fault(0xc0601680, 0, 0, 1) -> 0xe
   kernel: page fault trap, code=0
   Faulted in DDB; continuing...

 Sorry, not a very helpful output.  All information was in /var/log/messages.
 The output of ps shows:

   db> ps
    PID         PPID   PGRP      UID S   FLAGS LWPS        COMMAND    WAIT
   [...]
   >4              0      0        0 2 0x20200    1        usb0

 Best wishes,
 Igor.

From: Carl Brewer <carl@bl.echidna.id.au>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31003
Date: Mon, 21 Nov 2005 16:30:42 +1100

 another data point to this, I have a similar setup - using ViPower
 USB/PATA housings with large (160GB) IDE/PATA disks as backup
 media for NetBSD LAN servers.

 I get this error and server lockups :

 Nov 19 03:13:30 mail /netbsd: umass0: BBB bulk-in clear stall failed, 
 TIMEOUT
 Nov 19 03:14:35 mail /netbsd: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Nov 19 03:16:45 mail /netbsd: umass0: BBB reset failed, TIMEOUT
 Nov 19 03:17:50 mail /netbsd: umass0: BBB bulk-in clear stall failed, 
 TIMEOUT
 Nov 19 03:18:55 mail /netbsd: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Nov 19 03:21:05 mail /netbsd: umass0: BBB reset failed, TIMEOUT
 Nov 19 03:22:10 mail /netbsd: umass0: BBB bulk-in clear stall failed, 
 TIMEOUT
 Nov 19 03:23:15 mail /netbsd: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Nov 19 03:25:25 mail /netbsd: umass0: BBB reset failed, TIMEOUT

 Then the machine hangs and requires a power cycle.

 The box is now :

 NetBSD mail.cashmoredesign.com 2.1 NetBSD 2.1 (GENERIC) #0: Mon Oct 24 
 22:35:45 UTC 2005 
 jmc@faith.netbsd.org:/home/builds/ab/netbsd-2-1-RELEASE/i386/200510241747Z-obj/home/builds/ab/netbsd-2-1-RELEASE/src/sys/arch/i386/compile/GENERIC 
 i386



From: Igor Sobrado <igor@string1.ciencias.uniovi.es>
To: gnats-bugs@netbsd.org
Cc: Carl Brewer <carl@bl.echidna.id.au>
Subject: Re: kern/31003
Date: Thu, 29 Dec 2005 18:54:02 +0100

 Hi Carl.

 Thanks a lot for the feedback about your USB/PATA hard disk drive.
 Indeed, perhaps both problems are related.

 NetBSD 3 has a different behaviour and, perhaps, this fact will help
 looking for a fix.  I have attached the Plextor portable hard disk drive
 to my laptop:

 Dec 29 18:27:46 localhost /netbsd: umass0 at uhub0 port 1 configuration 1 interface 0
 Dec 29 18:27:46 localhost /netbsd: 
 Dec 29 18:27:46 localhost /netbsd: umass0: Plextor S.A./N.V. PLEXTOR PX-PH, rev 2.00/3.02, addr 2
 Dec 29 18:27:46 localhost /netbsd: umass0: using SCSI over Bulk-Only
 Dec 29 18:27:46 localhost /netbsd: scsibus0 at umass0: 2 targets, 1 lun per target
 Dec 29 18:27:46 localhost /netbsd: sd0 at scsibus0 target 0 lun 0: <PLEXTOR, PX-PH, 3.02> disk fixed
 Dec 29 18:27:46 localhost /netbsd: sd0: 76351 MB, 155127 cyl, 16 head, 63 sec, 512 bytes/sect x 156368016 sectors

 ...and changed the laptop state to suspend mode (Fn+Suspend) again.
 Now the sequence of events that follows once the computer returns
 from the suspend state:

 Dec 29 18:28:58 localhost /netbsd: atabus0: resuming...
 Dec 29 18:28:58 localhost /netbsd: atabus1: resuming...
 Dec 29 18:28:59 localhost /netbsd: pms0: pms_synaptics_resume: reset on resume 0 0xaa 0x00
 Dec 29 18:28:59 localhost /netbsd: cbb1: wait took 0.009072s
 Dec 29 18:28:59 localhost /netbsd: umass0: at uhub0 port 1 (addr 2) disconnected
 Dec 29 18:28:59 localhost /netbsd: sd0(umass0:0:0:0): generic HBA error
 Dec 29 18:28:59 localhost /netbsd: sd0: cache synchronization failed
 Dec 29 18:28:59 localhost /netbsd: sd0 detached
 Dec 29 18:28:59 localhost /netbsd: scsibus0 detached
 Dec 29 18:28:59 localhost /netbsd: umass0 detached
 Dec 29 18:28:58 localhost /netbsd: umass0 at uhub0 port 1 configuration 1 interface 0
 Dec 29 18:28:58 localhost /netbsd: 
 Dec 29 18:28:58 localhost /netbsd: umass0: Plextor S.A./N.V. PLEXTOR PX-PH, rev 2.00/3.02, addr 2
 Dec 29 18:28:58 localhost /netbsd: umass0: using SCSI over Bulk-Only
 Dec 29 18:28:58 localhost /netbsd: scsibus0 at umass0: 2 targets, 1 lun per target
 Dec 29 18:28:58 localhost /netbsd: sd0 at scsibus0 target 0 lun 0: <PLEXTOR, PX-PH, 3.02> disk fixed
 Dec 29 18:28:58 localhost /netbsd: sd0: 76351 MB, 155127 cyl, 16 head, 63 sec, 512 bytes/sect x 156368016 sectors
 Dec 29 18:30:27 localhost syslogd: restart
 Dec 29 18:30:27 localhost /netbsd: uvm_fault(0xcbd71708, 0, 0, 1) -> 0xe
 Dec 29 18:30:27 localhost /netbsd: syncing disks... done
 Dec 29 18:30:27 localhost /netbsd: unmounting file systems...unmount of /dev failed with error 16
 Dec 29 18:30:27 localhost /netbsd: done
 Dec 29 18:30:27 localhost /netbsd: WARNING: some file systems would not unmount
 Dec 29 18:30:27 localhost /netbsd: rebooting...

 Ok, this time at least I was able to reboot the computer, but the
 screen was turned off.  There is an *unrelated* problem with APM
 in NetBSD: sometimes the screen does not turn on again once the
 machine returns from a power management state.  Will this problem
 be fixed?  It is a bit annoying if someone does not type "xset s off"
 before starting a talk and the computer does not turn on again!!!
 (no... it did not happen to me... yet!).

 The key here is that the laptop can be rebooted and only the portable
 hard disk drive requires a consistency check now (in any case, I do not
 want to try this experiement a lot of times!)

 Cheers,
 Igor.

From: Carl Brewer <carl@bl.echidna.id.au>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31003
Date: Thu, 13 Apr 2006 16:58:00 +1000

 for what it's worth, I still see random total machine
 hangs with the same hardware - ViPower and 160GB IDE HDD
 and swapped hardware, but the same - just new disk drives and caddies
 etc) on 3.0/i386

 dmesg shows :

 umass0 at uhub4 port 3 configuration 1 interface 0
 umass0: ViPowER, Inc. ViPowER USB2.0 Storage Device, rev 2.00/0.01, addr 2
 umass0: using SCSI over Bulk-Only
 scsibus0 at umass0: 2 targets, 1 lun per target
 sd0 at scsibus0 target 0 lun 0: <ST316002, 1A, 0\0000> disk fixed
 sd0: 149 GB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors

 It shows a lot of this in dmesg :

 sd0: dos partition I/O error
 sd0(umass0:0:0:0):  Check Condition on CDB: 0x28 00 00 00 00 00 00 00 01 00
      SENSE KEY:  Aborted Command
       ASC/ASCQ:  Logical Unit Communication CRC Error

 sd0(umass0:0:0:0):  Check Condition on CDB: 0x28 00 00 00 00 00 00 00 01 00
      SENSE KEY:  Aborted Command
       ASC/ASCQ:  Logical Unit Communication CRC Error

 sd0(umass0:0:0:0):  Check Condition on CDB: 0x28 00 00 00 00 00 00 00 01 00
      SENSE KEY:  Aborted Command
       ASC/ASCQ:  Logical Unit Communication CRC Error

 sd0(umass0:0:0:0):  Check Condition on CDB: 0x28 00 00 00 00 00 00 00 01 00
      SENSE KEY:  Aborted Command
       ASC/ASCQ:  Logical Unit Communication CRC Error

 sd0(umass0:0:0:0):  Check Condition on CDB: 0x28 00 00 00 00 00 00 00 01 00
      SENSE KEY:  Aborted Command
       ASC/ASCQ:  Logical Unit Communication CRC Error

 Does the above suggest a hardware problem or a problem with
 the umass driver?


 Is anyone looking at this or doing anything similar?  I'm
 trying (but having no luck!) to use these removable HDDs
 for backup devices, by mounting them, dumping the FS, and
 umounting so the users at the site can take the HDD home as
 an offsite backup.








State-Changed-From-To: open->feedback
State-Changed-By: gutteridge@NetBSD.org
State-Changed-When: Thu, 21 Mar 2019 00:38:08 +0000
State-Changed-Why:
Is this still duplicable? There have been a lot of changes in the last
14 years. There was a related PR 31428 that I confirmed was fixed ten
years ago. I also just pushed 1TB of data to an external HD via umass
without issue.

State-Changed-From-To: feedback->closed
State-Changed-By: gutteridge@NetBSD.org
State-Changed-When: Sun, 28 Apr 2019 07:22:51 +0000
State-Changed-Why:
Feedback timeout. There have been improvements made since this was
filed, and I can't reproduce this with either an external USB hard
drive or a USB stick.

>Unformatted:
Home
PR Database Search
(Contact us) $NetBSD: query-full-pr,v 1.43 2018/01/16 07:36:43 maya Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2017 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.