NetBSD Problem Report #39308

From simonb@thistledown.com.au  Thu Aug  7 08:21:08 2008
Return-Path: <simonb@thistledown.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id 64B9A63BB81
	for <gnats-bugs@gnats.NetBSD.org>; Thu,  7 Aug 2008 08:21:08 +0000 (UTC)
Message-Id: <20080807082105.D8698AFD04@thoreau.thistledown.com.au>
Date: Thu,  7 Aug 2008 18:21:05 +1000 (EST)
From: Simon Burge <simonb@NetBSD.org>
Reply-To: Simon Burge <simonb@NetBSD.org>
To: gnats-bugs@gnats.NetBSD.org
Subject: sort +1n failure
X-Send-Pr-Version: 3.95

>Number:         39308
>Category:       bin
>Synopsis:       sort +1n failure
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    apb
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 07 08:25:00 +0000 2008
>Closed-Date:    Sat Aug 22 11:21:37 +0000 2009
>Last-Modified:  Wed Oct 14 20:45:17 +0000 2009
>Originator:     Simon Burge
>Release:        NetBSD 4.0, NetBSD -current as of 20080807
>Organization:
>Environment:
Architecture: any?  fails on i386 and amd64  not tried on big-endian.
>Description:
	"sort +1n" is failing to numerically sort the second field of a
	file.  Note that the input is important and very touchy.  The
        "x 4" line can have any number of "x"s before the 4, and seems
        like any number of these lines will work.  Shorten the number of
        "x"s at the start of most of the "xxxxxxxxxxxxx 2" lines seems
        to be enough to make it sort properly, as will changing an "x"
        after the "4" or deleting a line most of the time.

>How-To-Repeat:
	Uudecode the following file and run:

		zcat testfile.bz2 | sort +1n | sed -n -e 1p -e \$p

	and observe that 4 is less than 2 (!):

		x 4
		xxxxxxxxxxxxx 2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

	begin-base64 644 testfile.bz2
	QlpoOTFBWSZTWW1WSR4CIjXb/AAwQAAUACAAEf//////////////8ABAAnuUkSAc
	DTTTQaGhoZGgGQBoaA00ZAAAwmIDQ4Gmmmg0NDQyNAMgDQ0BpoyAABhMQGgmqqo0
	GhoNDQaAGhkaBiZBoGTQZDCDI0AApVSQGjQaZAyaADIAMg0AAAA0BkNH6py4pCi3
	m9wa32/uAmFU4KnCJwyOIVWTBN5UplVCy4I3oKb5Im/JFwCouDIhllIZhFLmSgua
	RJzqSjnwodCJBxVUo41KiccIpyJRJyVBTlKqJy6iJzCCmeVCshCK7ECV2YKVkypV
	dqUqrtlCdwqh3UQd4qDq1IHWKA0YFWlJVXXgVdipRdlILtUpGnEi7YUnclBd0Ene
	Kkd+iI8EIegqFoKJokq9Mk0pK9Ui9aj2RV7VK06GoR7pT3qXwRqi+EnxR8i+Y1C+
	lPqJ9lH3ovxFfmK1UjWin6Iv3UX8QtcU/qR/orAkw11EqKezBRKimFRKimMkhBYY
	ikgixKJUUxFIYyiVFMQhEif8XckU4UJBtVkkeA==
	====

	Note that the above file unzips to a ~7.5MB file.

>Fix:
	None given.  Some overflow somewhere given the ease with which it
	will and wont work?

>Release-Note:

>Audit-Trail:

Responsible-Changed-From-To: bin-bug-people->apb
Responsible-Changed-By: apb@NetBSD.org
Responsible-Changed-When: Thu, 07 Aug 2008 08:28:43 +0000
Responsible-Changed-Why:
take


From: David Holland <dholland-bugs@netbsd.org>
To: Simon Burge <simonb@NetBSD.org>, gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/39308: sort +1n failure
Date: Thu, 7 Aug 2008 08:42:24 +0000

 On Thu, Aug 07, 2008 at 08:25:00AM +0000, Simon Burge wrote:
  > >Description:
  > 	"sort +1n" is failing to numerically sort the second field of a
  > 	file.  Note that the input is important and very touchy.  The

 See also PRs 37860, 36816, 30504, 30195, 27257, 25551, 24756, 24316,
 22182, 18614...

 -- 
 David A. Holland
 dholland@netbsd.org

From: Simon Burge <simonb@NetBSD.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Subject: Re: bin/39308: sort +1n failure 
Date: Thu, 07 Aug 2008 19:14:11 +1000

 David Holland wrote:

 > On Thu, Aug 07, 2008 at 08:25:00AM +0000, Simon Burge wrote:
 >  > >Description:
 >  > 	"sort +1n" is failing to numerically sort the second field of a
 >  > 	file.  Note that the input is important and very touchy.  The
 > 
 > See also PRs 37860, 36816, 30504, 30195, 27257, 25551, 24756, 24316,
 > 22182, 18614...

 I should have mentioned that I tried the patch in bin/30504 but it
 did have any effect on this problem.

 Cheers,
 Simon.

From: Simon Burge <simonb@NetBSD.org>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org
Cc: 
Subject: Re: bin/39308: sort +1n failure 
Date: Sat, 16 Aug 2008 18:13:43 +1000

 It looks like the simplified test case fails on amd64 but works on i386.
 This file fails on i386 -current:

 begin-base64 644 testfile2.bz2
 QlpoOTFBWSZTWdqnJF8EHmNf4AAwQAAQAAAAj7///////////EAB+6pKIOaZGQyY
 IaMJgjTRoxA0yZGAAIc0yMhkwQ0YTBGmjRiBpkyMAAQTVVUYEwTCYCDIxNMAIwED
 TQB6gUqqjEyYTBGBBkYJpgBGJk0aYmgb3IVEGJi42PXkZFk5VZasxPKZqzpnxm09
 JeqmcmfNAmimkjTg9dJ7AvTSeqibCqmyiPWlK9gRbUSrXhBsEE2SqjaoRfGVBtwB
 vRIrfKSXBUlThFFOKkicaCHJFVVcqoocSpQcZUg5IIHLSoHzikH0pQHNElXOSodh
 UR20qh3EUnxJFPlFQuuVSX3Qovwgi7CoL8wUnbUFO6opO8qI/VQh4KhX7qiH8glf
 0iD/VUTBUFfyhUQYKFRBgoVEHtUKSSYWChUQeOAMKhUQeKqUpVH/MUFZJlNZReo2
 MQAbBNq8ADBAABQAAAIAQAAH/////+AAMAEbKgw0Mhpk0AxDTTRoaME1VUYjTCDJ
 hBoGmmQApVU0ZGhoDQBoGg2o/VNBApaOFCcQpJpKqOOhRlKFmUlZ0DrijslR2kl3
 ELQoWmQu+RWoo1lLYVd8h4JXiU8qrzRsK9IesL2ovdJ8Qb5HzJXCRfVKcai3IWQp
 +pH8lYJMsiBSxggUsSBSwEITOQKWaQKWYIFL/F3JFOFCQ8KR+jg=
 ====

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: bin/39308: sort +1n failure
Date: Sat, 15 Aug 2009 22:40:20 +0100

 This is a bug in the handling of large files, not numeric sort fields.

 The code seems to start sorting on subsequent bytes of the key as part
 of the actions when it saves chunks of sorted data to disk.
 Since the key record for the first line is {190,170,0,'x',' ','4'} it
 soon runs off that buffer - and into the later records.
 (It also advances past the key of the other lines, but the later bytes
 are always the same for those.)

 I'm not (yet) at all sure what the 'radix exchange' is trying to achieve,
 but I suspect it is at fault here.

 	David

 -- 
 David Laight: david@l8s.co.uk

From: David Laight <dsl@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/9308 CVS commit: src/usr.bin/sort
Date: Tue, 18 Aug 2009 18:00:28 +0000

 Module Name:	src
 Committed By:	dsl
 Date:		Tue Aug 18 18:00:28 UTC 2009

 Modified Files:
 	src/usr.bin/sort: append.c files.c fsort.c sort.c sort.h

 Log Message:
 The code that attempted to sort large files by sorting each chunk by the
 first key byte and writing to a temp file, then sorting the records from
 each temp file that had the same first key byte (and repeating for upto
 4 key bytes) was a nice idea, but completely doomed to failure.
 Eg PR/9308 where a 70MB file has all but one record the same and short keys.
 Not only does the code not work, it is rather guaranteed to be slow.
 Instead always use a merge sort for fully sorted chunk of records (each
 temporary file contains one lot of sorted records).
 The -H option already did this, so just rip out all the code and variables
 that can't be used when -H was specified.
 Further cleanup to come ...


 To generate a diff of this commit:
 cvs rdiff -u -r1.17 -r1.18 src/usr.bin/sort/append.c
 cvs rdiff -u -r1.33 -r1.34 src/usr.bin/sort/files.c
 cvs rdiff -u -r1.36 -r1.37 src/usr.bin/sort/fsort.c
 cvs rdiff -u -r1.49 -r1.50 src/usr.bin/sort/sort.c
 cvs rdiff -u -r1.22 -r1.23 src/usr.bin/sort/sort.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.


From: David Laight <dsl@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/39308 CVS commit: src/usr.bin/sort
Date: Sat, 22 Aug 2009 10:53:29 +0000

 Module Name:	src
 Committed By:	dsl
 Date:		Sat Aug 22 10:53:28 UTC 2009

 Modified Files:
 	src/usr.bin/sort: append.c fields.c files.c fsort.c init.c msort.c
 	    sort.c sort.h

 Log Message:
 Rework the way sort generates sort keys:
 - If we generate a key, it is always sortable using memcmp()
 - If we are sorting the whole record, then a weight-table must be used
   during compares.
 - Major surgery to encoding of numbers to ensure unique keys for equal
   numeric values.  Reverse numerics are handled by inverting the sign.
 - Case folding (-f) is handled when the sort keys are generated. No other
   code has to care at all.
 - Key uniqueness (-u) is done during merge for large datasets. It only
   has to be done when writing the output file for small files.
   Since the file is in key order this is simple!
 Probably fixes all of: PR/27257 PR/25551 PR/22182 PR/31095 PR/30504
 PR/36816 PR/37860 PR/39308
 Also PR/18614 should no longer die, but a little more work needs to be
 done on the merging for very large files.


 To generate a diff of this commit:
 cvs rdiff -u -r1.19 -r1.20 src/usr.bin/sort/append.c src/usr.bin/sort/init.c
 cvs rdiff -u -r1.24 -r1.25 src/usr.bin/sort/fields.c src/usr.bin/sort/sort.h
 cvs rdiff -u -r1.34 -r1.35 src/usr.bin/sort/files.c
 cvs rdiff -u -r1.38 -r1.39 src/usr.bin/sort/fsort.c
 cvs rdiff -u -r1.22 -r1.23 src/usr.bin/sort/msort.c
 cvs rdiff -u -r1.51 -r1.52 src/usr.bin/sort/sort.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: dsl@NetBSD.org
State-Changed-When: Sat, 22 Aug 2009 11:21:37 +0000
State-Changed-Why:
Fixed by an earlier commit that enforced the use of a merge sort for
the temporary files (ie enforcing the -H arg).


From: Stephen Borrill <sborrill@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/39308 CVS commit: [netbsd-5] src/usr.bin/sort
Date: Wed, 14 Oct 2009 20:41:53 +0000

 Module Name:	src
 Committed By:	sborrill
 Date:		Wed Oct 14 20:41:53 UTC 2009

 Modified Files:
 	src/usr.bin/sort [netbsd-5]: Makefile append.c fields.c files.c fsort.c
 	    fsort.h init.c msort.c sort.1 sort.c sort.h tmp.c
 Added Files:
 	src/usr.bin/sort [netbsd-5]: radix_sort.c

 Log Message:
 Pull up the following revisions(s) (requested by dsl in ticket #1084):
 	usr.bin/sort/Makefile:	revision 1.6-1.8
 	usr.bin/sort/append.c:	revision 1.15-1.22
 	usr.bin/sort/fields.c:	revision 1.20-1.30
 	usr.bin/sort/files.c:	revision 1.27-1.40
 	usr.bin/sort/fsort.c:	revision 1.33-1.45
 	usr.bin/sort/fsort.h:	revision 1.14-1.17
 	usr.bin/sort/init.c:	revision 1.19-1.23
 	usr.bin/sort/msort.c:	revision 1.19-1.28
 	usr.bin/sort/radix_sort.c:	revision 1.1-1.4
 	usr.bin/sort/sort.1:	revision 1.27-1.29
 	usr.bin/sort/sort.c:	revision 1.47-1.56
 	usr.bin/sort/sort.h:	revision 1.20-1.30
 	usr.bin/sort/tmp.c:	revision 1.14-1.15

 Only use radix sort for in-memory sort, always merge temporary files.
 Use a local radixsort() function so we can pass record length.
 Avoid use of weight tables for key compares.
 Fix generation of keys for numbers, negate value for reverse sort.
 Write file in reverse-key order for 'sort -n'.
 'sort -S' now does a posix sort (sort matching keys by record data).
 Ensure merge sort doesn't have too many temporary files open.
 Fixes: PR#18614 PR#27257 PR#25551 PR#22182 PR#31095 PR#30504 PR#36816
 PR#37860 PR#39308 PR#42094


 To generate a diff of this commit:
 cvs rdiff -u -r1.5 -r1.5.40.1 src/usr.bin/sort/Makefile
 cvs rdiff -u -r1.14 -r1.14.6.1 src/usr.bin/sort/append.c
 cvs rdiff -u -r1.19 -r1.19.6.1 src/usr.bin/sort/fields.c \
     src/usr.bin/sort/sort.h
 cvs rdiff -u -r1.26 -r1.26.6.1 src/usr.bin/sort/files.c \
     src/usr.bin/sort/sort.1
 cvs rdiff -u -r1.32 -r1.32.6.1 src/usr.bin/sort/fsort.c
 cvs rdiff -u -r1.13 -r1.13.6.1 src/usr.bin/sort/fsort.h \
     src/usr.bin/sort/tmp.c
 cvs rdiff -u -r1.18 -r1.18.6.1 src/usr.bin/sort/init.c \
     src/usr.bin/sort/msort.c
 cvs rdiff -u -r0 -r1.4.2.2 src/usr.bin/sort/radix_sort.c
 cvs rdiff -u -r1.46 -r1.46.4.1 src/usr.bin/sort/sort.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.