NetBSD Problem Report #39308
From simonb@thistledown.com.au Thu Aug 7 08:21:08 2008
Return-Path: <simonb@thistledown.com.au>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
by narn.NetBSD.org (Postfix) with ESMTP id 64B9A63BB81
for <gnats-bugs@gnats.NetBSD.org>; Thu, 7 Aug 2008 08:21:08 +0000 (UTC)
Message-Id: <20080807082105.D8698AFD04@thoreau.thistledown.com.au>
Date: Thu, 7 Aug 2008 18:21:05 +1000 (EST)
From: Simon Burge <simonb@NetBSD.org>
Reply-To: Simon Burge <simonb@NetBSD.org>
To: gnats-bugs@gnats.NetBSD.org
Subject: sort +1n failure
X-Send-Pr-Version: 3.95
>Number: 39308
>Category: bin
>Synopsis: sort +1n failure
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: apb
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Aug 07 08:25:00 +0000 2008
>Closed-Date: Sat Aug 22 11:21:37 +0000 2009
>Last-Modified: Wed Oct 14 20:45:17 +0000 2009
>Originator: Simon Burge
>Release: NetBSD 4.0, NetBSD -current as of 20080807
>Organization:
>Environment:
Architecture: any? fails on i386 and amd64 not tried on big-endian.
>Description:
"sort +1n" is failing to numerically sort the second field of a
file. Note that the input is important and very touchy. The
"x 4" line can have any number of "x"s before the 4, and seems
like any number of these lines will work. Shorten the number of
"x"s at the start of most of the "xxxxxxxxxxxxx 2" lines seems
to be enough to make it sort properly, as will changing an "x"
after the "4" or deleting a line most of the time.
>How-To-Repeat:
Uudecode the following file and run:
zcat testfile.bz2 | sort +1n | sed -n -e 1p -e \$p
and observe that 4 is less than 2 (!):
x 4
xxxxxxxxxxxxx 2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
begin-base64 644 testfile.bz2
QlpoOTFBWSZTWW1WSR4CIjXb/AAwQAAUACAAEf//////////////8ABAAnuUkSAc
DTTTQaGhoZGgGQBoaA00ZAAAwmIDQ4Gmmmg0NDQyNAMgDQ0BpoyAABhMQGgmqqo0
GhoNDQaAGhkaBiZBoGTQZDCDI0AApVSQGjQaZAyaADIAMg0AAAA0BkNH6py4pCi3
m9wa32/uAmFU4KnCJwyOIVWTBN5UplVCy4I3oKb5Im/JFwCouDIhllIZhFLmSgua
RJzqSjnwodCJBxVUo41KiccIpyJRJyVBTlKqJy6iJzCCmeVCshCK7ECV2YKVkypV
dqUqrtlCdwqh3UQd4qDq1IHWKA0YFWlJVXXgVdipRdlILtUpGnEi7YUnclBd0Ene
Kkd+iI8EIegqFoKJokq9Mk0pK9Ui9aj2RV7VK06GoR7pT3qXwRqi+EnxR8i+Y1C+
lPqJ9lH3ovxFfmK1UjWin6Iv3UX8QtcU/qR/orAkw11EqKezBRKimFRKimMkhBYY
ikgixKJUUxFIYyiVFMQhEif8XckU4UJBtVkkeA==
====
Note that the above file unzips to a ~7.5MB file.
>Fix:
None given. Some overflow somewhere given the ease with which it
will and wont work?
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: bin-bug-people->apb
Responsible-Changed-By: apb@NetBSD.org
Responsible-Changed-When: Thu, 07 Aug 2008 08:28:43 +0000
Responsible-Changed-Why:
take
From: David Holland <dholland-bugs@netbsd.org>
To: Simon Burge <simonb@NetBSD.org>, gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/39308: sort +1n failure
Date: Thu, 7 Aug 2008 08:42:24 +0000
On Thu, Aug 07, 2008 at 08:25:00AM +0000, Simon Burge wrote:
> >Description:
> "sort +1n" is failing to numerically sort the second field of a
> file. Note that the input is important and very touchy. The
See also PRs 37860, 36816, 30504, 30195, 27257, 25551, 24756, 24316,
22182, 18614...
--
David A. Holland
dholland@netbsd.org
From: Simon Burge <simonb@NetBSD.org>
To: David Holland <dholland-bugs@netbsd.org>
Cc: gnats-bugs@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: bin/39308: sort +1n failure
Date: Thu, 07 Aug 2008 19:14:11 +1000
David Holland wrote:
> On Thu, Aug 07, 2008 at 08:25:00AM +0000, Simon Burge wrote:
> > >Description:
> > "sort +1n" is failing to numerically sort the second field of a
> > file. Note that the input is important and very touchy. The
>
> See also PRs 37860, 36816, 30504, 30195, 27257, 25551, 24756, 24316,
> 22182, 18614...
I should have mentioned that I tried the patch in bin/30504 but it
did have any effect on this problem.
Cheers,
Simon.
From: Simon Burge <simonb@NetBSD.org>
To: gnats-bugs@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Cc:
Subject: Re: bin/39308: sort +1n failure
Date: Sat, 16 Aug 2008 18:13:43 +1000
It looks like the simplified test case fails on amd64 but works on i386.
This file fails on i386 -current:
begin-base64 644 testfile2.bz2
QlpoOTFBWSZTWdqnJF8EHmNf4AAwQAAQAAAAj7///////////EAB+6pKIOaZGQyY
IaMJgjTRoxA0yZGAAIc0yMhkwQ0YTBGmjRiBpkyMAAQTVVUYEwTCYCDIxNMAIwED
TQB6gUqqjEyYTBGBBkYJpgBGJk0aYmgb3IVEGJi42PXkZFk5VZasxPKZqzpnxm09
JeqmcmfNAmimkjTg9dJ7AvTSeqibCqmyiPWlK9gRbUSrXhBsEE2SqjaoRfGVBtwB
vRIrfKSXBUlThFFOKkicaCHJFVVcqoocSpQcZUg5IIHLSoHzikH0pQHNElXOSodh
UR20qh3EUnxJFPlFQuuVSX3Qovwgi7CoL8wUnbUFO6opO8qI/VQh4KhX7qiH8glf
0iD/VUTBUFfyhUQYKFRBgoVEHtUKSSYWChUQeOAMKhUQeKqUpVH/MUFZJlNZReo2
MQAbBNq8ADBAABQAAAIAQAAH/////+AAMAEbKgw0Mhpk0AxDTTRoaME1VUYjTCDJ
hBoGmmQApVU0ZGhoDQBoGg2o/VNBApaOFCcQpJpKqOOhRlKFmUlZ0DrijslR2kl3
ELQoWmQu+RWoo1lLYVd8h4JXiU8qrzRsK9IesL2ovdJ8Qb5HzJXCRfVKcai3IWQp
+pH8lYJMsiBSxggUsSBSwEITOQKWaQKWYIFL/F3JFOFCQ8KR+jg=
====
From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: bin/39308: sort +1n failure
Date: Sat, 15 Aug 2009 22:40:20 +0100
This is a bug in the handling of large files, not numeric sort fields.
The code seems to start sorting on subsequent bytes of the key as part
of the actions when it saves chunks of sorted data to disk.
Since the key record for the first line is {190,170,0,'x',' ','4'} it
soon runs off that buffer - and into the later records.
(It also advances past the key of the other lines, but the later bytes
are always the same for those.)
I'm not (yet) at all sure what the 'radix exchange' is trying to achieve,
but I suspect it is at fault here.
David
--
David Laight: david@l8s.co.uk
From: David Laight <dsl@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/9308 CVS commit: src/usr.bin/sort
Date: Tue, 18 Aug 2009 18:00:28 +0000
Module Name: src
Committed By: dsl
Date: Tue Aug 18 18:00:28 UTC 2009
Modified Files:
src/usr.bin/sort: append.c files.c fsort.c sort.c sort.h
Log Message:
The code that attempted to sort large files by sorting each chunk by the
first key byte and writing to a temp file, then sorting the records from
each temp file that had the same first key byte (and repeating for upto
4 key bytes) was a nice idea, but completely doomed to failure.
Eg PR/9308 where a 70MB file has all but one record the same and short keys.
Not only does the code not work, it is rather guaranteed to be slow.
Instead always use a merge sort for fully sorted chunk of records (each
temporary file contains one lot of sorted records).
The -H option already did this, so just rip out all the code and variables
that can't be used when -H was specified.
Further cleanup to come ...
To generate a diff of this commit:
cvs rdiff -u -r1.17 -r1.18 src/usr.bin/sort/append.c
cvs rdiff -u -r1.33 -r1.34 src/usr.bin/sort/files.c
cvs rdiff -u -r1.36 -r1.37 src/usr.bin/sort/fsort.c
cvs rdiff -u -r1.49 -r1.50 src/usr.bin/sort/sort.c
cvs rdiff -u -r1.22 -r1.23 src/usr.bin/sort/sort.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: David Laight <dsl@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/39308 CVS commit: src/usr.bin/sort
Date: Sat, 22 Aug 2009 10:53:29 +0000
Module Name: src
Committed By: dsl
Date: Sat Aug 22 10:53:28 UTC 2009
Modified Files:
src/usr.bin/sort: append.c fields.c files.c fsort.c init.c msort.c
sort.c sort.h
Log Message:
Rework the way sort generates sort keys:
- If we generate a key, it is always sortable using memcmp()
- If we are sorting the whole record, then a weight-table must be used
during compares.
- Major surgery to encoding of numbers to ensure unique keys for equal
numeric values. Reverse numerics are handled by inverting the sign.
- Case folding (-f) is handled when the sort keys are generated. No other
code has to care at all.
- Key uniqueness (-u) is done during merge for large datasets. It only
has to be done when writing the output file for small files.
Since the file is in key order this is simple!
Probably fixes all of: PR/27257 PR/25551 PR/22182 PR/31095 PR/30504
PR/36816 PR/37860 PR/39308
Also PR/18614 should no longer die, but a little more work needs to be
done on the merging for very large files.
To generate a diff of this commit:
cvs rdiff -u -r1.19 -r1.20 src/usr.bin/sort/append.c src/usr.bin/sort/init.c
cvs rdiff -u -r1.24 -r1.25 src/usr.bin/sort/fields.c src/usr.bin/sort/sort.h
cvs rdiff -u -r1.34 -r1.35 src/usr.bin/sort/files.c
cvs rdiff -u -r1.38 -r1.39 src/usr.bin/sort/fsort.c
cvs rdiff -u -r1.22 -r1.23 src/usr.bin/sort/msort.c
cvs rdiff -u -r1.51 -r1.52 src/usr.bin/sort/sort.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: dsl@NetBSD.org
State-Changed-When: Sat, 22 Aug 2009 11:21:37 +0000
State-Changed-Why:
Fixed by an earlier commit that enforced the use of a merge sort for
the temporary files (ie enforcing the -H arg).
From: Stephen Borrill <sborrill@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/39308 CVS commit: [netbsd-5] src/usr.bin/sort
Date: Wed, 14 Oct 2009 20:41:53 +0000
Module Name: src
Committed By: sborrill
Date: Wed Oct 14 20:41:53 UTC 2009
Modified Files:
src/usr.bin/sort [netbsd-5]: Makefile append.c fields.c files.c fsort.c
fsort.h init.c msort.c sort.1 sort.c sort.h tmp.c
Added Files:
src/usr.bin/sort [netbsd-5]: radix_sort.c
Log Message:
Pull up the following revisions(s) (requested by dsl in ticket #1084):
usr.bin/sort/Makefile: revision 1.6-1.8
usr.bin/sort/append.c: revision 1.15-1.22
usr.bin/sort/fields.c: revision 1.20-1.30
usr.bin/sort/files.c: revision 1.27-1.40
usr.bin/sort/fsort.c: revision 1.33-1.45
usr.bin/sort/fsort.h: revision 1.14-1.17
usr.bin/sort/init.c: revision 1.19-1.23
usr.bin/sort/msort.c: revision 1.19-1.28
usr.bin/sort/radix_sort.c: revision 1.1-1.4
usr.bin/sort/sort.1: revision 1.27-1.29
usr.bin/sort/sort.c: revision 1.47-1.56
usr.bin/sort/sort.h: revision 1.20-1.30
usr.bin/sort/tmp.c: revision 1.14-1.15
Only use radix sort for in-memory sort, always merge temporary files.
Use a local radixsort() function so we can pass record length.
Avoid use of weight tables for key compares.
Fix generation of keys for numbers, negate value for reverse sort.
Write file in reverse-key order for 'sort -n'.
'sort -S' now does a posix sort (sort matching keys by record data).
Ensure merge sort doesn't have too many temporary files open.
Fixes: PR#18614 PR#27257 PR#25551 PR#22182 PR#31095 PR#30504 PR#36816
PR#37860 PR#39308 PR#42094
To generate a diff of this commit:
cvs rdiff -u -r1.5 -r1.5.40.1 src/usr.bin/sort/Makefile
cvs rdiff -u -r1.14 -r1.14.6.1 src/usr.bin/sort/append.c
cvs rdiff -u -r1.19 -r1.19.6.1 src/usr.bin/sort/fields.c \
src/usr.bin/sort/sort.h
cvs rdiff -u -r1.26 -r1.26.6.1 src/usr.bin/sort/files.c \
src/usr.bin/sort/sort.1
cvs rdiff -u -r1.32 -r1.32.6.1 src/usr.bin/sort/fsort.c
cvs rdiff -u -r1.13 -r1.13.6.1 src/usr.bin/sort/fsort.h \
src/usr.bin/sort/tmp.c
cvs rdiff -u -r1.18 -r1.18.6.1 src/usr.bin/sort/init.c \
src/usr.bin/sort/msort.c
cvs rdiff -u -r0 -r1.4.2.2 src/usr.bin/sort/radix_sort.c
cvs rdiff -u -r1.46 -r1.46.4.1 src/usr.bin/sort/sort.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.