NetBSD Problem Report #46255
From njoly@lanfeust.sis.pasteur.fr Sat Mar 24 22:57:15 2012
Return-Path: <njoly@lanfeust.sis.pasteur.fr>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
by www.NetBSD.org (Postfix) with ESMTP id EBE8B63B946
for <gnats-bugs@gnats.NetBSD.org>; Sat, 24 Mar 2012 22:57:14 +0000 (UTC)
Message-Id: <20120324225717.4FB7FDC9BD@lanfeust.sis.pasteur.fr>
Date: Sat, 24 Mar 2012 23:57:17 +0100 (CET)
From: njoly@pasteur.fr
Reply-To: njoly@pasteur.fr
To: gnats-bugs@gnats.NetBSD.org
Subject: apropos(1) sometimes report unrelated responses
X-Send-Pr-Version: 3.95
>Number: 46255
>Category: bin
>Synopsis: apropos(1) sometimes report unrelated results
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: abhinav
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 24 23:00:00 +0000 2012
>Closed-Date: Sun Jun 25 15:22:02 +0000 2017
>Last-Modified: Sun Jun 25 15:22:02 +0000 2017
>Originator: Nicolas Joly
>Release: NetBSD 6.99.4
>Organization:
Institut Pasteur
>Environment:
System: NetBSD lanfeust.sis.pasteur.fr 6.99.4 NetBSD 6.99.4 (LANFEUST) #5: Sat Mar 24 14:34:56 CET 2012 njoly@lanfeust.sis.pasteur.fr:/local/src/NetBSD/obj.amd64/sys/arch/amd64/compile/LANFEUST amd64
Architecture: x86_64
Machine: amd64
>Description:
Sometimes, apropos(1) return un-related results. By example, the `apropos lfs'
command report correct entries that include the searched word but some
un-related ones for the LF word .
newfs_lfs(8) construct a new LFS file system
rump_lfs(8) mount a lfs image with a userspace server
scan_ffs(8) find FFSv1/FFSv2/LFS partitions on a disk or file
lfs_segclean(2) mark a segment clean
mvme68k/lpt(4) parallel port driver
lfs_segwait(2) wait until a segment is written
x86/lpt(4) Parallel port driver
installboot(8) install disk bootstrap software
PCRE(3) - Perl-compatible regular expressions
PCRE(3) - Perl-compatible regular expressions
For the 10 results reported, 6 are correct and 4 are wrong (2 lpt and 2 PCRE).
Things are worse for `apropos crs' which only report pages with "cr" word,
not even a single "crs" result is found.
njoly@lanfeust [~]> apropos -n 1000 crs | head
mvme68k/lpt(4) parallel port driver
...the driver. Minor Bit Function 128 Use the interruptless driver. (polling) 64 Do not initialize the device on the port. 32 Automatic LF on CR. 16 Select 1.6uS strobe pulse width (default is 6.4uS) pcc(4) , pcctwo(4)
[...]
njoly@lanfeust [~]> apropos -n 1000 crs | grep -ic crs
0
>How-To-Repeat:
Check results from `apropos lfs' or `apropos crs' commands.
>Fix:
>Release-Note:
>Audit-Trail:
From: Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/46255: apropos(1) sometimes report unrelated responses
Date: Sun, 25 Mar 2012 13:57:28 +0900
On Sun, Mar 25, 2012 at 8:00 AM, <njoly@pasteur.fr> wrote:
>>Number: =A0 =A0 =A0 =A0 46255
>>Category: =A0 =A0 =A0 bin
>>Synopsis: =A0 =A0 =A0 apropos(1) sometimes report unrelated results
>>Confidential: =A0 no
>>Severity: =A0 =A0 =A0 non-critical
>>Priority: =A0 =A0 =A0 medium
>>Responsible: =A0 =A0bin-bug-people
>>State: =A0 =A0 =A0 =A0 =A0open
>>Class: =A0 =A0 =A0 =A0 =A0sw-bug
>>Submitter-Id: =A0 net
>>Arrival-Date: =A0 Sat Mar 24 23:00:00 +0000 2012
>>Originator: =A0 =A0 Nicolas Joly
>>Release: =A0 =A0 =A0 =A0NetBSD 6.99.4
>>Organization:
> Institut Pasteur
>>Environment:
> System: NetBSD lanfeust.sis.pasteur.fr 6.99.4 NetBSD 6.99.4 (LANFEUST) #5=
: Sat Mar 24 14:34:56 CET 2012 njoly@lanfeust.sis.pasteur.fr:/local/src/Net=
BSD/obj.amd64/sys/arch/amd64/compile/LANFEUST amd64
> Architecture: x86_64
> Machine: amd64
>>Description:
> Sometimes, apropos(1) return un-related results. By example, the `apropos=
lfs'
> command report correct entries that include the searched word but some
> un-related ones for the LF word .
>
> newfs_lfs(8) =A0 =A0construct a new LFS file system
> rump_lfs(8) =A0 =A0 mount a lfs image with a userspace server
> scan_ffs(8) =A0 =A0 find FFSv1/FFSv2/LFS partitions on a disk or file
> lfs_segclean(2) mark a segment clean
> mvme68k/lpt(4) =A0parallel port driver
> lfs_segwait(2) =A0wait until a segment is written
> x86/lpt(4) =A0 =A0 =A0Parallel port driver
> installboot(8) =A0install disk bootstrap software
> PCRE(3) - Perl-compatible regular expressions
> PCRE(3) - Perl-compatible regular expressions
>
> For the 10 results reported, 6 are correct and 4 are wrong (2 lpt and 2 P=
CRE).
>
> Things are worse for `apropos crs' which only report pages with "cr" word=
,
> not even a single "crs" result is found.
>
> njoly@lanfeust [~]> apropos -n 1000 crs | head
> mvme68k/lpt(4) =A0parallel port driver
> ...the driver. Minor Bit Function 128 Use the interruptless driver. (poll=
ing) 64 Do not initialize the device on the port. 32 Automatic LF on CR. 16=
Select 1.6uS strobe pulse width (default is 6.4uS) pcc(4) , pcctwo(4)
> [...]
> njoly@lanfeust [~]> apropos -n 1000 crs | grep -ic crs
This is because of the stemmer. The stemmer strips off the suffix 's'
from the ending of all the tokens in an attempt to reduce the tokens
to their root word. This of course isn't right for technical terms
like lfs or abbreviations etc. I think the fix for this would require
writing a custom tokenizer for the FTS engine of Sqlite, which does
not try to stem down such technical keywords, but it would be a bit of
an undertaking :)
On the other hand, since the new apropos(1) supports full text search,
I think to get better millage out of it, it would be more useful to
specify a bit more detailed queries. It is hard to get 100% relevant
results but I hope to improve it.
--
Abhinav
Responsible-Changed-From-To: bin-bug-people->abhinav
Responsible-Changed-By: abhinav@NetBSD.org
Responsible-Changed-When: Thu, 15 Jun 2017 16:40:14 +0000
Responsible-Changed-Why:
I am working on this
From: "Abhinav Upadhyay" <abhinav@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/46255 CVS commit: src/usr.sbin/makemandb
Date: Sun, 18 Jun 2017 16:24:10 +0000
Module Name: src
Committed By: abhinav
Date: Sun Jun 18 16:24:10 UTC 2017
Modified Files:
src/usr.sbin/makemandb: Makefile apropos-utils.c apropos-utils.h
Added Files:
src/usr.sbin/makemandb: custom_apropos_tokenizer.c
custom_apropos_tokenizer.h fts3_tokenizer.h nostem.txt
Log Message:
Add a custom tokenizer which does not stem certain keywords.
Which keywords should not be stemmed is specified in the nostem.txt file.
(Right now I have taken all the man page names, split them if they had
underscores, removed common English words and converted everything to
lowercase.)
The tokenizer itself is based on the Porter stemming tokenizer shipped with
Sqlite. The code in custom_apropos_tokenizer.c is copy of that code with
some modifications to prevent stemming keywords specified in nostem.txt.
Additionally, it now uses underscore `_' also as a token delimiter. Therefore,
now it's possible to do query for `lwp' and all `_lwp_*' man page names
will be matched. Or the query can be `unconst' and `__UNCONST' will be matched.
This was not possible earlier, because underscore was not a delimiter and therefore
the index would have __UNCONST as a key rather than UNCONST.
The tokenizer needs fts3_tokenizer.h file, which is not shipped with the
amalgamation build of Sqlite, therefore it needs to be added here (unless
we decide there is a better place for it).
To enforce using the new tokenizer, a schema version bump is needed
Since the tokenization is done both at the indexing time (via makemandb) and
also while query time (via apropos or whatis), it will be needed to bump
the schema version everytime nostem.txt is modified. Otherwise the
index will consist of old tokens and desired changes will not be seen with
apropos.
This should also fix the issue reported in PR bin/46255. Similar suggestion was
also made on tech-userlevel@ recently:
<http://mail-index.netbsd.org/tech-userlevel/2017/06/08/msg010620.html>
Thanks to christos@ for multiple rounds of reviews of the tokenizer code.
To generate a diff of this commit:
cvs rdiff -u -r1.8 -r1.9 src/usr.sbin/makemandb/Makefile
cvs rdiff -u -r1.37 -r1.38 src/usr.sbin/makemandb/apropos-utils.c
cvs rdiff -u -r1.12 -r1.13 src/usr.sbin/makemandb/apropos-utils.h
cvs rdiff -u -r0 -r1.1 src/usr.sbin/makemandb/custom_apropos_tokenizer.c \
src/usr.sbin/makemandb/custom_apropos_tokenizer.h \
src/usr.sbin/makemandb/fts3_tokenizer.h src/usr.sbin/makemandb/nostem.txt
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: abhinav@NetBSD.org
State-Changed-When: Sun, 25 Jun 2017 15:22:02 +0000
State-Changed-Why:
Fixed with the custom tokenizer implementation.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.