NetBSD Problem Report #46255

From njoly@lanfeust.sis.pasteur.fr  Sat Mar 24 22:57:15 2012
Return-Path: <njoly@lanfeust.sis.pasteur.fr>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id EBE8B63B946
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 24 Mar 2012 22:57:14 +0000 (UTC)
Message-Id: <20120324225717.4FB7FDC9BD@lanfeust.sis.pasteur.fr>
Date: Sat, 24 Mar 2012 23:57:17 +0100 (CET)
From: njoly@pasteur.fr
Reply-To: njoly@pasteur.fr
To: gnats-bugs@gnats.NetBSD.org
Subject: apropos(1) sometimes report unrelated responses
X-Send-Pr-Version: 3.95

>Number:         46255
>Category:       bin
>Synopsis:       apropos(1) sometimes report unrelated results
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    abhinav
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 24 23:00:00 +0000 2012
>Closed-Date:    Sun Jun 25 15:22:02 +0000 2017
>Last-Modified:  Sun Jun 25 15:22:02 +0000 2017
>Originator:     Nicolas Joly
>Release:        NetBSD 6.99.4
>Organization:
Institut Pasteur
>Environment:
System: NetBSD lanfeust.sis.pasteur.fr 6.99.4 NetBSD 6.99.4 (LANFEUST) #5: Sat Mar 24 14:34:56 CET 2012 njoly@lanfeust.sis.pasteur.fr:/local/src/NetBSD/obj.amd64/sys/arch/amd64/compile/LANFEUST amd64
Architecture: x86_64
Machine: amd64
>Description:
Sometimes, apropos(1) return un-related results. By example, the `apropos lfs'
command report correct entries that include the searched word but some
un-related ones for the LF word .

newfs_lfs(8)    construct a new LFS file system
rump_lfs(8)     mount a lfs image with a userspace server
scan_ffs(8)     find FFSv1/FFSv2/LFS partitions on a disk or file
lfs_segclean(2) mark a segment clean
mvme68k/lpt(4)  parallel port driver
lfs_segwait(2)  wait until a segment is written
x86/lpt(4)      Parallel port driver
installboot(8)  install disk bootstrap software
PCRE(3) - Perl-compatible regular expressions
PCRE(3) - Perl-compatible regular expressions

For the 10 results reported, 6 are correct and 4 are wrong (2 lpt and 2 PCRE).

Things are worse for `apropos crs' which only report pages with "cr" word,
not even a single "crs" result is found.

njoly@lanfeust [~]> apropos -n 1000 crs | head        
mvme68k/lpt(4)  parallel port driver
...the driver. Minor Bit Function 128 Use the interruptless driver. (polling) 64 Do not initialize the device on the port. 32 Automatic LF on CR. 16 Select 1.6uS strobe pulse width (default is 6.4uS) pcc(4) , pcctwo(4)
[...]
njoly@lanfeust [~]> apropos -n 1000 crs | grep -ic crs
0

>How-To-Repeat:
Check results from `apropos lfs' or `apropos crs' commands.
>Fix:

>Release-Note:

>Audit-Trail:
From: Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
To: gnats-bugs@netbsd.org
Cc: gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: bin/46255: apropos(1) sometimes report unrelated responses
Date: Sun, 25 Mar 2012 13:57:28 +0900

 On Sun, Mar 25, 2012 at 8:00 AM,  <njoly@pasteur.fr> wrote:
 >>Number: =A0 =A0 =A0 =A0 46255
 >>Category: =A0 =A0 =A0 bin
 >>Synopsis: =A0 =A0 =A0 apropos(1) sometimes report unrelated results
 >>Confidential: =A0 no
 >>Severity: =A0 =A0 =A0 non-critical
 >>Priority: =A0 =A0 =A0 medium
 >>Responsible: =A0 =A0bin-bug-people
 >>State: =A0 =A0 =A0 =A0 =A0open
 >>Class: =A0 =A0 =A0 =A0 =A0sw-bug
 >>Submitter-Id: =A0 net
 >>Arrival-Date: =A0 Sat Mar 24 23:00:00 +0000 2012
 >>Originator: =A0 =A0 Nicolas Joly
 >>Release: =A0 =A0 =A0 =A0NetBSD 6.99.4
 >>Organization:
 > Institut Pasteur
 >>Environment:
 > System: NetBSD lanfeust.sis.pasteur.fr 6.99.4 NetBSD 6.99.4 (LANFEUST) #5=
 : Sat Mar 24 14:34:56 CET 2012 njoly@lanfeust.sis.pasteur.fr:/local/src/Net=
 BSD/obj.amd64/sys/arch/amd64/compile/LANFEUST amd64
 > Architecture: x86_64
 > Machine: amd64
 >>Description:
 > Sometimes, apropos(1) return un-related results. By example, the `apropos=
  lfs'
 > command report correct entries that include the searched word but some
 > un-related ones for the LF word .
 >
 > newfs_lfs(8) =A0 =A0construct a new LFS file system
 > rump_lfs(8) =A0 =A0 mount a lfs image with a userspace server
 > scan_ffs(8) =A0 =A0 find FFSv1/FFSv2/LFS partitions on a disk or file
 > lfs_segclean(2) mark a segment clean
 > mvme68k/lpt(4) =A0parallel port driver
 > lfs_segwait(2) =A0wait until a segment is written
 > x86/lpt(4) =A0 =A0 =A0Parallel port driver
 > installboot(8) =A0install disk bootstrap software
 > PCRE(3) - Perl-compatible regular expressions
 > PCRE(3) - Perl-compatible regular expressions
 >
 > For the 10 results reported, 6 are correct and 4 are wrong (2 lpt and 2 P=
 CRE).
 >
 > Things are worse for `apropos crs' which only report pages with "cr" word=
 ,
 > not even a single "crs" result is found.
 >
 > njoly@lanfeust [~]> apropos -n 1000 crs | head
 > mvme68k/lpt(4) =A0parallel port driver
 > ...the driver. Minor Bit Function 128 Use the interruptless driver. (poll=
 ing) 64 Do not initialize the device on the port. 32 Automatic LF on CR. 16=
  Select 1.6uS strobe pulse width (default is 6.4uS) pcc(4) , pcctwo(4)
 > [...]
 > njoly@lanfeust [~]> apropos -n 1000 crs | grep -ic crs

 This is because of the stemmer. The stemmer strips off the suffix 's'
 from the ending of all the tokens in an attempt to reduce the tokens
 to their root word. This of course isn't right for technical terms
 like lfs or abbreviations etc. I think the fix for this would require
 writing a custom tokenizer for the FTS engine of Sqlite, which does
 not try to stem down such technical keywords, but it would be a bit of
 an undertaking :)

 On the other hand, since the new apropos(1) supports full text search,
 I think to get better millage out of it, it would be more useful to
 specify a bit more detailed queries. It is hard to get 100% relevant
 results but I hope to improve it.

 --
 Abhinav

Responsible-Changed-From-To: bin-bug-people->abhinav
Responsible-Changed-By: abhinav@NetBSD.org
Responsible-Changed-When: Thu, 15 Jun 2017 16:40:14 +0000
Responsible-Changed-Why:
I am working on this


From: "Abhinav Upadhyay" <abhinav@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/46255 CVS commit: src/usr.sbin/makemandb
Date: Sun, 18 Jun 2017 16:24:10 +0000

 Module Name:	src
 Committed By:	abhinav
 Date:		Sun Jun 18 16:24:10 UTC 2017

 Modified Files:
 	src/usr.sbin/makemandb: Makefile apropos-utils.c apropos-utils.h
 Added Files:
 	src/usr.sbin/makemandb: custom_apropos_tokenizer.c
 	    custom_apropos_tokenizer.h fts3_tokenizer.h nostem.txt

 Log Message:
 Add a custom tokenizer which does not stem certain keywords.

 Which keywords should not be stemmed is specified in the nostem.txt file.
 (Right now I have taken all the man page names, split them if they had
 underscores, removed common English words and converted everything to
 lowercase.)

 The tokenizer itself is based on the Porter stemming tokenizer shipped with
 Sqlite. The code in custom_apropos_tokenizer.c is copy of that code with
 some modifications to prevent stemming keywords specified in nostem.txt.

 Additionally, it now uses underscore `_' also as a token delimiter. Therefore,
 now it's possible to do query for `lwp' and all `_lwp_*' man page names
 will be matched. Or the query can be `unconst' and `__UNCONST' will be matched.
 This was not possible earlier, because underscore was not a delimiter and therefore
 the index would have __UNCONST as a key rather than UNCONST.

 The tokenizer needs fts3_tokenizer.h file, which is not shipped with the
 amalgamation build of Sqlite, therefore it needs to be added here (unless
 we decide there is a better place for it).

 To enforce using the new tokenizer, a schema version bump is needed

 Since the tokenization is done both at the indexing time (via makemandb) and
 also while query time (via apropos or whatis), it will be needed to bump
 the schema version everytime nostem.txt is modified. Otherwise the
 index will consist of old tokens and desired changes will not be seen with
 apropos.

 This should also fix the issue reported in PR bin/46255. Similar suggestion was
 also made on tech-userlevel@ recently:
 <http://mail-index.netbsd.org/tech-userlevel/2017/06/08/msg010620.html>

 Thanks to christos@ for multiple rounds of reviews of the tokenizer code.


 To generate a diff of this commit:
 cvs rdiff -u -r1.8 -r1.9 src/usr.sbin/makemandb/Makefile
 cvs rdiff -u -r1.37 -r1.38 src/usr.sbin/makemandb/apropos-utils.c
 cvs rdiff -u -r1.12 -r1.13 src/usr.sbin/makemandb/apropos-utils.h
 cvs rdiff -u -r0 -r1.1 src/usr.sbin/makemandb/custom_apropos_tokenizer.c \
     src/usr.sbin/makemandb/custom_apropos_tokenizer.h \
     src/usr.sbin/makemandb/fts3_tokenizer.h src/usr.sbin/makemandb/nostem.txt

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: abhinav@NetBSD.org
State-Changed-When: Sun, 25 Jun 2017 15:22:02 +0000
State-Changed-Why:
Fixed with the custom tokenizer implementation.


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.