NetBSD Problem Report #50216

From gson@gson.org  Mon Sep  7 13:28:39 2015
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
	by mollari.NetBSD.org (Postfix) with ESMTPS id CFC21A5B2E
	for <gnats-bugs@gnats.NetBSD.org>; Mon,  7 Sep 2015 13:28:39 +0000 (UTC)
Message-Id: <20150907132830.E7800743D8C@guava.gson.org>
Date: Mon,  7 Sep 2015 16:28:30 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: USB, PCI ID data grow quadratically in the repository
X-Send-Pr-Version: 3.95

>Number:         50216
>Category:       misc
>Synopsis:       USB, PCI ID data grow quadratically in the repository
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 07 13:30:00 +0000 2015
>Last-Modified:  Fri Oct 27 12:15:00 +0000 2017
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current, source date 2015.01.26.10.53.21
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

A while ago, I added a new device to src/sys/dev/usb/usbdevs
and regenerated usbdevs.h and usbdevs_data.h.

When I did this, I noticed that the addition of a single line to
usbdevs caused a large number of lines to change in usbdevs_data.h.

It looks that the number of lines in usb_data.h that change when
adding a new device is typically on the order of a few thousand, and I
suspect it is actually proportional to the number of devices that
already exist, on average.  This is a problem, because as new USB
devices are added one by one, the space used by usbdevs_data.h in the
repository will grow proportionally to the *square* of the total
number of devices.  The PCI device data also appears to have the same
problem.

As of today, the size of usbdevs_data.h,v in the repository is 3.2
megabytes, and pcidevs_data.h,v is 57 megabytes.  But it's not the
current size that worries me as much as what will happen over time if
we keep adding new devices and this results in the the repository
files growing not just at a constant rate, but at an *accelerating*
rate.

>How-To-Repeat:

  cd src/sys/dev/pci
  cvs log pcidevs_data.h | grep 'lines:' 

Notice how thousands of lines change in each commit.

>Fix:

Revert the compression done in devlist2h.awk, or replace it with a
different compression algorithm where a small change in the input only
results in a small change in the output.

>Audit-Trail:
From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: misc-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Mon, 7 Sep 2015 17:01:15 +0200

 On Mon, Sep 07, 2015 at 01:30:00PM +0000, Andreas Gustafsson wrote:
 > >Fix:
 > 
 > Revert the compression done in devlist2h.awk, or replace it with a
 > different compression algorithm where a small change in the input only
 > results in a small change in the output.

 ...or just create the files during the build.

 Joerg

From: matthew green <mrg@eterna.com.au>
To: Joerg Sonnenberger <joerg@britannica.bec.de>
Cc: misc-bug-people@netbsd.org, gnats-admin@netbsd.org,
    netbsd-bugs@netbsd.org, gnats-bugs@NetBSD.org
Subject: re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Tue, 08 Sep 2015 04:35:38 +1000

 Joerg Sonnenberger writes:
 > On Mon, Sep 07, 2015 at 01:30:00PM +0000, Andreas Gustafsson wrote:
 > > >Fix:
 > > 
 > > Revert the compression done in devlist2h.awk, or replace it with a
 > > different compression algorithm where a small change in the input only
 > > results in a small change in the output.
 > 
 > ...or just create the files during the build.

 i would rather avoid this method, since it means "grep" can't find
 the symbol used in the source code.


 .mrg.

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@www.NetBSD.org
Cc: 
Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Fri, 27 Oct 2017 18:54:06 +0700

 I just got asked to take a look at this issue, and I totally agree with
 Joerg - these files should be generated as part of the build, not
 committed to CVS.

 Matt's objection to that was ...

 mrg@eterna.com.au said (back in Sep 2015):
   |  i would rather avoid this method, since it means "grep" can't find
   |  the symbol used in the source code. 

 and I understand that.   I see two ways that could be resolved.  First,
 the files (as well as being used in the build) could be installed
 somewhere (not in an include directory, somewhere like /usr/share/misc)
 so they'd be available ... but that means only after the system has been
 built and installed, so is probably not the solution that will satisfy.

 Or second, there are two files generated (for each *devs file).  That is,
 for pcidevs we get pcidevs.h and pcidevs_data.h

 The first one I don't think is too much of a problem (and a little tweaking
 might be able to reduce the churn in that one even more.)

 The _data.h file is the one that really causes the problem, and frankly
 I can't see anyone wanting to grep that (or even keep it visible more than
 a few seconds so it can be included in whatever #includes it.)  If
 for some reason you do, you can always generate it - obviously a method
 to do that is going to be needed.

 So, for now the solution I am going to suggest, at least as a first
 quick fix, is that we change the procedure to only generate and check in
 the namedevs.h files when namedevs is altered, and we fix the Makefiles
 to generate namedevs_data.h on the fly during compilation (in the obj
 directories).

 Then we cvs delete the namedevs_data.h files, and arrange with the admins to
 make them all read only in the repo, so there can be no more commits to them,
 ever, from any release - any of the older releases that wants to change the
 device lists will need to adopt the new procedure first. Then after NetBSD-8
 has faded from memory (years from now) we simply nuke them.

 I mean rm namedevs_data.h,v so they are simply gone!  Anyone wanting to
 checkout an older source tree after that will need to regenerate the files
 (we could make a script for that to make life easier) before building, using
 the appropriate devlist2h.awk scripts that match the distribution in question.

 I think this is a much better solution than looking for a more stable
 compression algorithm for the _data.h files (and then having to debug
 the code to extract data from it) where what we have now seems to work
 well enough ... but doing this would also make it much easier for someone
 with the desire to create better algorithms, and experiment.

 Any objections to this strategy?   [The nuking the files from the repo
 part is kind of an optional extra, just to reduce the repo size, it is
 not crucial to the rest of the plan.]

 If not, I'll start the process making all of this happen.   I'll also look
 at why we seem to need 15 devlist2h.awk files in the tree.  That seems
 a little excessive, I find it hard to believe that they're all so
 different from each other that every single one of them is needed.

 kre


From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, misc-bug-people@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, 
	gson@gson.org (Andreas Gustafsson)
Cc: 
Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Fri, 27 Oct 2017 08:10:59 -0400

 On Oct 27, 11:55am, kre@munnari.OZ.AU (Robert Elz) wrote:
 -- Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the reposi

 |  and I understand that.   I see two ways that could be resolved.  First,
 |  the files (as well as being used in the build) could be installed
 |  somewhere (not in an include directory, somewhere like /usr/share/misc)
 |  so they'd be available ... but that means only after the system has been
 |  built and installed, so is probably not the solution that will satisfy.
 |  
 |  Or second, there are two files generated (for each *devs file).  That is,
 |  for pcidevs we get pcidevs.h and pcidevs_data.h
 |  
 |  The first one I don't think is too much of a problem (and a little tweaking
 |  might be able to reduce the churn in that one even more.)
 |  
 |  The _data.h file is the one that really causes the problem, and frankly
 |  I can't see anyone wanting to grep that (or even keep it visible more than
 |  a few seconds so it can be included in whatever #includes it.)  If
 |  for some reason you do, you can always generate it - obviously a method
 |  to do that is going to be needed.
 |  
 |  So, for now the solution I am going to suggest, at least as a first
 |  quick fix, is that we change the procedure to only generate and check in
 |  the namedevs.h files when namedevs is altered, and we fix the Makefiles
 |  to generate namedevs_data.h on the fly during compilation (in the obj
 |  directories).
 |  
 |  Then we cvs delete the namedevs_data.h files, and arrange with the admins to
 |  make them all read only in the repo, so there can be no more commits to them,
 |  ever, from any release - any of the older releases that wants to change the
 |  device lists will need to adopt the new procedure first. Then after NetBSD-8
 |  has faded from memory (years from now) we simply nuke them.
 |  
 |  I mean rm namedevs_data.h,v so they are simply gone!  Anyone wanting to
 |  checkout an older source tree after that will need to regenerate the files
 |  (we could make a script for that to make life easier) before building, using
 |  the appropriate devlist2h.awk scripts that match the distribution in question.
 |  
 |  I think this is a much better solution than looking for a more stable
 |  compression algorithm for the _data.h files (and then having to debug
 |  the code to extract data from it) where what we have now seems to work
 |  well enough ... but doing this would also make it much easier for someone
 |  with the desire to create better algorithms, and experiment.
 |  
 |  Any objections to this strategy?   [The nuking the files from the repo
 |  part is kind of an optional extra, just to reduce the repo size, it is
 |  not crucial to the rest of the plan.]
 |  
 |  If not, I'll start the process making all of this happen.   I'll also look
 |  at why we seem to need 15 devlist2h.awk files in the tree.  That seems
 |  a little excessive, I find it hard to believe that they're all so
 |  different from each other that every single one of them is needed.

 I am happy with the plan to move the foo_data.h file to be
 auto-generated during the build, and cvs delete it from head.
 Perhaps it should build both files (foo.h foo_data.h) int the obj
 directory and check that the committed foo.h matches the newly
 generated one.

 I would not do anything at the repo level until that stabilizes;
 any read-only decision etc, can be deferred.

 As for merging them further, by all means. I tried and merged some
 of them a while ago, the problem is that they are not all the same
 :-)

 christos

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.