NetBSD Problem Report #50216
From gson@gson.org Mon Sep 7 13:28:39 2015
Return-Path: <gson@gson.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id CFC21A5B2E
for <gnats-bugs@gnats.NetBSD.org>; Mon, 7 Sep 2015 13:28:39 +0000 (UTC)
Message-Id: <20150907132830.E7800743D8C@guava.gson.org>
Date: Mon, 7 Sep 2015 16:28:30 +0300 (EEST)
From: gson@gson.org (Andreas Gustafsson)
Reply-To: gson@gson.org (Andreas Gustafsson)
To: gnats-bugs@gnats.NetBSD.org
Subject: USB, PCI ID data grow quadratically in the repository
X-Send-Pr-Version: 3.95
>Number: 50216
>Category: misc
>Synopsis: USB, PCI ID data grow quadratically in the repository
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: misc-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Sep 07 13:30:00 +0000 2015
>Last-Modified: Fri Oct 27 12:15:00 +0000 2017
>Originator: Andreas Gustafsson
>Release: NetBSD-current, source date 2015.01.26.10.53.21
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
A while ago, I added a new device to src/sys/dev/usb/usbdevs
and regenerated usbdevs.h and usbdevs_data.h.
When I did this, I noticed that the addition of a single line to
usbdevs caused a large number of lines to change in usbdevs_data.h.
It looks that the number of lines in usb_data.h that change when
adding a new device is typically on the order of a few thousand, and I
suspect it is actually proportional to the number of devices that
already exist, on average. This is a problem, because as new USB
devices are added one by one, the space used by usbdevs_data.h in the
repository will grow proportionally to the *square* of the total
number of devices. The PCI device data also appears to have the same
problem.
As of today, the size of usbdevs_data.h,v in the repository is 3.2
megabytes, and pcidevs_data.h,v is 57 megabytes. But it's not the
current size that worries me as much as what will happen over time if
we keep adding new devices and this results in the the repository
files growing not just at a constant rate, but at an *accelerating*
rate.
>How-To-Repeat:
cd src/sys/dev/pci
cvs log pcidevs_data.h | grep 'lines:'
Notice how thousands of lines change in each commit.
>Fix:
Revert the compression done in devlist2h.awk, or replace it with a
different compression algorithm where a small change in the input only
results in a small change in the output.
>Audit-Trail:
From: Joerg Sonnenberger <joerg@britannica.bec.de>
To: gnats-bugs@NetBSD.org
Cc: misc-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org
Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Mon, 7 Sep 2015 17:01:15 +0200
On Mon, Sep 07, 2015 at 01:30:00PM +0000, Andreas Gustafsson wrote:
> >Fix:
>
> Revert the compression done in devlist2h.awk, or replace it with a
> different compression algorithm where a small change in the input only
> results in a small change in the output.
...or just create the files during the build.
Joerg
From: matthew green <mrg@eterna.com.au>
To: Joerg Sonnenberger <joerg@britannica.bec.de>
Cc: misc-bug-people@netbsd.org, gnats-admin@netbsd.org,
netbsd-bugs@netbsd.org, gnats-bugs@NetBSD.org
Subject: re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Tue, 08 Sep 2015 04:35:38 +1000
Joerg Sonnenberger writes:
> On Mon, Sep 07, 2015 at 01:30:00PM +0000, Andreas Gustafsson wrote:
> > >Fix:
> >
> > Revert the compression done in devlist2h.awk, or replace it with a
> > different compression algorithm where a small change in the input only
> > results in a small change in the output.
>
> ...or just create the files during the build.
i would rather avoid this method, since it means "grep" can't find
the symbol used in the source code.
.mrg.
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@www.NetBSD.org
Cc:
Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Fri, 27 Oct 2017 18:54:06 +0700
I just got asked to take a look at this issue, and I totally agree with
Joerg - these files should be generated as part of the build, not
committed to CVS.
Matt's objection to that was ...
mrg@eterna.com.au said (back in Sep 2015):
| i would rather avoid this method, since it means "grep" can't find
| the symbol used in the source code.
and I understand that. I see two ways that could be resolved. First,
the files (as well as being used in the build) could be installed
somewhere (not in an include directory, somewhere like /usr/share/misc)
so they'd be available ... but that means only after the system has been
built and installed, so is probably not the solution that will satisfy.
Or second, there are two files generated (for each *devs file). That is,
for pcidevs we get pcidevs.h and pcidevs_data.h
The first one I don't think is too much of a problem (and a little tweaking
might be able to reduce the churn in that one even more.)
The _data.h file is the one that really causes the problem, and frankly
I can't see anyone wanting to grep that (or even keep it visible more than
a few seconds so it can be included in whatever #includes it.) If
for some reason you do, you can always generate it - obviously a method
to do that is going to be needed.
So, for now the solution I am going to suggest, at least as a first
quick fix, is that we change the procedure to only generate and check in
the namedevs.h files when namedevs is altered, and we fix the Makefiles
to generate namedevs_data.h on the fly during compilation (in the obj
directories).
Then we cvs delete the namedevs_data.h files, and arrange with the admins to
make them all read only in the repo, so there can be no more commits to them,
ever, from any release - any of the older releases that wants to change the
device lists will need to adopt the new procedure first. Then after NetBSD-8
has faded from memory (years from now) we simply nuke them.
I mean rm namedevs_data.h,v so they are simply gone! Anyone wanting to
checkout an older source tree after that will need to regenerate the files
(we could make a script for that to make life easier) before building, using
the appropriate devlist2h.awk scripts that match the distribution in question.
I think this is a much better solution than looking for a more stable
compression algorithm for the _data.h files (and then having to debug
the code to extract data from it) where what we have now seems to work
well enough ... but doing this would also make it much easier for someone
with the desire to create better algorithms, and experiment.
Any objections to this strategy? [The nuking the files from the repo
part is kind of an optional extra, just to reduce the repo size, it is
not crucial to the rest of the plan.]
If not, I'll start the process making all of this happen. I'll also look
at why we seem to need 15 devlist2h.awk files in the tree. That seems
a little excessive, I find it hard to believe that they're all so
different from each other that every single one of them is needed.
kre
From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, misc-bug-people@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org,
gson@gson.org (Andreas Gustafsson)
Cc:
Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the repository
Date: Fri, 27 Oct 2017 08:10:59 -0400
On Oct 27, 11:55am, kre@munnari.OZ.AU (Robert Elz) wrote:
-- Subject: Re: misc/50216: USB, PCI ID data grow quadratically in the reposi
| and I understand that. I see two ways that could be resolved. First,
| the files (as well as being used in the build) could be installed
| somewhere (not in an include directory, somewhere like /usr/share/misc)
| so they'd be available ... but that means only after the system has been
| built and installed, so is probably not the solution that will satisfy.
|
| Or second, there are two files generated (for each *devs file). That is,
| for pcidevs we get pcidevs.h and pcidevs_data.h
|
| The first one I don't think is too much of a problem (and a little tweaking
| might be able to reduce the churn in that one even more.)
|
| The _data.h file is the one that really causes the problem, and frankly
| I can't see anyone wanting to grep that (or even keep it visible more than
| a few seconds so it can be included in whatever #includes it.) If
| for some reason you do, you can always generate it - obviously a method
| to do that is going to be needed.
|
| So, for now the solution I am going to suggest, at least as a first
| quick fix, is that we change the procedure to only generate and check in
| the namedevs.h files when namedevs is altered, and we fix the Makefiles
| to generate namedevs_data.h on the fly during compilation (in the obj
| directories).
|
| Then we cvs delete the namedevs_data.h files, and arrange with the admins to
| make them all read only in the repo, so there can be no more commits to them,
| ever, from any release - any of the older releases that wants to change the
| device lists will need to adopt the new procedure first. Then after NetBSD-8
| has faded from memory (years from now) we simply nuke them.
|
| I mean rm namedevs_data.h,v so they are simply gone! Anyone wanting to
| checkout an older source tree after that will need to regenerate the files
| (we could make a script for that to make life easier) before building, using
| the appropriate devlist2h.awk scripts that match the distribution in question.
|
| I think this is a much better solution than looking for a more stable
| compression algorithm for the _data.h files (and then having to debug
| the code to extract data from it) where what we have now seems to work
| well enough ... but doing this would also make it much easier for someone
| with the desire to create better algorithms, and experiment.
|
| Any objections to this strategy? [The nuking the files from the repo
| part is kind of an optional extra, just to reduce the repo size, it is
| not crucial to the rest of the plan.]
|
| If not, I'll start the process making all of this happen. I'll also look
| at why we seem to need 15 devlist2h.awk files in the tree. That seems
| a little excessive, I find it hard to believe that they're all so
| different from each other that every single one of them is needed.
I am happy with the plan to move the foo_data.h file to be
auto-generated during the build, and cvs delete it from head.
Perhaps it should build both files (foo.h foo_data.h) int the obj
directory and check that the committed foo.h matches the newly
generated one.
I would not do anything at the repo level until that stabilizes;
any read-only decision etc, can be deferred.
As for merging them further, by all means. I tried and merged some
of them a while ago, the problem is that they are not all the same
:-)
christos
(Contact us)
$NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2014
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.