Completed Projects

Completed projects

Projects from the projects available page that have been completed.

allpkgs consistency checking

Not implemented, obsolete, because all catalogs are generated are now consistent, and the upload procedure ensures they remain that way.

It's possible that two catalogs are referencing a package with one file name but different md5 sums. For example:

$ (curl -s; curl -s | awk '$1 == "samba_doc" { print $4, $5;}'
samba_doc-3.0.22,REV=2006.06.27-SunOS5.8-all-CSW.pkg.gz 189b4a3d408a235c2d4ac04ae63ff104
samba_doc-3.0.22,REV=2006.06.27-SunOS5.8-all-CSW.pkg.gz 3c27aa4a1dca35e9920ebffe907516d1

The consistency check would make sure that:

1. All catalogs associate 1 file name with 1 md5 sum
2. The file actually has that md5 sum (when you read it from allpkgs and compute the md5)

Information necessary for these checks could obtained from the mirror via HTTP. It could also read data from the database (via HTTP).

This could be a standalone utility.

Lazy catalog generation

Our catalogs are currently always generated, whether it's necessary (content changed) or not. This leads to the catalog files being generated anew while it's not really necessary. This project is about the catalog+description files, not about the hard- and symlinks to package files.

It could be a standalone utility. It would:

For example:

catalog_diff -a URI1 -b URI2
echo $?
# 0 means no difference, 1 means there is a difference

On our buildfarm it would be:

catalog_diff -a -b file:///export/mirror/opencsw-official/unstable/i386/5.10/catalog
echo $?
  • The script would detect what kind of URI we're dealing with.
  • The script needs to compare the actual content of the catalogs, so that the outcome does not depend on any comments or gpg signatures in the catalog file on disk.
  • Ideally it would be also a function in our Python code so that we would be able to roll it into our future integrated catalog generation (as opposed to today's collection of shell scripts and Python scripts that call each other).

Rewrite of package hard- and symlinking during catalog generation

Our current catalog generation takes about 80 35 minutes. If we can make it even faster, we can generate the catalog more often and have a quicker build-push-release turnaround, and relieve the buildfarm from most of the current catalog-generation-induced disk stress. We currently run the generation every 3h. If we can make the generation complete in something like 10 minutes (which I think is possible), we could run catalog generation e.g. every hour.

We have a directory on disk with a package catalog, as we can see on the mirror:

We can query the RESTful interface for the current state of the same catalog in the database:

curl -s | python -m json.tool | (head -n 30; cat >/dev/null)

The python bit is here just for data pretty-printing.

We also have the 'allpkgs' directory:

It's excluded from rsync, so it doesn't get propagated to mirrors, but it does exist on the master mirror and the buildfarm. It's the central pool for all the package data files.

When we generate catalogs, we do not copy anything, instead we make hardlinks to the allpkgs directory. For example, we make a hardlink from allpkgs/foo-i386-CSW.pkg.gz to unstable/5.9/i386. However, when we generate a catalog for the next OS release (e.g. 5.10), we do not make a hardlink; if possible, we make a symlink from the 5.10 directory to the 5.9 directory. This way we save space on mirrors: we only send out 1 copy of the file (in the lowest OS release in which it occurs), and then we create symlinks to it.

For example:

allpkgs/foo-i386-CSW.pkg.gz (not synced to mirrors)
unstable/i386/5.9/foo-i386-CSW.pkg.gz (hardlink to the file in allpkgs)
unstable/i386/5.10/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz (symlink)
unstable/i386/5.11/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz (symlink)

You can now see that we need to generate catalogs for one catalog release (e.g. unstable) and one architecture, and all OS releases in one program run.

We do currently have code that does it, but the code is really stupid. It unlinks everything from the directory, and starts from scratch every time. This generates a lot of unnecessary disk operations, and makes the whole process slow. It would be much better to see what's in the database, see what's on disk, and figure out the smallest set of operations to bring the disk to the new state.

Package rename tracking

Implemented by Carsten

We're renaming and splitting packages a lot. We've worked out a system how to do it, where we keep the pkgname (e.g. CSWfoo) the same, but we alter the catalogname (e.g. foo) by adding _stub, and trying to rebuild all dependencies so that they don't depend on the stub any more. The problem is that we don't have a tool to track these renames.

Imagine an upgrade path:

legacy → dublin → kiel →…

We can only make limited changes during one upgrade, so we need 2 catalog transitions in order to successfully rename a package. For instance, let's rename CSWfoo to CSWbar:

  • Release N-2: CSWfoo/foo; there are packages depending on it
  • Release N-1: CSWfoo/foo_stub depends on CSWbar/bar; there still might be packages depending on CSWfoo
  • Release N: CSWbar/bar is marked incompatible with CSWfoo; nothing depends on CSWfoo any more, and the package itself has been removed

There is a mechanism in pkgutil which removes the incompatible packages. This way, installing CSWbar in Release N would cause CSWfoo/foo_stub to be removed from the system.

The project would be about writing a tool which would take 2 catalogs, examine them, and say:

  • Which packages need rebuilding (so they don't depend on the _stub any more)
  • Which _stub packages can be removed
  • Which packages can declare incompatibility on the old packages, so that the old packages can be removed

Package database catalog consistency checking, using chkcat (Done)

Implemented by Rafael.

It sometimes happens that the state of the catalog in the database is inconsistent, for example one of the packages declares a dependency on a package which is not in that catalog. We have safeguards that will prevent such broken catalog from being propagated to the mirror, but we also don't have any alerting that would let us know that the catalog in the database is in a bad state. This code would be similar to chkcat, but working on data from the database, and not on an on-disk catalog file.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License