================================ The detached binaries repository ================================ .. contents:: A brief description =================== Ideally, all binaries from packages sources (ie. all the binary files inside SOURCES/) will be placed in another subversion repository. This repository is called "tarballs repository", "binaries repository" or just "binrepo". It will contain mostly the same directory structure of the main repository, but instead of having SOURCES and SPECS, it will only have a SOURCES directory. Every copy/move operation should happen in both repositories. In order to allow deceasing binaries from older distributions, each stable distro will have its own subversion repository for binary files. mgarepo knows how to access these binrepos by checking which URL defined in the "[binrepo]" section of the configuration file matches the path-part of the repository being accessed. (see open issues) The package changelogs will be generated from SVN commit logs in the main "plaintext" repository ("txtrepo" for short) only. Old changelogs will be preserved, as even empty revisions are preserved in the binaries-filtering conversion. Mapping repositories states --------------------------- In order to allow the use of `mgarepo {getsrpm,co} -r REV`, mgarepo will have to use a reference in the text repo which will be used to know in what state was the binrepo when a binary was uploaded. We cannot use direct revision number mapping through properties/files/etc mainly because we may have multiple binaries repositories, and eventually they can be filtered for reducing space, thus can't ensure revisions will survive. Thus another mechanism which relies on dates instead of revisions numbers is needed. When a binary is uploaded to the binrepo, the file `sha1.lst` is updated to have the files's hash and committed in the main text repo. This file will be used as the reference when the user uses -r REV on mgarepo. mgarepo will checkout the package in the main text repo with -r REV and then will use the "Last Changed Date" of `sha1.lst` to checkout the binrepo part. Thus, `sha1.lst` should be always committed to the main text repository *after* the corresponding binary files have been committed to the binrepo. Hooks in the main repository may be used to try to enforce this, by checking if the files changed in `sha1.lst` are already committed in the corresponding binrepo. Computation of `sha1.lst` is unlikely to be an issue: - it should not happen too often for any given package - it takes[0] less than 10s to sha1sum all SOURCES of openoffice.org-3.1-1mdv2010.0.src.rpm - it probably takes way less than the time to upload the file into the repository - it can be computed in parallel to the binrepo commit, and probably finish before that, thus ready by the time `sha1.lst` should be committed - users don't need to verify the SHA1s "every time", but the build system does, thus Repsys can default to not verify and avoid wasting users' time The use of `sha1.lst` has the valuable property of tying the state of the main repository and the binrepo. With it, at getsrpm time of a package submission we can verify the SHA1 of the SOURCES-bin, and be sure that either the package will be built with the expected state, or early fail the build. It also allows for verifying binaries without trusting the binrepo, which may be useful if we consider using an unversioned plain filesystem storage in the future (for old distros or whatever), or at "client side", which maintainers may find useful. [0]: In a single core AMD Athlon(tm) 3800+ (2400Mhz) Mapping of revisions using SVN properties ----------------------------------------- Alternatively to using the above "sha1.lst scheme", the revision mapping between the main repository and a binrepo could be done using subversion properties. This could be done by making every commit to binrepos also cause a corresponding commit in the main text repository to happen, which would update a property recording the current date. That is, a subversion property in the main text repository would be kept, such that for any given main repository revision, the corresponding state of the binrepos is obtainable (using the registered date). This would be "more transparent", as it can be maintained simply by using subversion hooks, without user intervention. OTOH, as every time the user commits to a binrepo this would result in a commit in the main repository, it would require the user to "svn up" the directories from there before committing, after every binrepo commit. Also, this might result in a big number of "bogus" commits to the main repository, which could be seen as log pollution, and may potentially increase space usage etc.. Why a new repository without the tarballs ========================================== - the current svn repository is too large, hard to manage - big binary files (in general, "tarballs") history is of little value in the distro development, we care much more about our specs, patches, configurations, etc.; nonetheless, those big files we don't care much for take the most resources and make backups and restoration in case of failure very expensive, much more so than the more valuable data - there is no easy way to strip undesired tarballs without recreating the whole repository - Fedora and Ubuntu have separated repositories, so we must have it too! Numbers ------- Circa 2011 repository is +390000 revisions and ~340Gb big, while the bzip2ed dumps backup for it takes about a bit more than half that size (FIXME: estimate; can't check in the backup server right now). Current txtrepo with the same number of revisions is ~180Gb big, takes about 2-3 days to be imported, while the gzipped full dump backup for it currently takes ~1.2Gb. Initial binrepo for cauldron (only `current/` packages' branches) took ~28Gb in disk, gzipped full dump for it takes ~25Gb, took about 5h30m to be populated from the current in use repository ("oldrepo"). Drawbacks of this layout ========================= - (always) everything that changes the single-repository usage increases the chance of failure and make things more complicated. - subversion can't be used alone as easily as the current scheme allows - copying binaries between distro branches may not be "svn-cheap" anymore (unless they're in the same binrepo) - ... Open issues ============ Multiple binrepos dont allow us to have one permanent URL --------------------------------------------------------- We would have to update the configuration files from all the users in order to add a new stable repository. spuk suggests to use properties in the main text repo that would point to the right repository locations. How to handle failures when operating on more repositories? ----------------------------------------------------------- binrepos should replicate the structure of the main text repo. What we should do if the markrelease succeeds in the binrepo, but fails in the main text repo? R: Markrelease must be done first in the txtrepo. If it fails there "we're in trouble" (though currently, we just miss it[0]). When the markrelease is done in the txtrepo, we can do markrelease in the binrepo using '-r {DATE}', using the markrelease date in the txtrepo as '{DATE}'. [0] We should add transaction support for markrelease. The transaction could be stored out of the packages SVN (another SVN, a DB, a txt file, etc.), and would work like: 0. mark beginning of markrelease, early failing the package build if it fails 1. do markrelease 2. mark successful end of markrelease or mark failed markrelease, so we can replay it later Interesting use cases (first phase) =================================== mgarepo co 1/mutt --------------------- - mgarepo checkouts http://svn.mageia.org/svn/packages/updates/1/mutt/current to the mutt directory - mgarepo checkouts http://svn.mageia.org/svn/binrepo/updates/1/mutt/current/SOURCES into mutt/SOURCES-bin - creates symlinks for all files found in SOURCES-bin/ into ../SOURCES/ (rpm doesn't handle symlinks, this allows us to have explicit links and proper src.rpm generates by rpmbuild) In case the path doesn't exist in the binrepo it will not fail, as we may have not imported all packages or the repository is not prepared to work on this model, etc. markrelease of a package ------------------------ :: $ mgarepo markrelease - will copy current/ to releases/VERSION/RELEASE, as usual - will copy current/ to releases/, on the binrepo too Optionally, markrelease could create revprops indicating which is the revision of current/ on the binrepo that represents the tarballs that are being tagged. Use cases to be implemented after the first phase ================================================= upgrading to a newer version of the package ------------------------------------------- :: $ cd bla/SOURCES/ $ wget https://prdownloads.sourceforge.net/bla/bla-1.6.tar.bz2 $ mgarepo upload bla-1.6.0.tar.bz2 - mgarepo notices this is a tarball (checking filename and/or file size) - mgarepo will move the file to SOURCES-bin/, create the symlink, and svn-add it to the working copy $ # the user updates the spec $ mgarepo rm SOURCES/bla-1.5.1.tar.bz2 - it will remove the symlink and run svn rm on SOURCES-bin/bla-1.6.0.tar.bz2:: $ cd ../ # package top dir $ mgarepo ci - mgarepo will commit the new tarball on SOURCES-bin/ and then on the rest of the working copy mgarepo sync would perform these steps too. importing a package ------------------- $ mgarepo putsrpm mypkg.src.rpm - mgarepo will open the src.rpm - will look for tarballs inside SOURCES/ and import them to http://svn.mageia.org/svn/binrepo/cauldron/mypkg/current/SOURCES/ - will move the tarballs out of SOURCES and import the remaining files to http://svn.mageia.org/svn/packages/cauldron/mypkg/current/ - will do whatever else putsrpm already does TODO ===== First phase ----------- - upload - markrelease - putsrpm - getsrpm Second phase ------------ - up - sync Rejected or postponed ideas =========================== Use of a plain filesystem storage for the tarballs -------------------------------------------------- This was planned, then rejected. It becomes too complicated when thinking about markrelease, and mapping SVN revisions in the main repository to binaries versions in the "tarballs storage", basically requiring implementing VCS-like features on top of filesystem. Would also require implementing another authentication and access scheme. The main feature would be ease of removing old binaries, which isn't much of a point because we don't know precisely what and when we want to remove, so may end up not removing much files anyway. Use of a plain unversioned filesystem storage for the tarballs -------------------------------------------------------------- Different than the previous one, this would mean not relying at all on binary files history keeping. Structure could be something simple like:: packages/${pkg:0:1}/$pkg/$tarball This alternative does not suffice for cauldron, nor for supported distros, for which we want history. It could, however, at some point be used for "very old" distros, for which we may have lost interest in keeping *binaries* history (package history will kept "forever" in the main SVN repository). Alternatively, "resetting" an SVN binrepo (i.e. recreate the repository) to contain only the latest tarballs would probably take about the same amount of space, anyway... Open tarballs repository ------------------------ This idea is not really rejected. It does not go against splitting txtrepo and binrepo, but rather complement this idea, where the open-tarballs-repository would take the place of the binrepo. The txtrepo would still be used +- the same way. This repository could be used selectively, for packages where it makes sense, while most packages could be kept "closed", still as tarballs. Use of externals for more seamless Subversion usage --------------------------------------------------- This idea is not discarded, but it just provides easiness. OTOH, it makes things more complicated: - markrelease: externals would have to be updated in order to make it point to the tagged version in the binrepo, otherwise changes in current@binrepo would change older releases; - branching whole distro: even though subversion now supports "relative externals", we would have to update the URLs for *every* package on the distro, as the path to reach the binrepo spans the local distribution directory; - keeping externals up-to-date (as stated above and below) - authentication and access control: only markrelease action done by the build system should be allowed to change externals (so what about importing new packages?) - just a convenience, we don't need and shouldn't rely on externals for running the build system, while most people will use the repositories via Repsys, so why spend time to implement and keep it? - "svn co" works transparently, cool, but "svn co -r N" does not, otherwise every change in the binrepo would require svn:externals to be updated in the respective package; - it does not solve the problem of creating and handling symlinks between SOURCES and SOURCES-bin. Keeping svn:externals updated for every package has almost the same cost of keeping the `sha1.lst` updated, with the difference that in the latter we would not have to update every package when creating distro branches. Use of "external" xdelta to save space on binaries -------------------------------------------------- But how? First idea is this could be done by defining a protocol and assuming repository manipulation with mgarepo (for ease). Repsys could xdelta tarballs and add it to SVN with a special filename, then use it when checking out. Would require a policy/algorithm on when to ditch old whole binaries, too (i.e. hopefully wouldn't need to be handled manually by the maintainer). Also, this is something complementary to splitting the repository, so we may do it later, for binrepos. The Future ========== - Open tarballs repositories - suited for GIT, maybe multi-VCS - incremental move - not everything will be suited for this, must handle all cases or be optional - Xdelta