diff options
Diffstat (limited to 'README.BINREPO')
-rw-r--r-- | README.BINREPO | 360 |
1 files changed, 360 insertions, 0 deletions
diff --git a/README.BINREPO b/README.BINREPO new file mode 100644 index 0000000..78b951d --- /dev/null +++ b/README.BINREPO @@ -0,0 +1,360 @@ +================================ +The detached binaries repository +================================ + +.. contents:: + +A brief description +=================== + +Ideally, all binaries from packages sources (ie. all the binary files inside +SOURCES/) will be placed in another subversion repository. This repository +is called "tarballs repository", "binaries repository" or just "binrepo". +It will contain mostly the same directory structure of the main repository, +but instead of having SOURCES and SPECS, it will only have a SOURCES +directory. Every copy/move operation should happen in both repositories. + +In order to allow deceasing binaries from older distributions, each stable +distro will have its own subversion repository for binary files. mgarepo +knows how to access these binrepos by checking which URL defined in the +"[binrepo]" section of the configuration file matches the path-part of the +repository being accessed. (see open issues) + +The package changelogs will be generated from SVN commit logs in the main +"plaintext" repository ("txtrepo" for short) only. Old changelogs will be +preserved, as even empty revisions are preserved in the binaries-filtering +conversion. + + +Mapping repositories states +--------------------------- + +In order to allow the use of `mgarepo {getsrpm,co} -r REV`, mgarepo will have +to use a reference in the text repo which will be used to know in what +state was the binrepo when a binary was uploaded. + +We cannot use direct revision number mapping through properties/files/etc +mainly because we may have multiple binaries repositories, and eventually +they can be filtered for reducing space, thus can't ensure revisions will +survive. Thus another mechanism which relies on dates instead of revisions +numbers is needed. + +When a binary is uploaded to the binrepo, the file `sha1.lst` is updated to +have the files's hash and commited in the main text repo. This file will be +used as the reference when the user uses -r REV on mgarepo. mgarepo will +checkout the package in the main text repo with -r REV and then will use +the "Last Changed Date" of `sha1.lst` to checkout the binrepo part. Thus, +`sha1.lst` should be always commited to the main text repository *after* the +corresponding binary files have been commited to the binrepo. Hooks in the +main repository may be used to try to enforce this, by checking if the files +changed in `sha1.lst` are already commited in the corresponding binrepo. + +Computation of `sha1.lst` is unlikely to be an issue: + +- it should not happen too often for any given package +- it takes[0] less than 10s to sha1sum all SOURCES of openoffice.org-3.1-1mdv2010.0.src.rpm +- it probably takes way less than the time to upload the file into the repository +- it can be computed in parallel to the binrepo commit, and probably finish + before that, thus ready by the time `sha1.lst` should be commited +- users don't need to verify the SHA1s "everytime", but the build system + does, thus Repsys can default to not verify and avoid wasting users' time + +The use of `sha1.lst` has the valuable property of tying the state of the main +repository and the binrepo. With it, at getsrpm time of a package +submission we can verify the SHA1 of the SOURCES-bin, and be sure that +either the package will be built with the expected state, or early fail the +build. It also allows for verifying binaries without trusting the binrepo, +which may be useful if we consider using an unversioned plain filesystem +storage in the future (for old distros or whatever), or at "client side", +which maintainers may find useful. + +[0]: In a single core AMD Athlon(tm) 3800+ (2400Mhz) + +Mapping of revisions using SVN properties +----------------------------------------- + +Alternatively to using the above "sha1.lst scheme", the revision mapping +between the main repository and a binrepo could be done using subversion +properties. This could be done by making every commit to binrepos also +cause a corresponding commit in the main text repository to happen, which +would update a property recording the current date. That is, a subversion +property in the main text repository would be kept, such that for any given +main repository revision, the corresponding state of the binrepos is +obtainable (using the registered date). + +This would be "more transparent", as it can be maintened simply by using +subversion hooks, without user intervention. OTOH, as every time the user +commits to a binrepo this would result in a commit in the main repository, +it would require the user to "svn up" the directories from there before +commiting, after every binrepo commit. Also, this might result in a big +number of "bogus" commits to the main repository, which could be seen as log +pollution, and may potentially increase space usage etc.. + +Why a new repository without the tarballs +========================================== + +- the current svn repository is too large, hard to manage +- big binary files (in general, "tarballs") history is of little value in + the distro development, we care much more about our specs, patches, + configurations, etc.; nonetheless, those big files we don't care much for + take the most resources and make backups and restoration in case of + failure very expensive, much more so than the more valuable data +- there is no easy way to strip undesired tarballs without recreating the + whole repository +- fedora and ubuntu have separated repositories, so we must have it too! + +Numbers +------- + +Current repository is +390000 revisions and ~340Gb big, while the bzip2ed +dumps backup for it takes about a bit more than half that size (FIXME: +estimative, can't check in the backup server right now). Current txtrepo +with the same number of revisions is ~180Gb big, takes about 2-3 days to be +imported, while the gzipped full dump backup for it currently takes ~1.2Gb. +Initial binrepo for Cooker (only `current/` packages' branches) took ~28Gb +in disk, gzipped full dump for it takes ~25Gb, took about 5h30m to be +populated from the current in use repository ("oldrepo"). + + +Drawbacks of this layout +========================= + +- (always) everything that changes the single-repository usage increases the chance + of failure and make things more complicated. +- subversion can't be used alone as easily as the current scheme allows +- copying binaries between distro branches may not be "svn-cheap" anymore + (unless they're in the same binrepo) +- ... + + +Open issues +============ + +Multiple binrepos dont allow us to have one permanent URL +--------------------------------------------------------- + +We would have to update the configuration files from all the users in order +to add a new stable repository. spuk suggests to use properties in the main +text repo that would point to the right repository locations. + +How to handle failures when operating on more repositores? +---------------------------------------------------------- + +binrepos should replicate the structure of the main text repo. What we +should do if the markrelease succeeds in the binrepo, but fails in the main +text repo? + +R: Markrelease must be done first in the txtrepo. If it fails there "we're +in trouble" (though currently, we just miss it[0]). When the markrelease is +done in the txtrepo, we can do markrelease in the binrepo using '-r {DATE}', +using the markrelease date in the txtrepo as '{DATE}'. + +[0] We should add transaction support for markrelease. The transaction could +be stored out of the packages SVN (another SVN, a DB, a txt file, etc.), and +would work like: + +0. mark beginning of markrelease, early failing the package build if it fails +1. do markrelease +2. mark sucessful end of markrelease + or mark failed markrelease, so we can replay it later + + +Interesting use cases (first phase) +=================================== + +mgarepo co 1/mutt +--------------------- + +- mgarepo checkouts + http://svn.mageia.org/svn/packages/updates/1/mutt/current to the + mutt directory + +- mgarepo checkouts + http://svn.mageia.org/svn/binrepo/updates/1/mutt/current/SOURCES + into mutt/SOURCES-bin + +- creates symlinks for all files found in SOURCES-bin/ into ../SOURCES/ + + (rpm doesn't handle symlinks, this allows us to have explicit links and + proper src.rpm generates by rpmbuild) + +In case the path doesn't exist in the binrepo it will not fail, as we may +have not imported all packages or the repository is not prepared to work on +this model, etc. + +markrelease of a package +------------------------ + +:: + + $ mgarepo markrelease + +- will copy current/ to releases/VERSION/RELEASE, as usual + +- will copy current/ to releases/, on the binrepo too + +Optionally, markrelease could create revprops indicating which is the +revision of current/ on the binrepo that represents the tarballs that are +being tagged. + + +Use cases to be implemented after the first phase +================================================= + +upgrading to a newer version of the package +------------------------------------------- + +:: + + $ cd bla/SOURCES/ + $ wget http://prdownloads.sourceforge.net/bla/bla-1.6.tar.bz2 + $ mgarepo upload bla-1.6.0.tar.bz2 + +- mgarepo notices this is a tarball (checking filename and/or file size) + +- mgarepo will move the file to SOURCES-bin/, create the symlink, and svn-add + it to the working copy + + $ # the user updates the spec + + $ mgarepo rm SOURCES/bla-1.5.1.tar.bz2 + +- it will remove the symlink and run svn rm on + SOURCES-bin/bla-1.6.0.tar.bz2:: + + $ cd ../ # package top dir + $ mgarepo ci + +- mgarepo will commit the new tarball on SOURCES-bin/ and then on the rest + of the working copy + +mgarepo sync would perform these steps too. + +importing a package +------------------- + + $ mgarepo putsrpm mypkg.src.rpm + +- mgarepo will open the src.rpm + +- will look for tarballs inside SOURCES/ and import them to + http://svn.mageia.org/svn/binrepo/cauldron/mypkg/current/SOURCES/ + +- will move the tarballs out of SOURCES and import the remaining files to + http://svn.mageia.org/svn/packages/cauldron/mypkg/current/ + +- will do whatever else putsrpm already does + +TODO +===== + +First phase +----------- + +- upload +- markrelease +- putsrpm +- getsrpm + + +Second phase +------------ + +- up +- sync + +Rejected or postponed ideas +=========================== + +Use of a plain filesystem storage for the tarballs +-------------------------------------------------- + +This was planned, then rejected. It becomes too complicated when thinking +about markrelease, and mapping SVN revisions in the main repository to +binaries versions in the "tarballs storage", basically requiring +implementing VCS-like features on top of filesystem. Would also require +implementing another authentication and access scheme. The main feature +would be ease of removing old binaries, which isn't much of a point because +we don't know precisely what and when we want to remove, so may end up not +removing much files anyway. + +Use of a plain unversioned filesystem storage for the tarballs +-------------------------------------------------------------- + +Different than the previous one, this would mean not relying at all on +binary files history keeping. Structure could be something simple like:: + + packages/${pkg:0:1}/$pkg/$tarball + +This alternative does not suffice for Cooker, nor for supported distros, for +which we want history. It could, however, at some point be used for "very +old" distros, for which we may have lost interest in keeping *binaries* +history (package history will kept "forever" in the main SVN repository). +Alternatively, "resetting" an SVN binrepo (i.e. recreate the repository) to +contain only the latest tarballs would probably take about the same amount +of space, anyway... + +Open tarballs repository +------------------------ + +This idea is not really rejected. It does not go against splitting txtrepo +and binrepo, but rather complement this idea, where the +open-tarballs-repository would take the place of the binrepo. The txtrepo +would still be used +- the same way. This repository could be used +selectively, for packages where it makes sense, while most packages could be +kept "closed", still as tarballs. + +Use of externals for more seamless Subversion usage +--------------------------------------------------- + +This idea is not discarded, but it just provides easiness. OTOH, it makes +things more complicated: + +- markrelease: externals would have to be updated in order to make it point + to the tagged version in the binrepo, otherwise changes in + current@binrepo would change older releases; +- branching whole distro: even though subversion now supports "relative + externals", we would have to update the URLs for *every* package on the + distro, as the path to reach the binrepo spans the local distribution + directory; +- keeping externals up-to-date (as stated above and below) +- authentication and access control: only markrelease action done by the + build system should be allowed to change externals (so what about importing + new packages?) +- just a convenience, we don't need and shouldn't rely on externals for + running the build system, while most people will use the repositories via + Repsys, so why spend time to implement and keep it? +- "svn co" works transparently, cool, but "svn co -r N" does not, otherwise + every change in the binrepo would require svn:externals to be updated in + the respective package; +- it does not solve the problem of creating and handling symlinks between + SOURCES and SOURCES-bin. + +Keeping svn:externals updated for every package has almost the same cost of +keeping the `sha1.lst` updated, with the difference that in the latter we +would not have to update every package when creating distro branches. + +Use of "external" xdelta to save space on binaries +-------------------------------------------------- + +But how? First idea is this could be done by defining a protocol and +assuming repository manipulation with mgarepo (for ease). Repsys could +xdelta tarballs and add it to SVN with a special filename, then use it when +checking out. Would require a policy/algorithm on when to ditch old whole +binaries, too (i.e. hopefully wouldn't need to be handled manually by the +maintainer). Also, this is something complemental to splitting the +repository, so we may do it later, for binrepos. + + +The Future +========== + +- Open tarballs repositories + + - suited for GIT, maybe multi-VCS + - incremental move + - not everything will be suited for this, must handle all cases or be + optional + +- Xdelta + |