aboutsummaryrefslogtreecommitdiffstats
path: root/README.BINREPO
blob: bc4c0d49ec2de4f8f43d8ac0655a4f596192ec4b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
================================
The detached binaries repository
================================

.. contents::

A brief description
===================

Ideally, all binaries from packages sources (ie. all the binary files inside
SOURCES/) will be placed in another subversion repository. This repository
is called "tarballs repository", "binaries repository" or just "binrepo".
It will contain mostly the same directory structure of the main repository,
but instead of having SOURCES and SPECS, it will only have a SOURCES
directory. Every copy/move operation should happen in both repositories.

In order to allow deceasing binaries from older distributions, each stable
distro will have its own subversion repository for binary files.  mgarepo
knows how to access these binrepos by checking which URL defined in the
"[binrepo]" section of the configuration file matches the path-part of the
repository being accessed.  (see open issues)

The package changelogs will be generated from SVN commit logs in the main
"plaintext" repository ("txtrepo" for short) only.  Old changelogs will be
preserved, as even empty revisions are preserved in the binaries-filtering
conversion.


Mapping repositories states
---------------------------

In order to allow the use of `mgarepo {getsrpm,co} -r REV`, mgarepo will have
to use a reference in the text repo which will be used to know in what
state was the binrepo when a binary was uploaded.

We cannot use direct revision number mapping through properties/files/etc
mainly because we may have multiple binaries repositories, and eventually
they can be filtered for reducing space, thus can't ensure revisions will
survive.  Thus another mechanism which relies on dates instead of revisions
numbers is needed.

When a binary is uploaded to the binrepo, the file `sha1.lst` is updated to
have the files's hash and committed in the main text repo. This file will be
used as the reference when the user uses -r REV on mgarepo. mgarepo will
checkout the package in the main text repo with -r REV and then will use
the "Last Changed Date" of `sha1.lst` to checkout the binrepo part. Thus,
`sha1.lst` should be always committed to the main text repository *after* the
corresponding binary files have been committed to the binrepo. Hooks in the
main repository may be used to try to enforce this, by checking if the files
changed in `sha1.lst` are already committed in the corresponding binrepo.

Computation of `sha1.lst` is unlikely to be an issue:

- it should not happen too often for any given package
- it takes[0] less than 10s to sha1sum all SOURCES of openoffice.org-3.1-1mdv2010.0.src.rpm
- it probably takes way less than the time to upload the file into the repository
- it can be computed in parallel to the binrepo commit, and probably finish
  before that, thus ready by the time `sha1.lst` should be committed
- users don't need to verify the SHA1s "every time", but the build system
  does, thus Repsys can default to not verify and avoid wasting users' time

The use of `sha1.lst` has the valuable property of tying the state of the main
repository and the binrepo.  With it, at getsrpm time of a package
submission we can verify the SHA1 of the SOURCES-bin, and be sure that
either the package will be built with the expected state, or early fail the
build. It also allows for verifying binaries without trusting the binrepo,
which may be useful if we consider using an unversioned plain filesystem
storage in the future (for old distros or whatever), or at "client side",
which maintainers may find useful.

[0]: In a single core AMD Athlon(tm) 3800+ (2400Mhz)

Mapping of revisions using SVN properties
-----------------------------------------

Alternatively to using the above "sha1.lst scheme", the revision mapping
between the main repository and a binrepo could be done using subversion
properties.  This could be done by making every commit to binrepos also
cause a corresponding commit in the main text repository to happen, which
would update a property recording the current date.  That is, a subversion
property in the main text repository would be kept, such that for any given
main repository revision, the corresponding state of the binrepos is
obtainable (using the registered date).

This would be "more transparent", as it can be maintained simply by using
subversion hooks, without user intervention.  OTOH, as every time the user
commits to a binrepo this would result in a commit in the main repository,
it would require the user to "svn up" the directories from there before
committing, after every binrepo commit.  Also, this might result in a big
number of "bogus" commits to the main repository, which could be seen as log
pollution, and may potentially increase space usage etc..

Why a new repository without the tarballs
==========================================

- the current svn repository is too large, hard to manage
- big binary files (in general, "tarballs") history is of little value in
  the distro development, we care much more about our specs, patches,
  configurations, etc.; nonetheless, those big files we don't care much for
  take the most resources and make backups and restoration in case of
  failure very expensive, much more so than the more valuable data
- there is no easy way to strip undesired tarballs without recreating the
  whole repository
- Fedora and Ubuntu have separated repositories, so we must have it too!

Numbers
-------

Circa 2011 repository is +390000 revisions and ~340Gb big, while the bzip2ed
dumps backup for it takes about a bit more than half that size (FIXME:
estimate; can't check in the backup server right now).  Current txtrepo
with the same number of revisions is ~180Gb big, takes about 2-3 days to be
imported, while the gzipped full dump backup for it currently takes ~1.2Gb. 
Initial binrepo for cauldron (only `current/` packages' branches) took ~28Gb
in disk, gzipped full dump for it takes ~25Gb, took about 5h30m to be
populated from the current in use repository ("oldrepo").


Drawbacks of this layout
=========================

- (always) everything that changes the single-repository usage increases the chance
  of failure and make things more complicated.
- subversion can't be used alone as easily as the current scheme allows
- copying binaries between distro branches may not be "svn-cheap" anymore
  (unless they're in the same binrepo)
- ...


Open issues
============

Multiple binrepos dont allow us to have one permanent URL
---------------------------------------------------------

We would have to update the configuration files from all the users in order
to add a new stable repository. spuk suggests to use properties in the main
text repo that would point to the right repository locations.

How to handle failures when operating on more repositories?
-----------------------------------------------------------

binrepos should replicate the structure of the main text repo. What we
should do if the markrelease succeeds in the binrepo, but fails in the main
text repo?

R: Markrelease must be done first in the txtrepo. If it fails there "we're
in trouble" (though currently, we just miss it[0]).  When the markrelease is
done in the txtrepo, we can do markrelease in the binrepo using '-r {DATE}',
using the markrelease date in the txtrepo as '{DATE}'.

[0] We should add transaction support for markrelease. The transaction could
be stored out of the packages SVN (another SVN, a DB, a txt file, etc.), and
would work like:

0. mark beginning of markrelease, early failing the package build if it fails
1. do markrelease
2. mark successful end of markrelease
   or mark failed markrelease, so we can replay it later


Interesting use cases (first phase)
===================================

mgarepo co 1/mutt
---------------------

- mgarepo checkouts
  http://svn.mageia.org/svn/packages/updates/1/mutt/current to the
  mutt directory

- mgarepo checkouts
  http://svn.mageia.org/svn/binrepo/updates/1/mutt/current/SOURCES
  into mutt/SOURCES-bin

- creates symlinks for all files found in SOURCES-bin/ into ../SOURCES/

  (rpm doesn't handle symlinks, this allows us to have explicit links and
   proper src.rpm generates by rpmbuild)

In case the path doesn't exist in the binrepo it will not fail, as we may
have not imported all packages or the repository is not prepared to work on
this model, etc.

markrelease of a package
------------------------

::

   $ mgarepo markrelease 

- will copy current/ to releases/VERSION/RELEASE, as usual

- will copy current/ to releases/, on the binrepo too

Optionally, markrelease could create revprops indicating which is the
revision of current/ on the binrepo that represents the tarballs that are
being tagged.


Use cases to be implemented after the first phase
=================================================

upgrading to a newer version of the package
-------------------------------------------

::

  $ cd bla/SOURCES/
  $ wget https://prdownloads.sourceforge.net/bla/bla-1.6.tar.bz2
  $ mgarepo upload bla-1.6.0.tar.bz2

- mgarepo notices this is a tarball (checking filename and/or file size)

- mgarepo will move the file to SOURCES-bin/, create the symlink, and svn-add
  it to the working copy

  $ # the user updates the spec

  $ mgarepo rm SOURCES/bla-1.5.1.tar.bz2

- it will remove the symlink and run svn rm on
  SOURCES-bin/bla-1.6.0.tar.bz2::

  $ cd ../ # package top dir
  $ mgarepo ci

- mgarepo will commit the new tarball on SOURCES-bin/ and then on the rest
  of the working copy

mgarepo sync would perform these steps too.

importing a package
-------------------

  $ mgarepo putsrpm mypkg.src.rpm

- mgarepo will open the src.rpm

- will look for tarballs inside SOURCES/ and import them to
  http://svn.mageia.org/svn/binrepo/cauldron/mypkg/current/SOURCES/

- will move the tarballs out of SOURCES and import the remaining files to
  http://svn.mageia.org/svn/packages/cauldron/mypkg/current/

- will do whatever else putsrpm already does

TODO
=====

First phase
-----------

- upload
- markrelease
- putsrpm
- getsrpm


Second phase
------------

- up
- sync

Rejected or postponed ideas
===========================

Use of a plain filesystem storage for the tarballs
--------------------------------------------------

This was planned, then rejected. It becomes too complicated when thinking
about markrelease, and mapping SVN revisions in the main repository to
binaries versions in the "tarballs storage", basically requiring
implementing VCS-like features on top of filesystem.  Would also require
implementing another authentication and access scheme.  The main feature
would be ease of removing old binaries, which isn't much of a point because
we don't know precisely what and when we want to remove, so may end up not
removing much files anyway.

Use of a plain unversioned filesystem storage for the tarballs
--------------------------------------------------------------

Different than the previous one, this would mean not relying at all on
binary files history keeping.  Structure could be something simple like::

  packages/${pkg:0:1}/$pkg/$tarball

This alternative does not suffice for cauldron, nor for supported distros, for
which we want history.  It could, however, at some point be used for "very
old" distros, for which we may have lost interest in keeping *binaries*
history (package history will kept "forever" in the main SVN repository). 
Alternatively, "resetting" an SVN binrepo (i.e.  recreate the repository) to
contain only the latest tarballs would probably take about the same amount
of space, anyway...

Open tarballs repository
------------------------

This idea is not really rejected. It does not go against splitting txtrepo
and binrepo, but rather complement this idea, where the
open-tarballs-repository would take the place of the binrepo.  The txtrepo
would still be used +- the same way.  This repository could be used
selectively, for packages where it makes sense, while most packages could be
kept "closed", still as tarballs.

Use of externals for more seamless Subversion usage
---------------------------------------------------

This idea is not discarded, but it just provides easiness. OTOH, it makes
things more complicated:

- markrelease: externals would have to be updated in order to make it point
  to the tagged version in the binrepo, otherwise changes in
  current@binrepo would change older releases;
- branching whole distro: even though subversion now supports "relative
  externals", we would have to update the URLs for *every* package on the
  distro, as the path to reach the binrepo spans the local distribution
  directory;
- keeping externals up-to-date (as stated above and below)
- authentication and access control: only markrelease action done by the
  build system should be allowed to change externals (so what about importing
  new packages?)
- just a convenience, we don't need and shouldn't rely on externals for
  running the build system, while most people will use the repositories via
  Repsys, so why spend time to implement and keep it?
- "svn co" works transparently, cool, but "svn co -r N" does not, otherwise
  every change in the binrepo would require svn:externals to be updated in
  the respective package;
- it does not solve the problem of creating and handling symlinks between
  SOURCES and SOURCES-bin.

Keeping svn:externals updated for every package has almost the same cost of
keeping the `sha1.lst` updated, with the difference that in the latter we
would not have to update every package when creating distro branches.

Use of "external" xdelta to save space on binaries
--------------------------------------------------

But how? First idea is this could be done by defining a protocol and
assuming repository manipulation with mgarepo (for ease).  Repsys could
xdelta tarballs and add it to SVN with a special filename, then use it when
checking out.  Would require a policy/algorithm on when to ditch old whole
binaries, too (i.e.  hopefully wouldn't need to be handled manually by the
maintainer).  Also, this is something complementary to splitting the
repository, so we may do it later, for binrepos.


The Future
==========

- Open tarballs repositories

  - suited for GIT, maybe multi-VCS
  - incremental move
  - not everything will be suited for this, must handle all cases or be
    optional

- Xdelta