README (text version)
A while ago, I was looking for a way to cache my downloaded debian (unstable) packages. I have some computers on a LAN, and it seemed like a waste of time and bandwidth to download the same packages over and over again for each computer. So I looked into some existing solutions…
Squid is a very well known proxy server, which (amongst others) maintains a cache of downloaded files. Quite a few people seem to use squid for the purposes I'm interested in. What bothers me is that these files are accessible only by squid itself. It is (as far as I know) not possible to get a list of all cached files, let alone to manually install or copy a file from cache. This is probably a consequence of squids immense functionality, but for my purpose it seems most logical to just store files as files.
Apt-proxy is a proxy server written specially for the purpose of maintaining a debian cache. The current version is a bash script, so speed is not optimal but it easily beats my internet connection… Actually it isn't a real proxy server but a normal http server that serves files from some configured remote hosts. These hosts are configured in both sources.list and the apt-proxy configuration file. I believe either location has its pros and cons, but in requiring both you'll end up with cons only. I've used it for a while, but I got a strange http error every now and then that screwed up apt so I had to constantly restart. There is also a beta rewrite in python, which I also tried, but I don't remember being to satisfied with that either.
This one looked really promising. It is a specialized debian proxy server, just like apt-proxy, but it is written in perl as an apache module. Again, this isn't a real proxy, but this time sources are defined in sources.list only so that's an improvement. Moreover, this one works (!) and it even seems to store files in the normal way that seems so obvious to me. But… when inspecting the files in cache they appear to be somehow altered, for a reason I cannot understand, but they can't be viewed nor installed manually. Apart from that I don't like the idea of running an entire apache server for a debian cache only.
Yet another specialized proxy server, a true proxy this time. As you can see from the "this project has not released any files" it never came out of cvs. I can't remember if I ever had it up and running. However judging from the description this project would have been ideal if it would have been finished. It is a real proxy, simply caching all files that are downloaded through it or serving from cache if they're there already. Creator lordsuch named this kind of proxy a "replicating proxy". This name seemed so obvious that I immediately googled for more… but only debproxy showed up. That's when I began to realize that the proxy I was looking for didn't yet exist. Hence my own attempt:
Replicator is a general purpose proxy server. That is, not specifically geared towards debian caching. It can be used for any http client, for instance a web browser. While browsing the web it will simply save all the files that go through it. The data is transparently streamed to all clients simultaneously requesting the same file, so everything is downloaded only once. When a requested file is already in cache a check is added to the http header to see if the file is modified since the time it was cached. If it isn't than the file is served from cache, preventing it from being downloaded again. Real speed gains are obtained when the server is configured not to check for modifications but to serve the cached file directly.
To use replicator for debian caching modify or create /etc/apt/apt.conf to contain the following line:
Acquire::http::Proxy "http://(HOST):(PORT)";
Now apt and similar will download through the replicator proxy, caching all packages that are installed. This has worked flawlessly on my computer for quite some time now. The advantage of a general proxy rather than a specialized one becomes apparent when installing packages such as msttcorefonts and flashplugin-nonfree. These packages offer the possibility to use a proxy server, so not only the packages themselves but also related files can be cached. As of version 2.0 replicator is also suitable for maintaining a gentoo package cache, mainly because of the new flat mode - see below. And as of version 2.1 it is also possible to forward requests to an external proxy server.
In version 3.0 a lot of new features are introduced, such as… cache
browsing! Yes, that's right, replicator can now act as a web server.
Simply visit the address where you have replicator configured to
listen for proxy connections in a browser and you'll get an html
listing of the files in cache. And the great thing is that the
streaming principle still applies, so even incomplete files can be
served. Another new feature is client-side support for range requests,
so incomplete downloads can now be resumed. For a complete list of
changes see the changelog. All this added
functionality required some fairly deep changes in code. The
interested will notice that the LocalResponse construct has
disappeared, which is what I personally like most about this version
:-).
The following command line options control the server:
When the server is started from /etc/init.d the command line options
are read from /etc/default/http-replicator. Here two variables are
defined: DAEMON_OPTS that contains the command line options for the
server and MAINTENANCE_OPTS that contains the options for the
maintenance (cron) script. When replicator is just installed all
files are in place but the server is not yet activated. For that you
must remove the exit command in /etc/default/http-replicator. This
is done for two reasons. First, you might not want to have the server
running all the time but just use it from the command line. As I said,
package caching is not replicator's only purpose. Second, if you do
want to have replicator running all the time you need to create the
caching directory first and set up the right permissions. You may also
want to change the port and select the static or flat operation mode
before starting the server.
I already mentioned the cron maintenance script that comes with
replicator. This script helps keeping the package cache down in size.
Without such script a package cache tends to grow very rapidly to
immense proportions. The maintenance script simply deletes all
packages of which more than a certain number of newer versions are
present. This number is set on the commandline with —keep, which is
read from /etc/default/http-replicator. The script is disabled by
setting the number to zero or less.
Some websites that are somehow related to this project: