Navigation

Music

Looking for some new music? Then follow this link and download four great songs from Selma's latest demo Silently I'll wrap my ears around your mouth:

www.selmamusic.tk

Have those playing while reading on!

HTTP Replicator

by Gertjan van Zwieten gertjanvanzwieten@fastmail.fm

README (text version)

project statistics

Freshmeat Project

HTTP Replicator is a general purpose, replicating HTTP proxy server. All downloads through the proxy are checked against a private cache, which is an exact copy of the remote file structure. If the requested file is in the cache, replicator sends it out at LAN speeds. If not in the cache, it will simultaneously download the file and stream it to multiple clients. No matter how many machines request the same file, only one copy comes down the Internet pipe. Very useful for maintaining a cache of Debian or Gentoo packages.

History

A while ago, I was looking for a way to cache my downloaded debian (unstable) packages. I have some computers on a LAN, and it seemed like a waste of time and bandwidth to download the same packages over and over again for each computer. So I looked into some existing solutions…

squid

Squid is a very well known proxy server, which (amongst others) maintains a cache of downloaded files. Quite a few people seem to use squid for the purposes I'm interested in. What bothers me is that these files are accessible only by squid itself. It is (as far as I know) not possible to get a list of all cached files, let alone to manually install or copy a file from cache. This is probably a consequence of squids immense functionality, but for my purpose it seems most logical to just store files as files.

apt-proxy

Apt-proxy is a proxy server written specially for the purpose of maintaining a debian cache. The current version is a bash script, so speed is not optimal but it easily beats my internet connection… Actually it isn't a real proxy server but a normal http server that serves files from some configured remote hosts. These hosts are configured in both sources.list and the apt-proxy configuration file. I believe either location has its pros and cons, but in requiring both you'll end up with cons only. I've used it for a while, but I got a strange http error every now and then that screwed up apt so I had to constantly restart. There is also a beta rewrite in python, which I also tried, but I don't remember being to satisfied with that either.

apt-cacher

This one looked really promising. It is a specialized debian proxy server, just like apt-proxy, but it is written in perl as an apache module. Again, this isn't a real proxy, but this time sources are defined in sources.list only so that's an improvement. Moreover, this one works (!) and it even seems to store files in the normal way that seems so obvious to me. But… when inspecting the files in cache they appear to be somehow altered, for a reason I cannot understand, but they can't be viewed nor installed manually. Apart from that I don't like the idea of running an entire apache server for a debian cache only.

debproxy

Yet another specialized proxy server, a true proxy this time. As you can see from the "this project has not released any files" it never came out of cvs. I can't remember if I ever had it up and running. However judging from the description this project would have been ideal if it would have been finished. It is a real proxy, simply caching all files that are downloaded through it or serving from cache if they're there already. Creator lordsuch named this kind of proxy a "replicating proxy". This name seemed so obvious that I immediately googled for more… but only debproxy showed up. That's when I began to realize that the proxy I was looking for didn't yet exist. Hence my own attempt:

Replicator

Replicator is a general purpose proxy server. That is, not specifically geared towards debian caching. It can be used for any http client, for instance a web browser. While browsing the web it will simply save all the files that go through it. The data is transparently streamed to all clients simultaneously requesting the same file, so everything is downloaded only once. When a requested file is already in cache a check is added to the http header to see if the file is modified since the time it was cached. If it isn't than the file is served from cache, preventing it from being downloaded again. Real speed gains are obtained when the server is configured not to check for modifications but to serve the cached file directly.

To use replicator for debian caching modify or create /etc/apt/apt.conf to contain the following line:

Acquire::http::Proxy "http://(HOST):(PORT)";

Now apt and similar will download through the replicator proxy, caching all packages that are installed. This has worked flawlessly on my computer for quite some time now. The advantage of a general proxy rather than a specialized one becomes apparent when installing packages such as msttcorefonts and flashplugin-nonfree. These packages offer the possibility to use a proxy server, so not only the packages themselves but also related files can be cached. As of version 2.0 replicator is also suitable for maintaining a gentoo package cache, mainly because of the new flat mode - see below. And as of version 2.1 it is also possible to forward requests to an external proxy server.

In version 3.0 a lot of new features are introduced, such as… cache browsing! Yes, that's right, replicator can now act as a web server. Simply visit the address where you have replicator configured to listen for proxy connections in a browser and you'll get an html listing of the files in cache. And the great thing is that the streaming principle still applies, so even incomplete files can be served. Another new feature is client-side support for range requests, so incomplete downloads can now be resumed. For a complete list of changes see the changelog. All this added functionality required some fairly deep changes in code. The interested will notice that the LocalResponse construct has disappeared, which is what I personally like most about this version :-).

The following command line options control the server:

Options

-p, —port PORT
The proxy port on which the server listens for http requests. Default is 8080.
-i, —ip IP
The ip addresses from which access is allowed, optionally with wildcards ? and * for single and multiple digits respectivily. Defaults to localhost * and can be used multiple times.
-d, —dir DIR
The directory to store the cached files in. By default files are simply saved under the current directory.
-a, —alias STR
Colon separated list of mirrors, i.e. hosts serving identical content. Useful for caching (debian) packages from different mirrors to the same directory, like this: cache/dir:host1/somepath:host2/other/path. Can be used multiple times.
-s, —static
Static mode: files are known to never change so files that are present are served from cache directly without contacting the server. Primarily useful for gentoo caching, not for debian because of the changing Packages.gz and Release files.
-f, —flat
Flat mode: all files are saved in a single directory. Primary use for gentoo.
-e, —external
Forward requests to an external proxy server, specified as host:port or username:password@host:port if the server requires authentication.
-q, —quiet
Decrease verbosity. Messages on stdout all have a certain level of importancy. By repeatedly selecting this option the least important messages are skipped.

Daemon options

—daemon
Switches to daemon mode and activates the following options:
—log LOG
Write output to a log file instead of stdout.
—pid PID
Write the process id to a pid file instead of stdout.
—user USER
Change the process user id after opening the log and pid file.

Debugging options

—debug
Set the lowest possible logging level (the quiet option is ignored) and activate the following options:
—intolerant
Crash on unhandled exceptions.

When the server is started from /etc/init.d the command line options are read from /etc/default/http-replicator. Here two variables are defined: DAEMON_OPTS that contains the command line options for the server and MAINTENANCE_OPTS that contains the options for the maintenance (cron) script. When replicator is just installed all files are in place but the server is not yet activated. For that you must remove the exit command in /etc/default/http-replicator. This is done for two reasons. First, you might not want to have the server running all the time but just use it from the command line. As I said, package caching is not replicator's only purpose. Second, if you do want to have replicator running all the time you need to create the caching directory first and set up the right permissions. You may also want to change the port and select the static or flat operation mode before starting the server.

I already mentioned the cron maintenance script that comes with replicator. This script helps keeping the package cache down in size. Without such script a package cache tends to grow very rapidly to immense proportions. The maintenance script simply deletes all packages of which more than a certain number of newer versions are present. This number is set on the commandline with —keep, which is read from /etc/default/http-replicator. The script is disabled by setting the number to zero or less.

Links

Some websites that are somehow related to this project: