Pre-proposal

CS189B – Senior Computer Programming Project

apt-got

http://www.apt-got.com/

 

Computer Science Capstone

 

Tobias Hertkorn        t_hertkorn@umail.ucsb.edu


Project outline

Packages from the Debian-Distribution are obtained from public http-mirrors. To save bandwidth our program will function like a drop-in, stand-alone proxy for the internal network. But in addition it will store already requested packages locally. Packages that are not yet stored locally are fetched upon client-request from the parent server transparently.

Vision

Problem Statement

To write a program that improves the performance of the Linux Debian package update and install application named apt-get.

Key high-level goals

System features

Project status

This project is not made up for this class. I am and will use it on the servers I am administrating in Germany. As a matter of fact the system as-is is ready to be used, although some things still are not very comfortable, everything is reasonably safe and stable.

As this system will be under constant improvement, we tried to implement a highly modular system, which will be easy to extend and enhance after the end of this class. We strictly followed the model-view-control approach, which will allow us to change or replace parts of the project.

For example the server that is part of the project could be replaced by a commercial http server as tomcat or jetty.

Our efforts to follow this approach and to do a good system design in general helped a lot to cut down the actual implementation time. It also allowed us to assign one person to a specific part of the project and easily synchronize the progress.

Planned Project Improvements

Critical

-          Create configuration infrastructure.

-          Dynamically load modules, prepare for different kinds of module flavors

-          Solve “blocking”-problem when file is not yet downloaded.

-          Create package-lists parser.

-          Create intelligent caching/purging algorithms.

Important

-          Increase delivering speed to client by creating an apache module.

-          Parse all available source-lists and merge them into one.

-          Get MD5sum from source-lists and check downloaded data before storing.

-          Create statistic-/download tracking-extensions.

Optional

-          Create pre-fetch mechanism for frequently requested files and/or dependencies.

Detailed description

Configuration infrastructure

Currently no configuration values can be changed on the fly. Changing the configuration requires changing and recompiling of com.debianmirror.mirror.data.XmlMirrorConf.java.

This is unacceptable. Users must be able to change a text file in order to alter the settings of apt-got. In additions command line options must be provided to have the ability to point to alternative configuration files.

Dynamically load modules

If the default Debian related behavior is not desired a different classname can be specified in the configuration. As long as this class implements com.debianmirror.mirror.module.MirrorModule (will be an interface later) the desired customized module will get loaded and put to work.

Solve “blocking”-problem

This is the current situation: When a file is not yet fully downloaded the client is put on hold until the download is completed. After that the first data is sent to the client. This situation can seriously delay the whole process especially when it comes to large files. It could even create timeouts on the client side.

This must get changed. Data that is available should be sent to the client.

Create package lists parser

Debian uses package-lists to spread the information which packages are available. (Example: http://ftp.debian.org/debian/dists/woody/main/binary-i386/Packages). In addition these lists inform about package specifics, like the md5sum of the file, etc. This information must get pared in order to be available for various parts of the DebianMirrorModule.

Create intelligent caching/purging algorithms

Right now apt-got does no purging of old packages at all. It also has no space limitations. That means in a worst case scenario it will fill your disk to the last byte.

The least will be, that there is a user option to specify how much space apt-got is allowed to use. The plan is to include an algorithm that will purge packages that are no longer listed in the package-list.

Increase delivering speed to client by creating an apache module

The apache API allows the creation of modules to fine-tune the behavior of apache.

The idea is to create a module, that let apache deliver all files found (already cached). If it does not find the requested file it will forward the request to apt-got. This will greatly improve the delivering speed. Apache is more than 5 times faster in delivering files than our Java http engine.

Parse all available source-lists and merge them into one

The idea behind his is that multiple remote package archives can be specified for one module. The package lists of every archive will get merged into one big package list.

Get MD5sum and check data

This is an extension to the caching algorithm. Before storing the package in the local cache it will get checked against the MD5sum provided in the package-list. If it does not match the file will not get stored.

Create statistic-/download tracking-extensions

Create a extension to the modules to track downloads and create download statistics. Might be used in the future to further improve the purging/caching algorithm.

Create pre-fetch mechanism

The statistic extension can be used to further improve the caching algorithm. If almost every version of a certain package got downloaded up to now chances are that the next version will be needed as well. Knowing this apt-got should download the next available version of the package before it even gets requested.

On the other side, files that don’t get requested to often should be the first to get purged, if apt-got runs out of disk space.