Title: OpenBSD: pkg_add performance analysis
Author: Solène
Date: 08 July 2021
Tags: bandwidth openbsd unix
Description: 

# Introduction

OpenBSD package manager pkg_add is known to be quite slow and using
much bandwidth, I'm trying to figure out easy ways to improve it and I
may nailed something today by replacing ftp(1) http client by curl.

# Testing protocol

I used on an OpenBSD -current amd64 the following command "pkg_add -u
-v | head -n 70" which will check for updates of the 70 first packages
and then stop.  The packages tested are always the same so the test is
reproducible.

The traditional "ftp" will be tested, but also "curl" and "curl -N".

The bandwidth usage has been accounted using "pfctl -s labels" by a
match rule matching the mirror IP and reset after each test.

# What happens when pkg_add runs

Here is a quick intro to what happens in the code when you run pkg_add
-u on http://

* pkg_add downloads the package list on the mirror (which could be
considered to be an index.html file) which weights ~2.5 MB, if you add
two packages separately the index will be downloaded twice.
* pkg_add will run /usr/bin/ftp on the first package to upgrade to read
its first bytes and pipe this to gunzip (done from perl from pkg_add)
and piped to signify to check the package signature.  The signature is
the list of dependencies and their version which is used by pkg_add to
know if the package requires update and the whole package signify
signature is stored in the gzip header if the whole package is
downloaded (there are 2 signatures: signify and the packages
dependencies, don't be mislead!).
* if everything is fine, package is downloaded and the old one is
replaced.
* if there is no need to update, package is skipped.
* new package = new connection with ftp(1) and pipes to setup

Using FETCH_CMD variable it's possible to tell pkg_add to use another
command than /usr/bin/ftp as long as it understand "-o -" parameter and
also "-S session" for https:// connections.  Because curl doesn't
support the "-S session=..." parameter, I used a shell wrapper that
discard this parameter.

# Raw results

I measured the whole execution time and the total bytes downloaded for
each combination.  I didn't show the whole results but I did the tests
multiple times and the standard deviation is near to 0, meaning a test
done multiple time was giving the same result at each run.

```
operation               time to run     data transferred
---------               -----------     ----------------
ftp http://             39.01           26
curl -N http://                28.74           12
curl http://            31.76           14
ftp https://            76.55           26
curl -N https://        55.62           15
curl https://           54.51           15
```
# Analysis

There are a few surprising facts from the results.

* ftp(1) not taking the same time in http and https, while it is
supposed to reuse the same TLS socket to avoid handshake for every
package.
* ftp(1) bandwidth usage is drastically higher than with curl, time
seems proportional to the bandwidth difference.
* curl -N and curl performs exactly the same using https.

# Conclusion

Using http:// is way faster than https://, the risk is about privacy
because in case of man in the middle the download packaged will be
known, but the signify signature will prevent any malicious package
modification to be installed.  Using 'FETCH_CMD="/usr/local/bin/curl -L
-s -q -N"' gave the best results.

However I can't explain yet the very different behaviors between ftp
and curl or between http and https.

# Extra: set a download speed limit to pkg_add operations

By using curl as FETCH_CMD you can use the "--limit-rate 900k"
parameter to limit the transfer speed to the given rate.