Title: BTRFS deduplication using bees
Author: Solène
Date: 16 August 2022
Tags: nixos btrfs linux
Description: This explains how to use bees to enable offline
deduplication on a BTRFS file system

# Introduction

BTRFS is a Linux file system that uses a Copy On Write (COW) model.  It
is providing many features like on the fly compression, volumes
management, snapshots and clones etc...
Wikipedia page about Copy on write
However, BTRFS doesn't natively support deduplication, which is a
feature that looks for chunks in files to see if another file share
that block, if so, only one chunk of data can be used for both files. 
In some scenarios, this can drastically reduce the disk space usage.

This is where we can use "bees", a program that can do offline
deduplication for BTRFS file systems.  In this context, offline means
it's done when you run a command, while it could be live/on the fly
where deduplication is instantly applied.  HAMMER file system from
DragonFly BSD is doing offline deduplication, while ZFS is doing it
live.  There are pros and cons for both models, ZFS documentation
recommends 1 GB of memory per Terabyte of disk when deduplication is
enabled, because it requires to have all chunks hashes in memory.
Bees GitHub page project
# Usage

Bees is a service you need to install and start on your system, it has
some limitations and caveats documented, but it should work for most
users.

You can define a BTRFS file system on which you want deduplication and
a load target.  Bees will work silently when your system is below the
load threshold, and will stop when the load exceeds the limit, this is
a simple mechanism to prevent bees to eat all your system resources
after some freshly modified/created files need to be scanned.

First time you run bees on a file system that is not empty, it may take
a while to scan everything, but then it's really quiet except if you do
heavy I/O operation like downloading big files, but it's doing a good
job at staying behind the scene.

# Installation on NixOS

Add this code to /etc/nixos/configuration.nix and run "nixos-rebuild
switch" to apply the changes.

```
services.beesd.filesystems = {
  root = {
    spec = "LABEL=nixos";
    hashTableSizeMB = 256;
    verbosity = "crit";
    extraOptions = [ "--loadavg-target" "2.0" ];
  };
};
```

The code suppose your root partition is labelled "nixos", you want a
hash table of 256 MB (this will be used by bees) and you don't want
bees to run when the system load is more than 2.0.

You may want to tune the values, mostly the hash size, depending on
your file system size. Bees is for terabytes file systems, but this
doesn't mean you can use it for the average user disks.

# Results

I tried on my workstation with a lot of build artifacts and git
repositories, bees reduced the disk usage from 160 GB to 124 GB, so
it's a huge win here.

Later, I tried again on some Steam games with a few proton versions, it
didn't save much on the games but saved a lot on the proton
installations.

On my local cache server, it did save nothing, but is to be expected.

# Conclusion

BTRFS is a solid alternative to ZFS, it requires less memory while
providing volumes, snapshots and compression.  The only thing it needed
for me was deduplication, and I'm glad it's offline, so it doesn't use
too much memory.