Jump Starting Ethereum Nodes - Part 1

I’ve been working with Ethereum for about five years now, and most of that time has been focused in one way or another on interacting with Ethereum clients.

During that time there has always been one particular pain in the ass: syncing.

Depending on which client implementation you choose and whether you want access to all archive data or not, syncing an execution client against mainnet can take anywhere between a few hours to a few months and requiring a few hundred GB’s of storage up to more than 12 TB!

Whilst there are plans to address this such as history and state expiration, for a while now I’ve been musing about ways to jump start new instances without having to wait around for them to sync.

This week I finally had the chance to start working on it as part of ethereum.nix.

The Problem

When using a snapshot sync, one of the faster available sync modes, you’re still looking at a sync time of a few hours and a download size on the order of 30 GB or more. Times may vary depending on the quality of the peers you are connected to and your available bandwidth.

Typically, outside of home setups, you’re going to be running at least a few instances within the same data center, or like my current client, several instances within several data centers.

So why are we spending all that time downloading the same data over and over when the client next door already has it?

“Why don’t you just set up each of your clients as trusted peers of one another, optimise the snap sync?”

We could do that, but downloading the data is only part of it. Intermediate hashes within the state trie still need to be computed etc., otherwise it wouldn’t take 2-3 hours to download 30GB of data.

Again, why re-compute all that state when the client next door has already done it, and your nodes are all sitting in the same data center with a high bandwidth connection between them?

The Solution

Since the client next door has what we need, let’s just make a copy.

Most of the contents of the data directory for an Ethereum client is chain related, with a small amount of it being unique to a given node.

Look at Geth’s data directory for example:

.
├── geth
│   ├── chaindata
│   │   ├── 000066.ldb
│   │   ├── 000067.ldb
│   │   ├── ...
│   │   ├── ancient
│   │   │   └── chain
│   │   │       ├── bodies.0000.cdat
│   │   │       ├── bodies.cidx
│   │   │       ├── bodies.meta
│   │   │       ├── diffs.0000.rdat
│   │   │       ├── diffs.meta
│   │   │       ├── diffs.ridx
│   │   │       ├── FLOCK
│   │   │       ├── hashes.0000.rdat
│   │   │       ├── hashes.meta
│   │   │       ├── hashes.ridx
│   │   │       ├── headers.0000.cdat
│   │   │       ├── headers.cidx
│   │   │       ├── headers.meta
│   │   │       ├── receipts.0000.cdat
│   │   │       ├── receipts.cidx
│   │   │       └── receipts.meta
│   │   ├── CURRENT
│   │   ├── CURRENT.bak
│   │   ├── LOCK
│   │   ├── LOG
│   │   └── MANIFEST-000075
│   ├── LOCK
│   ├── nodekey
│   ├── nodes
│   │   ├── 000135.ldb
│   │   ├── 000151.log
│   │   ├── 000153.ldb
│   │   ├── CURRENT
│   │   ├── CURRENT.bak
│   │   ├── LOCK
│   │   ├── LOG
│   │   └── MANIFEST-000137
│   ├── transactions.rlp
│   └── triecache
│       ├── data.0.bin
│       └── metadata.bin
└── keystore

Aside from the nodekey, keystore and maybe transactions.rlp (need to check that one), everything else is related to the state of the chain.

This may vary depending on the client implementation, but for the most part the data directory is re-usable.

Our first step then towards being able to share this state between clients is to implement a lightweight mechanism for snapshotting this data directory. That is what the rest of this blog post will focus on.

BTRFS

Whenever we started discussing snapshotting, my mind naturally went to BTRFS subvolumes. I’ve been using BTRFS for a couple of years now, and in my experience it has been stable and without issue.

That being said however, I have heard some anecdotal evidence from others about stability issues. A quick google reveals more discussion around this topic, but it’s difficult to tease apart issues which are historic, a result of not RTFM’ing or genuine problems.

There’s currently an ongoing discussion within Numtide where we are trying to filter the noise from the signal and come to a conclusion for ourselves.

For the time being, I’m going to continue under the assumption that BTRFS is a good choice and see what mileage I get, since it makes what I want to do really easy as you’ll see in a bit.

Managing the data directory

Ethereum.nix allows you to define a Geth execution client running against the Sepolia Testnet like this:

  services.ethereum.geth.sepolia = {
    enable = true;
    openFirewall = true;

    args = {
      network = "sepolia";
      http = {
        enable = true;
        addr = "0.0.0.0";
        vhosts = ["localhost" "phoebe"];
        api = ["net" "web3" "sealer" "eth"];
      };
      authrpc.jwtsecret = sops.secrets.geth_jwt_secret.path;
    };
  };

This creates a Systemd Service for running the client process and follows a predictable naming convention: in this instance geth-sepolia.service.

One notable aspect of the service to be aware of for the purposes of snapshotting is that DynamicUser is enabled. Why is that important?

Well if we look at the module definition, we can see that:

the State Directory is being set to /var/lib/geth-sepolia,
it is being passed to Geth via --datadir %S/${serviceName} where %S expands to /var/lib.

    scriptArgs = let
      ...
    in ''
      --ipcdisable ${network} ${jwtSecret} \
      --datadir %S/${serviceName} \
      ${concatStringsSep " \\\n" filteredArgs} \
      ${lib.escapeShellArgs cfg.extraArgs}
    '';
in
    nameValuePair serviceName (mkIf cfg.enable {
      after = ["network.target"];
      wantedBy = ["multi-user.target"];
      description = "Go Ethereum node (${gethName})";

      # create service config by merging with the base config
      serviceConfig = mkMerge [
        baseServiceConfig
        {
          User = serviceName;
          StateDirectory = serviceName; # /var/lib/<serviceName>
          ExecStart = "${cfg.package}/bin/geth ${scriptArgs}";
        }
        (mkIf (cfg.args.authrpc.jwtsecret != null) {
          LoadCredential = "jwtsecret:${cfg.args.authrpc.jwtsecret}";
        })
      ];
    })

Whenever DynamicUser is enabled, Systemd pulls a bit of a switcheroo on the process before startup by:

creating a sand-boxed state directory located at /var/lib/private/<service-name>
symbolically linking /var/lib/private/<service-name> to /var/lib/<service-name>

From the perspective of the host and from inside the service itself, nothing is different. But if you’re trying to manage the storage for a given service you need to be aware of this behaviour.

Replacing the state directory

Now that we know where the state directory for our service is going to be created, we can replace it with a BTRFS subvolume by adding a rule to systemd-tmpfiles:

v /var/lib/private/geth-sepolia

What this does, is instruct Systemd to create a subvolume at the specified path if the path does not exist yet and the file system supports subvolumes.

There is one caveat: your / filesystem must also be a BTRFS subvolume.

Typically though, when I’m using BTRFS my / filesystem will be a BTRFS volume, not a subvolume. But if you add the SYSTEMD_TMPFILES_FORCE_SUBVOL=1 environment variable, a subvolume will be created without this constraint.

And that’s it… well, not quite yet.

Disabling Copy-On-Write

By default, BTRFS uses Copy-On-Write to provide protection against data corruption and enable compression. However, this is not efficient with lots of small writes, and Ethereum clients are notorious for their spurious and voluminous random read/writes.

There are however two ways we can disable copy-on-write:

as an option when mounting the subvolume, however this makes the whole filesystem nodatacow, not just the subvolume
by setting a C file attribute on the state directory before any files are created within it

Since we don’t want to accidentally disable Copy-On-Write for the whole filesystem, I opted for the file attribute approach. And the way we can do this is by adding an ExecStartPre entry to the service config:

set -euo pipefail

# determine the private path to the volume mount
SERVICE_NAME=$(basename $STATE_DIRECTORY)
VOLUME_DIR=/var/lib/private/$SERVICE_NAME

# ensure cowdata is disabled
${pkgs.e2fsprogs}/bin/chattr +C $VOLUME_DIR

This runs before our client process and ensures the C file attribute is set on the volume directory before anything is written to it.

Snapshotting

Now that we have replaced the state directory with a BTRFS subvolume, and ensured that copy-on-write is disabled such that we don’t adversely impact performance, we can finally get around to the task of snapshotting 🎉.

In keeping with the principle of keeping things simple, for this first version I decided that every time the process stops successfully, we should snapshot the state directory. That way, if we want to force a snapshot, we just restart the service.

We can achieve this by adding an ExecStopPost entry to the service config:

# check it was a clean shutdown before snapshotting
if [ $EXIT_STATUS -ne 0 ]; then
  echo "Unclean shutdown detected: $EXIT_CODE, skipping snapshot"
  exit 1
fi

# determine the private path to the volume mount
SERVICE_NAME=$(basename $STATE_DIRECTORY)
VOLUME_DIR=/var/lib/private/$SERVICE_NAME

# ensure snapshot directory exists
${pkgs.coreutils}/bin/mkdir -p ${cfg.snapshotDirectory}

# ISO 8601 date
TIMESTAMP=$(date +"%Y-%m-%dT%H:%M:%S%:z")

${pkgs.btrfs-progs}/bin/btrfs subvolume snapshot -r $VOLUME_DIR ${cfg.snapshotDirectory}/$SERVICE_NAME-$TIMESTAMP

As you can see, this script will get executed regardless of whether or not the process was stopped cleanly. Systemd will provide us with the process exit code, allowing us to ensure we do not snapshot after an unclean shutdown.

Eventually though, we could add some kind of integrity check for the state before snapshotting, but this is enough for the time being.

Summary

I’ve given you some insight into the issues surrounding syncing and state management for Ethereum clients, and also touched on how we are going about trying to solve some of these issues with Ethereum.nix.

There is a reason this blog post is titled Part 1 though, and that’s because we are not finished.

Now that we can successfully and simply snapshot a client’s state, the next step is to offload that to a storage server or service, and more importantly, download a state backup and prime a client’s state directory before startup.

That will happen over the next couple of weeks, so keep your eyes peeled for Part 2 👀.

Until then, everything I have described has been implemented as a NixOS module within Ethereum.nix, and can be enabled like so:

  services.ethereum.snapshot = {
    enable = true;
    interval = "1d";                # restart the service once a day, snapshotting the state directory
    services = ["geth-sepolia"];
  };