NixOS: the power of VM tests
Table of Contents

For the past couple of months I’ve been working on another project for the people over at Clan called Data Mesher.
As the name might suggest, it’s a decentralised, eventually consistent data store. Long-term, we want it to support flexible data schemas and merge strategies. Right now though, we’re focused on one use case in particular: decentralised DNS.
I think it’s fair to say that it’s nothing too fancy. In fact, we’re trying very hard to keep it as not fancy as possible. Yet, for as simple as we are trying to keep it, it quickly becomes difficult to test when you start putting the pieces together.
In the past few days, I’ve been putting effort into some simulation-style tests to try and tease out bugs and convince myself that it works as expected. But until now, like the cowboy I am, I’ve been doing a fair amount of manual testing as I’ve wrestled with, and been smacked around by the problem in general.
But that’s not to say I haven’t had any automated testing.
In fact, thanks to the NixOS Testing Framework, I have had a powerful smoke-test which has been running through the most common scenarios and letting me know when I break something fundamental.
Why so hard to test?
Without going into too much detail, Data Mesher currently has two components:
- A service, written in Go, which shares state as part of a memberlist cluster.
- An NSS module, which integrates the DNS entries generated by Data Mesher into host lookup on the machine where it’s running.
Now, you can run several instances of Data Mesher locally, overriding certain behaviours such as IP resolution, and
visually inspect the output of the dns.json
file being generated.
This is, in fact, what I have been doing.
But for a truly representative test, we need to involve the NSS module.
And for that, we need to configure a few different machines with Data Mesher running and the NSS module loaded. From there, we can test how host resolution behaves in real-world conditions as nodes come and go from the cluster, and when multiple machines try to claim the same hostname.
Doing this manually every time a change is made is not an option.

NixOS VM Tests
It turns out, when you build a declarative operating system like NixOS, it doesn’t take long to realise that with some clever plumbing you can create a test driver that allows you to write Python scripts which can fire up QEMU-based virtual machines.
It’s called the NixOS Testing Framework, and it’s fucking awesome! 🎉
Here is what it looks like in action, performing the kind of things I mentioned above.
{
lib,
dmModule,
nixosTest,
}:
nixosTest {
name = "data-mesher-boot";
nodes =
let
# elided for brevity
# mkNode is a helper function which helps with repetitive aspects of the NixOS config for each machine
in
{
alpha = mkNode {
name = "alpha";
hostnames = [
"mercury"
"venus"
"earth"
];
initNetwork = true;
};
beta = mkNode {
name = "beta";
hostnames = [
"mars"
"jupiter"
];
};
gamma = mkNode {
name = "gamma";
hostnames = [
"earth"
"saturn"
"uranus"
"neptune"
];
};
};
testScript = ''
def resolve(node, success = {}, fail = [], timeout = 60):
for hostname, ips in success.items():
for ip in ips:
node.wait_until_succeeds(f"getent ahosts {hostname} | grep {ip}", timeout)
for hostname in fail:
node.wait_until_fails(f"getent ahosts {hostname}")
# start a bootstrap node in isolation and check it can resolve its local services
alpha.wait_for_unit("data-mesher.service")
resolve(alpha, {
"mercury.sol": ["2001:db8:1::1", "192.168.1.1"],
"venus.sol": ["2001:db8:1::1", "192.168.1.1"],
"earth.sol": ["2001:db8:1::1", "192.168.1.1"]
})
# start the other nodes and check for all expected names
beta.wait_for_unit("data-mesher.service")
gamma.wait_for_unit("data-mesher.service")
for node in [alpha, beta, gamma]:
resolve(node, {
"mercury.sol": ["2001:db8:1::1", "192.168.1.1"],
"venus.sol": ["2001:db8:1::1", "192.168.1.1"],
"earth.sol": ["2001:db8:1::1", "192.168.1.1"],
"mars.sol": ["2001:db8:1::2", "192.168.1.2"],
"jupiter.sol": ["2001:db8:1::2", "192.168.1.2"],
"saturn.sol": ["2001:db8:1::3", "192.168.1.3"],
"uranus.sol": ["2001:db8:1::3", "192.168.1.3"],
"neptune.sol": ["2001:db8:1::3", "192.168.1.3"]
})
# stop alpha and check that its claims expire
alpha.shutdown()
for node in [beta, gamma]:
resolve(node,
{
"earth.sol": ["2001:db8:1::3", "192.168.1.3"], # earth has reverted to gamma
"mars.sol": ["2001:db8:1::2", "192.168.1.2"],
"jupiter.sol": ["2001:db8:1::2", "192.168.1.2"],
"saturn.sol": ["2001:db8:1::3", "192.168.1.3"],
"uranus.sol": ["2001:db8:1::3", "192.168.1.3"],
"neptune.sol": ["2001:db8:1::3", "192.168.1.3"]
}, [
"mercury.sol",
"venus.sol"
]
)
# stop beta and check that its claims expire
beta.shutdown()
resolve(gamma,
{
"earth.sol": ["2001:db8:1::3", "192.168.1.3"],
"saturn.sol": ["2001:db8:1::3", "192.168.1.3"],
"uranus.sol": ["2001:db8:1::3", "192.168.1.3"],
"neptune.sol": ["2001:db8:1::3", "192.168.1.3"]
}, [
"mercury.sol",
"venus.sol",
"mars.sol",
"jupiter.sol"
])
# start alpha again
# it should reconnect and reclaim its names
alpha.wait_for_unit("data-mesher.service")
for node in [alpha, gamma]:
resolve(node,
{
"mercury.sol": ["2001:db8:1::1", "192.168.1.1"],
"venus.sol": ["2001:db8:1::1", "192.168.1.1"],
"earth.sol": ["2001:db8:1::3", "192.168.1.3"], # still retained by gamma
"saturn.sol": ["2001:db8:1::3", "192.168.1.3"],
"uranus.sol": ["2001:db8:1::3", "192.168.1.3"],
"neptune.sol": ["2001:db8:1::3", "192.168.1.3"]
}, [
"mars.sol",
"jupiter.sol"
])
# stop the service on alpha and check that it stops reporting the alpha dns entries
alpha.stop_job("data-mesher.service")
resolve(alpha, {}, [
"mercury.sol",
"venus.sol",
"earth.sol",
"mars.sol",
"jupiter.sol",
"saturn.sol",
"uranus.sol",
"neptune.sol"
])
'';
}
And here is what that test is doing:
- Starts a VM called
alpha
, waits for thedata-mesher
systemd service to successfully start and then tests host resolution, waiting until it reaches our expected state. - Starts two more VMs called
beta
andgamma
, waiting for theirdata-mesher
systemd services to start. - It then tests host resolution again, from the perspective of each machine, based on the names each is configured to claim, waiting as needed for state to propagate and steady state within the cluster.
- Stop
alpha
, checking that the host resolution eventually changes as expected when its host claims expire. - Stop
beta
, again checking the host resolution eventually changes. - Start
alpha
again, waiting for the host resolution to reach our expected state.
All of this is being done with complete NixOS configurations similar to a real-world deployment, testing the NixOS module for Data Mesher, the systemd service running the Data Mesher instance and the NSS module.
This is just phenomenal, and in my opinion, one of the killer features of NixOS.
Summary
I didn’t want to get into the nitty-gritty of how the NixOS testing framework works, but instead highlight just how powerful it can be, and perhaps provide a little taster for those who are Nix curious.
On more than one occasion, it has allowed me to write automated and realistic tests with ease for complex multi-machine setups which I just couldn’t do otherwise.
What I’m using it for in Data Mesher so far is only some of what it is capable of.
And if you like the test framework but don’t want to use NixOS, you don’t have to. You can use nix-vm-test to re-use the test driver with Ubuntu, Debian and Fedora instead!