NATS: building a HTTP transport

It was a good run. Longer than most, I would say. But after two months of getting my hands dirty with NATS as I’ve been building out Nits, I discovered my first gripe: controlling read/write permissions for KV and Object stores.

Having reached a point where I could successfully deploy NixOS closures to my test VMs, I began reviewing my prototype and looking for areas where I could improve.

I say I went looking for places to improve when I actually went straight to the permissions for my agent processes which had been bugging me ever since I figured them out:

# generate users for the agent vms
for AGENT_DIR in $VM_DATA_DIR/*; do
   NKEY=$(${self'.packages.nits}/bin/nits-agent nkey "$AGENT_DIR/ssh_host_ed25519_key")
   BASENAME=$(basename $AGENT_DIR)

   nsc add user -a numtide -k $NKEY -n $BASENAME \
    --allow-pub nits.log.agent.$NKEY \
    --allow-sub nits.inbox.agent.$NKEY.> \
    --allow-pub \$JS.API.STREAM.INFO.KV_deployment,\$JS.API.CONSUMER.CREATE.KV_deployment \
    --allow-sub \$JS.API.CONSUMER.DELETE.KV_deployment.> \
    --allow-sub \$JS.API.DIRECT.GET.KV_deployment.\$KV.deployment.$NKEY,\$KV.deployment.$NKEY \
    --allow-pub \$JS.API.STREAM.INFO.KV_deployment-result,\$KV.deployment-result.$NKEY \
    --allow-pub \$JS.API.STREAM.INFO.KV_nar-info,\$JS.API.DIRECT.GET.KV_nar-info.> \
    --allow-pub \$JS.API.STREAM.INFO.OBJ_nar,\$JS.API.STREAM.MSG.GET.OBJ_nar, \
    --allow-pub \$JS.API.STREAM.NAMES,\$JS.API.CONSUMER.CREATE.OBJ_nar,\$JS.FC.OBJ_nar.> \
    --allow-pub \$JS.API.CONSUMER.DELETE.OBJ_nar.> \
    --allow-sub \$O.nar.>

   nsc describe user -n $BASENAME -R > $AGENT_DIR/user.jwt
   echo "$NKEY" > "$AGENT_DIR/nkey.pub"
done

Yeah, it looks a bit shit. And if I’m honest, I’m not 100% sure I achieved what I wanted to, which was:

Allow reads (only) against the deployment KV Store for a particular key
Allow writes (only) against the deployment-result KV Store for a particular key
Allow reads (only) against the nar Object Store and the nar-info KV Store

How We Got Here

As I develop Nits, there are two principles I try and follow:

Keep it simple: the fewer moving parts, the better.
Get as much out of the underlying language and technology choices as possible.

It’s for these reasons that I chose NATS in the first place.

When I needed to distribute NixOS closures to my agent processes, instead of introducing a third-party dependency, I built a Nix Binary Cache on top of KV and Object stores.

This helped keep the security boundary easy to reason about and reduce the number of dependencies for the agent process. Similarly, I opted for more KV stores when it came to tracking deployments and their results.

Through the nats cli I can view, modify and listen to changes without building that admin functionality myself. As you can see above, though, the story around the permissions for KV and Object Stores is incomplete.

Why So Complicated?

As I explained in a previous post, KV and Object stores are just streams under the hood with some API sugar. This means you must consider all the operations that can be performed with a stream when setting permissions.

Consider the deployment KV store, for example:

    # allow retrieving info about the KV_deployment stream which backs the deployment KV store
    --allow-pub \$JS.API.STREAM.INFO.KV_deployment
    # allow creating/removing consumers for the KV_deployment stream
    --allow-pub \$JS.API.CONSUMER.CREATE.KV_deployment \
    --allow-sub \$JS.API.CONSUMER.DELETE.KV_deployment.> \
    # allow retrieving a $NKEY sub key and listening for updates to it
    --allow-sub \$JS.API.DIRECT.GET.KV_deployment.\$KV.deployment.$NKEY,\$KV.deployment.$NKEY \

Whilst there are some internal helper APIs like $JS.API.DIRECT.GET there is not yet some form of API to help manage permissions for KV and Object stores as a logical unit instead of their underlying implementations.

Given how brittle and error-prone the current situation is, I have decided to start moving things behind service endpoints instead, starting with the binary cache. This has reduced many of the permissions above to a single --allow-pub nits.cache.>.

But to get there, I had to decide how to implement services over NATS.

Hello Old Friend

At a fundamental level, you could say that NATS does not impose many patterns. It has a spectrum of capabilities, many of which you may never need, and when you do need them, NATS tries to let you, the developer make the decisions about how best to employ them.

When deciding how best to move the binary cache behind a service endpoint, I couldn’t find much to recommend how to proceed.

Much of the internal NATS API relies on subject hierarchies for intimating actions and JSON bodies when required. There is also a nascent micro-services framework mostly focusing on adding discoverability and metrics.

Once again, it doesn’t impose much of a pattern on the implementation.

I did see some recent efforts to generate microservices from Protobuf RPC specs, but on the whole and as best I can tell, it seems everyone rolls their own.

So left to my own devices, I returned to something tried and tested to see if I could make it work: HTTP.

Prior Art

If you google “HTTP over NATS”, you’ll come across a sample project by the main man, the head honcho, El Jefe himself, Derek Collison, who captures the serialised form of HTTP requests and responses and sticks them in a NATS message body.

Whilst functional, it was never intended to be a robust solution and suffers from a few issues which will become apparent later.

The top search result, however, is an excellent post by Peter Gillich in which he fleshes out a more robust approach to bridging HTTP over NATS and explains some of the reasons why you would want to.

That article was a starting point for my attempt: nats-http.

The Happy Path

At its core, HTTP (1.x) is a text-based protocol in which a message begins with a start-line or a status line, followed by a series of headers and ends with an optional body.

# request

GET / HTTP/1.1
Host: google.com
User-Agent: curl/8.1.1
Accept: */*

# response

HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Content-Security-Policy-Report-Only: object-src 'none';base-uri 'self';script-src 'nonce-vPYivhh4U0JhyU9UXothtA' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp
Date: Mon, 03 Jul 2023 14:16:13 GMT
Expires: Wed, 02 Aug 2023 14:16:13 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

In NATS, a message is comprised of a subject, headers and data. It is easy to see then that we can represent an HTTP request and response as NATS messages with little effort.

func ReqToMsg(req *http.Request) (msg *nats.Msg, err error) {
  msg = nats.NewMsg("foo.bar")  // our server is listening on "foo.bar"

  // get a reference to the msg headers
  h := msg.Header

  // capture method
  h.Set("X-Method", req.Method)

  // capture path properties with some custom headers
  h.Set("X-Path", req.URL.Path)
  h.Set("X-Query", req.URL.RawQuery)
  h.Set("X-Fragment", req.URL.RawFragment)

  // add request headers
  for key, values := range req.Header {
    for _, value := range values {
  	  msg.Header.Add(key, value)
	}
  }

  // copy the body
  msg.Data, err = io.ReadAll(req.Body)

  return
}

This approach will get you pretty far and is not too dissimilar to the method described by Peter Gillich, albeit using the Headers API. For my use case, serving Nar archives for a Nix binary cache, it doesn’t take long before we run into a problem: max message size.

Chunked Transfers

By default, NATS is configured to allow messages with a maximum size of 1 MB. This can be increased up to 64 MB. As you can imagine, some of the Nar archives we need to transfer will be more than 64 MB. There’s no way around it, then. We need to support chunked transfers.

Given that HTTP is typically implemented over a connection-oriented transport such as TCP, it’s not immediately obvious how to implement chunked transfers over NATS. Whilst the NATS client always maintains an open socket with the server, the messages we pass on top of that connection have no sense of a logical connection.

At first, I considered using an Object store.

If a message body was unknown at the time of sending or too large for a single message, I could upload the body to an Object store and then embed a reference in the original message.

Whilst this might seem elegant at first glance, it adds a dependency on JetStream, which some clients may not have access to, and generally felt over-engineered.

In the end, I settled on a much simpler approach:

Determine if the request needs to be chunked. This is done by examining the Transfer-Encoding header if present or comparing the Content-Length with the conn.MaxPayload() size.
If a chunked transfer is required, we send the first message as normal with the request headers and put as much of the request body as possible into the message payload.
We then wait for a response message from the server, which should include a private inbox, to which we will then send additional messages with the remaining chunks to be transferred until the entire request body is read.
Once the body has been read in full, we indicate the end of the transfer by sending a final message with no headers and an empty body.

You may be wondering why we are waiting for a response from the server before continuing with the transfer. That is for a very good reason.

If the responder listening on the other end is doing so as part of a Queue Group, we cannot ensure the same server instance will receive all the messages we send.

To overcome this, we wait for the responder to generate a private inbox that only it listens to and to which we can send the additional chunks. This way, regardless of whether or not the responder is participating in a Queue Group, we can be sure the same instance will process all of our messages.

In addition, this handshake approach relies solely upon core NATS functionality and does not require JetStream to be enabled.

Leveraging Subject Hierarchies

At this point, I had integrated nats-http into Nits and was feeling pretty satisfied.

Having moved the binary cache behind a service endpoint and simplified the permissions for an agent, I was thinking about what other services I might want to implement for things such as administration, agent provisioning, etc.

That’s when I started thinking about authorisation and realised I had missed a trick with the subject hierarchy I was using. I had kept it too simple.

A request with the URL nats+http://foo.bar/hello/world would target a service listening on the foo.bar subject.

Nothing about the path or the HTTP method was available in the subject, and you had to look at the message headers for that information instead.

If a user was allowed to publish to a subject foo.bar, they could target any path that foo.bar served.

As you may have noticed earlier, the internal NATS API subjects use the subject hierarchy to carry a bit more information:

$JS.API.STREAM.MSG.GET.OBJ_nar
$JS.API.CONSUMER.CREATE.KV_deployment

Such a hierarchy allows us to later constrain who can do what by limiting who can publish or subscribe to a given subject.

So taking a leaf out of Synadia’s book, I changed how URLs are mapped to subjects.

Now a GET request for a URL nats+http://foo.bar/hello/world is mapped to a subject foo.bar.hello.world.GET, bringing two main benefits:

It is possible to execute fine-grained wiretaps on all or a subset of requests based on URL and HTTP method
It is possible to constrain user permissions to specific paths and methods using NATS-native permissions without the need for an additional authorisation mechanism at the router level in the server.

Summary

In this post, I’ve shown how work must still be done regarding permissions for KV and Object stores.

If you have a use case where you need to control fine-grained access, you’re better off sticking all interaction with them behind a service endpoint and leveraging the subject hierarchy, as I have shown.

If you do decide to go the HTTP over NATS route feel free to give nats-http a shake, but please understand it is heavily focused on my use cases.

I’m sure there are some aspects of HTTP I haven’t mapped quite right, but I will put more effort into improving the tests and documentation over the coming months.

And it goes without saying, but I’ll say it anyway: contributions and feedback are most welcome 🙏