Prototyping a machine deployment tool with Spritely Goblins

Wrapping up December Adventure for this year and related to my last post, here's some early prototyping I did over the holidays of a machine management tool. After I posted the last post, I had a conversation with some of my friends in which it was difficult to convey exactly how this might work, so I'm writing this up partly to serve as an explainer for my thought process.

In this post I may assume some basic familiarity with object capabilities and Spritely Goblins, though I'll try to explain things to some degree as I go.

Goals

From a high level, what I'd like to accomplish is something like: you have a centralized infra management tool that's responsible for provisioning new machines. When you want to create a new machine, you send a message to that management tool, and the response you get back is a reference to some management interface for that new machine.

Crucially, this system must effectively avoid the trust bootstrapping problem; that is to say, the new instance must be able to securely connect to the management tool, and the management tool must have confidence that the new instance is what just connected to it. Despite seeming simple, this has been surprisingly difficult to accomplish with common devops tools in my experience! It can be done, but often either is not or takes a lot of work. So that's something I'd like to address right out of the gate.

Furthermore, I'd like to make it straightforward to connect services running on different machines to each other. Even when you have a single deployment tool deploying two different services, it's often more involved than you would hope to get those two services talking to each other. Consider the example of a web application and a database instance deployed specifically for that web application. You've deployed the web application onto one machine, and the database instance on another, and you need the web application to connect to the database. What do you need for this? Typically, at least:

The ability to generate login credentials to the database for the web application to use
The ability to modify the web application's configuration to use said database instance

In my experience, both of those steps are often done semi-manually; often someone will manually generate database credentials to use, and while the configuration of the webapp will be config managed, it'll pull those static database credentials from some source (maybe it's a sops-encrypted file, maybe it's in a secrets manager somewhere, etc). When it is done automatically via something like Vault, this relies on that separate tool with its own policies, and bootstrapping client trust to Vault is itself a whole rabbit hole!

Wouldn't it be nice if your provisioning tool could take care of this kind of thing for you?

Building a prototype

I've built a basic prototype of some of what I have in mind called gobs-of-machines. I'll probably work on it (and document it) more in the future, but for now I'd like to go over what I have so far.

The system I've built has a few parts, most with D&D-esque names. There's the boss, which is responsible for triggering the provisioning process of new machines and keeping track of the machines it's provisioned. There are provisioners, which are what would call out to the APIs of a given cloud provider and provision a new machine with some specified user data. And then there's the hob, a service in two parts - one colocated with the boss, and another running on each deployed machine. This acts as the communication channel between the boss and each machine. (Named after the household spirit or hobgoblin, not the stove. Maybe I'll rename it if it gets too confusing, but hobgoblin is pretty verbose. Names, ya know!)

Boss

Starting with the boss, we'll create a Goblins object with a hashtable inside it. This will map a human-readable name for each machine to an object we can use to communicate with that machine.

(define (^boss bcom)
  (define machines (spawn ^ghash))

For the time being, since I haven't implemented any cloud provider provisioners, we'll use a dummy provisioner as a placeholder. In a real implementation we'd want a more sophisticated way to manage provisioners for multiple providers, but for now let's just create an instance of the dummy provisioner inside the boss object.

  (define dummy-provisioner (spawn ^dummy-provisioner))

We'll add a getter methods to get a specific machine from the hashtable by name, and a list method to list all the machines that have been registered:

  (methods
   [(get-machine name)
    ($ machines 'ref name)]
   [(list-machines)
    (ghash-fold ghash-keys '() ($ machines 'data))]

For those coming from other object oriented programming languages, this will hopefully feel somewhat familiar. We're defining the constructor for an object (in Goblins's convention, the ^ at the beginning of names like ^boss denotes a constructor) and defining some methods for that object. There are a few Goblins-specific quirks to explain here:

The $ in the get-machine method stands for a synchronous method call on another Goblins object (with some transactional functionality that I won't get into yet)
The implementation of the list-machines method is a bit funky, because Goblins's ghash objects don't have a built-in method to list all the keys in the hashtable. So what I'm doing here is grabbing the underlying hashtable and pulling out the keys from it. (ghash-keys is a procedure whose implementation I've elided for brevity, check the source if you're curious.)

One thing worth noting about Goblins programming is that, unlike some other object oriented programming languages like Python, you should generally think of method calls as messages sent between objects. It's a very Alan Kay-style object system.

Okay, now for the only particularly interesting part of the boss, the create-machine method:

   [(create-machine name)
    (define hob-server (spawn ^hob-server-presence))
    (on (<- dummy-provisioner 'new-machine hob-server) ;; TODO selectable provisioners
        (lambda (ret)
          ($ machines 'set name hob-server)))]))

This one adds a couple new things to go over. Within the body of the create-machine method, we're spawning one of the components of the hob service - the one that lives alongside the boss. We're then sending an asynchronous message to the provisioner (that's what <- does) and passing it a reference to the hob object that we created.

Async messages, as in some other languages like JavaScript, return promises that will be fulfilled later rather than waiting for the response to come back. In Goblins, you can trigger an action when a promise is resolved with on. In this case, when the promise is resolved (i.e. when the boss gets a response from the provisioner), we add an entry to the hashtable created above for the new server. Currently I have it waiting until after hearing back from the provisioner to avoid adding entries for new nodes if provisioning fails, but we could just as well create a "pending" entry in the hashtable and update it when provisioning succeeds.

Provisioners

Now that we've seen the code that kicks off the provisioning process, let's take a look at what the provisioner itself does. For now I only have a dummy provisioner with one method to try out the idea:

(define (^dummy-provisioner bcom)
  (methods
   [(new-machine hob-server-presence)
    (display "dummy-provisioner: In a real provisioner, this would talk to a cloud provider\n")
    (let ((new-vat (spawn-vat)))
      (with-vat new-vat
                (define machine (spawn ^dummy-machine))
                ($ machine 'provision hob-server-presence)))]))

This is the method of the provisioner that we saw the boss call earlier.

Part of this again deserves some explanation, because it's Goblins-specific. Because we're not actually provisioning a new VM at this stage, I'm instead spawning a new "vat" to run some extra code in. A vat is Goblins's mechanism of concurrency; it's an event loop containing objects. Objects within the same vat can make synchronous method calls between each other, while objects in different vats can only send asynchronous method calls. Objects on separate machines can communicate with each other via CapTP, but they will necessarily be in different vats and so can only communicate asynchronously.

Running part of this code in a separate vat is my attempt to simulate running on a separate machine, though of course there are a few things that would be different in a real provisioner:

The provisioner must be able to deploy some code to the new machine. There's a client component that needs to run on the new machine for this all to work.
Crucially, the provisioner must be able to pass the hob-server reference to the new machine.

The latter is trivial in the dummy provisioner case, when everything is running on the same machine, but Goblins has a trick up its sleeve that makes it possible across machines. If you have OCapN set up, you can serialize the reference to the hob-server reference as a sturdyref so you can send it over the network as a string. In a cloud provider that supports user data (Linode, for example), you can include the sturdyref in user data for the new instance. The code running on the remote machine can then enliven the sturdyref, turning it back into a live reference, after which point objects on each machine can send asynchronous messages to each other as usual.

The dummy provisioning code that runs in the new vat doesn't do a whole lot; it looks like this:

(define (^dummy-machine bcom)
  (methods
   [(provision hob-server-presence)
    (display "dummy: Provisioning new machine\n")
    (let ((client (spawn ^hob-client-presence)))
      (on (<- hob-server-presence 'register client)
          (lambda (ret)
            (display "dummy: Registered new client machine\n"))))]))

Essentially, it just spawns a new instance of a hob-client-presence on the new machine, registers it with the associated hob server, and logs that it's done so. More on what that does in a minute. In a real provisioner, this might also be where we set up persistence for the hob-client-presence object, because we'll want that to stick around across reboots.

Hobs

So I've shown how you start the provisioning process, and I've shown what a trivial provisioner might do. That covers some initial provisioning, but what does the interface to machines after that point look like? That's the job of the hob.

As I mentioned before, the hob is a service in two parts: one that lives on the management server (alongside the boss, one instance per machine), and one that lives on each machine. As with the other components of this system, these are implemented as Goblins objects.

We saw above that the one thing the machine-side component did was to register the hob-client with its associated server. Let's take a look at the server:

(define-actor (^hob-server-presence bcom #:optional (client #f))

Each machine gets a hob-server instance, and each hob-server instance will communicate with its associated hob-client instance. The hob-server therefore needs to know where its hob-client is. So we'll pass in the client as an optional parameter, defaulting to false initially.

The next part has a few Goblins-isms that deserve some explanation:

  (define pending-beh
    (methods
     [(register client-presence)
      (bcom (^hob-server-presence bcom client-presence))]))

We're using methods like before, only now we're assigning it to a variable. What? Well, it turns out methods isn't a special, core part of the language - it's a macro that emits a function dispatching on a method name as its first argument. And because it emits a function, we can assign that function to a variable as we would anything else. We'll see why this is useful in a bit.

The other oddity here is the last line in the register method - (bcom (^hob-server-presence bcom client-presence)). I didn't explain it above, but you may have noticed that each actor has bcom in its argument list. This is a capability that allows an object to "become" something else; it lets Goblins objects act as a sort of state machine. If the last thing that an object does on a method call is bcom something else, then the next time you make a call to that object, it'll use the behavior of that new thing instead of its original behavior. Here, the argument to bcom is the constructor for a new hob-server-presence, but this time with a client defined.

When we spawn a new hob-server initially, it doesn't know where its client is! It can't at that point, because the machine it's on hasn't been provisioned yet. Only after the new machine has been provisioned and its hob-client has been spawned can the server possibly know where it is. So we initially create the hob-server in a sort of pending state, where it's just sitting there waiting for a client to register to it. Only after a client registers does it transition to its active behavior. And what is its active behavior?

  (define active-beh
    (methods
     [(register-binding name svc-cap)
      (<- client 'register-binding name svc-cap)]
     [(list-bindings)
      (<- client 'list-bindings)]
     [(get-binding name)
      (<- client 'get-binding name)]))

For now, it only does three things: it lets you register a service binding, list the already registered service bindings, and get a service binding by name. Each of these method calls is passed through directly to the underlying hob-client.

What is a service binding? Inspired by this CloudFlare Workers blog post, a binding is a mapping between a human-readable name (something like db for example) and a service capability (which is currently any arbitrary capability, but it may be better to have a defined interface for these capabilities later on). This gives each machine in the system its own namespace for looking up services. For example, an application could be configured to look up and connect to the db service, and exactly which service that ends up connecting to depends on the bindings available on that machine.

Lastly, the hob-server needs an initial behavior to start with. The client was an optional parameter, so we'll choose the active behavior if that parameter is truthy (which it would be if we've already registered a client) or the pending behavior if it's false:

  (if client
      active-beh
      pending-beh))

Now for the client side. Its implementation is pretty similar to the boss as it turns out, because it's doing a very similar thing - mapping from a human-readable name to something stored in a hash table:

(define (^hob-client-presence bcom)
  (define bindings (spawn ^ghash))
  (methods
   [(register-binding name svc-cap)
    ($ bindings 'set name svc-cap)]
   [(list-bindings)
    (ghash-fold ghash-keys '() ($ bindings 'data))]
   [(get-binding name)
    ($ bindings 'ref name)]))

(An aside on naming - you might be wondering why the hob-client and hob-server are called "presences". The terminology, and the rough outlines of this architecture, come from something known in the ocap world as the unum pattern.)

Putting it all together

Now that we have all this in place, let's think a bit about what this allows us to do. Imagine that we've created two machines, say, app-machine and db-machine. On db-machine, imagine that we're running a database service (and assume for a moment that the database service itself speaks CapTP for simplicity's sake). On the database server, you could run:

($ hob-client 'register-binding "db" db-cap) ;; running on db-machine

You could then do this, from the boss:

(define db-machine ($ boss 'get-machine "db-machine"))
(define app-machine ($ boss 'get-machine "app-machine"))
(define db-cap (<- db-machine 'get-binding "db"))
(<- app-machine 'register-binding "db" db-cap)

Now, app-machine has access to the DB service via its capability!

($ hob-client 'get-binding "db") ;; running on app-machine

Of course, there aren't any database services that speak CapTP yet that I know of, but you can see the above flow working if you define db-cap to be something like a ghash for testing purposes.

Conclusion

This is still a rough prototype, and there's lots of details I haven't covered yet. For example, you may want to attenuate the capability to a service exposed by one machine before passing it on to another. And because most existing services do not natively speak CapTP, you'd need to write a capability-style wrapper for them. This may be more or less difficult, depending on the service in question, and supporting attenuation in particular in a wrapper is likely to be a challenge.

I also haven't defined a declarative interface for passing service references between machines. That's possible to build on top of these foundations, but is definitely a lot more involved than what I've shown so far. For long-term administrative tasks, you may want to be able to define a service graph (i.e. "service db on db-machine points to service db on app-machine") and have the system transfer those capabilities between the machines automatically.

But rough as it is, I hope this has given you some insight into the things that are possible to build in a capability framework. I know in my professional life there have been many times when I've wished connecting services to each other were a lot more straightforward, and so I wanted to take a crack at showing what's possible. Hopefully I've at least somewhat succeeded. :)