Sandboxing Code in the Era of Containers

Guest post by Tomasz Janczuk, Engineer, Auth0

Every application is different. When creating a product aimed at developers, extensibility is a key aspect to take into account. Allowing custom code to be run on a user’s behalf is a neat way to accomplish that extensibility. However, as an operator of such a system you have limited control over that custom code. You must assume it is not always well behaved. To maintain integrity of the data, the code of individual users, and the overall system, you must run custom code in an isolated environment. A sandbox provides such an environment. This is a story of how we have built one at Auth0. We called our sandboxing technology Auth0 Webtasks.

What is Auth0 About?

Auth0 was founded to make identity management as simple and approachable to developers as sending SMS messages with Twilio or accepting payments through Stripe. Security in general and identity management in particular have always been complex areas. While authentication is a necessary aspect of any application of value, it is rarely part of the core value proposition of an application. By allowing authentication and authorization to be simply and quickly integrated with any app, Auth0 assumes the tax of identity management and allows developers to spend more time on the core aspects of the application logic.

Auth0 offers authentication as a service. Auth0 acts as a broker between a client application and several supported social (Facebook, Twitter, Github, etc.) and enterprise (LDAP, AD, WAAD, etc.) identity providers. Auth0 normalizes the wire protocol as well as enables single sign-on. It greatly simplifies securing applications by providing developers with reasonable defaults for a wide range of declarative controls over the authentication behavior as well as SDKs for the most common platforms and languages.

What Is a Sandbox and Why You May Need One

Flexibility and extensibility are two distinctive features of the Auth0 identity management platform. We allow our subscribers to augment the authentication and authorization pipeline by executing custom code as part of the authentication transaction. For example, you can implement a logic that requires two-factor authentication for login attempts from unusual locations, create a new record in a custom CRM system for all new users, enhance user profiles using external databases, and much more — all by running arbitrary code on our platform.

Like Auth0, many multi-tenant systems that allow extensibility through custom code face the same problem. How do you prevent malicious or just badly written code of one tenant from accessing data of another tenant? How do you prevent code of one tenant from using up an unfair share of computing resources and slowing down or preventing code from other tenants from running?

Because we don’t have strict control of what people can do within custom code, we needed a technology solution to safely execute it. It must provide the necessary data isolation and resource utilization guarantees. We evaluated some off-the-shelf alternatives (e.g., node-sandbox), but they either fell short of our isolation expectations or did not fit our execution model. We decided to build our own sandboxing solution, which we called Auth0 Webtasks.

A Multi-Tenant SaaS Model for Sandboxing Untrusted Code

All custom code we execute at Auth0 using Auth0 Webtasks runs in the context of an HTTP request. Execution time is limited to the typical lifetime of an HTTP request. Given this basic constraint, the execution model we chose is based on HTTP. The webtask cluster accepts an HTTP POST request with the code to execute in the request body. The request also specifies the webtask container name, which denotes the isolation boundary the code will execute in. In the case of Auth0, we map Auth0 customers 1:1 onto webtask containers, which means the code of one Auth0 subscriber is always isolated from the code of another subscriber. The webtask cluster executes the custom code in an isolated environment we call a webtask container, and sends back a JSON response with the results. HTTP post request diagram

One key implication of this model is that custom code executes in a uniform environment across all tenants. Custom code of different users can leverage only a single set of capabilities and modules; the list of installed modules or software cannot vary on a per-tenant basis. This implies that a webtask request must include all necessary code as well as contextual data required during its execution. The upside of the uniform execution environment is that it enables specific performance optimizations in the implementation of the system. The downside is that does not have as much flexibility as platforms that provide full control over the installed software on a per-tenant basis.

The webtask programming model is simple. The caller submits a JavaScript function closure. The webtask cluster invokes that function and provides a single callback parameter. When the custom code has finished executing, it calls the callback and provides an error or a single result value. That value is then serialized as JSON and returned to the caller in the HTTP response.

return function (cb) {
    cb(null, “Hello, world”);
}

The webtask runtime is based on Node.js. It allows custom code to utilize a fixed set of Node.js modules pre-provisioned in the webtask environment. The set of supported modules is dictated by the specific requirements of the Auth0 extensibility scenarios. While the set of modules is easily modified, it remains uniform across all tenants in the multi-tenant webtask cluster deployment. This uniformity allows us to keep a pool of pre-warmed webtask containers ready to be assigned to tenants when a request arrives. This greatly reduces the cold startup latency.

Webtask Isolation and Assurances

A multi-tenant system that executes untrusted code must provide many assurances for secure code execution. First and foremost is data isolation. This means code of one tenant should be prevented from accessing code or data of another tenant. For example, if one tenant is running code that accesses a custom database using a connection string or URL with an embedded password, code of another tenant running in the same system should be unable to discover that password.

Second to preventing data disclosure or tampering is ensuring fair resource consumption to mitigate authenticated DOS attacks. To achieve this, the sandbox must limit the amount of memory, CPU, and other system resources any one tenant can use.

Solutions that Did Not Meet Our Requirements

Before we decided on a sandbox solution, we explored many existing options. First, we looked at node-sandbox as a way of sandboxing the execution of Node.js code using a process-level isolation boundary. While it is conveniently integrated with Node.js, which the rest of our platform uses, it falls short on meeting our isolation requirements. The module provides only a basic level of isolation between JavaScript code running in different processes. Out of the box, the solution does not support any OS-level isolation mechanisms for processes. We also did not like the added latency of creating a new Node.js process for every request to evaluate custom code.

Next, we considered using PaaS offerings like Heroku or Windows Azure Web Sites. While these hosting solutions offer adequate isolation given our requirements, using them would require us to provision a separate website for each Auth0 subscriber. This solution simply did not scale from the cost perspective.

At that point, we resorted to building our own solution that we called Auth0 Webtasks. While we were at it, AWS announced AWS Lambda, which looked like a potential remedy to our challenge. Upon further investigation, however, it became clear AWS Lambda is centered on an asynchronous programming model that did not quite meet our needs. The custom logic we run in Auth0 is an RPC logic: it accepts inputs and must return outputs at low latency. Given the asynchronous model of AWS Lambda, implementing such logic would require polling for the results of the computation, which would adversely affect latency.

Given the lack of reasonable off-the-shelf offerings, we developed Auth0 Webtasks.

Implementation of Auth0 Webtasks in AWS

The external interface of the Auth0 Webtask technology is an HTTP endpoint that accepts POST requests for executing custom code. Let’s take a top-down look at our deployment of Auth0 Webtask technology on top of AWS, as shown in the following diagram.

deployment of Auth0 Webtask technology on top of AWS

Auth0 services require a high level of availability, and we have decided to employ several mechanisms to support this requirement:

We have two webtask clusters deployed in different AWS regions (us-west-1 and us-east-1). One of them is the primary cluster, and the other is a failover cluster. We are using a failover routing policy at the Amazon Route 53 level to implement that failover logic.
Each webtask cluster consists of a 3 VM deployment with Elastic Load Balancing (ELB) in front of the cluster. ELB has configured health check monitoring and is able to take VMs out of circulation upon failure.
The VM placement across several Availability Zones in an AWS region supports high availability at the cluster level.

Custom code running on the webtask cluster VMs frequently makes outbound network calls to systems under the control of Auth0 customers. Some of these systems require firewall configuration to allow these inbound calls. To enable our customers to deterministically configure their firewalls, we wanted to guarantee a fixed set of source IP addresses for those outbound network calls. To that end, all outbound traffic from the webtask VMs is routed through NAT VMs with fixed Elastic IPs. This means that even when a webtask VM or the entire cluster is re-provisioned and assigned a new private IP address, the publicly visible IP address of outbound calls remains unchanged.

The NAT machines with fixed IP addresses also allow tunneled SSH access to individual webtask VMs to our engineers.

Each webtask cluster runs on its own VPC. The VPC has two subnets per availability zone: one private and one public. The NAT machines and the ELB are attached to the public network. The webtask VMs themselves run on the private network.

Now let’s take a look at the architecture of the software running in a webtask cluster. At the high level, the Auth0 Webtasks are built on top of CoreOS, Docker, etcd, and fleet. These technologies provide a great foundation for building scalable and distributed container-based applications. However, by themselves they are not sufficient to provide the kind of guarantees we discussed earlier. We had to use additional mechanisms on top of CoreOS and Docker.

To isolate code and data of one tenant from another, we run every tenant’s code in their own Docker container, which we call a webtask container. When an HTTP request arrives at a webtask VM, it is first processed by a proxy. The proxy looks at the etcd configuration to determine if a webtask container is already assigned for a particular tenant. If it is, the proxy forwards the request to that webtask container. If it is not, the proxy assigns a new webtask container for that tenant from a pool of pre-warmed webtask containers available in the webtask cluster. The proxy then records that association in etcd for the sake of subsequent requests.

The pre-warmed pool of webtask containers is a distinctive feature of this design. It is made possible given the decision to offer a uniform execution environment for all tenants. Being able to pick a pre-warmed container from a pool greatly reduces cold startup latency compared to provisioning of a container on the fly, even if one takes into account the already low startup latency of Docker containers. We are using fleet to maintain a pre-warmed pool of webtask containers. Containers that are recycled are simply re-started by fleet and put back in the unassigned pool.

Any single webtask container is just a simple HTTP server that allows multiple, concurrent requests to be processed on behalf of a single tenant. Requests executing within a specific webtask container are not isolated from each other in any way. The lifetime of the webtask containers is managed by the controller daemon, which runs in a trusted Docker container and can therefore terminate any webtask container in the cluster following a pre-configured lifetime management policy.

HTTP controller daemon

In addition to running every tenant’s code in its own Docker container, we are also configuring egress firewall rules in the CoreOS webtask VMs. These rules prevent untrusted code in one webtask container from communicating with other webtask containers or webtask infrastructure. Setting up the firewall rules is possible because the HTTP server of the webtask container is running on a local network separated from the host’s network by a bridge created by Docker. The code running in the webtask container can, however, initiate outbound calls to the public internet. This enables outbound communication from the custom code to external data sources and services, for example, a customer’s database or corporate edge services.

To limit memory and CPU consumption, we are using the cgroups mechanism exposed by Docker. In addition, every webtask container creates a transient Linux user and configures PAM limits for that user on startup. These two mechanisms together help prevent a range of attacks on memory and CPU, for example, fork bombs.

Streaming Real-Time Logs

Developers love logs because they provide tremendous help during development as well as during runtime. Handling logging in a multi-tenant system requires some planning to ensure logs are both readily available yet isolated between tenants.

HTTP sandbox VM logging

Auth0 Webtasks allows all internal components as well as the custom code provided by the tenant to contribute to the logs. We use bunyan for structured, JSON based logging. All logs are stored in kafka, which is deployed across all webtask VMs in the cluster. From there, logs are exposed over HTTP can be steamed to the developer console using something as simple as Curl and a long-running HTTP response.

To support log isolation between tenants, logs generated by any tenant’s code are published to a separate kafka topic. In addition, all information is also logged to a system-wide topic for administrative analysis.

How about .NET?

JavaScript and Node.js provide for a useful and capable programming model, but many people prefer authoring custom code in .NET and C#. To address this need, the Auth0 Webtasks allow running C# code.

This is achieved with the use of Edge.js, a Node.js module that supports running .NET code in-process with Node.js. Edge.js enables creating JavaScript function proxies around C# async lambda expressions. It supports marshaling data between V8 and CLR heaps in the same process, and reconciles threading models between single-threaded V8 and multi-threaded CLR.

return function (context, cb) {
    require('edge').func(function () {/*
        async (dynamic context) => {
            return "Hello, " + context.data.name + "!";
        }
    */})(context, cb);
}

In the context of the Auth0 Webtasks, Edge.js uses Mono as the CLR implementation on Linux. With this mechanism, developers can use the full power of the Mono framework to write custom code in C#, and run it in a Linux-based webtask container. The C# code is embedded in Node.js and compiled on the fly to an in-memory assembly for execution.

The Outlook

We are quite excited about the possibilities the Auth0 Webtasks creates for developers. Currently we use webtasks to allow our subscribers to easily import custom credential databases, as well as run custom logic in the authentication pipeline. However, seeing how this extensibility model resonates with the level of control developers appreciate, we are looking forward to opening up many more use cases in a similar way.

AWS Startups Blog