Building a Presence API using AWS AppSync, AWS Lambda, Amazon Elasticache and Amazon EventBridge

Introduction

When developing a video game, whether single-player or multiplayer, social and competitive features help create a network effect and increase players’ engagement. These features usually require a backend API. Among them, presence information lets players know about online status changes of other users, allowing them to challenge others quickly or invite them for a game session.

AWS provides developers with a wide spectrum of choice to develop backend services, from plain Amazon Elastic Compute Cloud (Amazon EC2) instance-based servers to containers and serverless. Among those, AWS AppSync simplifies application development by letting you create a flexible API to securely access, manipulate, and combine data from one or more data sources. AWS AppSync is a managed service that uses GraphQL to make it easy for applications to get exactly the data they need. One interesting feature of AWS AppSync is the real-time updates, with which you can notify players about changes on the backend.

To enable users to subscribe to and receive notifications, a game client connects to an AWS AppSync endpoint via a websocket connection. As of today, AWS AppSync does not provide events related to client connections or disconnections.

This post describes a solution to build a Presence API using AWS AppSync, AWS Lambda, Amazon ElastiCache, and Amazon EventBridge.

Defining a Presence API in GraphQL

For this example, we will keep the API as simple as possible. Each player will be associated with a status that could take two different values: “online” or “offline“. Our API provide three basic operations:

connect mutation: to be called when a player opens its websocket connection
disconnect mutation: to be called when the player gracefully quits the game (and closes her connection)
status query: to retrieve the current status of one player

There is one additional and more elaborate use case. The player can be disconnected from the backend for many reasons: client crashes, network interruptions, or even intentionally to cheat. Still, we want other players to be informed of the disconnection even when the disconnect operation is not called.

To do this, the game client sends regular signals to the backend, called heartbeat, and a threshold (or timeout) is set to consider if the player is still online or not. As game clients perform reconnection attempts when disconnected, it’s important to carefully define both the heartbeat interval and the threshold to avoid the blinking player effect, whose status switches quickly from connected to disconnected.

Finally, we will have some subscriptions added to our API for players to receive notifications when another player’s status changes. The @aws_subscribe annotation is particular to AWS AppSync and specifies the mutations that will trigger the notification. The final schema of the AWS AppSync should match the following GraphQL code snippet:

enum Status {
        online
        offline
}
type Presence {
        id: ID!
        status: Status!
}
type Mutation {
        connect(id: ID!): Presence
        disconnect(id: ID!): Presence
        disconnected(id: ID!): Presence
}
type Query {
        heartbeat(id: ID!): Presence
        status(id: ID!): Presence
}
type Subscription {
        onStatus(id: ID!): Presence
               @aws_subscribe(mutations: ["connect","disconnect","disconnected"])
}

Presence data storage

AWS AppSync enables developers to decouple the GraphQL schema that is accessed by their client applications from the data source, allowing them to choose the right data source for their workload.

Find a description of the architecture diagram here: https://docs.aws.amazon.com/appsync/latest/devguide/system-overview-and-architecture.html

AppSync Architecture (Ref: AWS AppSync System Overview and Architecture)

For presence data, knowing if a player is still online can be translated into the last heartbeat did not happen before the given timeout. This information is similar to the session information you receive on a website.

Amazon Elasticache is a good fit for this use case. (Reference the Session Management page to learn more.) The key/value cache could store a player id as the key, and the heartbeat as the value. We also want to be able to quickly retrieve sessions that have expired during a time interval, which explains the choice of Redis Sorted Sets using operations such as ZADD, ZSCORE, ZRANGEBYSCORE, or ZREMRANGEBYSCORE.

Architecture overview

Architecture diagram

The infrastructure is defined using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to model and provision your cloud application resources using familiar programming languages (in this case, typescript). AWS CDK provides high-level constructs that allow developers to describe infrastructure with a few lines of code, while following the recommended best practices for security, reliability, and performance. It also gives developers the possibility to use more advanced programming features such as functions or loops.

Network overview

An ElastiCache for Redis cluster is deployed within an Amazon VPC. As depicted in the diagram, the VPC is divided into three subnet groups:

the Redis subnet group: fully private for the cluster deployment
the Lambda subnet group: In order to access the Redis endpoints, the Lambda functions must be deployed inside the same VPC.
a public subnet group: The timeout Lambda function requires access to the AppSync endpoint to call mutations. As of today, AWS AppSync does not provide private link, so the function has to access AppSync through a NAT Gateway, which in turn requires public subnets.

For high availability, we use a multi-AZ (Availability Zone) deployment, which requires definitions for one subnet resource per zone and group in our stack, as well as a route table to handle traffic from the Lambda subnets to the internet. Fortunately, this is where AWS CDK comes in handy with the Vpc construct:

this.vpc = new EC2.Vpc(this, 'PresenceVPC', {
  cidr: "10.42.0.0/16",
  subnetConfiguration: [
    // Subnet group for Redis
    {
      cidrMask: 24,
      name: "Redis",
      subnetType: EC2.SubnetType.ISOLATED
    },
    // Subnet group for Lambda functions
    {
      cidrMask: 24,
      name: "Lambda",
      subnetType: EC2.SubnetType.PRIVATE
    },
    // Public subnets required for the NAT Gateway
    {
      cidrMask: 24,
      name: "Public",
      subnetType: EC2.SubnetType.PUBLIC
    }
  ]
});

The Vpc construct creates subnets in different Availability Zones and, by choosing the right combination of SubnetType, can also create other necessary resources, such as the NAT Gateway and route tables. We can then create both security groups inside the VPC to allow incoming traffic on Redis group only from the security group attached to the Lambda functions, as well as create a multi-AZ Redis cluster with a read replica.

AWS AppSync API

The next part of the stack setup concerns the AWS AppSync API.

Using Lambda functions, we can take advantage of Direct Lambda Resolvers feature for AWS AppSync. For each query and mutation, a resolver is created with the corresponding Lambda data source and attached to the relevant schema field.

The Lambda function code is rather simple; they can access the queries and mutations arguments directly from the event argument and perform the corresponding Redis operation. As our functions need to access the Redis cluster, we use a Lambda layer containing the redis module.

Here is the heartbeat function code for example:

const redis = require('redis');
const { promisify } = require('util');
const redisEndpoint = process.env.REDIS_HOST;
const redisPort = process.env.REDIS_PORT;
const presence = redis.createClient(redisPort, redisEndpoint);
const zadd = promisify(presence.zadd).bind(presence);

/**
 * Heartbeat handler:
 * use zadd on the redis sorted set to add one entry
 * 
 * @param {object} event 
 */
exports.handler =  async function(event) {
  const id = event && event.arguments && event.arguments.id;
  if (undefined === id || null === id) throw new Error("Missing argument 'id'");
  const timestamp = Date.now();
  try {
    await zadd("presence", timestamp, id);
  } catch (error) {
    return error;
  }
  return { id: id, status: "online" };
}

The ZADD Redis command can either add a new entry in the set or update the entry score if it exists. Therefore, the corresponding Lambda data source can also be used by both the connect mutation and the heartbeat query. If you look at the CDK code that creates the resolvers or data sources, there is nothing related to the creation of an AWS Identity and Access Management (IAM) role to give AWS AppSync the permissions required to call the function. This action is automatically handled by the CDK constructs.

Handling expired connection

The process of handling an expired connection follows the steps annotated in the above diagram:

Triggered at regular intervals, the timeout function retrieves expired connections and remove them from the sorted set.
It performs one AWS AppSync disconnected mutation per disconnection through the NAT Gateway.
AWS AppSync triggers a notification for each disconnection to inform subscribed players.

Next, we need to modify the GraphQL schema with this additional disconnected mutation:

type Mutation {
        connect(id: ID!): Presence
        disconnect(id: ID!): Presence
        disconnected(id: ID!): Presence
    @aws_iam
}
type Subscription {
        onStatus(id: ID!): Presence
               @aws_subscribe(mutations: ["connect","disconnect","disconnected"])
}

The @aws_iam annotation informs AWS AppSync that this specific mutation requires AWS IAM authentication, through a specific role that the Lambda function assumes. You can learn more about AWS AppSync multiple authorization types in this article.

Finally, use the following code for the timeout function:

const redis = require('redis');
const { promisify } = require('util');
const timeout = parseInt(process.env.TIMEOUT);
const graphqlEndpoint = process.env.GRAPHQL_ENDPOINT;

// Initialize Redis client
const redisEndpoint = process.env.REDIS_HOST;
const redisPort = process.env.REDIS_PORT;
const presence = redis.createClient(redisPort, redisEndpoint);

// Initialize GraphQL client
const AWS = require('aws-sdk/global');
const AUTH_TYPE = require('aws-appsync').AUTH_TYPE;
const AWSAppSyncClient = require('aws-appsync').default;
const gql = require('graphql-tag');
const config = {
  url: graphqlEndpoint,
  region: process.env.AWS_REGION,
  auth: {
    type: AUTH_TYPE.AWS_IAM,
    credentials: AWS.config.credentials,
  },
  disableOffline: true
};
const gqlClient = new AWSAppSyncClient(config);
// The mutation query
const mutation = gql`
  mutation expired($id: ID!) {
    expired(id: $id)
  }
`;

exports.handler =  async function() {
  const timestamp = Date.now() - timeout;
  // Use a transaction to both retrieve the list of ids and remove them.
  const transaction = presence.multi();
  transaction.zrangebyscore("presence", "-inf", timestamp);
  transaction.zremrangebyscore("presence", "-inf", timestamp);
  const execute = promisify(transaction.exec).bind(transaction);
  try {
    const [ids] = await execute();
    if (!ids.length) return { expired: 0 };
    // Create and send all mutations to AppSync
    const promises = ids.map(
      (id) => gqlClient.mutate({ mutation, variables: {id} })
    );
    await Promise.all(promises);
  } catch (error) {
    return error;
  }
}

To trigger notifications in AWS AppSync, we use a specific mutation named disconnected. This mutation is attached to a local resolver; it forwards the result of the request mapping template to the response mapping template, without leaving AppSync, to trigger notifications to subscribed clients.

Event-based evolution

Now we have a working Presence API; however, it was defined without the context of other backend APIs, such as a friend or challenge API. Those APIs may also benefit from knowing if a player has been disconnected to perform updates or clean up.

Another issue with this version is that there are two differentiated paths to disconnect the user: one using the disconnect mutation on the API, and one through the disconnected mutation from the timeout function. When users disconnect themselves, other services won’t be notified.

To be consistent, we modify the disconnect function to send a disconnection event to our event bus as well. Here is the evolved architecture:

Evolved event-based architecture diagram

Amazon EventBridge triggers the timeout function.
The function retrieves and removes expired connections.
The function sends events to the custom event bus (as the disconnect function does too.)
The event bus triggers the Lambda function on_disconnect set as target.
The on_disconnect function sends a disconnected mutation to AWS AppSync.
AWS AppSync notifies clients that subscribed to this mutation.

Also note that the heartbeat function is now sending connect events to the Amazon EventBridge bus, which can be used by other backend services as well.

Network evolution

A point of note in the diagram is that the Lambda functions are not directly connected to AWS AppSync anymore, which removes the need to have private / public subnets and a NAT Gateway. And as Amazon EventBridge supports interface VPC Endpoint, we use the following code to add one to our VPC so that the Lambda function inside the VPC can access the service directly.

// Add an interface endpoint for EventBus
this.vpc.addInterfaceEndpoint("eventsEndPoint", {
  service: InterfaceVpcEndpointAwsService.CLOUDWATCH_EVENTS,
  subnets: this.vpc.selectSubnets({subnetGroupName: "Lambda"})
})

Event rules and targets

The next step is to define events and the rule that trigger them. The stack creates an event rule attached to the custom event bus:

// Rule for disconnection event
new AwsEvents.Rule(this, "PresenceExpiredRule", {
  eventBus: presenceBus,
  description: "Rule for presence disconnection",
  eventPattern: {
    detailType: ["presence.disconnected"],
    source: ["api.presence"]
  },
  targets: [new AwsEventsTargets.LambdaFunction(this.getFn("on_disconnect"))],
  enabled: true
});

The important points here are:

The eventPattern: It defines the events that will trigger this rule, in this case all events that have their detailType and source both match one of those in the rule definition, all other event fields are ignored.
The targets: The on_disconnect function is added as a target to the rule. Amazon EventBridge rules allow for multiple targets to be triggered by a single rule, which will allow the usage of a fan-out model where the event can trigger other targets for other services.

What remains is to change the code of our timeout and disconnect functions to send events to Amazon EventBridge. Here is the main handler for the timeout function as an example:

exports.handler =  async function() {
  const timestamp = Date.now() - timeout;
  const transaction = presence.multi();
  transaction.zrangebyscore("presence", "-inf", timestamp);
  transaction.zremrangebyscore("presence", "-inf", timestamp);
  const execute = promisify(transaction.exec).bind(transaction);
  try {
    const [ids] = await execute();
    if (!ids.length) return { expired: 0 };
    // putEvents is limited to 10 events per call
    let promises = [];
    while ( ids.length ) {
      const Entries = ids.splice(0, 10).map( (id) => {
        return {
          Detail: JSON.Stringify({id}),
          DetailType: "presence.disconnected",
          Source: "api.presence",
          EventBusName: eventBus,
          Time: Date.now()
        }
      });
      promises.push(eventBridge.putEvents({ Entries }).promise());
    }
    await Promise.all(promises);
    return { expired: ids.length };
  } catch (error) {
    return error;
  }
}

Deploying the sample

You can retrieve the full source code from samples on GitHub. More details and deployment instructions are included in the README file. The repository deploys the event version of the architecture.

Conclusion

If you already have an AWS AppSync-based API for your backend, you can easily add a Presence API to it with a set of simple Lambda functions and a Redis cluster. The simple version of the API could be used if there is no need to connect with or decouple it from your existing services. The event-based version allows hooking to other existing services from your backend by registering an additional target to the existing rule. As there are many different types of rule targets, including Amazon EC2 instances, Amazon ECS instances, and Amazon API Gateway REST API endpoints, the event-based version could also be used to extend other kinds of existing backends. Lastly, Amazon EventBridge has become even more useful by introducing a feature to archive and replay events (for example, to replay a series of events to debug or review interactions between players.)