Delta Sharing on AWS
This post was written by Frank Munz, Staff Developer Advocate at Databricks.
An introduction to Delta Sharing
During the past decade, much thought went into system and application architectures using domain-driven design and microservices, but we are still on the verge of building distributed data meshes. Such data meshes are based on two fundamental principles: centrally governing data and sharing data.
Data sharing itself has been severely limited. Solutions were tied to a single vendor, did not work for live data, couldn’t access multiple data lakes, came with severe security issues, and not scaled to the bandwidth of modern cloud object stores.
In this article, I’ll introduce Delta Sharing. Delta Sharing is a Linux Foundation open source framework that uses an open protocol to secure the real-time exchange of large datasets and enables secure data sharing across products for the first time. Delta Sharing directly leverages modern cloud object stores, such as Amazon Simple Storage Service (Amazon S3), to access large datasets reliably. (To explore the Delta Sharing open source project, go to delta.io/sharing.)
Two parties are involved: data providers, which is the Delta Sharing server; and recipients, which could be a pandas data frame or Apache Spark client.
The underlying REST protocol is open, with documentation available on GitHub.
The data provider decides what data to share and runs a sharing server. Based on Delta Lake, an open source project that provides reliability on top Amazon S3 data lakes, data can be shared as logical tables. Sharing live data that may consist of thousands of underlying objects in Amazon S3 as a table is a key differentiating factor of Delta Sharing. Also, unlike files copied over to an FTP server, a Delta table can be updated. Such updates are transactional, so the user will never read inconsistent data.
Because the Delta Sharing protocol is based on a simple REST API and Delta Lake uses Apache Parquet as its internal storage format, additional Delta Sharing adapters can be developed easily. We encourage and support community projects extending Delta Sharing (refer to the section “Join the community” for information on how to contribute).
Clients always read the latest version of the data. A client can also provide filters on the data based on partitioning to read only a subset of the data or limit the number of results it wants to retrieve for large tables. The server then verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back.
Most open source clients will use pandas or Apache Spark to access shared data, where connecting to the server and retrieving information is straightforward.
The server generates short-lived presigned URLs that allow the client to read the data directly from the cloud provider. The transfer can happen in parallel at the massive bandwidth of Amazon S3, without data streaming through the sharing server as a bottleneck. Note that this architecture also does not require the recipient to authenticate against Amazon S3.
Based on the Delta Sharing open protocol, several companies announced extended support for their products, such as Tableau, Qlik, Power BI, or Looker.
Reading shared data
Let’s get started with the most straightforward use case—accessing shared data.
Databricks is hosting a reference implementation of a Delta Sharing server. As a quick way to get started, use your favorite client to verify that you can read data from the Databricks server. (Later in this blog post, you will learn how to share your own data.)
I am a fan of minimum viable products (MVPs) and getting started with the smallest working example, so let’s kick it off with an MVP. In our case, this would be a Python program retrieving data from Databrick’s hosted sharing server.
Standalone Python in a virtual environment
You could begin installing Python 3.6+ on a local machine, creating a virtual environment using
virtualenv to avoid messing with your local installation, and then adding the Delta sharing library to the environment to use later with your Python code.
Or you could use the Delta Sharing connector interactively within pyspark, Spark’s Python shell. This simplifies it even more since the required packages will then be downloaded automatically. The following command line will start pyspark with Delta Sharing:
Other languages such as Scala, Java, R, and SQL are supported as well. Check the documentation for further details.
Of course, you could even run the Python code in the AWS CloudShell, or any local IDE; however, there is an even more straightforward way. In the following example, I chose AWS Cloud9. Cloud9 is a cloud-native IDE that features code completion and syntax highlighting and only costs pennies since its instance will shut itself down after a configurable idle time. With AWS Cloud9, you don’t have to fret about installing Python and the possible side effects on other applications when installing additional libraries.
A quick check in the terminal window of AWS Cloud9 will show that Python is already installed, but the
delta-sharing library is not:
So let’s install the required libraries for Delta Sharing with the pip command:
To run your receiver, copy the following code:
The code is using a
profile_file that contains the endpoint of the Delta Sharing server hosted at Databricks together with a bearer token that allows you to access the data.
Typically this file is managed and secured on the client-side. Because our first experiment with Delta Sharing is about reading data from the Databricks server, we can stick with the provided example
profile_file on GitHub and retrieve it via HTTP.
To get a better idea of the content and syntax of that file, display it from the built-in AWS Cloud9 terminal:
Now run the code using the run button. The output should look similar to the following screenshot.
When the program runs, the client accesses the server specified in the profile, authenticates with the bearer token, and creates a Delta Sharing client. Then the client loads the data from the server as a pandas data frame, and, finally, the client displays a filtered subset of the data. The following screenshot shows abbreviated output.
To replicate the preceding example without using AWS Cloud9 and only using open source, while still having a convenient and isolated environment for development, you can run the same example in a Jupyter notebook hosted on your laptop or an Amazon Elastic Compute Cloud (Amazon EC2) instance.
Self-managed Jupyter notebook
The high-level steps to get Jupyter working on Amazon EC2 or locally are as follows:
- Make sure you have a recent version of Python 3 installed. Jupyter requires Python 3.3 or later.
- Create a virtual environment using virtualenv (or the Python builtin venv) to separate the installations for this project from others. Then change to the new environment.
- PySpark requires Java 8. Make sure to properly set
JAVA_HOME. On Amazon EC2, I recommend installing Amazon Corretto, the Amazon no-cost distribution of OpenJDK.
- In the new environment, install PySpark:
pip install pyspark.
- Then install the Jupyter notebook:
pip install jupyter.
A Jupyter notebook with the preceding code pointing to the Databricks Delta Sharing server is available to download.
Managing Jupyter notebooks yourself—especially at an enterprise scale—isn’t an easy task because it requires resources and operational experience. Managed notebooks (such as Databricks on AWS, Amazon EMR, or Amazon SageMaker) can import the Jupyter notebook format and remove the heavy lifting of operating a notebook environment in the enterprise.
For machine learning, there is a multitude of options. MLflow is one of the most popular open source platforms to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
Managed machine learning platforms on AWS, such as Amazon SageMaker, can benefit from Delta Sharing in the following way: Typically such platforms load training and test data from Amazon S3 as files (refer to a SageMaker example on GitHub for more details). A single dataset can consist of several files, but machine learning engineers are working on a higher level of abstraction (for example, with pandas data frames).
Using Delta Sharing, you now can directly load the data as a pandas data frame with
load_as_pandas(), then continue with preprocessing, data cleaning, and feature engineering in pandas. Then, from within the same notebook, you can use SageMaker’s integrated machine learning environment with MLflow to train, validate, deploy, and monitor the model.
For larger datasets, in which processing with pandas is out of scope, data can be loaded as a Spark data frame using
load_as_spark(). Data cleaning and feature engineering with Apache Spark automatically scale across a cluster of machines, datasets larger than the main memory of the executor nodes can be processed, and Spot Instances to reduce the costs by up to 80% can be used.
Sharing data with Delta Sharing
Using Delta Sharing allows you to share data from one or many lakehouses. Shared data could be a target of streaming data inserts and be constantly updated. Still, you will get a consistent view of the live data thanks to the atomicity of the operations on the underlying Delta tables.
Host a Delta Sharing server
To host a Delta Sharing server, run it as a Docker container from the Docker hub or download the reference implementation of the server. Let’s first review running the prebuilt server.
Delta Sharing server
Download the prebuilt package
delta-sharing-server-x.y.z.zip with the reference implementation of the server from GitHub releases. Then use the packaged server example template file to create a
The Delta Sharing server is shown in the center of the following architecture diagram. The server’s conf.yaml file specifies the endpoint and port number the server is listening to, whether it uses a bearer token for authorization, and most importantly, the shares, schemas, and tables the server offers from your Delta Lake.
The notation for exporting data specifies the name of the share, the schema (database), and the table.
Figure 1: Delta Sharing architecture
In the following example, the server configuration file specifies the port
/delta-sharing for the endpoint the sharing server is listening to. A fictitious bearer token
123456 is used and car data from a Delta table stored in an Amazon S3 bucket with the name
deltafm2805/cars is exported as
Delta Sharing follows the lake-first approach. With Delta Lake, you can share data from an existing data lake as Delta tables. Note that there can be multiple listings of shares, schemas, and tables, and therefore, a single Delta Sharing server can share data from multiple lakehouses.
Access from Delta Sharing access to Amazon S3
There are various ways to configure access from the Delta Sharing server to Amazon S3. Creating an EC2 instance role that allows the server to access Amazon S3 avoids storing AWS access keys locally or in the metadata service.
Any Delta Sharing recipient (shown on the left in Figure 1) that supports pandas or Apache Spark can access the Delta Sharing server (shown in the center of the diagram). Technically the pandas and Spark libraries make use of the open REST protocol of Delta Sharing. Business intelligence tools vendors such as Tableau, Qlik, Power BI, and Looker build on the open protocol and extended their support for Delta Sharing.
As shown in Figure 1, a
profile_file defines the client settings. The respective client provides the bearer token from the file to the server endpoint (as described previously) to authenticate. Delta Sharing internally returns a list of presigned, short-lived Amazon S3 URLs to the recipient. Data is then directly retrieved from Amazon S3, in parallel, at the bandwidth of the cloud object store without Delta Sharing being a potential bottleneck for the throughput.
Delta Sharing with Docker
Running the Delta Sharing server as a Docker container is even easier. With Docker installed, no prior downloads are necessary. Simply start the Delta Sharing server with the following command:
Docker will retrieve the newest image from Docker hub if it is not available locally and run it as a container.
-p parameter maps the Delta Sharing server port to 9999 on the local machine in the preceding command. The
--mount parameter makes the
config.yaml from the local machine visible at the expected location inside the container.
Built-in sharing as a service
Nowadays, software architects want to build on open source cloud services. Doing so gives you platform independence and the fast innovation cycles of open source without the heavy lifting required to install, configure, integrate, operate, patch, and debug the deployment yourself.
A Databricks workspace on AWS allows you to import Jupyter notebooks and already integrates with AWS Glue, Amazon Redshift, Amazon SageMaker, and many other AWS services. Now, a Databricks account on AWS also comes with Delta Sharing as a built-in service. (For more information on Databricks on AWS, and how to integrate services such as Glue, Redshift, and SageMaker, visit databricks.com/aws).
Creating a share, adding a table to the share, creating a recipient, and assigning the recipient to the share are simple SQL commands in a notebook as shown in the following:
From an architectural perspective, it is interesting to see that operating a server and dealing with configuration files as described previously is all abstracted away and instead implemented as SQL commands for the end user.
Key takeaways in this article about Delta Sharing on AWS include:
- Delta Sharing provides a novel, platform-independent way of sharing massive amounts of live data. It follows a lake-first approach, in which your data remains on Amazon S3 in a lakehouse that unifies data, analytics, and AI.
- Data shared via Delta Sharing can be accessed with pandas, Apache Spark, standalone Python, and an increasing number of products, such as Tableau and Power BI.
- Using cloud services, shared data can be accessed from AWS Cloud9, or managed notebooks such as Amazon EMR, SageMaker, and Databricks.
- For sharing data, the prebuilt reference implementation of the server is also available as a Docker container.
- Databricks on AWS simplifies data and AI and includes a built-in sharing server.
Delta Sharing is young open source project, and we welcome your contributions and feedback. At the time of this writing, Delta Sharing 0.2.0 was released. Beyond sharing Delta tables, as explained in this article, the support for other formats such as Parquet files, machine learning models, database views, and support for further object stores are planned.
Join the community
We are curious to see how businesses will use Delta Sharing. Use cases are endless—from data as a service in medical research, to real-time public data feeds of environmental data lakes, or novel data monetization strategies in companies.
- Delta Sharing on GitHub
- Hackernoon article: Share large amounts of live data with Delta Sharing and Docker
- Databricks blog: Introducing Delta Sharing
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.