How Amazon Uses Amazon AppStream 2.0 to Provide Data Scientists and Analysts with Access to Sensitive Data
On February 28th 2020, due to the COVID-19 pandemic, Amazon announced that we had taken steps to protect the health of our employees and communities. This included canceling large events, moving stakeholder meetings online, and pausing tours of fulfillment centers.
As of this post, Amazon has continued to invest more than $8 billion in COVID-19 safety measures.
To supplement these safety measure initiatives, Amazon had to forecast the spread and risk of COVID-19 at Amazon sites. Forecasting required the construction of interactive reports and machine learning models. That led Amazon to build a secure data lake to store highly sensitive data, and a global scale, resilient analytics environment.
The challenge with building such a data lake is its competing requirements. On the one hand you must secure, anonymize and isolate your data, while on the other you have to expose it to its intended consumers.
The architecture for this solution had to meet the following security requirements:
- All data must be stored in an isolated environment with no internet access.
- No direct access to raw data including administrators of the environment.
- Access limited only to analytic interfaces via IAM roles.
- Access only allowed when connecting from a corporate device and on the corporate network.
- Data can’t leave the isolated environment, including through copying and pasting, or printing.
- Comprehensive auditing of user activities.
To provide access to this environment, Amazon leaned towards Virtual Desktop Infrastructure (VDI) as a solution. With VDI, only the pixels of the data are streamed to the users, while the data itself never leaves the environment. This would allow for increased security, by isolating the working environment of Amazon’s data scientists and data analysts. In addition, it could increase performance by placing the tools closer to the data.
Amazon had to choose between building a solution on Amazon Elastic Compute Cloud, or using AWS Managed Services such as Amazon AppStream 2.0 or Amazon WorkSpaces. In the end, we went with AppStream 2.0.
What is Amazon AppStream 2.0?
Amazon AppStream 2.0 is a fully managed non-persistent application and desktop streaming service. You centrally manage your desktop applications on AppStream 2.0 and securely deliver them to any computer. You can easily scale to any number of users across the globe without acquiring, provisioning, and operating hardware or infrastructure. AppStream 2.0 is built on AWS, so you benefit from a data center and network architecture designed for the most security-sensitive organizations.
Being a non-persistent solution, and with an image-based approach to managing application updates and operating system patches, AppStream 2.0 met our needs best. It enabled our administrators to curate the experience for our data scientists and data analysts, and reduced the effort involved in deploying updates.
Furthermore, AppStream 2.0’s ability to assert IAM roles through instance profiles, enabled us to provide access to AWS Services without the need for access and secret keys. This satisfied one of the important security requirements. In addition, its native auditing capabilities enabled Amazon to easily meet its security auditing requirements.
And finally, the use of automatic scaling enabled Amazon to build a cost-effective solution, with the environment automatically scaled down when not in use.
Once settled on AppStream 2.0, Amazon was able to sketch out and validate this solution in a matter of a week. Building a VDI (virtual desktop infrastructure) environment using EC2 would have taken considerably longer. We would have had to build the servers, streaming gateways, consider the benefits and drawbacks of third-party VDI solutions, and manage it ourselves.
Amazon’s use of AppStream 2.0 for our COVID-19 project provides our data scientists and data analysts access to isolated data in a secure manner. It ensures that our users are able to access the data only when connected to our corporate network. AppStream 2.0 also provides our administrators with the ability to disable file transfer, printing, or copying and pasting, which prevents a user from bringing the highly sensitive data to their local machine.
The following is an abbreviated diagram of this solution.
To implement this solution, the project team used Amazon’s internal SAML provider as the entry point to AppStream 2.0. This allowed us to prevent users from accessing the environment outside of the corporate network, in addition to requiring multi-factor authentication before granting them access.
AppStream 2.0’s use of the Microsoft Windows operating system enables us to provide our data analysts with familiar applications. This reduces any rework or retraining on different tools. For example, a data analyst can run an Amazon Redshift query via an ODBC driver, transpose and analyze data for reports needed by leadership.
Using an Amazon SageMaker environment, our data scientists are able to use Jupyter Notebooks in this secure environment, and are unable to copy out notebooks or cells. This feature allows the data scientists to build machine learning models using the sensitive data sources around COVID-19 without being able to transfer the data to their local device.
The APIs of AppStream 2.0’s are used in Lambda functions to check the user’s session identifier for spoofing. This functionality creates another layer of user validation and data leakage prevention. It ensures that the user granted access to the data is who they say they are.
How do I Audit this Environment?
As mentioned, one of the requirements is comprehensive auditing of our environment. We want to know who enters the environment, and who touched which data. If there is a security event within the AppStream 2.0 environment, our security team has full traceability on all activities.
To accomplish that, we are ingesting AWS CloudTrail logs and AppStream 2.0 Usage Reports to a central logging location. This enables our Security Analysts and Incident Response teams to determine exactly who logged on to our environment, and every action taken during their session.
The following is an example of logs we collect.
First we collect the Logon from CloudTrail. This gives us the user ID of the user who logged on. We then collect the Amazon S3 put from CloudTrail that gives us the IP address of the AppStream 2.0 instance. And finally, we collect the AppStream 2.0 usage reports which gives us the IP address of the AppStream 2.0 instance, plus the user ID. This allows us to connect the user ID that performed the activity at that time on Amazon S3.
To assist with its COVID-19 mitigation efforts, Amazon built a secure data lake where it ingests, curates, and analyzes highly sensitive data related to COVID-19. To provide its data scientists and data analysts with access to that data, Amazon realized it needed to build a VDI environment.
Amazon chose AppStream 2.0 as its VDI solution, instead of using EC2 to build their own. This enabled us to move fast, as we didn’t have to spend the time building and managing infrastructure. With AppStream 2.0, we were also able to ensure that our users only access the data when connecting from the corporate environment, as well as ensuring that all activities are audited and traceable. Furthermore, with AppStream 2.0, we were able to provide our data scientists and data analysts with a better-performing and consistent user experience, compared to direct access from their corporate laptops.
As Amazon continues its COVID-19 mitigation and prevention efforts, AppStream 2.0 continues to play an integral role in providing secure access to sensitive data.
For information on how to implement such a solution, visit our technical blog.
About the Authors
As a Sr. Cloud Architect at AWS, Chaim works with large enterprise customers, helping them create innovative solutions to address their cloud challenges. Chaim is passionate about his work, enjoys the creativity that goes into building solutions in the cloud, and derives pleasure from passing on his knowledge. In his spare time, he enjoys outdoor activities, spending time in nature, and immersing himself in his books.
As a Data and Machine Learning Engineer, JD helps organizations design and implement modern data architectures to deliver value to their internal and external customers. In his free time, he enjoys exploring Minneapolis with his fiancée and black lab.