re:Cap part one – open source at re:Invent 2019
As the dust settles after another re:Invent closes, I wanted to put together a quick summary of all the open source-related announcements that happened in the run up to this year’s re:Invent and the week itself. If you are interested in open source in mobile web development, devops, containers, security, big data and data analytics, machine learning, databases, emerging technologies and more, you can read about all the announcements and catch up on sessions from re:Invent, and then take a look at the workshops in case want to try them out from the comfort of your own keyboard.
This is part one of three, in which we’ll be covering all things data, analytics and machine learning. Part two will cover mobile web development and DevOps and part three will cover compute and emerging technologies such as robotics and blockchain, as well as all other areas of open source including Java.
Big data, data analytics, and databases
One of the key announcements that we made during re:Invent was the New Amazon managed Apache Cassandra service. Above is the announcement made during Andy’s keynote. You can also read Matt Assay’s post on how we are Contributing to Apache Cassandra community.
There were also some great announcements prior to re:Invent:
- Athena Federated queries (Query any data source with Amazon Athenas new federated query) – we have open sourced the connectors (you can find the GitHub repo at https://github.com/awslabs/aws–athena–query–federation) so customers can contribute and create their own. Also check out a deep dive video on Amazon Athena Federated Query.
- We announced support for Apache Hudi in Amazon EMR. Hudi is a popular open source project that helps address use cases which need incremental data processing, such as inserts, updates, or deletes (for example, when complying with data privacy regulations where a user invokes their right to be forgotten). Read more in New, insert, update,delete data on S3 with Amazon EMR and Apache HUDI.
- There were a couple of interesting Apache Kafka–related announcements. First, you can now run fully managed Apache Flink applications with Apache Kafka. You can also monitor your MSK cluster with Prometheus, an open–source monitoring system for time-series metric data. You can also use tools compatible with Prometheus-formatted metrics or tools that integrate with Amazon MSK Open Monitoring, such as Datadog, Lenses, New Relic, and Sumo Logic. Read more in Open Monitoring with Prometheus in the Amazon MSK documentation.
Here’s a selection of related sessions:
- ANT206–Leadership session on analytics and data lakes by Andi Gutmans covered many of the open source technologies that customers can use, including some of the new services and features announced during pre:Invent. Well worth starting off with this!
- OPN207–One query language for all your data introduces PartiQL and how it is used in several AWS services such as Amazon Redshift, Amazon S3 Select, and Amazon QLDB. This session also covers the open source PartiQL project and how you can get involved.
- With the recent addition of Apache Hudi in EMR, ANT239–Insert, update and delete data in Amazon S3 using Amazon EMR covers common use cases that drove the creation of this project, and how you can get started.
- If you’re looking to go deeper, ANT308–A deep dive into running Apache Spark on Amazon EMR is for you.
- There were plenty of sessions and workshops on Open Distro for Elasticsearch. As security is job zero, why not start with OPN204–Securing your Open Distro for Elasticsearch cluster and then check out the workshops below.
- In ANT309–Responding to customer needs in real time with Amazon MSK, learn about the foundations of real–time analytics and find out how Adobe are using Amazon MSK in their Adobe Experience Platform.
- There were some great sessions covering open source database technologies. In the DAT209–Leadership session: AWS purpose built databases, the second part of the session does a deeper dive and demo of Amazon Managed Apache Cassandra.
- If you want to know more about MySQL in AWS Aurora, this will sort you out: DAT316–MySQL; self managed, managed and serverless. If PostgresSQL is more your thing, there’s more open source database goodness in this familiar–sounding session: DAT317–PostgresSQL; self managed, managed and serverless. If you want to go deeper with that, check out DAT328–Deep dive with Amazon Aurora with PostgreSQL.
- OPN302 – Open Distro for Elasticsearch
- ARC316 – Deploy and monitor a serverless application
- ANT346 – Know your data with machine learning in Open Distro for Elasticsearch
- ANT303 – Have your front end and monitor it too
The AWS machine learning stack is the broadest and deepest toolkit you can provide your data scientists and web developers, and we made many announcements during re:Invent. Here I will cover the open source-related
- First up was Amazon Sagemaker operators for Kubernetes, allowing you to kick off machine learning workloads on Kubernetes, adding Amazon SageMaker as a custom resource. Read more, including some detailed examples, in Introducing Amazon Sagemaker operators for Kubernetes and then check out the code in the GitHub repo https://github.com/aws/amazon–sagemaker–operator–for–k8s
- Netflix also announced the open sourcing of a new project, Metaflow, a human–centric framework (Python library) for data science that has been battle tested within Netflix against hundreds of data science projects. In the post, Netflix explain how they partnered with AWS to provide a seamless integration between Metaflow and various AWS services.
- We announced Deep Java Library (DJL), an open source library to develop Deep Learning models in Java. The Deep Java Library home page includes links to the GitHub repo where you will find demo code and examples. This announcement was quickly followed by the Deep Graph Library (DGL) , a Python package built for easy implementation of graph neural network model families, on top of existing deep learning frameworks such as PyTorch, MXNet, Gluon, etc.
- Finally, TensorFlow 1.15 is now supported on the Deep Learning AMIs, Deep Learning containers and Amazon SageMaker, and TensorFlow 2.0 is available on the Deep Learning AMIs (and watch this space for it coming to containers and Amazon SageMaker, too)
From machine learning frameworks to running machine learning on containers and using open source tools, here are our picks:
- Going beyond simple “Hello world” examples, the ADM302-End to End machine learning using Spark and Amazon Sagemaker session went on to cover how to create environments for machine learning engineers so they can prototype and explore with TensorFlow before executing in distributed systems using Spark and Amazon SageMaker. The session goes into detail on how to productionize models and deployment.
- If you’re interested in TensorFlow, you’ll want to see this session on scaling TensorFlow and Sagemaker workloads: AIM410–Deep learning applications with Tensorflow, featuring Mobileye, and a repeat of this with a different customer in AIM410-Deep Learning applications with Tensorflow, featuring Fanny Mae.
- Learn from the PyTorch team the latest features and library releases in AIM412–Deep learning applications using PyTorch, featuring Freshworks, or the repeated session with a different customer AIM412-Deep learning applications using PyTorch featuring Autodesk.
- There were some great sessions on running machine learning workloads with containers. Check out CON306–Building machine learning infrastructure on Amazon EKS with kubeflow as well as AIM326–Implementing ML workloads on Kubernetes with Amazon Sagemaker.
Keep up to date with open source at AWS
I hope this summary has been useful. I have looked for all the session videos that have been uploaded to date, but if I have missed anything, please get in touch and I will update this summary. Remember to check out the Open Source homepage to keep up to date with all our activity in open source by following us on Twitter @AWSOpen.