AWS Partner Network (APN) Blog

Category: Amazon EMR

Bigstream-AWS-Partners

Bigstream Provides Big Data Acceleration with Apache Spark and Amazon EMR

Apache Spark and its parallel processing framework, along with the ease of scaling up in public clouds, have pushed out the limits for data analytics. Learn how Bigstream addresses growing Spark needs, with software that optimizes existing CPU infrastructure and can also seamlessly incorporate advanced programmable hardware. With the same number of servers, Bigstream can accelerate Spark clusters 3x with software alone and 10x when introducing FPGAs.

Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

Developing and maintaining an on-premises data lake is a complex undertaking. To maximize the value of data and use it as the basis for critical decisions, the data platform must be flexible and cost-effective. Learn how to build a hybrid data lake with Alluxio to leverage analytics and AI on AWS alongside a multi-petabyte on-premises data lake. Alluxio’s solution is called “zero-copy” hybrid cloud, indicating a cloud migration approach without first copying data to Amazon S3.

nClouds-AWS-Partners

How nClouds Helps Accelerate Data Delivery with Apache Hudi on Amazon EMR

Apache Hudi on Amazon EMR is an ideal solution for large-scale and near real-time applications that require incremental data pipelines and processing. This post provides a step-by-step method to perform a proof of concept for Apache Hudi on Amazon EMR. Learn how a non-customer-facing PoC solution from nClouds set up a new data and analytics platform using Apache Hudi on Amazon EMR and other managed services, including Amazon QuickSight for data visualization.

SnapLogic-AWS-Partners

How SnapLogic eXtreme Helps Visualize Spark ETL Pipelines on Amazon EMR

Fully managed cloud services enable global enterprises to focus on strategic differentiators versus maintaining infrastructure. They do this by creating data lakes and performing big data processing in the cloud. SnapLogic eXtreme allows citizen integrators, those who can’t code, and data integrators to efficiently support and augment data-integration use cases by performing complex transformations on large volumes of data. Learn how to set up SnapLogic eXtreme and use Amazon EMR to do Amazon Redshift ETL.

Okta-AWS-Partners

Implementing SAML AuthN for Amazon EMR Using Okta and Column-Level AuthZ with AWS Lake Formation

As organizations continue to build data lakes on AWS and adopt Amazon EMR, especially when consuming data at enterprise scale, it’s critical to govern your data lakes by establishing federated access and having fine-grained controls to access your data. Learn how to implement SAML-based authentication (AuthN) using Okta for Amazon EMR, querying data using Zeppelin notebooks, and applying column-level authorization (AuthZ) using AWS Lake Formation.

Training Multiple Machine Learning Models Simultaneously Using Spark and Apache Arrow

Spark is a distributed computing framework that added new features like Pandas UDF by using PyArrow. You can leverage Spark for distributed and advanced machine learning model lifecycle capabilities to build massive-scale products with a bunch of models in production. Learn how Perion Network implemented a model lifecycle capability to distribute the training and testing stages with few lines of PySpark code. This capability improved the performance and accuracy of Perion’s ML models.

Mactores-AWS-Partners

Lower TCO and Increase Query Performance by Running Hive on Spark in Amazon EMR

Learn how Mactores helped Seagate Technology to use Apache Hive on Apache Spark for queries larger than 10TB, combined with the use of transient Amazon EMR clusters leveraging Amazon EC2 Spot Instances. It was imperative for Seagate to have systems in place to ensure the cost of collecting, storing, and processing data did not exceed their ROI. Moving to Hive on Spark enabled Seagate to continue processing petabytes of data at scale with significantly lower TCO.

Mactores-AWS-Partners

Optimizing Presto SQL on Amazon EMR to Deliver Faster Query Processing

Seagate asked Mactores Cognition to evaluate and deliver an alternative data platform to process petabytes of data with consistent performance. It needed to lower query processing time and total cost of ownership, and provide the scalability required to support about 2,000 daily users. Learn about the the three migration options Mactores tested and the architecture of the solution Seagate selected. This effort improved the overall efficiency of Seagate’s Amazon EMR cluster and business operations.

Vertical Trail_AWS Solutions

How Musicians Use AWS to Go from Big Data to Their Big Break

Despite advances in music-sharing technology, touring and live shows remain staples for emerging artists. How do artists and promoters align the reach of streaming services with fans’ physical locations to better plan, promote, and manage live shows? To answer this question, Vertical Trail partnered with Gigable to create a music platform and app that uses curated playlists, geo-based concert discovery, and online ticketing to close the gap between artists and their fans—all while building a formidable big data asset.

AWS Operations

AWS HIPAA Program Update – Removal of Dedicated Instance Requirement

By Aaron Friedman, Partner Solutions Architect at AWS focused on Healthcare and Life Sciences I love working with Healthcare Competency Partners in the AWS Partner Network (APN) as they deliver solutions that meaningfully impact lives. Whether building SaaS solutions on AWS tackling problems like electronic health records, or offering platforms designed to achieve HIPAA compliance for customers, […]