AWS Big Data Blog

Category: Storage

The following graph shows that the minimum throughput achieved with the persistent HFile

Amazon EMR 6.2.0 adds persistent HFile tracking to improve performance with HBase on Amazon S3

Apache HBase is an open-source, NoSQL database that you can use to achieve low latency random access to billions of rows. Starting with Amazon EMR 5.2.0, you can enable HBase on Amazon Simple Storage Service (Amazon S3). With HBase on Amazon S3, the HBase data files (HFiles) are written to Amazon S3, enabling data lake […]

Read More
In the navigation name, choose Marketplace and search for Salesforce.

Ingest Salesforce data into Amazon S3 using the CData JDBC custom connector with AWS Glue

Organizations that successfully generate business value from their data will outperform their peers. Many AWS customers require a data storage and analytics solution that combines the prospect information stored in Salesforce, a popular and widely used customer relationship management (CRM) platform, with other structured and unstructured data in their data lake to innovate and build […]

Read More
The following diagram shows the flow of our solution.

Integrating Datadog data with AWS using Amazon AppFlow for intelligent monitoring

Infrastructure and operation teams are often challenged with getting a full view into their IT environments to do monitoring and troubleshooting. New monitoring technologies are needed to provide an integrated view of all components of an IT infrastructure and application system. Datadog provides intelligent application and service monitoring by bringing together data from servers, databases, […]

Read More
The following diagram shows the solution architecture for the Vertica custom connector when deployed to AWS.

Querying a Vertica data source in Amazon Athena using the Athena Federated Query SDK

The ability to query data and perform ad hoc analysis across multiple platforms and data stores with a single tool brings immense value to the big data analytical arena. As organizations build out data lakes with increasing volumes of data, there is a growing need to combine that data with large amounts of data in […]

Read More
In the following tree diagram, we’ve outlined what the bucket path may look like as logs are delivered to your S3 bucket

Automating AWS service logs table creation and querying them with Amazon Athena

I was working with a customer who was just getting started using AWS, and they wanted to understand how to query their AWS service logs that were being delivered to Amazon Simple Storage Service (Amazon S3). I introduced them to Amazon Athena, a serverless, interactive query service that allows you to easily analyze data in […]

Read More

Building a cost efficient, petabyte-scale lake house with Amazon S3 lifecycle rules and Amazon Redshift Spectrum: Part 2

In part 1 of this series, we demonstrated building an end-to-end data lifecycle management system integrated with a data lake house implemented on Amazon Simple Storage Service (Amazon S3) with Amazon Redshift and Amazon Redshift Spectrum. In this post, we address the ongoing operation of the solution we built. Data ageing process after a month […]

Read More

Building a cost efficient, petabyte-scale lake house with Amazon S3 lifecycle rules and Amazon Redshift Spectrum: Part 1

The continuous growth of data volumes combined with requirements to implement long-term retention (typically due to specific industry regulations) puts pressure on the storage costs of data warehouse solutions, even for cloud native data warehouse services such as Amazon Redshift. The introduction of the new Amazon Redshift RA3 node types helped in decoupling compute from […]

Read More
The following diagram shows the workflow to connect Apache Airflow to Amazon EMR.

Dream11’s journey to building their Data Highway on AWS

This is a guest post co-authored by Pradip Thoke of Dream11. In their own words, “Dream11, the flagship brand of Dream Sports, is India’s biggest fantasy sports platform, with more than 100 million users. We have infused the latest technologies of analytics, machine learning, social networks, and media technologies to enhance our users’ experience. Dream11 […]

Read More

How FanDuel Group secures personally identifiable information in a data lake using AWS Lake Formation

This post is co-written with Damian Grech from FanDuel FanDuel Group is an innovative sports-tech entertainment company that is changing the way consumers engage with their favorite sports, teams, and leagues. The premier gaming destination in the US, FanDuel Group consists of a portfolio of leading brands across gaming, sports betting, daily fantasy sports, advance-deposit […]

Read More

Working with timestamp with time zone in your Amazon S3-based data lake

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can use the purpose-built analytics services for a range of use cases, from analyzing petabyte-scale datasets to querying the metadata of a single object. AWS analytics services support open file formats such as Parquet, ORC, JSON, Avro, CSV, and more, so it’s […]

Read More