AWS Big Data Blog
Category: Analytics
New features from Apache Hudi available in Amazon EMR
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update and delete capabilities. This record-level capability is helpful if you’re building your data lakes on Amazon S3 or HDFS. You can use it to comply with data privacy regulations and simplify data […]
Top 9 performance tuning tips for PrestoDB on Amazon EMR
Presto is a popular distributed SQL query engine for interactive data analytics. With its massively parallel processing (MPP) architecture, it’s capable of directly querying large datasets without the need of time-consuming and costly ETL processes. With a properly tuned Presto cluster you can run fast queries against big data with response times ranging from subsecond […]
Ingest Salesforce data into Amazon S3 using the CData JDBC custom connector with AWS Glue
Organizations that successfully generate business value from their data will outperform their peers. Many AWS customers require a data storage and analytics solution that combines the prospect information stored in Salesforce, a popular and widely used customer relationship management (CRM) platform, with other structured and unstructured data in their data lake to innovate and build […]
Export and import Kibana dashboards with Amazon ES
September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Kibana is a popular open-source visualization tool designed to work with Elasticsearch. Amazon OpenSearch Service provides an installation of Kibana with every Amazon OpenSearch Service domain. Users of Kibana can create visualizations and add them into a dashboard. As organizations […]
Build a DataOps platform to break silos between engineers and analysts
Organizations across the globe are striving to provide a better service to internal and external stakeholders by enabling various divisions across the enterprise, like customer success, marketing, and finance, to make data-driven decisions. Data teams are the key enablers in this process, and usually consist of multiple roles, such as data engineers and analysts. However, […]
Build a data lake using Amazon Kinesis Data Streams for Amazon DynamoDB and Apache Hudi
Amazon DynamoDB helps you capture high-velocity data such as clickstream data to form customized user profiles and online order transaction data to develop customer order fulfillment applications, improve customer satisfaction, and get insights into sales revenue to create a promotional offer for the customer. It’s essential to store these data points in a centralized data […]
Amazon EMR 2020 year in review
Tens of thousands of customers use Amazon EMR to run big data analytics applications on Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto at scale. Amazon EMR automates the provisioning and scaling of these frameworks, and delivers high performance at low cost with optimized runtimes and support for a wide range […]
Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables
February 2023: The content of this blog post can be now be found on AWS Lake Formation public documentation. Please refer to it instead. Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days […]
Run usage analytics on Amazon QuickSight using AWS CloudTrail
Amazon QuickSight is a cloud-native BI service that allows end users to create and publish dashboards in minutes, without provisioning any servers or requiring complex licensing. You can view these dashboards on the QuickSight product console or embed them into applications and websites. After you deploy a dashboard, it’s important to assess how they and […]
Retaining data streams up to one year with Amazon Kinesis Data Streams
Streaming data is used extensively for use cases like sharing data between applications, streaming ETL (extract, transform, and load), real-time analytics, processing data from internet of things (IoT) devices, application monitoring, fraud detection, live leaderboards, and more. Typically, data streams are stored for short durations of time before being loaded into a permanent data store […]