AWS Big Data Blog

Category: Advanced (300)

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

Apache Spark is a powerful big data engine used for large-scale data analytics. Its in-memory computing makes it great for iterative algorithms and interactive queries. You can use Apache Spark to process streaming data from a variety of streaming sources, including Amazon Kinesis Data Streams for use cases like clickstream analysis, fraud detection, and more. Kinesis Data Streams is a serverless streaming data service that makes it straightforward to capture, process, and store data streams at any scale.

With the new open source Amazon Kinesis Data Streams Connector for Spark Structured Streaming, you can use the newer Spark Data Sources API. It also supports enhanced fan-out for dedicated read throughput and faster stream processing. In this post, we deep dive into the internal details of the connector and show you how to use it to consume and produce records from and to Kinesis Data Streams using Amazon EMR.

Achieve peak performance and boost scalability using multiple Amazon Redshift serverless workgroups and Network Load Balancer

As data analytics use cases grow, factors of scalability and concurrency become crucial for businesses. Your analytic solution architecture should be able to handle large data volumes at high concurrency and without compromising speed, thereby delivering a scalable high-performance analytics environment. Amazon Redshift Serverless provides a fully managed, petabyte-scale, auto scaling cloud data warehouse to […]

Governing data in relational databases using Amazon DataZone

Data governance is a key enabler for teams adopting a data-driven culture and operational model to drive innovation with data. Amazon DataZone is a fully managed data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across Amazon Web Services (AWS), on premises, and on third-party […]

Dive deep into security management: The Data on EKS Platform

The construction of big data applications based on open source software has become increasingly uncomplicated since the advent of projects like Data on EKS, an open source project from AWS to provide blueprints for building data and machine learning (ML) applications on Amazon Elastic Kubernetes Service (Amazon EKS). In the realm of big data, securing […]

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

To enable your workforce users for analytics with fine-grained data access controls and audit data access, you might have to create multiple AWS Identity and Access Management (IAM) roles with different data permissions and map the workforce users to one of those roles. Multiple users are often mapped to the same role where they need […]

Dynamic DAG generation with YAML and DAG Factory in Amazon MWAA

Amazon Managed Workflow for Apache Airflow (Amazon MWAA) is a managed service that allows you to use a familiar Apache Airflow environment with improved scalability, availability, and security to enhance and scale your business workflows without the operational burden of managing the underlying infrastructure. In Airflow, Directed Acyclic Graphs (DAGs) are defined as Python code. […]

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Amazon OpenSearch Service recently introduced the OpenSearch Optimized Instance family (OR1), which delivers up to 30% price-performance improvement over existing memory optimized instances in internal benchmarks, and uses Amazon Simple Storage Service (Amazon S3) to provide 11 9s of durability. With this new instance family, OpenSearch Service uses OpenSearch innovation and AWS technologies to reimagine […]

bdb-3883-image001

Achieve near real time operational analytics using Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift

Our zero-ETL integration with Amazon Redshift facilitates point-to-point data movement to get it ready for analytics, artificial intelligence (AI) and machine learning (ML) using Amazon Redshift on petabytes of data. In this post, we provide step-by-step guidance on how to get started with near real time operational analytics using the Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift.

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets. In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API.

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

Amazon MSK Connect is a fully managed service for Apache Kafka Connect. With a few clicks, MSK Connect allows you to deploy connectors that move data between Apache Kafka and external systems. MSK Connect now supports the ability to delete MSK Connect worker configurations, tag resources, and manage worker configurations and custom plugins using AWS […]