Building a Streaming Pipeline with Minimal Effort Using Amazon Kinesis and Talend
By Tamara Astakhova, Sr. Partner Solution Architect – AWS
By Monte Denehie, Global AWS Alliance Director – Talend
By Cameron Davie, Contributing Author
Starting under the moniker of Internet of Things (IoT), edge computing has expanded to many aspects of our business and personal lives–smart meters, drones, connected devices–that are all collecting and producing data and doing so at incredible rates.
It’s important to amalgamate data to be able to centralize, modernize, and analyze it in a single pipeline providing the flexibility to allow multiple teams to access the same data from a single location. For many applications, it’s equally important to feed data to its central location on a per-minute or even per-second basis, to make time-critical decisions.
To collect massive amounts of time-critical data, setting up data streaming pipelines is essential. There are two core elements to data streaming pipelines: the streaming technology used to gather, collect, and process the data, and the orchestration feature to build, manage, and maintain the data streaming pipelines.
In this post, you’ll learn how integrating Talend with Amazon Kinesis and supporting Spark streaming provides an accessible, no-code methodology for building Spark streaming pipelines on Amazon Web Services (AWS).
Talend, a Qlik company, is an AWS Specialization Partner and AWS Marketplace Seller with Competencies in both Data and Analytics and Migration and Modernization. Talend also has four Service Ready validations in Amazon Redshift, Amazon Relational Database Service (Amazon RDS), AWS PrivateLink, and AWS Outposts.
Integrating Talend with Amazon Kinesis
Amazon Kinesis is a fully-managed, serverless streaming data service that helps simplify the capture, processing, and storage of data streams at scale. With Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning (ML), analytics, and other applications.
Integrating Talend with Amazon Kinesis provides you with powerful tools for streaming data processing, analyzing, and gaining insights from real-time data, which can help you make better decisions, improve operations, and gain a competitive advantage.
Here are some use cases that can be solved with Talend and Amazon Kinesis:
- Application log analytics: Can be used to monitor application performance, alerting, troubleshooting, or collect insights about user behavior; for example, a web application can use Kinesis to ingest access logs and analyze those for trends and patterns. This information helps identify performance bottlenecks, improve security, and personalize user experience.
- Real-time website clickstreams analytics: Can be used to understand user behavior, personalize user experience, and improve website design. An example is a retail company that can pull clickstream data from its website and analyze that data to determine popular pages, products, and user journeys. This information may be used to improve the website’s layout, content, and advertising.
- IoT telemetry data analytics: Can be used to monitor and detect anomalies, device health, or optimize performance in real-time. For example, a manufacturing company can receive telemetry data from its sensors and analyze this data to identify potential problems in the production process. This information can then be used to prevent downtime and improve product quality.
In addition to these specific use cases, the combination of Talend and Amazon Kinesis can be used for a wide variety of other applications that require real-time data ingestion and processing.
With Talend, you have full support with Kinesis for both consuming and producing streaming data. The architecture diagram in Figure 1 shows how to ingest streaming data in real-time into scalable data lakes on AWS from an on-premises application, database, or other cloud provider using Talend Data Fabric and Amazon Kinesis.
Talend Data Fabric combines data integration, data integrity, and data governance in a single, unified environment that allows you to collect, transform, clean, govern, and share your data.
Figure 1 – Talend cloud architecture for streaming data injection.
To work with Amazon Kinesis and big data streaming jobs using the Spark streaming framework, follow the steps below in the Talend Studio interface.
- Build the following job to read and write data to an Amazon Kinesis stream. To do this, in a Talend job drag and drop the tKinesisInput and tKinesisOutput components on to the job canvas.
- Once you create an Amazon Kinesis stream in the AWS interface, you can immediately reference this stream in Talend using only a few points of information.
- To connect to your Kinesis stream, you’ll need your access key, secret key, stream name, and endpoint URL.
- To reduce security risk, follow the principle of least privilege to restrict access to a specific stream.
- Your access key and secret key should only be able to read/write to a specific stream as required. Follow this documentation for more information about security best practices.
- After connecting to your Amazon Kinesis stream in your Talend job, you can link that stream to a Talend component to help enhance data quality and data privacy, or to transform and improve that data in transit before it lands in its desired destination.
Note that Amazon Kinesis is unidirectional and can only be used as a source dataset in your pipeline. Support documents and information can be found in Talend Cloud documentation and described in details in the Kinesis scenario “Working with Amazon Kinesis and Big Data Streaming Jobs”.
Streaming data comes from many sources, including the internet, social media, click streams, sensors, and devices. The demand for fast analytics to understand the behavior of systems, customers, and devices in real-time has increased dramatically.
In this post, we demonstrated how the integration of Talend with Amazon Kinesis can help customers quickly build Spark streaming pipelines to be able to centralize, modernize, and analyze data with minimal effort.
Talend – AWS Partner Spotlight
Talend is an AWS Partner that provides a data integration platform enabling companies to accelerate migrations to cloud data lakes and warehouses on AWS.