Guidance for Renewables Data Lake and Analytics on AWS

This Guidance demonstrates how renewable energy operators can ingest data from renewable assets such as wind turbines, solar farms, and battery energy storage systems (BESS). The data can be collected into a data lake to perform advanced analytics with machine learning. Dashboards, alerts, business intelligence reporting, and comprehensive device management can all help operators derive insights from their asset data.

Architecture Diagram

Download the architecture diagram PDF

Part 1
Part 2

Part 1
Step 1
The renewable energy site represents the edge and includes multiple assets, topology, and device configurations.

Step 1a
Configuration scenario: Supervisory control and data acquisition (SCADA) is available for assets substation equipment and power - battery, solar, turbine.

Step 1b
Configuration scenario: When SCADA is not available (for example, when the energy output of a site doesn’t justify the investment in a SCADA system). In this case, you communicate directly with the programmable logic controller (PLC) of the assets (substation equipment and power - battery, solar, turbine).

Step 1c
Configuration scenario: SCADA isn’t available and there is no access to PLCs. This might be due to regulatory, compliance, or security reasons. In this case, the renewable energy operators deposit the asset data in an external application. This can be a Structured Query Language (SQL) database, data historian, or CSV files.

Step 1d
Hardware at the edge option: The device can ingest Internet of Things (IoT) data from SCADA or PLCs in any protocol (for example, Open Platform Communications Unified Architecture (OPC-UA), Distributed Network Protocol 3 (DNP3), Modbus, and Sunspec).

For legacy applications, the device connects through prebuilt or custom connectors. It communicates the IoT data over the internet and through Message Queuing Telemetry Transport (MQTT) protocol to the Part 2 solution hosted in the Cloud. All traffic is encrypted through x509 certificates.

Step 2
The AWS IoT connectivity solution and the Part 2 solution are hosted within the AWS Cloud.

Step 2a
No hardware at the edge option - Sometimes it is too costly to install an IoT edge gateway device at each renewable site. In this case, we use an AWS IoT connectivity solution, hosted entirely in the Cloud.

The SCADA, PLCs, or legacy applications communicate the IoT data through the native device protocol, over a secure virtual private network (VPN) connection, to the AWS IoT connectivity solution. The connectivity solution forwards the IoT traffic over MQTT for performance analytics, as shown in part 2. All traffic is encrypted through x509 certificates.

Step 2b
Continue to Part 2.

Click to enlarge

Step 1
The renewable energy site represents the edge and includes multiple assets, topology, and device configurations.

Step 1a
Configuration scenario: Supervisory control and data acquisition (SCADA) is available for assets substation equipment and power - battery, solar, turbine.

Step 1b
Configuration scenario: When SCADA is not available (for example, when the energy output of a site doesn’t justify the investment in a SCADA system). In this case, you communicate directly with the programmable logic controller (PLC) of the assets (substation equipment and power - battery, solar, turbine).

Step 1c
Configuration scenario: SCADA isn’t available and there is no access to PLCs. This might be due to regulatory, compliance, or security reasons. In this case, the renewable energy operators deposit the asset data in an external application. This can be a Structured Query Language (SQL) database, data historian, or CSV files.

Step 1d
Hardware at the edge option: The device can ingest Internet of Things (IoT) data from SCADA or PLCs in any protocol (for example, Open Platform Communications Unified Architecture (OPC-UA), Distributed Network Protocol 3 (DNP3), Modbus, and Sunspec).

For legacy applications, the device connects through prebuilt or custom connectors. It communicates the IoT data over the internet and through Message Queuing Telemetry Transport (MQTT) protocol to the Part 2 solution hosted in the Cloud. All traffic is encrypted through x509 certificates.

Step 2
The AWS IoT connectivity solution and the Part 2 solution are hosted within the AWS Cloud.

Step 2a
No hardware at the edge option - Sometimes it is too costly to install an IoT edge gateway device at each renewable site. In this case, we use an AWS IoT connectivity solution, hosted entirely in the Cloud.

The SCADA, PLCs, or legacy applications communicate the IoT data through the native device protocol, over a secure virtual private network (VPN) connection, to the AWS IoT connectivity solution. The connectivity solution forwards the IoT traffic over MQTT for performance analytics, as shown in part 2. All traffic is encrypted through x509 certificates.

Step 2b
Continue to Part 2.
Part 2
Step 1a
Data is ingested to AWS IoT Core, for non-asset modeled data, including native integration with 20 AWS services.

Step 1b
Data is ingested through Amazon Data Firehose to Amazon Simple Storage Service (Amazon S3) with optional in-flight data conversion (for example, conversion from JSON to Parquet).

Step 1c
Data is ingested at scale with detailed asset modeling in AWS IoT SiteWise.

Step 1d
AWS IoT Greengrass stream manager transfers high volume data directly to the AWS Cloud, with low latency.

Step 2
AWS IoT SiteWise, Amazon Timestream, and Amazon Managed Grafana make up the near real-time operational dashboard of “hot tags” (critical tags for health monitoring of assets).

Step 3
Build detector models in AWS IoT Events to continuously monitor the state of assets and issue immediate alerts in Amazon Simple Notification Service (Amazon SNS). This done through email and short message service (SMS) to operational staff.

Step 4
The industrial data lake is hydrated by different sources at different velocities. The data lake serves as a single version of truth for all consumers. Data lands “as-is” from sources, in a landing zone Amazon S3 bucket. From here, it is cleansed and normalized through AWS Glue ELT into a curated state and placed in a clean zone Amazon S3 bucket.

Amazon EMR consumes this curated data to calculate 10-minute averages. Amazon EMR also converts the clean data into the IEC-61400-25-2 standard for wind and IEC 61850-7-420 standard for solar. Amazon EMR then deposits the aggregated and standardized data in an Amazon S3 bucket called business zone.

Step 5
Data from the Amazon S3 bucket business zone is loaded into Amazon Redshift. Detailed business intelligence (BI) reporting can be done using Amazon Managed Grafana or Amazon QuickSight which uses Super-fast, Parallel, In-memory Calculation Engine (SPICE). It is also possible to connect with external BI tools like Tableau.

Step 6
Artificial intelligence and machine learning (AI/ML) services, like Amazon SageMaker, use curated data from the data lake for predictive health analysis and assessment.

Step 7
AWS IoT connectivity solutions have the full range of remote device management capabilities.

Click to enlarge

Step 1a
Data is ingested to AWS IoT Core, for non-asset modeled data, including native integration with 20 AWS services.

Step 1b
Data is ingested through Amazon Data Firehose to Amazon Simple Storage Service (Amazon S3) with optional in-flight data conversion (for example, conversion from JSON to Parquet).

Step 1c
Data is ingested at scale with detailed asset modeling in AWS IoT SiteWise.

Step 1d
AWS IoT Greengrass stream manager transfers high volume data directly to the AWS Cloud, with low latency.

Step 2
AWS IoT SiteWise, Amazon Timestream, and Amazon Managed Grafana make up the near real-time operational dashboard of “hot tags” (critical tags for health monitoring of assets).

Step 3
Build detector models in AWS IoT Events to continuously monitor the state of assets and issue immediate alerts in Amazon Simple Notification Service (Amazon SNS). This done through email and short message service (SMS) to operational staff.

Step 4
The industrial data lake is hydrated by different sources at different velocities. The data lake serves as a single version of truth for all consumers. Data lands “as-is” from sources, in a landing zone Amazon S3 bucket. From here, it is cleansed and normalized through AWS Glue ELT into a curated state and placed in a clean zone Amazon S3 bucket.

Amazon EMR consumes this curated data to calculate 10-minute averages. Amazon EMR also converts the clean data into the IEC-61400-25-2 standard for wind and IEC 61850-7-420 standard for solar. Amazon EMR then deposits the aggregated and standardized data in an Amazon S3 bucket called business zone.

Step 5
Data from the Amazon S3 bucket business zone is loaded into Amazon Redshift. Detailed business intelligence (BI) reporting can be done using Amazon Managed Grafana or Amazon QuickSight which uses Super-fast, Parallel, In-memory Calculation Engine (SPICE). It is also possible to connect with external BI tools like Tableau.

Step 6
Artificial intelligence and machine learning (AI/ML) services, like Amazon SageMaker, use curated data from the data lake for predictive health analysis and assessment.

Step 7
AWS IoT connectivity solutions have the full range of remote device management capabilities.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

You can safely operate this Guidance and respond to events with Firehose that integrates with Amazon CloudWatch alarms. Set these alarms to invoke when metrics exceed buffering limits.

Amazon EMR logs files to identify cluster issues such as failures or errors. You can archive log files in Amazon S3 to troubleshoot issues even after your Amazon EMR cluster terminates. Amazon EMR integrates with CloudWatch to track performance metrics. You can configure alarms based on different metrics. For example, “IsIdle” tracks if a cluster is active and not running tasks. And “HDFSUtilization” monitors the clusters' capacity to see if it requires resizing to add more core nodes.

AWS IoT SiteWise allows you to set alarms to identify equipment performance issues. The alarms can be integrated with Amazon Simple Queue Service (Amazon SQS) and Amazon SNS to perform additional actions based on the alarm.

Read the Operational Excellence whitepaper
Security

Implementing least privilege access is fundamental in reducing security risk and the impact that could result from errors or malicious intent. We therefore recommend implementing least privilege access for all resources.

The producer and client applications must have valid credentials to access Firehose delivery streams. We recommend you use AWS Identity and Access Management (IAM) roles to manage temporary credentials for producer and client applications.

Amazon EMR allows you to encrypt data in-transit and at-rest. At-rest, encryption can be done by using encrypted Amazon Elastic Block Store (Amazon EBS) or enabling encryption on Amazon S3 (or both) when using Elastic MapReduce File System (EMRFS). You can also use the Hadoop Distributed File System (HDFS) transparent encryption if you are using HDFS instead of EMRFS.

AWS IoT SiteWise stores data in the AWS Cloud and on a gateway. The data stored in other AWS services is encrypted by default. Encryption at rest integrates with AWS Key Management Service (AWS KMS) to manage the encryption key used to encrypt the asset.

The AWS IoT SiteWise gateway running on AWS IoT Greengrass relies on Unix file permissions and full-disk encryption to protect data at rest. Full-disk encryption can be enabled.

Read the Security whitepaper
Reliability

With Firehose, you can backup source data in an Amazon S3 bucket. This allows you to go back to the source data in case a failure occurs downstream.

Amazon EMR monitors nodes in the cluster and automatically terminates and replaces an instance in case of a failure.

Read the Reliability whitepaper
Performance Efficiency

Firehose allows dynamic partitioning of streaming data. Partitioning the data minimizes the amount of data scanned and optimizes performance. This makes it easier to run high performance and cost-efficient analytics on streaming data in Amazon S3 using Amazon EMR and Quicksight.

Amazon EMR cluster nodes can be monitored and optimized based on your workload. For some workloads, the primary node needs to be more powerful, for other situations, the core and task nodes will need to run on higher CPU instances.

Read the Performance Efficiency whitepaper
Cost Optimization

This Guidance relies on serverless AWS services that are fully-managed and auto scale according to workload demand. As a result, you only pay for what you use.

Firehose allows you to create interface VPC endpoints. This prohibits the traffic between the VPC and Firehose from leaving the AWS network, and it also reduces data transfer cost. With Firehose, you can use tags to categorize delivery streams, allowing you to view the usage and cost by the custom tags.

Amazon EMR makes it easy to use Amazon EC2 Spot Instances, saving you both time and money. You could configure the ‘task nodes’ in the Amazon EMR cluster to use Spot Instances. This allows you to reduce cost without losing data if those Spot Instances are lost.

Read the Cost Optimization whitepaper
Sustainability

Firehose allows you to convert input data from JSON to Apache Parquet before storing into Amazon S3, saving space and enabling faster queries. These efficiencies reduce the total amount of hardware needed to manage the data.

To further minimize hardware usage, you can use Amazon EMR Serverless. This helps you focus on the workload and not the underutilization of primary, core, or task nodes.

Read the Sustainability whitepaper

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open implementation guide

Open sample code on GitHub

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

[Title]

Disclaimer

Was this page helpful?

Guidance for Renewables Data Lake and Analytics on AWS

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

[Title]

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer