Accelerating generator interconnection study with serverless workflow and elastic HPC

The rapid growth of clean energy, propelled by cost reduction and decarbonization efforts, has placed significant strain on the existing power grid infrastructure. As more solar, wind, and other distributed energy resources seek to interconnect to the grid, the process of generator interconnection has become increasingly complex and challenging. Outdated grid planning processes, lengthy interconnection queues, and a lack of transmission capacity are just some of the key barriers inhibiting the seamless integration of new generation. These interconnection challenges are further exacerbated by the geographic misalignment between renewable resource-rich areas and load centers, necessitating extensive transmission build-out to deliver clean energy to consumers.

The scale of this issue is staggering: the number of projects waiting in the interconnection queue is now twice as large as the entire existing U.S. power grid. In 2022, there were almost 2 million megawatts (MW) of clean energy capacity on hold in interconnection queues, and by the end of 2023, this number increased to 2.6 million MW, equivalent to 2,600 gigawatts (GW) or 2.6 terawatts (TW). These proposed projects were progressing through the various necessary assessments and procedures before connecting to the electricity network. As of 2024, the average time from entering the queue to commercial operation was around 3 to 5 years (see report).

Figure 1. The U.S. Clean Energy Backlog (data as of the end of 2022, chart source)

Figure 2. The U.S. generator interconnection queue in 2022 vs. 2023 (chart source)

The generator interconnection process primarily consists of four stages: interconnection request, interconnection studies, interconnection agreement and commercial operation. Prior to connecting any new generator to the power grid, a series of complex, simulation-based engineering studies must be conducted, which are known as Feasibility Study, System Impact Study, and Facilities Study. These interconnection studies assess how one or more proposed generators could impact the grid’s overall reliability, stability, and performance. The multi-step study process determines the necessary network upgrades for interconnecting the proposed projects, calculates associated costs, and allocates these costs proportionally among the proposed projects based on their individual impacts. The output of these studies informs the subsequent stages of the interconnection process as shown in the following table.

Table 1. Generator interconnection process overview

These studies, although not the sole reason, are indeed a significant contributor to the lengthy interconnection process. The three phases of study: Feasibility Study (typically 30-60 days), System Impact Study (usually 90-180 days), and Facilities Study (often 90-180 days) could last anywhere from six months to two years or more if considering them all together. In practice, studies often take longer than the targeted timeframes due to complexities, re-studies, or resource constraints.

Business opportunity for utilities/ISOs/AWS

The Federal Energy Regulatory Commission (FERC) has taken action to streamline the generator interconnection process through Order No. 2023, titled “Improvements to Generator Interconnection Procedures and Agreements.” A crucial aspect of this order mandates that transmission providers shift from a sequential “first-come, first-served” approach to a more efficient “first-ready, first-served” cluster study methodology. This transition enables the simultaneous evaluation of multiple projects, and aims at expediting progress through the interconnection queue. The new approach creates an opportunity for stakeholders across the generator interconnection value chain to significantly scale up and accelerate these critical, traditionally time-intensive interconnection studies by using cloud technologies. This shift not only promises to reduce bottlenecks in the interconnection process, but also aligns with the broader goal of facilitating faster integration of new generation resources into the power grid.

Technical challenges of interconnection study on conventional IT resources

Although interconnection studies are usually run on on-premises server farms backed by traditional IT infrastructure, it presents several technical challenges, as follows.

Computational resources can’t keep up with the volume of studies. Interconnection studies, especially for complex systems with clustered project requests, need significant computational power. On-premises systems may struggle with the increasing complexity and volume of studies, leading to longer processing times.
Scalability issues. On-premises systems have limited scalability, making it difficult to quickly ramp up resources for peak demand periods or cluster studies. Moreover, adding new hardware to increase capacity is often costly and time-consuming.
Data management. Storing and managing large volume of data from multiple studies can strain local storage systems. Making sure of data backup on local systems can be challenging and resource-intensive.
Collaboration difficulties. On-premises setups can hinder real-time collaboration between team members, especially when the studies are conducted on siloed workstations. It is difficult to find out what jobs others are running, to track its progress, and to facilitate each other with troubleshooting. Moreover, sharing large datasets and results with external stakeholders can be cumbersome.
Availability and disaster recovery. Local hardware failures can lead to significant downtime and potential data loss. Meeting requirements for Recovery Time Objective (RTO) and Recovery Point Objective (RPO) can be more challenging and expensive for on-premises systems.
Limited access to advanced technologies. On-premises systems may struggle to incorporate advanced technologies such as artificial intelligence (AI) and machine learning (ML), which are increasingly used to optimize interconnection studies.
Cost of ownership. The total cost of ownership (TCO) for high-performance computing resources needed for complex studies can be significant when maintained on-premises.

Solution

Generator interconnection is crucial for delivering substantial electrical capacity to major industrial hubs and densely populated urban centers with high electric energy demands. This issue extends beyond the concerns of power utilities and manufacturers, holding particular significance for Amazon as the world’s foremost purchaser of renewable energy.

Furthermore, the surge in generative AI is driving unprecedented growth in data center power demands. Meeting these escalating energy needs necessitates immediate attention. We must overcome technical barriers that currently limit new generator connections to the electric grid. This technical focus should complement ongoing efforts in policy reform to create an approach more suitable to grid expansion and modernization.

This post presents a comprehensive, tailored solution framework on AWS to address the key obstacles in the generator interconnection study process. It showcases the diverse capabilities of AWS in significantly expediting and enhancing these critical assessments.

The following diagram illustrates an enhanced reference architecture for an interconnection study accelerator, using serverless-based workflow orchestration and an elastic High Performance Computing (HPC) cluster. This updated design builds upon our previously published guidance, introducing a key improvement: enhanced flexibility in compute service selection. This adaptability allows users to tailor the compute environment to specific study requirements, and optimize the performance and resource allocation for various interconnection study tasks. The solution uses AWS Step Functions to orchestrate the study process. Depending on the task characteristics such as run time, data commonality, complexity, and runtime dependencies, a Step Functions state can be executed on AWS Lambda, Amazon Elastic Container Service (Amazon ECS) on AWS Fargate, or AWS ParallelCluster (alternatively, can be AWS Parallel Computing Service, a managed service to run and scale high performance computing workloads). Users can choose any of these scalable computing services and combine them flexibly to carry out a complex analysis. For example, Lambda function is ideal for short-lived compute tasks such as retrieving a request from the desired interconnection queue due to its 15 minutes execution limit, while ECS on Fargate is more suitable for tasks with longer duration, such as retrieving (downloading and extracting) common data bundles from a remote repository. For compute-intensive, time-consuming engineering simulation jobs such as contingency analysis, short circuit analysis, transient stability analysis, and investment optimization, Step Functions can designate the HPC cluster, such as ParallelCluster, to tackle them.

Figure 3. Reference architecture for accelerating generator interconnection study process

On the frontend, the solution uses AWS Amplify to build and host a full stack web application. Before logging into the web user interface, users get authenticated with Amazon Cognito to assume a specific role such as admin, standard user, reviewer, or approver. It also uses Amazon DynamoDB to store user data and task information. AWS AppSync is used to create GraphQL APIs that can query a NoSQL database for displaying the information on the web portal.

A high-performance file system, Amazon FSx for Lustre or OpenZFS is configured to connect with Lambda, Amazon ECS, and the HPC cluster to store intermediate and consolidated results generated by interconnection study tools. For users to access the data from the web UI, AWS DataSync is used in this architectural design to move the selected portion of data from Amazon FSx to Amazon S3, which is integrated with the Amplify backend as the storage resource.

The solution uses AWS CodePipeline to implement DevOps practices, automating the build, test, and deployment processes for both Amplify code and Amazon ECS task container artifacts. As part of the continuous integration and continuous delivery (CI/CD) pipeline, the solution incorporates AWS CodeConnection and AWS CodeBuild. This integrated approach streamlines development workflows and enhances collaboration between development and operations teams.

The solution’s highlighted features include:

Support multi-user, multi-job execution
Controllable max concurrency for parallel runs
Highly cost-effective due to event (job submission) driven serverless workflow and elastic HPC cluster that scales down to zero compute instance when idling
User authentication and API authorization integrated with the frontend
Independent Web UI that smooths user experience and flattens learning curve
Fine-grained access control for different personas, such as admin, users, and approvers

The interconnection process varies across organizations, with study cycles differing by region based on criteria set by individual RTOs/ISOs. The following figure illustrates the solution architecture’s capability to orchestrate a customizable workflow. This example doesn’t represent a complete implementation of all interconnection study steps. Rather, it demonstrates the key tasks and how they can be executed using serverless and HPC services to achieve scalability, reliability, performance efficiency, and cost optimization. This sample provides a foundation for users to build upon. Using AWS Step Functions, users can create more sophisticated workflows that accurately mirror real-world study cycles. Step Functions allows for integration with various AWS compute services, which can be tailored to handle specific job requirements. Users have the flexibility to customize the solution to their unique needs, expanding the basic framework into a more comprehensive system that aligns with their particular processes.

Figure 4. A sample workflow of interconnection study

To illustrate the process, we use a range of open source power system engineering tools to execute all critical steps in the Interconnection study workflow, including GridCal, GridStatus, ANDES, PyPSA, and its sister library PyPSA-USA. These tools are all Python-based packages, offering flexibility in deployment. They can be executed in all types of aforementioned compute environments including Lambda, ECS on Fargate, and ParallelCluster. This approach demonstrates the versatility and scalability of these tools within cloud-based environments, catering to various computational needs in power system analysis.

A state machine is created in Step Functions based on the previously defined workflow. This state machine incorporates three primary state types: Task, Choice, and Fail. Choice state serves as a flow control mechanism, directing the execution path based on specified conditions. In this example, it controls whether a task should be run or skipped based on certain data existence. The Fail state handles exceptions, including both failures and timeouts, and provides a standardized method for error management. Each Task state is equipped with a retry logic with definition provided below, allowing for multiple execution attempts in case of transient failures. If a task exhausts its maximum number of retry attempts, the workflow terminates, transitioning to the Fail state. This structure makes sure of robust error handling and flow control throughout the interconnection study process.

Figure 5. Definition of interconnection study workflow in Step Functions

{
    ...
    "Retry": [
                {
                "ErrorEquals": [
                    "Lambda.ServiceException",
                    "Lambda.AWSLambdaException",
                    "Lambda.SdkClientException",
                    "Lambda.TooManyRequestsException"
                ],
                "IntervalSeconds": 1,
                "MaxAttempts": 3,
                "BackoffRate": 2
                }
            ]
    ...
 }

The state machine employs a standardized Lambda function template, integrated with the Wait for Callback pattern, to manage tasks that necessitate submitting interconnection study simulation jobs to ParallelCluster through the AWS Systems Manager Run Command API. This callback mechanism allows the workflow to pause until it receives a task token, indicating job completion. Upon finishing the cluster-submitted job, a callback function is triggered to report the task state: either successful (SendTaskSuccess) or failed (SendTaskFailure). This approach enables efficient management of long-running asynchronous processes within the workflow, maintaining the integrity of the study’s sequential dependencies. It allows each phase of the cascading study to properly inform and initiate subsequent stages.

Upon receiving jobs from the state machine, the HPC cluster employs a sophisticated job management process:

Job Queueing and Scheduling: Incoming jobs are initially placed in a queue, scheduling them for processing.
Dynamic Scaling: The cluster then scales out dynamically, provisioning compute-optimized nodes as needed to handle the queued jobs efficiently.
Unique Job Identification: Each job is assigned a distinctive identifier, comprising:
- Username
- Submission timestamp
- Job type
- Job number
- Processing core index
Output File Naming Convention: The cluster uses this unique identifier to name output files, following a consistent pattern. This naming strategy serves two crucial purposes:
- Prevents file conflicts by ensuring each output has a unique name
- Eliminates the risk of accidental file overwrites

This systematic approach makes sure of efficient job processing, scalable resource usage, and organized output management, even when handling multiple concurrent interconnection study runs.

Figure 6. Job queue status and output results (accessible by cluster admin)

The solution uses an ECS on Fargate cluster to execute various interconnection study tasks other than those that would be run on ParallelCluster, including base model selection, retrieval of project datasets associated with clustered interconnection requests, and network model construction. To make sure of seamless data accessibility and persistence across diverse computing environments, an FSx volume is created and mounted to multiple components: the ParallelCluster, Amazon ECS tasks, and Lambda functions using container images.

At the workflow’s conclusion, a Lambda function is triggered to perform two critical tasks. First, it aggregates the analysis results produced throughout the study process and generates a comprehensive study report (potentially generative AI can be used to facilitate the report generation). Second, it initiates a DataSync task, which selectively transfers data from the mounted FSx volume to an output layer S3 bucket, making sure of efficient storage and accessibility of final study outputs.

Circling back to the solution’s frontend, as mentioned previously, it is hosted on Amplify to provide a user-friendly web interface built with the Cloudscape design system. This interface enables users to submit job runs, monitor job status, and download comprehensive study reports along with detailed step-by-step analysis results in a compressed file format. The user experience is intuitive, engaging, and inclusive at scale.

Figure 7. Generator interconnection study web application home page

From the user’s perspective, the end-to-end study process unfolds as follows:

Initiation: a run begins when a user uploads a trigger file specifying an individual or clustered interconnection queue request to the Amazon S3 landing zone bucket.
Preprocessing: this upload triggers a Lambda function that performs a data integrity check.
Workflow Start: the Lambda function starts the state machine through Step Functions and simultaneously records metadata about the run in a DynamoDB table.
Data Storage: the DynamoDB table stores key information such as RunId, Owner, SubmissionTime, SourceFileName, OutputFileName, and OutputFileLocation, with RunId and Owner serving as partition and sort keys respectively.
Real-time Updates: a GraphQL API, facilitated by AppSync, periodically queries this table to present up-to-date information to all platform users.
Progress Monitoring: users can track the progress of specific runs through websocket connections managed by Amazon API Gateway (check this AWS sample for details).
Results Access: upon completion, study results become available for download from the output layer S3 bucket. Access is conditionally granted based on user group membership, as specified by object tags, making sure of both collaboration and data security.

This comprehensive approach streamlines the user interaction with the interconnection study process while maintaining robust security and efficient data management.

Figure 8. Generator interconnection study web application dashboard (network traffic view)

Figure 9. Generator interconnection study web application “Runs” page

Figure 10. Generator interconnection study web application “Run details” page

To simulate real-world scenarios where multiple engineers simultaneously perform interconnection study runs, we conducted concurrent execution tests on the platform. The solution’s serverless architecture, using Lambda, Step Functions, and ECS on Fargate, demonstrated robust capabilities in efficiently managing these parallel submissions. This architecture’s inherent scalability and the asynchronous processing model enabled seamless handling of multiple concurrent study requests, making sure of responsiveness and performance even under high-demand conditions.

Figure 11. Concurrent execution of multiple workflows initiated by different users

Conclusion

The rapid growth of renewable energy generation has created significant challenges in the generator interconnection process, leading to lengthy queues and delays in integrating new clean energy projects into the power grid. This post presents an innovative solution using AWS cloud technologies to address these challenges and accelerate the interconnection study process.

Combining serverless workflow orchestration with elastic high-performance computing allows the proposed architecture to offer a scalable, cost-effective, and efficient approach to conduct complex interconnection studies. This cloud-based solution addresses many critical limitations of the traditional on-premises systems, and offers improved scalability, collaboration, and cost-efficiency. Streamlining the interconnection study process means that it can significantly reduce bottlenecks in the integration of new generation resources, ultimately supporting the transition to a cleaner, more resilient power grid.

As the energy landscape continues to evolve, utilities, ISOs, and other stakeholders have a powerful ally in cloud computing. Embracing innovative cloud-based solutions such as this allow these organizations to fast-track their journey toward a more sustainable energy future. These approaches are instrumental in addressing the increasing need for renewable energy integration and power grid modernization.

AWS for Industries

Accelerating generator interconnection study with serverless workflow and elastic HPC

Business opportunity for utilities/ISOs/AWS

Technical challenges of interconnection study on conventional IT resources

Solution

Conclusion

Resources

Follow

Learn

Resources

Developers

Help