This Guidance helps organizations provide their data scientists with external package repository access while maintaining information security (infosec) compliance. Data scientists must commonly install open-source packages residing in public repositories, but this introduces security risks. By using an automated orchestration pipeline on AWS, organizations can make sure that all public packages undergo comprehensive security scans before entering data scientists’ private Jupyter notebook environments. InfoSec governance controls are seamlessly integrated, providing a smooth data science workflow experience without disruptions. With this Guidance, organizations can strike a balance between empowering data scientists with agility and maintaining robust security measures for operational harmony.

Please note: [Disclaimer]

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF 

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • This Guidance uses AWS managed services that support Amazon CloudWatch or AWS CloudTrail logging for incident and event response. You can view the CodePipeline implementation state through a console or a command line interface, and you can monitor each CodeBuild project individually. An automated deployment script monitors the AWS CloudFormation stack status and makes it visible to the user deploying this Guidance. Additionally, CloudTrail captures all API calls for CodeArtifact as events, including calls from package manager clients.

    Read the Operational Excellence whitepaper 
  • This Guidance uses Amazon VPC networking and VPC endpoints to establish a private data perimeter. Secrets Manager securely stores sensitive credentials, like a GitHub PAT and email, and the GitHub PAT authenticates the private repository webhook. The cfn-nag tool validates the CloudFormation template to make sure that IAM rules and security groups are not overly permissive, that access logs and encryption are enabled, and that there are no password literals. Additionally, CodePipeline uses an encrypted Amazon S3 artifact repository, encrypted by AWS Key Management System (AWS KMS), for its assets. CodeArtifact packages are published using a SHA256 hash that is calculated by the caller and provided with the request.

    Read the Security whitepaper 
  • This Guidance uses services managed by AWS that natively provide high availability and resilience. For example, SageMaker provides low latency, high throughput, and highly redundant networking. This Guidance is designed to be deployed in one AWS Region, but you can easily adapt its infrastructure as code to launch an identical stack in a secondary disaster recovery Region. Providing more availability, AWS Lambda runs your function in multiple Availability Zones (AZs) so that it can still process events in the case of a service interruption in a single AZ. Providing resiliency, CodePipeline and CodeBuild can retry failed stage actions either automatically or manually. Additionally, the CloudFormation template enables you to quickly launch new versions of the resource stack, and you can use CloudTrail and CloudWatch to access stack logs of resource provisioning and errors. Finally, Amazon QuickSight will email your account administrators when significant events occur.

    Read the Reliability whitepaper 
  • This Guidance uses higher-abstraction services managed by AWS and chosen for their operational benefits. For example, these services natively provide a minimum of 40 Gbps burstable throughput based on VPC endpoint quota limitations. For any service used, the minimum request quota—which can be increased—is 200 emails per day using Amazon Simple Email Service (Amazon SES). At that level, the Guidance scales to 1,000 CodePipeline implementations. It also integrates with various third-party source repositories, like GitHub, and you can simply plug a third-party security scanning software into the automation pipeline as a custom CodeBuild project. Additionally, you can use the SageMaker Studio system terminal to pull, edit, and push file copies between local and remote repositories. Alternatively, you can implement Git commands from their local system terminal or from another notebook environment.

    Read the Performance Efficiency whitepaper 
  • This Guidance provisions services in the same Region to reduce data transfer charges. As managed serverless services, they reduce your maintenance overhead and infrastructure cost. Additionally, these services follow the pay-as-you-go model, and with high efficiency, they are not required to run for an extended period of time and can scale down when not in use. NAT gateways in Amazon VPC are charged per database of processed data, support 5 Gbps of bandwidth, and automatically scale up to 100 Gbps. Internet gateways in Amazon VPC, which are horizontally scaled, redundant, and highly available, impose no bandwidth constraints. And each VPC endpoint supports a bandwidth of up to 10 Gbps per AZ and bursts of up to 40 Gbps. Additionally, CodePipeline and CodeBuild project implementations provide a unique instance for each run with no reported concurrency limits. Furthermore, Secrets Manager supports 10,000 DescribeSecret and GetSecretValue API requests per second. Finally, SageMaker Studio lets you automatically shut down idle resources, and CloudFormation lets you create and delete stacks as needed, avoiding static provisioning costs.

    Read the Cost Optimization whitepaper 
  • The services used in the Guidance that are managed by AWS scale based on demand and are serverless, so they do not need to be statically provisioned. For example, CodePipeline, CodeBuild, and Lambda all use the elasticity of the cloud to scale infrastructure dynamically, matching the supply of cloud resources to demand, avoiding overprovisioned capacity. Additionally, CloudFormation enables stack deprovisioning so that you can terminate resources that are no longer needed. By reducing overprovisioned compute and storage resources, you can minimize the environmental impact of your workloads.

    Read the Sustainability whitepaper 

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

[Content Type]

[Title]

This [blog post/e-book/Guidance/sample code] demonstrates how [insert short description].

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?