AWS Machine Learning Blog

Hosting a private PyPI server for Amazon SageMaker Studio notebooks in a VPC

Amazon SageMaker Studio notebooks provide a full-featured integrated development environment (IDE) for flexible machine learning (ML) experimentation and development. Security measures secure and support a versatile and collaborative environment. In some cases, such as to protect sensitive data or meet regulatory requirements, security protocols require that public internet access be disabled in the development environment.

Typically, developers have access to the public internet and can install any new libraries you want to import. You can install Python packages from the public Python Package Index (PyPI), a Python software repository, using standard tools such as pip. You can find hundreds of thousands of packages, including common packages such as NumPy, Pandas, Matplotlib, Pytest, Requests, Django, and BeautifulSoup.

In a development environment with internet access disabled, you can instead mirror packages and host your own PyPI server hosted in your own Amazon Virtual Private Cloud (Amazon VPC). A VPC is a logically isolated virtual network into which you can launch resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances and SageMaker Studio domains. You have fine-grained access control over its network connectivity. You can specify an IP address range for the VPC and associate security groups to control its inbound and outbound traffic. You can also add subnets that use a subset of IP addresses within the VPC, and choose whether each subnet is open to the public internet or is private.

When you use a local PyPI server with this architecture and install Python libraries from your SageMaker Studio notebook, you connect to your private server instead of a public package index, and all traffic remains within a single secured VPC and private subnet.

SageMaker Studio recently launched VPC integration to meet these security needs. You can now launch Studio notebooks within a private VPC, disabling internet access. To install Python packages within this secure environment, you can configure an EC2 instance in your VPC that acts as a PyPI server for your notebooks. This enables you to maintain productivity and ease of package installation while working within a private environment that isn’t accessible from the public internet.

Solution overview

This solution creates a private PyPI server on an EC2 instance, and connects it to a SageMaker Studio notebook through network configuration including a VPC, private subnet, security group, and elastic network interface. The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

You complete the following steps to implement this solution:

  1. Launch an EC2 instance within a VPC, subnet, and security group.
  2. Configure the instance to function as a private PyPI server.
  3. Create a VPC endpoint and add security group rules.
  4. Create a VPC-only SageMaker Studio domain, user, and notebook with the necessary permissions and networking.
  5. Install a Python package from the PyPI server onto the SageMaker Studio notebook.

Prerequisites

This is an intermediate-level solution with the following prerequisites:

  • An AWS account
  • Sufficient level of access to create Amazon SageMaker, Amazon EC2, and Amazon VPC resources
  • Familiarity with creating and modifying AWS resources on the AWS Management Console
  • Basic command-line experience, such as SSHing onto an EC2 instance, installing packages, and editing files using vim or another command-line text editor

Launching an EC2 instance

For this post, we launch a new EC2 instance in the us-east-2 Region. For the full list of available Regions supporting SageMaker Studio, see Supported Regions and Quotas.

  1. On the Amazon EC2 console, launch a new instance in a Region supporting SageMaker Studio.
  2. Choose an Amazon Linux 2 AMI.
  3. Choose a t2.medium instance (or larger t2, if preferred).
  4. On the Step 3: Configure Instance Details page, for Network, choose your VPC.
  5. For Subnet, choose your subnet.

You can use the default VPC and subnet, use other existing resources, or create new ones. Make sure to note the VPC and subnet you select for later reference.

  1. Leave all other settings as-is.
  2. Use default storage and tag settings.
  3. On the Step 6: Configure Security Group page, for Assign a security group, select Create a new security group.
  4. For Security group name, enter studio-SG.
  5. For Type, choose SSH on port range 22.
  6. For Source, choose My IP.

This allows you to SSH onto the instance from your current internet network.

  1. Create a new key pair, studio-host.
  2. Launch the instance.

For more information about launching an instance, see Tutorial: Getting started with Amazon EC2 Linux instances.

Configuring the instance as a PyPI server

To configure your instance, complete the following steps:

  1. Open a terminal window and navigate to the directory containing your .pem file.
  2. Change the key permissions and SSH onto your instance, substituting in the public IP address and Region:
    chmod 400 studio-host.pem
    ssh -i "studio-host.pem" ec2-user@ec2-x-x-x-x.{region}.compute.amazonaws.com

If needed, you can find the SSH command by selecting your instance on the console, choosing Connect, and navigating to the SSH Client tab.

  1. Install pip, which you use to install Python packages, and bandersnatch, which you use to mirror packages from the public PyPI server onto your instance. For this post, we use the package AWS Data Wrangler, an AWS Professional Services open-source library that integrates Pandas DataFrames with AWS services:
    sudo yum install python3-pip
    sudo pip3 install multidict==4.7.6
    sudo pip3 install yarl==1.6.0
    sudo pip3 install bandersnatch

You now configure bandersnatch to specify packages and their versions to mirror.

  1. Open a config file:
    sudo vim /etc/bandersnatch.conf
  1. Enter the following file contents:
    [mirror]
    directory = /pypi
    master = https://pypi.org
    timeout = 10
    workers = 3
    hash-index = false
    stop-on-error = false
    json = false
    
    [plugins]
    enabled =
        whitelist_project
        allowlist_release
    
    [whitelist]
    packages =
        awswrangler==1.10.0
        pyarrow==2.0.0
        SQLAlchemy==1.3.10
        s3fs==0.4.2
        numpy==1.18.4
        sqlalchemy-redshift==0.7.9
        boto3==1.15.10
        pandas==1.1.0
        psycopg2-binary==2.8.0
        pymysql==0.9.3
        botocore==1.18.10
        fsspec==0.7.4
        s3transfer==0.3.2
        jmespath==0.9.4
        pytz==2019.3
        python-dateutil==2.8.1
        urllib3==1.25.8
        six==1.14.0
    
  1. Mirror the libraries and list the directory contents to view that the libraries have been copied onto the instance:
    sudo /usr/local/bin/bandersnatch mirror
    ls /pypi/web/simple/

You must configure pip so that when pip is run to install packages, they are searched for within your private PyPI server instead of on the public server. The file already exists, and you add two more lines to the existing file.

  1. Open the file:
    sudo vim /etc/pip.conf
  1. Ensure your pip config file reads as follows, adding the last two lines:
    [global] 
    disable_pip_version_check = 1 
    format = columns 
    index-url = http://localhost/simple 
    trusted-host = localhost
  1. Install and configure nginx so that the instance can function as a private web server:
    sudo amazon-linux-extras install nginx1
    sudo vim /etc/nginx/nginx.conf
  1. Update the server section of the nginx config file to change the server_name to localhost, listen on the private IP address, and add the root and index locations. The server section of the nginx config file should be as follows:
    server {
            listen x.x.x.x:80;
            listen       80;
            listen       [::]:80;
            server_name localhost;
            root         /usr/share/nginx/html;
    
            # Load configuration files for the default server block.
            include /etc/nginx/default.d/*.conf;
    
            location / { root /pypi/web/; index index.html index.htm index.php; }
    
            error_page 404 /404.html;
                location = /40x.html {
            }
    
            error_page 500 502 503 504 /50x.html;
                location = /50x.html {
            }
        }
    
  2. Start the server and install the package locally to test it out:
    sudo service nginx start
    pip3 install --user awswrangler

Note that the packages are collected from the localhost, not the public package index.

You now have a private PyPI server ready for use.

Creating a VPC endpoint

VPC endpoints allow resources within a VPC to access AWS services. For this solution, you will create an endpoint for the SageMaker API. You can extend this solution by adding more endpoints for other services you need to access from your notebook.

There are two types of VPC endpoints:

  • Interface endpoints – Elastic network interfaces within a subnet that serve as entry points for traffic destined to a supported AWS service, such as SageMaker
  • Gateway endpoints – Only supported for Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create Endpoint.
  3. Create the SageMaker API endpoint com.amazonaws.{region}.sagemaker.api.
  4. Make sure you choose the same VPC, subnet, and security group used by your EC2 instance.

Make sure you choose the same VPC, subnet, and security group used by your EC2 instance.

When finished, your endpoint is listed as shown in the following screenshot.

For more information about VPC endpoints, including the distinction between interface endpoints and gateway endpoints, see VPC endpoints.

Editing your security group rules

Edit your security group to add an inbound rule allowing all traffic from within the security group. This allows the Studio notebook to communicate with the EC2 instance because they both reside within this security group.

You can search for the security group name on the Amazon EC2 console, and you receive a suggested ID.

After you add the rule, the security group has two inbound rules: one allowing SSH on port 22 from your IP to connect to the EC2 instance, and another allowing all traffic from within the security group.

For more information about security groups, see Security groups for your VPC.

Creating VPC-only SageMaker Studio resources

All SageMaker Studio resources reside within a domain, with a maximum of one domain per Region in an AWS account. A domain contains one or more users, and as a user you can open a Studio notebook. For more information about creating a domain, see CreateDomain.

With the recent release of VPC support for Studio, you can choose from two networking options: public internet only and VPC only. For more information, see Connect SageMaker Studio Notebooks to Resources in a VPC and Securing Amazon SageMaker Studio connectivity using a private VPC. For this post, we create a VPC-only domain.

  1. On the SageMaker Studio console, Select Standard setup.

This allows for detailed configuration.

  1. For Authentication method, select AWS Identity and Access Management (IAM).For Authentication method, select AWS Identity and Access Management (IAM).
  2. Under Permissions, choose Create a new role.
  3. Use the default settings.
  4. Choose Create role.

This creates a new SageMaker execution role.

  1. In the Network and Storage section, configure your VPC and subnet to match those of the EC2 instance.
  2. For Network Access for Studio, select VPC Only.
  3. For Security group(s), choose the same security group as used for the EC2 instance.
  4. Choose Submit.

Wait approximately a minute to see the banner notification that SageMaker Studio is ready.

You now create a Studio user within the domain.

  1. Choose Add user.
  2. Give the user a name (for example, studio-user).
  3. Choose the role you just created, AmazonSageMaker-ExecutionRole-<timestamp when the role was created>.
  4. Choose Submit.

This concludes the initial SageMaker Studio resource creation. You now have a Studio domain and user ready for use and can proceed with creating and using a notebook.

Installing a Python package onto the SageMaker Studio notebook

To start using the PyPI server from the SageMaker Studio notebook, complete the following steps:

  1. On the SageMaker Studio Control Panel, choose Open Studio next to the user name.
  2. Wait for your Studio environment to load.

You can now see the Studio UI. For more information, see the Amazon SageMaker Studio UI Overview.

  1. Use the default SageMaker JumpStart Data Science image and create a new Notebook Python 3.
  2. Wait a few minutes for the image to launch and your notebook to be available.

If you try to run a command before the notebook is available, you get the message: Note: The kernel is still starting. Please execute this cell again after the kernel is started. After your image has launched, you see it listed under Kernel Sessions, along with items for Running Instances and Running Apps. The kernel runs within the app, and the app runs on the instance.

Now you’re ready to configure your notebook. The first step is pip configuration, so that when you install a package using pip, your notebook searches for the package on the private PyPI server instead of through the public internet at pypi.org.

  1. Run the following command in a notebook cell, substituting your EC2 instance’s private IP address:
    !printf '[global]\nindex-url = http://x.x.x.x/simple\ntrusted-host = x.x.x.x'| sudo tee /etc/pip.conf
  1. To check that the file was successfully written, run the following command:
    !head /etc/pip.conf

Now you’re ready to install Python packages from your server.

  1. To see that AWS Data Wrangler isn’t installed by default, try to import it with the command:
    import awswrangler
  1. Install the package and append to your Python path:
    !pip install awswrangler
    import sys
    sys.path.append('/home/sagemaker-user/.local/lib/python3.7/site-packages')

The library was installed from your private server’s index, as you specified in the pip config file, http://{EC2-IP}/simple.

The library was installed from our private server’s index, as you specified in the pip config file,

  1. Now that the package has been installed, you can import the package smoothly:
    import awswrangler

    Now that the package has been installed, you can import the package smoothly:

Now your notebook is ready for development, including installation of the Python libraries of your choice! Moreover, your PyPI server remains operational and available even when you delete your notebooks or use multiple notebooks. Your PyPI server is separated from your development environment, giving you freedom to manage your notebook resources in the way that best suits your needs.

Cleaning up

To clean up your resources, complete the following steps:

  1. Shut down the running instance in the SageMaker Studio notebook.
  2. Delete any remaining user’s apps on the SageMaker Studio console, including the default app.
  3. Delete the SageMaker Studio user.
  4. Delete Studio in the SageMaker Studio Control Panel.
  5. Stop the EC2 instance.
  6. Terminate the EC2 instance.
  7. Delete the IAM role, VPC endpoint, studio-SG security group, and Amazon Elastic File System (EFS) file system.
  8. Delete the rules in the inbound and outbound NFS security groups.
  9. Delete the security groups.

Conclusion

This post demonstrated how to get started with SageMaker Studio in VPC-only mode, while retaining the ability to install Python packages by hosting a private PyPI server. Now you can move forward with your ML development in notebooks residing within this secure environment.

We invite you to explore other exciting applications of SageMaker Studio, including Amazon SageMaker Experiments and scheduling notebooks on SageMaker ephemeral instances.


About the Author

Julia Kroll is a Data & Machine Learning Engineer for AWS Professional Services. She works with enterprise and public sector customers to build data lake, analytics, and machine learning solutions.