In this module, you create an Amazon S3 bucket to stage your interactions dataset. To ensure that Amazon Personalize can access and work with the data, you must also grant permissions using IAM roles and policies.

Time to Complete Module: 20 Minutes


  • Step 1. Create Amazon S3 bucket and upload data

    Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

    Training a model produces model training data and model artifacts. In this lab, you use an Amazon S3 bucket to stage the interactions dataset, and store the model artifacts generated by Amazon Personalize during model training.

    In your Jupyter notebook, copy and paste the following code into a new code cell and choose Run.

    session = boto3.session.Session()
    region = session.region_name
    s3 = boto3.client('s3')
    account_id = boto3.client('sts').get_caller_identity().get('Account')
    bucket_name = account_id + "-" + region + "-" + "personalizedemoml"
    print(bucket_name)
    if region == "us-east-1":
        s3.create_bucket(Bucket=bucket_name)
    else:
        s3.create_bucket(
            Bucket=bucket_name,
            CreateBucketConfiguration={'LocationConstraint': region}
            )
    This script creates an Amazon S3 bucket with a name [account id]-[region]-personalizedemoml.
     
    Note: If you encounter an error, this is may be due to an existing S3 bucket with the same name. Modify the name of the bucket and run the code again.
    tutorial-personalize-create-bucket

    (Click to enlarge)

    tutorial-personalize-create-bucket

    Next, upload the data. In your Jupyter notebook, copy and paste the following code into a new code cell and choose Run.

    interactions_file_path = data_dir + "/" + interactions_filename
    boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
    interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename
    tutorial-personalize-import-data

    (Click to enlarge)

    tutorial-personalize-import-data
  • Step 2. Configure S3 bucket policy

    In this step, you configure the Amazon S3 bucket policy so that Amazon Personalize can read the content of your S3 bucket. Run the following code block to create and attach the appropriate policy.
    policy = {
        "Version": "2012-10-17",
        "Id": "PersonalizeS3BucketAccessPolicy",
        "Statement": [
            {
                "Sid": "PersonalizeS3BucketAccessPolicy",
                "Effect": "Allow",
                "Principal": {
                    "Service": "personalize.amazonaws.com"
                },
                "Action": [
                    "s3:*Object",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::{}".format(bucket_name),
                    "arn:aws:s3:::{}/*".format(bucket_name)
                ]
            }
        ]
    }
    
    s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))
    tutorial-personalize-import-data

    (Click to enlarge)

    tutorial-personalize-import-data

    You also need to provide Amazon Personalize with the ability to assume roles in AWS to have the permissions to execute certain tasks. Run the following code to create an IAM role and attach the required policies to it.

    iam = boto3.client("iam")
    
    role_name = "PersonalizeRolePOC"
    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": "personalize.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
            }
        ]
    }
    
    create_role_response = iam.create_role(
        RoleName = role_name,
        AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
    )
    
    # AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
    # if you would like to use a bucket with a different name, please consider creating and attaching a new policy
    # that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
    policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
    iam.attach_role_policy(
        RoleName = role_name,
        PolicyArn = policy_arn
    )
    
    # Now add S3 support
    iam.attach_role_policy(
        PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
        RoleName=role_name
    )
    time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate
    
    role_arn = create_role_response["Role"]["Arn"]
    print(role_arn)
    tutorial-personalize-create-role

    (Click to enlarge)

    tutorial-personalize-create-role
  • Step 3. Import the dataset into Amazon Personalize

    Remember that you created the dataset group and dataset in Step 2. Now, you can create the import job that loads the data from Amazon S3 into Amazon Personalize to use for your model.

    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = "personalize-demo-import1",
        datasetArn = interactions_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
        },
        roleArn = role_arn
    )
    
    dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    Next, run the import job:
    %%time
    max_time = time.time() + 6*60*60 # 6 hours
    while time.time() < max_time:
        describe_dataset_import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = dataset_import_job_arn
        )
        status = describe_dataset_import_job_response["datasetImportJob"]['status']
        print("DatasetImportJob: {}".format(status))
        
        if status == "ACTIVE" or status == "CREATE FAILED":
            break
            
        time.sleep(60)

    The output reports the status of the job each minute. Wait for the DatasetImportJob status to show ACTIVE in your notebook. This step takes 10 to 15 minutes.

    DatasetImportJob: CREATE PENDING
    DatasetImportJob: CREATE IN_PROGRESS
    ...
    DatasetImportJob: ACTIVE

    Great! Your dataset is now imported to Amazon Personalize.

    tutorial-personalize-create-import-job

    (Click to enlarge)

    tutorial-personalize-create-import-job

In this module, you created the Amazon S3 bucket to stage your dataset, created the appropriate policies and roles for Amazon Personalize to access the data, then created an import job to import the data into Amazon Personalize.

In the next module, you create an Amazon Personalize solution that you can later deploy for recommendations.