Optimizing your nonprofit mission impact with AWS Glue and Amazon Redshift ML

Nonprofit organizations focus on a specific mission to impact their members, communities, and the world. In the nonprofit space, where resources are limited, it’s important to optimize the impact of your efforts. Data analytics and machine learning (ML) tools on Amazon Web Services (AWS) make it simple to not only quantify your impact but also predict the outcome of an initiative before resources are dedicated. Such predictions can support organizations in decision-making around programming.

In this blog post, learn how you can apply machine learning with Amazon Redshift ML on public datasets to support data-driven decisions optimizing your impact. This walkthrough focuses on the use case for how to use open data to support food security programming, but this solution can be applied to many other initiatives in the nonprofit space.

Dataset overview: Current population survey (CPS) food security supplement

Rather than expending resources gathering data to inform programming nonprofits can benefit from publicly available data. AWS Data Exchange offers a portfolio of more than 3,500 third-party datasets, including data specific to the public sector, that are made available via Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Lake Formation or through API. AWS Data Exchange includes datasets from recent census surveys. In this blog post we are interested in food security specifically. Therefore, we use the latest (2021) CPS Food Security Supplement listed directly on the US Census Bureau website.

This dataset contains demographic information and food security scores for approximately 128,000 US residents from 2021. We have modified this data to showcase 50 representative columns and filtered to residents who completed the full food security supplement. The filtered dataset is available through the AWS CloudFormation stack below. AWS CloudFormation is an orchestration service that helps automatically deploy and configure AWS services and resources.

Variable names and entries are coded by the Census Bureau and can be mapped to readable names and values in the CPS technical documentation.

Solution overview: Optimizing nonprofit mission impact with AWS Glue and Amazon Redshift ML

In this post, we examine a use case in which local organizations want to determine demand for food security programming as new residents enter their communities. This walkthrough uses publicly available data from the US Census Bureau to predict the level of food security for new residents in a given county in the US to better inform local programming needs.

Using Amazon Redshift ML, we train a model that predicts food security of new residents given basic demographic information such as age, race, income, employment status, household type and more. Amazon Redshift ML makes it simple for data analysts and database developers to create, train, and apply ML models using familiar SQL commands in Amazon Redshift data warehouses. Amazon Redshift ML integrates with Amazon SageMaker, a fully managed ML service, without requiring you to become an expert in machine learning.

Prerequisites

Before beginning this solution, you need the following:

1. An AWS Account.

2. Administrative access to the following services

3. A default virtual private cloud (VPC) available in your AWS Account. Learn how to create a default VPC if you have deleted the original.

4. The applied quota value for Maximum number of parallel compilation jobs in the Amazon Sagemaker service is greater than 0. Learn more about AWS service quotas.

5. Basic knowledge of SQL.

Deploying resources with AWS CloudFormation

We can create all the resources needed for our ML model using an AWS CloudFormation template.

This template creates an Amazon S3 bucket that includes the food security data, catalogs the data in AWS Glue, and provisions an Amazon Redshift Serverless cluster in the US East (N. Virginia) Region. Note that deploying these resources into your AWS account incurs charges. See the Cleaning up section for instructions on deleting these resources after deployment. Learn more about AWS Pricing.

Deploy the solution

1. Select the Launch Stack button below or select this text to launch the AWS CloudFormation template. The AWS CloudFormation page in the AWS Management Console opens.

2. Enter a Stack name (example: npo-blog-redshiftml).

3. Check the box that says I acknowledge that AWS CloudFormation might create IAM resources and select the Create Stack.

The stack takes 5-10 minutes to provision. When the template finishes deploying resources, the Stack info tab shows CREATE_COMPLETE as shown in Figure 1.

Figure 1. The AWS CloudFormation Stack info tab showing CREATE_COMPLETE status.

Figure 1. The AWS CloudFormation Stack info tab showing CREATE_COMPLETE status.

Getting started with Amazon Redshift Serverless

Amazon Redshift Serverless makes it simpler to run and scale analytics without having to manage the underlying infrastructure. It automatically provisions and scales data warehouse capacity, and you pay only for what you use.

After running the AWS CloudFormation template above, we are ready to reference the AWS Glue Data Catalog in Amazon Redshift to build a prediction model for food security.

1. Sign in to the AWS Management Console and open the Amazon Redshift Serverless console.

2. Choose Query Data from the Serverless dashboard. This opens a new tab where you are prompted to configure your account.

3. Leave the default settings and choose Configure account.

Create an external schema reference to AWS Glue

1. In the resources pane, select the Serverless: food-security-workgroup resource to create a connection.

2. Select Federated user, leave the database name as dev and select Create connection.

3. Select Create and choose Schema from the dropdown, as shown in Figure 2, to create a reference to your catalogued food security data from the AWS Glue Data Catalog.

NOTE: In a future Amazon Redshift release your AWS Glue Data Catalog can automatically referenced as an external schema. Read more about querying the AWS Glue Data Catalog, now in preview.

Figure 2. The Amazon Redshift Serverless resources pane.

Figure 2. The Amazon Redshift Serverless resources pane.

4. Enter the new schema name food_security.

5. Choose the external schema type.

6. For Glue database name, choose redshift-ml-database.

7. Choose the listed IAM role and choose Create schema. This creates a table reference in Amazon Redshift to AWS Glue.

NOTE: It is important to be logged in to your AWS account as an IAM user with administrative privileges or you may not have permissions to create the schema.

Figure 3. The Amazon Redshift Serverless create schema configuration.

Figure 3. The Amazon Redshift Serverless create schema configuration.

Query food security data

8. Choose the arrow of the food_security schema and the Tables folder to view your tables.

9. Right-click the table name and choose Select table.

10. Select Run in the query editor to launch the generated command to view a sample of your data.

NOTE: Your table name will be different from the example in Figure 4.

Figure 4. The Amazon Redshift Serverless query editor with sample result.

Figure 4. The Amazon Redshift Serverless query editor with sample result.

Creating test sets for ML

Given that the raw dataset from the 2021 CPS Food Security Supplement is over 500 columns, we have modified the dataset to 50 representative columns and filtered to residents who completed the CPS Food Security Supplement.

Now that we have the filtered dataset referenced in Amazon Redshift, we need to split it into separate datasets to train the ML model, validate the model performance, and predict new residents’ level of food security. We use 90 percent of our data for input (training and validation) and the remaining 10 percent for prediction. Note that Amazon Redshift ML automatically splits your input data into training and validation sets, so we do not need to create a separate validation set at this stage.

We add a column called data_group_id with randomly generated integers to segment our data into training, validation and prediction sets.

As you launch queries you will be using the Amazon Redshift query editor. You can add queries to the current query tab, or open a new editor tab by clicking the + icon in the top left of the Amazon Redshift query editor window and selecting Editor.

Insert your auto generated table name from the previous Select Table command into the FROM clause of the query below. Copy and paste the query below into the editor and launch the query to create a trimmed dataset in the public schema.

CREATE TABLE public.food_security_all AS (
    SELECT t1.*, 
        CAST(RANDOM() * 100 AS INT) AS data_group_id
    FROM
        (
SELECT * FROM food_security.<insert-your-table-name-here>
        ) t1
);

Now that we have a table segmented by data_group_id, launch the queries below from the Amazon Redshift query editor to create datasets specific to input and prediction. Use the data_group_id column we generated above as a range to split the data into 90 percent training/validation and 10 percent prediction.

-- creates an input set
-- Amazon Redshift ML will automatically split the input into training and validation sets
CREATE TABLE public.food_security_input AS (
select * from public.food_security_all
where data_group_id <= 90
);

--creates prediction set
CREATE TABLE public.food_security_prediction AS (
select * from public.food_security_all
WHERE data_group_id > 90
);

Creating a model in Amazon Redshift ML

With Amazon Redshift ML, we can use a simple SQL statement to create a model from our training data that was segmented previously.

The target column we use is HRFS12MD, which categorizes the food security of a resident on a categorical scale from 1) High food security to 4) Very low food security. High food security represents no reported indications of food-access problems, while very low food security represents reports of disrupted eating patterns and reduced food intake. Our model generates predicted values for this food security column for new residents entering a community.

Collect the bucket name for Amazon Redshift ML output

1. Navigate to the AWS CloudFormation console.

2. Select the AWS CloudFormation stack you created previously.

3. Select the Outputs tab.

4. Copy the value of the RedshiftOutputBucketName as shown in Figure 5. This is where our generated model artifacts will be stored after training.

NOTE: Your value for the bucket name will be different than the example in Figure 5.

Figure 5. The AWS CloudFormation Outputs tab; the value of the RedshiftOutputBucketName is highlighted.

Figure 5. The AWS CloudFormation Outputs tab; the value of the RedshiftOutputBucketName is highlighted.

Build the model command

Insert your Amazon Redshift output bucket name in the S3_BUCKET parameter under SETTINGS of the query below.

CREATE MODEL food_security_resident_model
FROM (
SELECT
GEDIV,
GEREG,
GESTFIPS,
GTCBSA,
GTCBSASZ,
GTCO,
GTMETSTA,
HEFAMINC,
HEHOUSUT,
HETENURE,
HRHHID,
HRHHID2,
HRHTYPE,
HRINTSTA,
HRNUMHOU,
HRPOOR,
HRSUPINT,
HUBUS,
PEAFNOW,
PECERT1,
PEDISDRS,
PEDISEAR,
PEDISEYE,
PEDISOUT,
PEDISPHY,
PEDISREM,
PEEDUCA,
PEERNPER,
PEFNTVTY,
PEHSPNON,
PEMARITL,
PEMJOT,
PEMLR,
PEMNTVTY,
PENATVTY,
PENLFACT,
PERRP,
PESCHENR,
PESEX,
PRCITSHP,
PRFAMREL,
PRHRUSL,
PRIMIND1,
PRNMCHLD,
PRPERTYP,
PRTAGE,
PTDTRACE,
PULINENO,
PUWK,
HRFS12MD
FROM public.food_security_input
)
TARGET HRFS12MD
FUNCTION food_security_prediction
IAM_ROLE default
SETTINGS (
    S3_BUCKET '<insert-redshift-output-bucket-name>'
    )

2. Launch the query from the Amazon Redshift query editor to create a new model.

The model creates a function (food_security_prediction) that we invoke for predictions. You need to list the specific column names because you do not want to include the data_group_id column we used to split test sets.

The query executes quickly and launches an asynchronous training job in the background. The model takes about 1.5 hours to train.

CREATE MODEL parameters

The CREATE MODEL SQL command takes in a number of parameters to tune your training job. In this example, we include our input data, a target column, a function name, an IAM role, and an Amazon S3 destination for outputting the model artifacts. This is a simple list of parameters and prompts Amazon Redshift to discover an appropriate model on your behalf during training.

There are optional parameters available to further restrict the model during training. These parameters enable you to specify the model, problem type, hyperparameters, and maximum runtime, among other settings.

Learn more about the CREATE MODEL parameters.

Show the model status

Execute the query below from the Amazon Redshift query editor to check the status of your model. If the MODEL STATE is TRAINING then the model has not yet completed.

show model food_security_resident_model;

Figure 6. Result from show model command with model state as TRAINING.

Once the MODEL STATE is READY, then it is available for use. When creating the model, Amazon Redshift ML automatically splits your input into training and validation test sets to determine how close the predicted values are to the actual values. You can see this accuracy value for the completed model in the output under validation:accuracy. Your model should showcase around 85 percent accuracy, but results may differ because the underlying training set is randomized, and hyperparameter selection varies.

Figure 7. Result from show model command with model state as READY.

Figure 7. Result from show model command with model state as READY.

These results will show the model and problem type that were automatically selected by Amazon Redshift ML for the dataset.

Figure 8. Result from show model command with model type xgboost and problem type MulticlassClassification.

Figure 8. Result from show model command with model type xgboost and problem type MulticlassClassification.

Predicting food security for new residents

Once the model finishes training, it is ready to predict food security values for new residents in the community. We use the prediction function in a simple SQL statement to output predicted food security as a new column called predicted_food_security.

These results can be grouped by county, with the food scores summarized to showcase the level of need in a particular area. Launch the query below from the Amazon Redshift query editor to showcase predicted food security counts across a state and county.

select state, county_code, predicted_food_security,
count(unique_respondent_ID) as count_respondents from (
SELECT
HRHHID||HRHHID2||PULINENO as unique_respondent_ID,
GESTFIPS as state,
GTCO as county_code, 
food_security_prediction(
GEDIV,
GEREG,
GESTFIPS,
GTCBSA,
GTCBSASZ,
GTCO,
GTMETSTA,
HEFAMINC,
HEHOUSUT,
HETENURE,
HRHHID,
HRHHID2,
HRHTYPE,
HRINTSTA,
HRNUMHOU,
HRPOOR,
HRSUPINT,
HUBUS,
PEAFNOW,
PECERT1,
PEDISDRS,
PEDISEAR,
PEDISEYE,
PEDISOUT,
PEDISPHY,
PEDISREM,
PEEDUCA,
PEERNPER,
PEFNTVTY,
PEHSPNON,
PEMARITL,
PEMJOT,
PEMLR,
PEMNTVTY,
PENATVTY,
PENLFACT,
PERRP,
PESCHENR,
PESEX,
PRCITSHP,
PRFAMREL,
PRHRUSL,
PRIMIND1,
PRNMCHLD,
PRPERTYP,
PRTAGE,
PTDTRACE,
PULINENO,
PUWK
) AS predicted_food_security
FROM
public.food_security_prediction
order by unique_respondent_id desc 
) t 
group by state, county_code, predicted_food_security
order by state, county_code, predicted_food_security

The expected output of the query below shows the count of respondents for a food security category in a given state and county. Your results may be different depending on your prediction set and the hyperparameters the automatic model tuning selected on your behalf.

Filter to a given state and county code in your SQL query. State and county codes can be found in the CPS technical documentation.

We can see that for state = 26 (Minnesota) where county = 163 (Washington County), there are 10 respondents predicted to have high food security (1), and 9 predicted with marginal, low, or very low food security (2 – 4). Overall, 47 percent of surveyed residents in this county are predicted to showcase some level of food insecurity.

Local organizations can use this information to target support for counties and residents predicted to have low food security and better prepare for the demand for local food security programming.

Figure 9. Sample prediction result in the Amazon Redshift Serverless query editor.

Figure 9. Sample prediction result in the Amazon Redshift Serverless query editor.

Cleaning up

To avoid unnecessary charges like storage and computing costs, you can delete the AWS CloudFormation stack to remove all resources from your AWS account.

Navigate to AWS CloudFormation in the AWS console and select your stack. Choose Delete and select Delete Stack. All resources are deleted after about 10 minutes.

Additionally, Amazon Redshift ML generates several models in the Amazon SageMaker service. You can manually delete these models if you do not wish to retain them.

Navigate to the SageMaker console.
In the side menu, select Inference, then Models.
Select the model you wish to delete.
Choose Actions and select Delete from the dropdown.
Choose Delete to permanently delete your model.
Repeat for the existing models generated by Amazon Redshift ML you do not wish to retain.

Conclusion

In this blog post, we showed how nonprofit organizations can use Amazon Redshift ML to predict the outcomes of their organizational efforts. We walked through the process from configuring Amazon Redshift to testing and evaluating the insights provided by Amazon Redshift ML. We showcased how you can create, train and deploy ML models with familiar SQL commands without needing to be an expert in ML.

Your organization can use existing skillsets to apply ML, saving time and resources in getting to actionable insights. You can apply this approach to your own datasets to make data-driven decisions that optimize your impact.

To learn more about Amazon Redshift ML, explore the Using machine learning in Amazon Redshift documentation. If you would like to discuss this further with your AWS Account Team, complete the Public Sector Contact Us form for your organization.