Using the XOR Secret Computing Platform for machine learning on private data sources
AWS users occasionally need to perform analysis on data sources containing private or sensitive inputs. Inpher’s XOR Secret Computing Platform, available in AWS Marketplace, enables data scientists to train and run machine learning models while maintaining data privacy and without trading utility. Data analysis and machine learning performed by XOR can improve model performance with mathematically guaranteed data privacy while ensuring the data never leaves the data source.
In this post, I show you how to use XOR Trial Beta to predict the risk of coronary heart disease by performing Secret Computing. I show how to use secure multi-party computation on three distributed datasets and how to add features to the training data.
This demonstration involves joining the three datasets using a private set intersect function. A private set intersect function involves joining data source features from a common identifier. I also show how to use the output of the private set intersect function in logistic regression to identify the influence of those features on the target variable. A logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.
This is all performed without viewing the data inputs or requiring the data to be transferred.
This solution doesn’t require you to download any software and requires no previous knowledge of machine learning. To access XOR Trial Beta, follow this link to the XOR Secret Computing product detail page and then select the link in the last sentence of the Product Overview section.
- Selecting this link will take you to the registration page, where you can sign up by entering your email, full name, position, and company.
- You will receive an email at the address you entered to set a password.
- Select the Magic Link in the email. This directs you to a page where you can set your password. Set your password and select Update.
- Choose Try out the GUI.
- On the next page, accept the terms and conditions.
- When prompted about taking a Walkthrough, select No.
A. Selecting a use case
While logged into XOR Trial, navigate to the use case landing page.
- On the use case landing page, scroll to the Heart Disease Prediction icon and select it. This expands a description of the use case.
- At the bottom of the page, choose Try it. This opens a new browser window that shows the main user interface of the XOR Secret Computing Engine application.
By default, you are in a three-player network configuration, meaning that one independent, private dataset is located on each distributed XOR Machine. XOR Machines are where data source owners store private data when it can’t be combined or must be siloed.
B. Selecting the datasets
You must now select the datasets to use in your Heart Disease Prediction solution.
- At the top of the page, choose Select datasets. This takes you to the Dataset library. Each XOR Machine, player-0, player-1, and player-2, has a dataset preloaded on it.
- To reveal the datasets, choose the down arrow and name of the respective player. This shows only the feature headings of the data but doesn’t reveal any underlying data inputs. The datasets contain features such as a patient’s age, education, and number of cigarettes smoked per day, with each feature having a unique identifier. A more complete display of the metadata, including the feature description and sample formatting, can be viewed by choosing the lower case i icon next to the feature headings.
- To choose the dataset for each player, choose the gray bar on the top of all of the respective datasets that displays the dimensions for each. The bar turns dark gray to indicate that the dataset is selected.
- In the top right corner, choose Done. This automatically directs you back to the main user interface tab.
C. Selecting the operation
You must now select the operation, or algorithm, for the model to use. To do this, in the Select function section, under the Special column, select Private Set Intersection. It turns green to indicate the selection.
The private set intersection attempts to find a row-level match between all of the datasets in the column you select. In this walkthrough, the first column is already selected as the feature to use to identify the matching patient across all three datasets. You can change this for each dataset by choosing a different column until it’s highlighted green. For this walkthrough, make sure that only the first column in each dataset is highlighted green, otherwise the private intersection will be unable to find a matching identifier.
D. Specifying the output
Next, specify the output of the operation. In the Specify output section of the page, use a formula to identify which features you want included in the output of the rows that match. For this walkthrough, I want to use all of the features, so if it is not already prepopulated, enter the following for Output: A:B:C.
To maintain the privacy of the individual patients who have provided their data for inclusion in the respective datasets, you must use only the output of the private intersect function in a second operation. To specify only the private intersect function output, do the following:
- Using the Magic Link from the email in step 3 of the prerequisites section, navigate to the main user interface page. In the Output specification section, choose Specify output.
- Clear Download locally.
- Select Export to PDDStore.
You are now ready to run the operation.
E. Running the operation
If you have entered everything correctly—the datasets, the function, the identifiers to match, what to deliver in the output, and how to deliver it—Run computation turns green. To continue, choose Run computation.
A new section of the main user interface tab called Execution is displayed. It shows how long XOR is taking to perform each step of the computation, including the offline and online phases.
The result of this process appears in the Result section. The newly joined dataset is ready for further analysis. To begin to improve the model’s predictions, at the bottom of the page, choose the New operation button.
F. Improving the model’s predictions
Now you can perform a logistic regression function on the newly joined dataset to improve the ability for the model to predict the ten-year risk of coronary heart disease.
- Select new datasets for analysis. To perform a logistic regression function, select the new datasets for analysis.
- In the Select function section of the main user interface tab, under the Regression & Classification column from the previously selected Private intersection function, select Logistic regression.
- In the Select datasets section, choose Select datasets.
- Clear the existing datasets you chose in step B.
- To expand the datasets on each XOR Machine, choose the down arrow next to player-0, player-1, and player-2.
- Choose the dark gray dimensions bar above each player. The bar turns light gray to indicate that the dataset is no longer selected.
- Select the new datasets.
- To expand the PDDStore datasets, choose the down arrow next to player-0, player-1, and player-2.
- Choose the light gray dimensions bar above each player. The bar turns dark gray to indicate that the dataset is selected.
- In the top right corner, choose Done.
- This redirects you to the Specify data section of the main user interface tab. From here, you must select the features and target variable to use in your analysis.
- Select features. Next, select all of the features in each dataset except for feature 1 (the first column or ID column) in both the hospital patient dataset and the tobacco company dataset.
- In the Specify data section, for Features, choose Select.
- Highlight each of the features that you want to include by manually selecting each one until the feature turns light green. Do not select feature 1 in all datasets and feature 2 in the target chronic heart disease dataset.
- Keep the value for intercept (false).
- Choose Confirm.
- Set the target variable.
- For Target, in the Specify data section, choose Select.
- In the target chronic heart disease dataset (TenYearCHD), choose Feature 2. It turns light blue. All of the features, except for the first feature in each dataset, should be shaded a different color from the default white.
- Choose Confirm. If you entered everything properly—the datasets, the correct function, the features to include, and the target variable feature—Run computation turns green.
- To continue, choose Run computation.
- A new section called Execution appears in the main user interface tab. It shows how long XOR is taking to perform each step of the computation, including the offline and online phases.
- The Result section now shows the output of the logistic regression function. Under the features, choose expand. This displays the theta weights of the 15 features where the model identified the weighted values most beneficial in predicting the 10-year risk of coronary heart disease. A theta weight closer to one means the feature was predictive, a value closer to 0 means it was not.
- To see a more granular version of the output, choose download. This downloads a .csv file of the theta weights. You can evaluate this output against the results of a logistic regression function performed on plain text data for the purposes of comparison in precision.
In this post, I showed you how to perform Secret Computing using secure multi-party computation on three distributed datasets. These datasets contain private inputs you couldn’t see, and the data stayed localized in three separate XOR Machines. No data transfer or third-party was required in the operation. The output that was produced generated a result that would match the results of a computation performed on plain text data.
For more information, see Inpher’s listing in AWS Marketplace and watch the How Secret Computing Works video. You can sign up for a Sandbox License to start experimenting with your own data and reach out to Inpher about testing Secret Computing at your organization.
The content and opinions in this post are those of the third-party author, and AWS is not responsible for the content or accuracy of this post.
About the author
Conor Moran is a senior director of business development at Inpher, where he’s responsible for global client and partnership development.