Anonymizing Sensitive Data of the Migration Evaluator’s Export File

1. Introduction

Migration Evaluator is a complimentary migration assessment service that helps customers to create a data-driven directional business case for migrating workloads to AWS. It includes a data collection tool that discovers server workloads running on-premises, along with their utilization patterns.

Data from the Migration Evaluator collector needs to be received by the AWS Migration Evaluator team so it can be analyzed, workloads right-sized, and mappings to Amazon EC2 created. This can be done using one of two methods:

The Migration Evaluator collector can be configured to send the data automatically to the Migration Evaluator team every day.
The Migration Evaluator collector can export the data to a file and then securely uploaded to the Migration Evaluator team.

The first option is suitable for most customers as it allows Migration Evaluator to verify the scope of the assessment and that collection from the servers in scope is successful.

The inventory data collected by Migration Evaluator includes server names and IP addresses. Some customers have security requirements so server names and IP addresses cannot be disclosed outside of their company. To address this, the data must be anonymized before being sent to AWS. The most suitable approach for meeting the security requirement is to use a scripted solution rather than manual anonymization, as the latter can be a lengthy and error-prone process.

In this blog, I am providing a solution using a python script to anonymize sensitive data collected by Migration Evaluator.

Once the data available at AWS, it will be analyzed at the end of the assessment collection period (as shown here) and two deliverables are offered:

Quick Insights: a one-page summary highlighting the estimated savings to re-host at AWS based on usage patterns with costs divided by infrastructure and software licensing.
A detailed CSV export is also available that combines on-premises discovery data (server hardware provisioning, Microsoft SQL Server configuration, and resource utilization) with recommendations for re-hosting to Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Storage (Amazon EBS).
Directional Business Case includes several sections:
- Multiple customized cost model scenarios showing costs when workloads are right-sized and migrated to Amazon EC2, in addition to an Optimization & Licensing Assessment (OLA) with Microsoft Windows and Microsoft SQL Server licensing optimization analysis.
- A Sustainability analysis showing expected CO2e reductions.
- Optionally, and depending on the scope of discovery, details of additional AWS Services such as VMware Cloud on AWS, Amazon Relational Database Service (RDS), Amazon Workspaces and AWS Elastic Disaster Recovery.

2. Solution Overview

Let’s talk about the script that anonymizes the server names, hypervisor host names and IP addresses. The script has two functions:

Take a Migration Evaluator collector export file as input and produce an anonymized version as output. The output file will contain:
- anonymized server names
- anonymized hypervisor host names
- no IP addresses
Take a Migration Evaluator Quick Insights result ZIP file as input and produce de-anonymized results as output.

Note: The mappings between anonymized data and de-anonymized data uses randomly generated identifiers, are kept on the system that runs the script and are never sent to AWS.

2.1 Prerequisites

Python 3 (current version)
The openpyxl library installed. To install the openpyxl library: use the following command:
pip install openpyxl
Using your preferred text editor, create a file called “collector-anonymizer.py”, and copy the script in section 3 “ME-Collector Export File -Anonymizer python code” into this file.

2.2 Anonymize data for Migration Evaluator collector export file

Download and annotate (if required) the export file (see the install guide section 10 for details).
In this example, I will use the filename ‘Inventory_And_Usage_Workbook-2023-03-24.xlsx’

Example Virtual Provisioning sheet showing server names (Column B) and hypervisor names (Column H)

Run the python script you saved earlier
python collector-anonymizer.py an Inventory_And_Usage_Workbook-2023-03-24.xlsx

The output file is named Inventory_And_Usage_Workbook Anonymized.xlsx. Open the file and confirm it is now in accordance with requirements for anonymization. You can then upload the anonymized export to the ME console as explained in section 10 in the install guide.

Example Virtual Provisioning sheet showing anonymized server names (Column B) and hypervisor names (Column H)

2.3 De-anonymize Migration Evaluator Quick Insights

Once the Quick Insights are ready you will be notified via email, then:

Navigate to the ME console, download the Quick Insights Standard Format Zip File, and place it in the same folder as the original export file.
Run the following command to de-anonymize your results
python collector-anonymizer.py de Inventory_And_Usage_Workbook-2023-03-24.xlsx
standard-customernamemas-12345-mas-12345_2023-03-30-11-26-00.zip

In this example the original export file is named Inventory_And_Usage_Workbook-2023-03-24.xlsx, the Quick Insights zip file is named standard-customernamemas-12345-mas-12345_2023-03-30-11-26-00.zip.
Note: The example command has been split over two lines for readability – your command should be one line only.

The script will output a file with your Quick Insights results including your real server names, and, if Microsoft SQL Servers were detected in the assessment collection, a second file containing Microsoft SQL Server information.

3. ME-Collector Export File -Anonymizer python code

#Beginning of Script
#!/usr/bin/env python3
import csv
import zipfile
import argparse
import openpyxl


def get_column_indexes(sheets):
    """Create a dict containing sheet name -> column name -> index # for all sheets provided."""
    headers = {}
    for sheet in sheets:
        headers[sheet.title] = {}
        for idx, column in enumerate(sheet.columns):
            headers[sheet.title][column[0].value] = idx + 1
    return headers


def anonymize(filename):
    """Anonymize the given Excel file, saving output as a new file."""
    wb = openpyxl.load_workbook(filename)

    # Read sheet names
    uti = wb["Utilization"]
    asset = wb["Asset Ownership"]
    virt = wb["Virtual Provisioning"]
    phys = wb["Physical Provisioning"]

    column_indexes = get_column_indexes([uti, asset, virt, phys])

    # replace Hypervisor Name in Virtual Provisioning sheet with hypervisor's Unique Identifier
    for cell in list(virt.columns)[
        column_indexes["Virtual Provisioning"]["Hypervisor Name"] - 1
    ]:
        for b_cell in list(phys.columns)[
            column_indexes["Physical Provisioning"]["Human Name"] - 1
        ]:
            if b_cell.value == cell.value:
                virt.cell(
                    row=cell.row,
                    column=column_indexes["Virtual Provisioning"]["Hypervisor Name"],
                ).value = phys.cell(
                    row=b_cell.row,
                    column=column_indexes["Physical Provisioning"]["Unique Identifier"],
                ).value

    # replace "Human Name" with "Unique Identifier" across all sheets
    for sheet in [uti, asset, virt, phys]:
        for row in range(2, sheet.max_row + 1):
            sheet.cell(
                row=row, column=column_indexes[sheet.title]["Human Name"]
            ).value = sheet.cell(
                row=row, column=column_indexes[sheet.title]["Unique Identifier"]
            ).value

    # Remove IPs in Physical, Virtual Provisioning sheets
    for sheet in [phys, virt]:
        for cell in list(sheet.columns)[column_indexes[sheet.title]["Address"] - 1][1:]:
            cell.value = None

    wb.save("Inventory_And_Usage_Workbook Anonymized.xlsx")
    print(
        "Anonymization successful, Inventory_And_Usage_Workbook Anonymized has been created"
    )


def deanonymize(filename, qi):
    """De-anonymize the given Quick Insights .zip file using the original collector export file."""
    # Get filenames from the Quick Insights zip file
    with zipfile.ZipFile(qi, "r") as z:
        zip_filenames = z.namelist()

        # Get the Asset Ownership and Physical Provisioning of the original pre-anonymized script
        wb = openpyxl.load_workbook(filename)
        asset = wb["Asset Ownership"]
        phys = wb["Physical Provisioning"]

        column_indexes = get_column_indexes([asset, phys])

        unique_id_to_hostname = {}
        for row in asset.rows:
            unique_id_to_hostname[
                row[
                    column_indexes["Asset Ownership"]["Unique Identifier"] - 1
                ].value.upper()
            ] = row[column_indexes["Asset Ownership"]["Human Name"] - 1].value
        for row in phys.rows:
            unique_id_to_hostname[
                row[
                    column_indexes["Physical Provisioning"]["Unique Identifier"] - 1
                ].value.upper()
            ] = row[column_indexes["Physical Provisioning"]["Human Name"] - 1].value

        # create de-anonymized Server and SQL QI files
        for zip_file in zip_filenames:
            with z.open(zip_file) as f:
                file_string = f.read().decode("utf-8")

            reader = csv.reader(file_string.splitlines())
            headers = next(reader)
            csv_col_indexes = {}
            for idx, column in enumerate(headers):
                csv_col_indexes[column] = idx

            # Creating the de-anon file
            output_filename = "deanonymized_" + zip_file
            print(f"Creating output file: {output_filename}")

            with open(output_filename, "w", newline="") as g:
                writer = csv.writer(g)
                writer.writerow(headers)

                for row in reader:
                    # swap the "Server Name" column values back to the original hostnames
                    row[csv_col_indexes["Server Name"]] = unique_id_to_hostname[
                        row[csv_col_indexes["Server Id"]].upper()
                    ]

                    # swap the "Virtualization | Host Name" column values back to the original hostnames (if the column exists and has a value)
                    if (
                        "Virtualization | Host Name" in csv_col_indexes
                        and row[csv_col_indexes["Virtualization | Host Name"]]
                    ):
                        row[
                            csv_col_indexes["Virtualization | Host Name"]
                        ] = unique_id_to_hostname[
                            row[csv_col_indexes["Virtualization | Host Name"]].upper()
                        ]

                    # write modified row to CSV
                    writer.writerow(row)


if __name__ == "__main__":
    # arguments processing
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "method",
        help="The method to use, 'an' for anonymization or 'de' for de-anonymization",
    )
    parser.add_argument("filename", help="Inventory and Utilization Export file")
    parser.add_argument("QI", help="QI .zip file", nargs="?", default=None)
    args = parser.parse_args()

    if args.method == "an":
        anonymize(args.filename)

    elif args.method == "de":
        if args.QI is None:
            parser.error("QI is required for de-anonymization")
        deanonymize(args.filename, args.QI)

    else:
        parser.error("Invalid input. Please enter either 'an' or 'de'.")
#End of script

4. Conclusion

In this post, I introduced a simple way for a Migration Evaluator customer to anonymize and de-anonymize their server meta-data (host name, IP Address, and server name).
Special thanks to Roger Trevor for helping with code optimization.

AWS Cloud Operations & Migrations Blog