Use Migration Evaluator in protected or regulated environments by anonymizing sensitive network data

The Amazon Web Services (AWS) Migration Evaluator can help organizations with VMware fleets by collecting detailed data on VM (virtual machine) usage and using that to prepare a business case for moving to the cloud and estimate the cost of migration. However, sensitive network data collected by the Migration Evaluator, such as system names and IP addresses, cannot leave highly secured and regulated organisations. Manually anonymising and uploading this data is a hurdle that can prevent regulated organisations from making use of Migration Evaluator.

In this blog post, I show how a simple python script can anonymise AWS Migration Evaluator usage data, allowing it to be uploaded even in highly regulated environments. This unlocks Migration Evaluator insights without compromising data sovereignty or residency requirements such as those found in the Information Security Registered Assessors Program (IRAP).

Why use Migration Evaluator

Anonymising data is one thing, but to get the most value from Migration Evaluator, an organisation also needs to be able to reverse that process. An example where this is useful is in identifying “zombie” or unused VMs—machines with little processor usage and no network usage. For example, test machines that were not decommissioned, or servers for teams that no longer exist.

Where the desire to migrate is driven by a lack of capacity in the on-premises environment, identifying unused VMs can help by freeing up capacity, giving the organisation time and headspace for a well-considered migration.

Consider this scenario: after running the Migration Evaluator agentless collector, you have the following as a data file.

Figure 1. Migration Evaluator output showing details of the VMs in a customer environment.

On examining the file, you’re concerned about the sensitive nature of the VM name and address information it contains. Uploading this file to the Migration Evaluator portal could conflict with security policies requiring internal network data not leave your local jurisdiction, such as if your operations are subject to the European General Data Protection Regulation (GDPR), or in an Australian IRAP compliant environment.

So you anonymise the data as follows:

Figure 2. Migration Evaluator data file with sensitive details replaced by anonymised values.

Now that you have removed all the sensitive information, you upload the data to the Migration Evaluator portal for analysis. That analysis informs you that 33% of your sever fleet are zombies. Drilling into the report, you find that Server1 is no longer used and can be decommissioned (named “AvengersFileShare” in the original Migration Evaluator output file, shown in Figure 1). Now you’ve reduced your potential migration targets by a third, and freed up on-premises resources.

Now imagine that instead of having three servers in your fleet, you have a more realistic 200 or 2,000. Manually anonymising such a list is not only time consuming, but also prone to error.

Solution overview and using Migration Evaluator

The steps and script presented in this blog post can help progress a migration even in heavily regulated environments. It anonymises any sensitive fields in the original Migration Evaluator data by replacing values such as “PrimaryDC01” with “Server0”, and randomising IP addresses. This anonymised file can be sent to AWS or out of region because it contains no sensitive data. However, in addition to anonymising the data, the script also generates a separate reverse mapping file containing both the original and anonymised values.

When the Migration Evaluator reports are generated, the reverse mapping file can be searched to identify specific VMs, allowing the insights in the Migration Evaluator report to be actioned.

Prerequisites

1. The organisation must have already requested a migration assessment and installed one of the Inventory Discovery tools as per the Migration Evaluator Getting Started hub.

2. The Migration Evaluation Collector has already been running in the VM environment collecting usage data for the suggested time period (two weeks).

3. Automatic data synchronisation must be disabled otherwise the unaltered data files will be automatically uploaded to AWS.

Prepare usage data and script

First you need to generate the Migration Evaluator data, export that data as .csv files, and create the Python script using a text editor.

1. Generate an export of the Migration Evaluator collector’s inventory using the instructions in Section 10 of the On-Premises Collector System Requirements and Install Guide, “Annotating Discovered Inventory with Business Data and Provisioning.” This generated export is an Excel Workbook (.xlxs). The workbook contains four tabs with system data collected from your environment. Both the “Physical Provisioning” and “Virtual Provisioning” tabs contain IP addresses, though all four tabs contain system names.

2. Open the Excel document and export each tab containing sensitive data to a separate .csv file using Excel’s “File/Save As” menu and the file type “Comma Separated”. The names of the .csv files are not important; “Physical_Provisioning.csv”, “Virtual_Provisioning.csv”, “Asset_Ownership.csv”, and “Utilization.csv” are used in these instructions.

3. Copy the .csv files to a folder on machine containing a Python interpreter. The script works with both Python version 2 or 3.

4. Use a text editor to create a file named “migr_eval_anon.py” in that same folder and copy-paste the following script into it. Make sure you have the whole script by checking for the “# beginning of script” and “# end of script” comments.

# beginning of script
import csv
import sys
import os
from argparse import ArgumentParser

# Used to store each of the transformations that will be applied to a row in the data
class Transformation:
    def __init__(self, ix, prefix):
        self.col_index = ix
        self.prefix = prefix    

def ProcessArgs(argv):
    parser = ArgumentParser(description = """
    Script to process a CSV generated by the AWS Migration Evaluator tool and anonymise sensitive columns (e.g. system names and IPs)
    This script will generate artificial values for the anonymised columns in the output so that it can be sent to AWS without 
    exposing values. The original values are retained in the reverse mapping output file that allow the anonomysed values to be 
    converted back. If only Input File is specified, then anon and reverse map filenames are generated automatically. """)
    parser.add_argument("-if", "--input-file", dest="infile", default='Physical_Provisioning_Tab.csv',
                        help="File to use as source", metavar="FILE")
    parser.add_argument("-of", "--output-file", dest="outfile", default='',
                        help="File for output anonymised version", metavar="FILE")
    parser.add_argument("-rf", "--revmap-file", dest="revmapfile", default='',
                        help="File for output of dereference (original and anonomised) values", metavar="FILE")
    parser.add_argument('-c','--cols', nargs='+', dest="anoncols", default = ['Human Name', 'Server Name', 'Address'],
                        help='List of Column names to anonymise', metavar="COL")
    parser.add_argument('-p','--prefixes', nargs='+', dest="anonprefixes", default = ['SERVER', 'SERVER', 'IP'],
                        help='Prefix to use when generating anonymised values (IP generates anonymised IP Addresses)', metavar="COL")
    parser.add_argument('-id','--idcol', dest="idcol", default = 'Unique Identifier',
                        help='Name of Column with per system unique identifier for de-referencing', metavar="COL")
    args = parser.parse_args()    
    # Generate output filenames if empty (default)
    if (args.outfile == ''):
        args.outfile = os.path.splitext(args.infile)[0]+'_anon.csv'
    if (args.revmapfile == ''):
        args.revmapfile = os.path.splitext(args.infile)[0]+'_revmap.csv'
    return args

# Read a CSV Into a structure seperating out headers and data
def ReadCSV (filename):
    file_data = []
    file_headers = []
    with open(filename) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_count = 0
        for row in csv_reader:
            trans_row = row
            if line_count > 0:
                # save the first line (it's headers)
                file_data.append(row)
            else:
                #assign the transformed row to the data array
                file_headers = row
            line_count += 1
        print('Loaded row count: ' + str(line_count-1))
    
    result = {
        "headers": file_headers,
        "data": file_data
    }
    return result

# Take filename headers and a data array and create a CSV
def WriteCSV(filename, headers, rows):
    line_count = 0
    with open(filename, mode='w') as out_file:
        out_writer = csv.writer(out_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        out_writer.writerow(headers)
        for row in rows:
            out_writer.writerow(row)
            line_count += 1
    print('Wrote row count: ' + str(line_count))

# Take a single row, the indexes and prefixes of which cols to anonymise and return a transformed version and the de-reference values
def TransformRow(row, rowidix, id, transforms):
    revmap_vals = [ row[rowidix] ]
    for tr in transforms:
        revmap_vals.append(row[tr.col_index]) # original value
        if (tr.prefix == "IP"): # special case for IP, convert count into 10.0.0.0/8 CIDR
            row[tr.col_index] = '10.' + str(int(id / 65536)) + '.' + str(int(id / 256) % 256) + '.' + str(id % 256)   
        else:
            row[tr.col_index] = tr.prefix + str(id)
        revmap_vals.append(row[tr.col_index]) # also include transformed value in same row for easy searching
    return row, revmap_vals

# Take the data, headers and apply transformation rules generating a new structure
def TransformData(filedata, anon_cols, rowid, prefixes):
    # First process headers and generate indexes for any of the columns we need to anonymise that are in the data
    rowidix = filedata['headers'].index(rowid)
    transforms = []
    revmap_headers = [rowid]
    i = 0
    for col in anon_cols:
        if (col in filedata['headers'] ):
            transforms.append(Transformation(filedata['headers'].index(col), prefixes[i]))
            revmap_headers.append(col)
            revmap_headers.append("anon_" + col)
        i = i + 1
    #Now anonymise the data 
    trans_array = []
    revmap_array = []
    line_count = 0
    for row in filedata['data']:
        trans_row = row
        trans_row, revmap_row = TransformRow(trans_row, rowidix, line_count, transforms)
        # Add to result ready for next
        trans_array.append(trans_row)
        revmap_array.append(revmap_row)
        line_count += 1
    print('Transformed row count: ' + str(line_count))
    result = {
        "headers": filedata['headers'],
        "data": trans_array,
        "revmapHeaders": revmap_headers,
        "revmap": revmap_array,
    }
    return result

# Simple Mainline. Process args, Read, Transform, then Write the two files.
args = ProcessArgs(sys.argv)
print ('Anonomysing Columns: ' + ','.join(args.anoncols))
readfile = ReadCSV(args.infile)
transfile = TransformData(readfile, args.anoncols, args.idcol, args.anonprefixes)
print ('Write anonomysed data to: ' + args.outfile)
WriteCSV(args.outfile, transfile['headers'], transfile['data'])
print ('Write reverse mapping data to: ' + args.revmapfile)
WriteCSV(args.revmapfile, transfile['revmapHeaders'], transfile['revmap'])
# end of script

Remove sensitive data

Now you are ready to remove the sensitive data from the .csv files using the above script. By default, the script takes any columns named “Human Name” and “Server Name” and replace them with “SERVER1” etc. It also replaces all value in the “Address” column with 10.X.X.X IP addresses. However, any field can be anonymised, and different prefixes used if required by providing alternative command line parameters.

1. Open a command prompt/shell; then open to the folder with the script and .csv files.

2. Run the script against each .csv file generating an anonymised and reverse lookup file for each. The script prints the number of rows processed as it goes.

Use the following commands, adjusting the input file names if you’ve used different names during .csv file export:

python migr_eval_anon.py --input-file Physical_Provisioning.csv python migr_eval_anon.py --input-file Virtual_Provisioning.csv python migr_eval_anon.py --input-file Asset_Ownership.csv python migr_eval_anon.py --input-file Utilization.csv

3. In addition to the original .csv files, you now have an *_anon.csv and *_revmap.csv for each. For example:

Figure 3. Command output screen-shot showing generated anonymised files.

4. Open the original Excel document you exported from the Migration Evaluator Collector and save a new copy.

5. Open Physical_Provisioning_anon.csv in Excel, select the whole sheet and copy all the data. Then replace the contents of the “Physical Provisioning” tab by selecting the whole tab and using “Paste Special/ Values” to paste in the anonymised data. Repeat this process for each .csv file and tab until you’ve got an Excel workbook with no sensitive data.

6. Keep the *_revmap.csv files safe on a local machine or file share along with the original Excel workbook. You can use the original system names in the *_revmap.csv files to identify specific systems in the Migration Evaluator report once it’s produced.

Review and upload

At this point you now have an Excel workbook that can be uploaded to the Migration Evaluator Management Console. Depending on your organisation’s specific security rules, you may still need to have the workbook reviewed. You are now on your way to obtaining a Migration Evaluator report and discovering how many zombie VMs your infrastructure is hosting.

Migration Evaluator can do more than identify zombie VMs. It uses the usage data from your environment to right-size a migration, and produces an estimated cost for migrating your fleet to AWS along with a business case comparing that cost to a standards based on-premises deployment.

Learn more on the AWS Migration Evaluator main page, plus read more about how rightsizing infrastructure can help organizations and businesses save on costs.

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

The AWS Public Sector Blog needs your help. Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.