Model and data lineage in machine learning experimentation

Modern quantitative finance is based around the approach of pattern recognition in historical data. This approach requires teams of scientists to work in a collaborative and regulated setting in order to develop models that can be used to make trading predictions. With the growing influence of this field, both participants and regulators are looking to put in place mechanisms to understand how and why models have been developed, for reasons such as regulatory compliance and model reproducibility. We refer to this tractability problem as lineage.

The challenge of reproducibility and lineage in machine learning (ML) is three-fold: code lineage, data lineage, and model lineage. Source version control is a standard for managing changes to code. For data lineage, most data storage services support versioning, which gives you the ability to track datasets at a given point in time. Model lineage combines code lineage, data lineage, and ML-specific information such as Docker containers used for training and deployment, model hyperparameters, and more. The focus of this post is model lineage and how it pertains to reproducibility. For the purposes of this post, we illustrate lineage using the example of quantitative finance.

Problem statement

When developing a model, a quantitative finance researcher experiments with many different model configurations and parameters, which are trained and simulated on many different datasets. Such research often entails varying one parameter while holding all others constant and then simulating against some target metric (for example Sharpe ratio). For a given model, the process of research may last for days, weeks, or even months. Eventually the researcher settles on a given model, with a given set of parameters, which produces certain results on a given dataset. The researcher may work as part of a larger team, directly reporting into a research manager and collaborating with peers. The research team might work collaboratively alongside other teams, such as development, quant-dev, and risk. Keeping track of the changes associated with data versions and model parameters has typically happened in spreadsheets or shared notes, and as such doesn’t scale well.

In such a collaborative setting, the researcher has various tools and approaches open to them, the primary of which is version control software, such as GitHub and AWS CodeCommit. With version control software, you can track and share model configurations based in code. You can also look back in history to identify which model was being developed when and by whom, based on commits made to the source code. A second approach is data versioning, whereby data is versioned and the associated metadata (including the path to the data) is stored in a relational or graph database. A related tool is the feature store, whereby the researcher generates features one time and then persists them, allowing different models and different researchers and teams to access the same feature.

At some point, after selecting a given model or set of parameters, you may need to replay the simulation and regenerate the results. Examples of times when such a course of action might be necessary include a trading loss or a regulatory breach. The ex-post examination of the model selection criteria may be internal to the firm, or might be external. One example of the latter occurred in 2014 when AMF, the French financial regulator, asked the market-maker GETCO to explain the actions of one of its algorithms in 2010 on Euronext, resulting in a fine for the firm.

A closely related problem exists in academia. The issue of reproducibility is one that peer-reviewed journals have struggled with for years. A researcher will present results for peer review, with no easy mechanism for the peers to reproduce and thereby verify those results. With the passage of time, the problem compounds as readers who want to replicate published results in order to build on them are faced with a near impossible task.

Theoretical basis to a solution

To ensure reproducibility of experimental results, the researcher needs to store versions of parameters, models, input data, features data, managed services, proprietary libraries, third-party libraries, container images, environmental configurations, RNG seeds, and other variable quantities. They then need to link all these entities together, persist them, and allow the user to recall them on demand.

In addition to storing all the inputs to an experiment, the experimental outputs should also be stored such that the experiment can be reproduced and verified. Experimental outputs range widely, and may include scalars, time series, and images.

Simply recording all this information in a spreadsheet isn’t a viable solution. The way this information should be stored needs to have certain properties. The records should be immutable, searchable, and the relationships between the entities tracked should be reflected by the storage mechanism.

A model’s lineage is a set of associations between a model and all the components that were involved in the creation of that model. A model has relationships with experiments, datasets, container images, and so on. Model building experiments make use of trials, and each trial has components for data preprocessing, model training, and data postprocessing.

When representing a model’s lineage, the associations between the artifacts associated with a model are at the core of what needs to be represented. A graph structure is the most suitable approach for this kind of highly associated artifact representation.

Graphs model data as a set of vertices that may be connected to other vertices by edges. Vertices typically represent entities like people, places, and things. Vertices can further be described with attributes, and edges define relationships between vertices. As with entities, relationships may have additional details that can be modeled.

Building an equivalent application using a relational database requires creating many tables with multiple foreign keys and then writing nested SQL queries and complex joins. Not only does that approach quickly become unwieldy from a coding perspective, its performance degrades quickly as the amount of data increases.

A best practice is to use a model registry as a centralized store for models that should be tracked, shared, deployed, and reused. The model registry should support tracking a model’s lineage so that at any point in time, all the relevant metadata for a model can be viewed. When you use the model registry’s versioning capability for models, you have a mapping between specific versions of a model and its entire lineage all accessible from the model registry itself.

Given the number of components that are part of a model’s lineage, you may want to inspect the lineage of not only the model, but any object associated with the model, for example, an experiment. With a graph as the underlying data structure that supports lineage, you should have the flexibility to traverse an entity’s lineage from different focal points. You should be able to find the entire lineage of an experiment to retrieve the experiment components and its associations, the data used by each trial in the experiment, the parameters used by the trials, and so on.

Your organization’s governance polices may require tight controls over model and lineage data access. Using role or resource-based policies to each of the resources that are part of a model’s lineage allows your organization to enforce governance requirements.

In summary, an end-to-end lineage solution needs to give you the means to access information about parameters, versioning, and object locations, and ensure that access controls are enforced. A GUI enables you to view a model’s lineage by clicking through a model’s associations to get all the way down to the data used to train the model.

One challenge with this approach is although artifacts are small in size and cheap to store, raw data is not. Storing all experimental data for periods of years in order to enable a lineage solution isn’t realistic. A compromise solution might be ex-post an experiment, choosing to enable lineage tracking for that experiment with the option of only storing metadata. Any solution in this space requires you to closely monitor storage costs, as well as enabling a subset of users to delete lineage that’s no longer required.

AWS services and solutions in this space

AWS has a system of services and features that support an end-to-end ML workflow. With Amazon SageMaker features, you can build your own solutions to manage the end-to-end lineage of your ML models:

Amazon SageMaker Studio is an IDE to support the ML experience. Studio consists of multiple separate services, several of which have applicability to the lineage use case.
Amazon SageMaker Feature Store is a fully managed repository to store, update, retrieve, and share ML features. If a common set of features are shared across model building runs or across teams, you can use the feature store to avoid recreating the same set of features each time they’re needed. This saves time and money spent on resources. You can reference feature store data with a feature store ID, which you can associate with a training job while tracking the lineage of a model.
Amazon SageMaker Lineage Tracking creates and stores information about the steps of a ML workflow from data preparation to model deployment. With the tracking information, you can reproduce the workflow steps, track model and dataset lineage, and establish model governance and audit standards. Lineage Tracking uses tracking entities to represent each of the elements in the ML workflow. SageMaker automatically creates tracking entities for SageMaker jobs (training, processing, batch transform), models, model packages, and endpoints if the data is available. Two types of tracking entities are defined: experiment entities and lineage entities. Experiment entities include trial components, trials, and experiments. Lineage entities are contexts, actions, artifacts, and associations. Each entity has a set of properties that you can use to associate relevant information to it, including type (model, image, dataset, and so on). You can supplement the SageMaker-created tracking entities with manually created tracking entities of your own; for example, you can represent a CodeCommit repository or an Amazon Simple Storage Service (Amazon S3) URI as artifacts.
Amazon SageMaker Pipelines is CI/CD service for ML. With Pipelines, you can manage your end-to-end ML workflow at scale by orchestrating all the steps of the ML workflow using a single interface. Pipelines can stitch together data processing steps, model training steps, postprocessing steps, model evaluation steps, and so on into a single workflow. Because a pipeline is defined as a set of steps with metadata associated with each of those steps, saving the pipeline definition (a JSON object) allows you to recreate the workflow at any point in time as long as the resources needed by each step can be accessed. Keeping track of the entire workflow is an important step towards model reproducibility.
Amazon SageMaker Experiments lets you organize, track, compare and evaluate ML experiments and model versions. Experiments are made up of trials and trial components that can be referenced by IDs and can track all metadata associated with a model training job, such as the hyperparameters used to train the model, the performance metrics of the model, the data used, bias reports, model explainability reports, and more. Trials and trial components are encapsulated into experiments. You can use multiple experiments during the model building process.
Amazon SageMaker Projects allow you to create a repeatable end-to-end ML solution using AWS Service Catalog provisioned products. This allows data scientists, MLOps teams, and organizations to create a single template or a set of templates for model building and deployment to bring standardization into the ML process. With this, you can track the specific template and version of the template used in an ML workflow in Lineage Tracking so that the same template can be used at the time of reproducing a model.
The Amazon SageMaker Model Registry stores models and model versions in a centralized repository. The model registry is organized into model package groups, with each group having several versions of a model package. These versions can be marked as approved or rejected depending on the acceptance criteria determined by the team building these models. The model registry provides a centralized place for model management, model sharing within and across teams, and a standard mechanism to access and identify models.

In addition to the various features of SageMaker, several other AWS services relate to the concept of lineage:

Amazon ECR is a fully managed container registry that makes it easy to store, manage, and track your container images. You may have separate container images used for preprocessing, training, and inference that need to be stored and versioned. To reproduce any step of the model building workflow, the exact version of the container image used by that step is required. When you store the Amazon ECR image ID along with the version of that image, you can retrieve the specific version of the container used at the point of interest even if that version is several iterations old and no longer in use.
Amazon Simple Storage Service (Amazon S3) is the Amazon object store. You can use it to store the data for model training and for the model artefacts themselves. Amazon S3 supports data versioning, which allows tracking the data used for training a model not only by the Amazon S3 URI, but also by the version tag associated with that object in Amazon S3. As models are trained, the training dataset is continuously updated as new data is made available, or the entire dataset itself could be replaced with higher quality data. Keeping track of not only the Amazon S3 URI, but the version as well, allows you to have constant references to object locations in Amazon S3 by defining logic in your code to always use the latest version of an object. Without versioning, you need to upload new datasets to different Amazon S3 locations, and update the reference to a training dataset in the training code to reflect that.
AWS CodeCommit is the Amazon fully managed source control service. The model building process is often a collaborative one, with source code being shared across individuals and teams. Source control practices are strongly encouraged when creating ML models for two reasons: collaboration and reproducibility. With a version control tool like CodeCommit, developers can have IDs associated with each version of code they have used (commit ID) while developing their models. After a successful training job, you can commit code and push it to CodeCommit, and link the specific commit ID associated with that code version with this training job. When reproducing a model, or rerunning a training job, you can roll back to the specific version of code used to generate that model or run that training job.

Build a solution with AWS

A complete ML workflow consists of many or all of the services described in the previous section, each of which is an important component of a model’s lineage.

Upstream and downstream lineage

The concepts of upstream and downstream lineage explore different mechanisms by which a model’s lineage is traversed and different problems that need to be solved using lineage. Take the example of reproducibility: given a model, you need to find all the entities that were part of the model creation—the training parameters, the datasets, preprocessing steps, and so on. The upstream lineage of a model is necessary in cases where there are stringent governance policies for ML models and when reproducing models across teams or different points in time.

Downstream lineage approaches traversing the entities leading to a model’s creation from a different angle. Instead of starting with a model as the focal point, you may need to identify all the models that were trained by a given dataset, or trained using a certain version of training code. This is useful when troubleshooting issues or doing a blast radius analysis; for example, you may want to find all the models trained by a version of code that was later found to contain a bug. You need to use a specific version of code as the lineage focal point, and then traverse the lineage of that code artifact until the model artifacts are identified.

SageMaker Search API

You can use the SageMaker Search API to look up information about the data that was used to train a model, the parameters used for training, the container image used, and model metrics. The following code is an example of how you can use the Search API to find the training job that created the model by using the model data URI:

# Get the model artifact location
model_data_url = smclient.describe_model(ModelName=model_name) ['PrimaryContainer']['ModelDataUrl']


# Search the training job by the location of model artifacts in Amazon S3
search_params={
   "MaxResults": 1,
   "Resource": "TrainingJob",
   "SearchExpression": { 
      "Filters": [ 
         { 
            "Name": "ModelArtifacts.S3ModelArtifacts",
            "Operator": "Equals",
            "Value": model_data_url
         }]},
}
results = smclient.search(**search_params)

After the training job is found, the job can be described and all its information can be extracted, such as hyperparameters, input data, experiment configuration, trial components, Amazon ECR image used, and so on.

In addition to the API, SageMaker Search provides a UI with which you can look up information related to a model using several properties. You can look for all the models that were trained using a specific training dataset by providing the Amazon S3 URI to that data, look for models based on roles, use the Search UI to find parameters associated with a training job, and more.

SageMaker Lineage Tracking

SageMaker Lineage automatically creates lineage entities for the components of an ML workflow, such as endpoints, models, datasets, and training jobs. These entities have deep relationships between them and are best modelled as a graph.

This diagram depicts the lineage entities automatically created by SageMaker. All of these relationships can be queried using the SageMaker Lineage API called `query-lineage`.

Using the API, you can construct queries with filters that define the lineage entities you want to find. Let’s take an example where you run a training job with an experiment, register the model in the Model Registry, and deploy the model to an endpoint. The entities in this workflow include datasets used for training, the training job and its trial component, the model package in the Model Registry, and a SageMaker endpoint. Given these entities, there are various questions you may ask about their lineage. As we explore the lineage APIs, we’ll use the concepts of “Focal Point” and “Direction”.

Let’s start with a scenario where it’s discovered that a dataset has an error and all of the endpoints with models trained using that data must be rolled back. Using the lineage API, we must construct a query that will find all of the endpoints associated with that dataset.

The first thing to do is define the filter to look for contexts that represent endpoints. This is because SageMaker Lineage automatically creates a Context for Endpoints.

query_filter = LineageFilter(
    entities=[LineageEntityEnum.CONTEXT], sources=[LineageSourceEnum.ENDPOINT]
)

We pass this filter to the LineageQuery API provided in the SageMaker Python SDK:

query_result = LineageQuery(sagemaker_session).query(
    start_arns = [dataset_artifact_arn],
    direction = LineageQueryDirectionEnum.DESCENDANTS,
    query_filter = query_filter,
    include_edges = False
    
)

Here, the start_arns parameter is used to specify the focal point of the lineage query, which in this case is the dataset that is corrupted. The direction specified is “DESCENDANTS”. When the direction of the LineageQuery is set to `DESCENDANT`, the query result will specify all of the entities to which the focal point is a descendant entity.

In the case where direction is “ASCENDANT”, the query result will specify all of the entities to which the focal point is an ascendant entity.

The lineage relationship directions are similar to that in family tree, where grandfather is the ascendant of children and a child is a descendant of a parent. Similarly, with the following entities,

Dataset -> Training Job -> Model -> Endpoint

Dataset is the ascendant entity of the Endpoint, and Endpoint is a descendant entity of the Dataset.

In the query shown above, since the direction is set to descendant, it will return entities to the right of dataset (training job, model, endpoint), and since the filter is only looking for Endpoints, the query result will contain all of the Endpoints trained with this dataset.

Following the same pattern of defining a LineageFilter and a LineageQuery, you can define all sorts of queries to answer any question about lineage. We call this “multi-hop” lineage since the queries traverse relationships across multiple hops of the lineage graph.

In production workloads, it is often the case that models are trained in one account and deployed in another, or there is a central account that must monitor all of the ML entities in other accounts. Given the number of use cases involving multiple accounts, lineage tracking capabilities must span across account boundaries. SageMaker Lineage provides a way to track ML entity lineage across accounts through “LineageGroups”.

Each account has a default LineageGroup that holds all of the lineage entities and new LineageGroups can be created so that specific entities can be shared with other accounts. Sharing a LineageGroup is done by adding a resource policy to the group or by sharing the entity via AWS RAM.

In production use cases where there are several accounts, we expect users to share LineageGroups between several accounts. This could be Account A, B, and C each sharing their LineageGroups with an Admin account. Or Accounts A, B, and C each sharing their LineageGroups between each other.

Using the RAM console or API, you can create a resource share and add all of the principals to share it with. Each principal will receive a notification in the RAM console that must be accepted to receive the share. This repository has examples for multi-hop and cross account lineage queries.

Conclusion

Reproducibility in the ML research space is a challenging and multi-faceted topic, and quantitative finance is a prominent use case. By working with our customers across industry verticals, AWS is leading the way towards a holistic solution set in this field. We encourage you to reach out and discuss your use cases with the authors via your AWS account manager. To learn more about Amazon SageMaker, visit the webpage.

About the Authors

Hugh Christensen is a Principal Analytics Specialist working in Global Financial Services. Prior to joining AWS, Hugh worked between academia and industry. The former in the Engineering Department at the University of Cambridge, and the latter in various roles in algorithmic trading and data analytics solutions for financial exchanges and trading houses.

Kirit Thadaka is an ML Solutions Architect working in the Amazon SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.

Venkatesh Krishnan is a Principal Product Manager – Technical for Amazon SageMaker in AWS. He is the product owner for a portfolio of services in the MLOps space, including SageMaker Pipelines, Model Registry, Projects, and Experiments. Earlier he was the Head of Product, Integrations and the lead product manager for Amazon AppFlow, a service that he helped build from the ground up. Before joining Amazon in 2018, Venkatesh served in various research, engineering, and product roles at Qualcomm, Inc. He holds a PhD in Electrical and Computer Engineering from Georgia Tech and an MBA from ULCA’s Anderson School of Management.

Dana Benson is a Software Engineer working in the Amazon SageMaker Experiments, Lineage, and Search team. Prior to joining AWS, Dana spent time enabling smart home functionality in Alexa and mobile ordering at Starbucks.