AWS Machine Learning Blog

Analyze and tag assets stored in Veeva Vault PromoMats using Amazon AppFlow and Amazon AI Services

In a previous post, we talked about analyzing and tagging assets stored in Veeva Vault PromoMats using Amazon AI services and the Veeva Vault Platform’s APIs. In this post, we explore how to use Amazon AppFlow, a fully managed integration service that enables you to securely transfer data from software as a service (SaaS) applications like Veeva Vault to AWS. The Amazon AppFlow Veeva connector allows you to connect your AWS environment to the Veeva ecosystem quickly, reliably, and cost-effectively in order to analyze the rich content stored in Veeva Vault at scale.

The Amazon AppFlow Veeva connector is the first Amazon AppFlow connector supporting automatic transfer of Veeva documents. It allows you to choose between the latest version (the Steady State version in Veeva terms) and all versions of documents. Moreover, you can import document metadata.

With a few clicks, you can easily set up a managed connection and choose the Veeva Vault documents and metadata to import. You can further adjust the import behavior by mapping source fields to destination fields. You can also add filters based on document type and subtype, classification, products, country, site, and more. Lastly, you can add validation and manage on-demand and scheduled flow triggers.

You can use the Amazon AppFlow Veeva connector for various use cases, ranging from Veeva Vault PromoMats to other Veeva Vault solutions such as QualityDocs, eTMF, or Regulatory Information Management (RIM). The following are some of the use cases where you can use the connector:

  • Data synchronization – You can use the connector in the process of establishing consistency and harmonization between data from a source Veeva Vault and any downstream systems over time. For example, you can share Veeva PromoMats marketing assets to Salesforce. You could also use the connector to share Veeva QualityDocs like Standard Operating Procedures (SOPs) or specifications to cached websites that are searchable from tablets present on the manufacturing floor.
  • Anomaly detection – You can share Veeva PromoMats documents to Amazon Lookout for Metrics for anomaly detection. You can also use the connector with Vault RIM in artwork, commercial labels, templates, or patient leaflets before importing them for printing into enterprise labeling solutions such as Loftware.
  • Data lake hydration – The connector can be an effective tool for replicating structured or unstructured data into data lakes, in order to support the creation and hydration of data lakes. For example, you can use the connector to extract standardized study information from protocols stored in Vault RIM and expose it downstream to medical analytics insight teams.
  • Translations – The connector can be useful in sending artwork, clinical documents, marketing materials, or study protocols for translation in native languages to departments such as packaging, clinical trials, or regulatory submissions.

This post focuses on how you can use Amazon AI services in combination with Amazon AppFlow to analyze content stored in Veeva Vault PromoMats, automatically extract tag information, and ultimately feed this information back into the Veeva Vault system. The post discusses the overall architecture, the steps to deploy a solution and dashboard, and a use case of asset metadata tagging. For more information about the proof of concept code base for this use case, see the GitHub repository.

Solution overview

The following diagram illustrates the updated solution architecture.

Solution architecture

Previously, in order to import assets from Veeva Vault, you had to write your own custom code logic using the Veeva Vault APIs to poll for changes and import the data into Amazon Simple Storage Service (Amazon S3). This could be a manual, time-consuming process, in which you had to account for API limitations, failures, and retries, as well as scalability to accommodate an unlimited amount of assets. The updated solution uses Amazon AppFlow to abstract away the complexity of maintaining a custom Veeva to Amazon S3 data import pipeline.

As mentioned in the introduction, Amazon AppFlow is an easy-to-use, no-code self-service tool that uses point-and-click configurations to move data easily and securely between various SaaS applications and AWS services. AppFlow allows you to pull data (objects and documents) from supported sources and write that data to various supported destinations. The source or destination could be a SaaS application or an AWS service such as Amazon S3, Amazon Redshift, or Lookout for Metrics. In addition to the no-code interface, Amazon AppFlow supports configuration via API, AWS CLI, and AWS CloudFormation interfaces.

A flow in Amazon AppFlow describes how data is to be moved, including source details, destination details, flow trigger conditions (on demand, on event, or scheduled), and data processing tasks such as checkpointing, field validation, or masking. When triggered, Amazon AppFlow runs a flow that fetches the source data (generally through the source application’s public APIs), runs data processing tasks, and transfers processed data to the destination.

In this example, you deploy a preconfigured flow using a CloudFormation template. The following screenshot shows the preconfigured veeva-aws-connector flow that is created automatically by the solution template on the Amazon AppFlow console.

Amazon AppFlow Flow Source Details

The flow uses Veeva as a source and is configured to import Veeva Vault component objects. Both the metadata and source files are necessary in order to keep track of the assets that have been processed and push tags back on the correct corresponding asset in the source system. In this situation, only the latest version is being imported, and renditions aren’t included.

Amazon AppFlow Flow destination details

The flow’s destination also needs to be configured. In the following screenshot, we define a file format and folder structure for the S3 bucket that was created as part of the CloudFormation template.

Amazon AppFlow flow trigger options

Finally, the flow is triggered on demand for demonstration purposes. This can be modified so that the flow runs on a schedule, with a maximum granularity of 1 minute. When triggered on a schedule, the transfer mode changes automatically from a full transfer to an incremental transfer mode. You specify a source timestamp field for tracking the changes. For the tagging use case, we have found that the Last Modified Date setting is the most suitable.

Amazon AppFlow flow trigger on schedule options

Amazon AppFlow is then integrated with Amazon EventBridge to publish events whenever a flow run is complete.

For better resiliency, the AVAIAppFlowListener AWS Lambda function is wired into EventBridge. When an Amazon AppFlow event is triggered, it verifies that the specific flow run has completed successfully, reads the metadata information of all imported assets from that specific flow run, and pushes individual document metadata into an Amazon Simple Queue Service (Amazon SQS) queue. Using Amazon SQS provides a loose coupling between the producer and processor sections of the architecture and also allows you to deploy changes to the processor section without stopping the incoming updates.

A second poller function (AVAIQueuePoller) reads the SQS queue at frequent intervals (every minute) and processes the incoming assets. For an even better reaction time from the Lambda function, you can replace the CloudWatch rule by configuring Amazon SQS as a trigger for the function.

Depending on the incoming message type, the solution uses various AWS AI services to derive insights from your data. Some examples include:

  • Text files – The function uses the DetectEntities operation of Amazon Comprehend Medical, a natural language processing (NLP) service that makes it easy to use ML to extract relevant medical information from unstructured text. This operation detects entities in categories like Anatomy, Medical_Condition, Medication, Protected_Health_Information, and Test_Treatment_Procedure. The resulting output is filtered for Protected_Health_Information, and the remaining information, along with confidence scores, is flattened and inserted into an Amazon DynamoDB table. This information is plotted on the OpenSearch Kibana cluster. In real-world applications, you can also use the Amazon Comprehend Medical ICD-10-CM or RxNorm feature to link the detected information to medical ontologies so downstream healthcare applications can use it for further analysis.
  • Images – The function uses the DetectLabels method of Amazon Rekognition to detect labels in the incoming image. These labels can act as tags to identify the rich information buried in your images, such as information about commercial artwork and clinical labels. If labels like Human or Person are detected with a confidence score of more than 80%, the code uses the DetectFaces method to look for key facial features such as eyes, nose, and mouth to detect faces in the input image. Amazon Rekognition delivers all this information with an associated confidence score, which is flattened and stored in the DynamoDB table.
  • Voice recordings – For audio assets, the code uses the StartTranscriptionJob asynchronous method of Amazon Transcribe to transcribe the incoming audio to text, passing in a unique identifier as the TranscriptionJobName. The code assumes the audio language to be English (US), but you can modify it to tie to the information coming from Veeva Vault. The code calls the GetTranscriptionJob method, passing in the same unique identifier as the TranscriptionJobName in a loop, until the job is complete. Amazon Transcribe delivers the output file on an S3 bucket, which is read by the code and deleted. The code calls the text processing workflow (as discussed earlier) to extract entities from transcribed audio.
  • Scanned documents (PDFs) – A large percentage of life sciences assets are represented in PDFs—these could be anything from scientific journals and research papers to drug labels. Amazon Textract is a service that automatically extracts text and data from scanned documents. The code uses the StartDocumentTextDetection method to start an asynchronous job to detect text in the document. The code uses the JobId returned in the response to call GetDocumentTextDetection in a loop, until the job is complete. The output JSON structure contains lines and words of detected text, along with confidence scores for each element it identifies, so you can make informed decisions about how to use the results. The code processes the JSON structure to recreate the text blurb and calls the text processing workflow to extract entities from the text.

A DynamoDB table stores all the processed data. The solution uses DynamoDB Streams and Lambda triggers (AVAIPopulateES) to populate data into an OpenSearch Kibana cluster. The AVAIPopulateES function runs for every update, insert, and delete operation that happens in the DynamoDB table, and inserts one corresponding record in the OpenSearch index. You can visualize these records using Kibana.

To close the feedback loop, the AVAICustomFieldPopulator Lambda function has been created. It’s triggered by events in the DynamoDB stream of the metadata DynamoDB table. For every DocumentID in the DynamoDB records, the function tries to upsert tag information into a predefined custom field property of the asset with the corresponding ID in Veeva, using the Veeva API. To avoid inserting noise into the custom field, the Lambda function filters any tags that have been identified with a confidence score lower than 0.9. Failed requests are forwarded to a dead-letter queue (DLQ) for manual inspection or automatic retry.

This solution offers a serverless, pay-as-you-go approach to process, tag, and enable comprehensive searches on your digital assets. Additionally, each managed component has high availability built in by automatic deployment across multiple Availability Zones. For Amazon OpenSearch Service, you can choose the three-AZ option to provide better availability for your domains.


For this walkthrough, you should have the following prerequisites:

  • An AWS account with appropriate AWS Identity and Access Management (IAM) permissions to launch the CloudFormation template
  • Appropriate access credentials for a Veeva Vault PromoMats domain (domain URL, user name, and password)
  • A custom content tag defined in Veeva for the digital assets that you want to be tagged (as an example, we created the AutoTags custom content tag)
  • Digital assets in the PromoMats Vault accessible to the preceding credentials

Deploy your solution

You use a CloudFormation stack to deploy the solution. The stack creates all the necessary resources, including:

  • An S3 bucket to store the incoming assets.
  • An Amazon AppFlow flow to automatically import assets into the S3 bucket.
  • An EventBridge rule and Lambda function to react to the events generated by Amazon AppFlow (AVAIAppFlowListener).
  • An SQS FIFO queue to act as a loose coupling between the listener function (AVAIAppFlowListener) and the poller function (AVAIQueuePoller).
  • A DynamoDB table to store the output of Amazon AI services.
  • An Amazon OpenSearch Kibana (ELK) cluster to visualize the analyzed tags.
  • A Lambda function to push back identified tags into Veeva (AVAICustomFieldPopulator), with a corresponding DLQ.
  • Required Lambda functions:
    • AVAIAppFlowListener – Triggered by events pushed by Amazon AppFlow to EventBridge. Used for flow run validation and pushing a message to the SQS queue.
    • AVAIQueuePoller – Triggered every 1 minute. Used for polling the SQS queue, processing the assets using Amazon AI services, and populating the DynamoDB table.
    • AVAIPopulateES – Triggered when there is an update, insert, or delete on the DynamoDB table. Used for capturing changes from DynamoDB and populating the ELK cluster.
    • AVAICustomFieldPopulator – Triggered when there is an update, insert, or delete on the DynamoDB table. Used for feeding back tag information into Veeva.
  • The Amazon CloudWatch Events rules that trigger the AVAIQueuePoller function. These triggers are in the DISABLED state by default.
  • Required IAM roles and policies for interacting with EventBridge and the AI services in a scoped-down manner.

To get started, complete the following steps:

  1. Sign in to the AWS Management Console with an account that has the prerequisite IAM permissions.
  2. Choose Launch Stack and open it on a new tab:
    Launch stack button
  3. On the Create stack page, choose Next.Create stack details
  4. On the Specify stack details page, enter a name for the stack.
  5. Enter values for the parameters.
  6. Choose Next.Stack parameters
  7. On the Configure stack options page, leave everything as the default and choose Next.Stack options
  8. On the Review page, in the Capabilities and transforms section, select the three check boxes.
  9. Choose Create stack.Configure stack capabilities
  10. Wait for the stack to complete. You can examine various events from the stack creation process on the Events tab.
  11. After the stack creation is complete, you can look on the Resources tab to see all the resources the CloudFormation template created.
  12. On the Outputs tab, copy the value of ESDomainAccessPrincipal.

This is the ARN of the IAM role that the AVAIPopulateES function assumes. You use it later to configure access to the Amazon OpenSearch Service domain.

Set up Amazon OpenSearch Service and Kibana

This section walks you through securing your Amazon OpenSearch Service cluster and installing a local proxy to access Kibana securely.

  1. On the Amazon OpenSearch Service console, select the domain that was created by the template.
  2. On the Actions menu, choose Modify access policy.OpenSearch modify access policy
  3. For Domain access policy, choose Custom access policy.OpenSearch custom access policy dialog
  4. In the Access policy will be cleared pop-up window, choose Clear and continue.OpenSearch access policy warning
  5. On the next page, configure the following statements to lock down access to the Amazon OpenSearch Service domain:
    1. Allow IPv4 address – Your IP address.
    2. Allow IAM ARN – The value of ESDomainAccessPrincipal you copied earlier.OpenSearch custom access policy details
  6. Choose Submit.

This creates an access policy that grants access to the AVAIPopulateES function and Kibana access from your IP address. For more information about scoping down your access policy, see Configuring access policies.

  1. Wait for the domain status to show as Active.
  2. On the Amazon EventBridge console, under Events, choose Rules. You can see two rules that the CloudFormation template created.
  3. Select the AVAIQueuePollerSchedule rule and enable it by clicking Enable.Enable poller

In 5–8 minutes, the data should start flowing in and entities are created in the Amazon OpenSearch Service cluster. You can now visualize these entities in Kibana. To do this, you use an open-source proxy called aws-es-kibana. To install the proxy on your computer, enter the following code:

aws-es-kibana your_OpenSearch_domain_endpoint

You can find the domain endpoint on the Outputs tab of the CloudFormation stack under ESDomainEndPoint. You should see the following output:

aws-es-kibana console output

Create visualizations and analyze tagged content

Please refer to the original blogpost.

Clean up

To avoid incurring future charges, delete the resources when not in use. You can easily delete all resources by deleting the associated CloudFormation stack. Note that you need to empty the created S3 buckets of content in order for the deletion of the stack to be successful.


In this post, we demonstrated how you can use Amazon AI services in combination with Amazon AppFlow to extend the functionality of Veeva Vault PromoMats and extract valuable information quickly and easily. The built-in loop back mechanism allows you to update the tags back into Veeva Vault and enable auto-tagging of your assets. This makes it easier for your team to find and locate assets quickly.

Although no ML output is perfect, it can come very close to human performance and help offset a substantial portion of your team’s efforts. You can use this additional capacity towards value-added tasks, while dedicating a small capacity to check the output of the ML solution. This solution can also help optimize costs, achieve tagging consistency, and enable quick discovery of existing assets.

Finally, you can maintain ownership of your data and choose which AWS services can process, store, and host the content. AWS doesn’t access or use your content for any purpose without your consent, and never uses customer data to derive information for marketing or advertising. For more information, see Data Privacy FAQ.

You can also extend the functionality of this solution further with additional enhancements. For example, in addition to the AI and ML services in this post, you can easily add any of your custom ML models built using Amazon SageMaker to the architecture.

If you’re interested in exploring additional use cases for Veeva and AWS, please reach out to your AWS account team.

Veeva Systems has reviewed and approved this content. For additional Veeva Vault-related questions, please contact Veeva support.

About the authors

Mayank Thakkar is Head of AI/ML Business Development, Global Healthcare and Life Sciences at AWS. He has more than 18 years of experience in varied industries like healthcare, life sciences, insurance, and retail, specializing in building serverless, artificial intelligence, and machine learning-based solutions to solve real-world industry problems. At AWS, he works closely with big pharma companies around the world to build cutting-edge solutions and help them along their cloud journey. Apart from work, Mayank, along with his wife, is busy raising two energetic and mischievous boys, Aaryan (6) and Kiaan (4), while trying to keep the house from burning down or getting flooded!

Anamaria Todor is a Senior Solutions Architect based in Copenhagen, Denmark. She saw her first computer when she was 4 years old and never let computer science and engineering go ever since. She has worked in various technical roles from full-stack developer, to data engineer, technical lead, and CTO at various Danish companies. Anamaria has a bachelor’s degree in Applied Engineering and Computer Science, a master’s degree in Computer Science, and over 10 years of hands-on AWS experience. At AWS, she works closely with healthcare and life sciences companies in the enterprise segment. When she’s not working or playing video games, she’s coaching girls and female professionals in understanding and finding their path through technology.