Optimizing document AI and structured outputs by fine-tuning Amazon Nova Models and on-demand inference

Multimodal fine-tuning represents a powerful approach for customizing vision large language models (LLMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that matter to your business.

A common use case is document processing, which includes extracting structured information from complex layouts including invoices, purchase orders, forms, tables, or technical diagrams. Although off-shelf LLMs often struggle with specialized documents like tax forms, invoices, and loan applications, fine-tuned models can learn from high data variations and can deliver significantly higher accuracy while reducing processing costs.

This post provides a comprehensive hands-on guide to fine-tune Amazon Nova Lite for document processing tasks, with a focus on tax form data extraction. Using our open-source GitHub repository code sample, we demonstrate the complete workflow from data preparation to model deployment. Since Amazon Bedrock provides on-demand inference with pay-per-token pricing for Amazon Nova, we can benefit from the accuracy improvement from model customization and maintain the pay-as-you-go cost structure.

The document processing challenge

Given a single or multi-page document, the goal is to extract or derive specific structured information from the document so that it can be used for downstream systems or additional insights. The following diagram shows how a vision LLM can be used to derive the structured information based on a combination of text and vision capabilities.

High-level overview of the Intelligent Document Processing workflow

The key challenges for enterprises in workflow automation when processing documents, like invoices or W2 tax forms, are the following:

Complex layouts: Specialized forms contain multiple sections with specific fields arranged in a structured format.
Variability of document types: Many diverse document types exist (invoices, contracts, forms).
Variability within a single document type: Each vendor can send a different invoice format and style or type.
Data quality variations: Scanned documents vary in quality, orientation, and completeness.
Language barriers: Documents can be in multiple languages.
Critical accuracy requirements: Tax-related data extraction demands extremely high accuracy.
Structured output needs: Extracted data must be formatted consistently for downstream processing.
Scalability and integration: Grow with business needs and integrate with existing systems; for example, Enterprise Resource Planning (ERP) systems.

Approaches for intelligent document processing that use LLMs or vision LLMs fall into three main categories:

Zero-shot prompting: An LLM or vision LLM is used to derive the structured information based on the input document, instructions, and the target schema.
Few-shot prompting: A technique used with LLMs or vision LLMs where a few of other additional examples (document + target output) are provided within the prompt to guide the model in completing a specific task. Unlike zero-shot prompting, which relies solely on natural language instructions, few-shot prompting can improve accuracy and consistency by demonstrating the desired input-output behavior through a set of examples.
Fine-tuning: Customize or fine-tune the weights of a given LLM or vision LLM by providing larger amounts of annotated documents (input/output pairs), to teach the model exactly how to extract or interpret relevant information.

For the first two approaches, refer to the amazon-nova-samples repository, which contains sample code on how to use the Amazon Bedrock Converse API for structured output by using tool calling.

Off-shelf LLMs excel at general document understanding, but they might not optimally handle domain-specific challenges. A fine-tuned Nova model can enhance performance by:

Learning document-specific layouts and field relationships
Adapting to common quality variations in your document dataset
Providing consistent, structured outputs
Maintaining high accuracy across different document variations. For example, invoice documents can have hundreds of different vendors, each with different formats, layouts or even different languages.

Creating the annotated dataset and selecting the customization technique

While there are various methods for customization of Amazon Nova models available, the most relevant for document processing are the following:

Fine-tune for specific tasks: Adapt Nova models for specific tasks using supervised fine-tuning (SFT). Choose between Parameter-Efficient Fine-Tuning (PEFT) for light-weight adaptation with limited data, or full fine-tuning when you have extensive training datasets to update all parameters of the model.
Distill to create smaller, faster models: Use knowledge distillation to transfer knowledge from a larger, more intelligent model, like Nova Premier (teacher) to a smaller, faster, more cost-efficient model (student), ideal for when you don’t have enough annotated training datasets and the teacher model provides the accuracy that meets your requirement.

To be able to learn from previous examples, you need to either have an annotated dataset from which we can learn or a model that is good enough for your task so that you can use it as a teacher model.

Automated dataset annotation with historic data from Enterprise Resource Planning (ERP) systems, such as SAP: Many customers have already historic documents that have been manually processed and consumed by downstream systems, like ERP or customer relationship management (CRM) systems. Explore existing downstream systems like SAP and the data they contain. This data can often be mapped back to the original source document it has been derived from and helps you to bootstrap an annotated dataset very quickly.
Manual dataset annotation: Identify the most relevant documents and formats, and annotate them using human annotators, so that you have document/JSON pairs where the JSON contains the target information that you want to extract or derive from your source documents.
Annotate with the teacher model: Explore if a larger model like Nova Premier can provide accurate enough results using prompt engineering. If that is the case, you can also use distillation.

For the first and second options, we recommend supervised model fine-tuning. For the third, model distillation is the right approach.

Amazon Bedrock currently provides both fine-tuning and distillation techniques, so that anyone with a basic data science skillset can very easily submit jobs. They run on compute completely managed by Amazon, so you don’t have worry about instance sizes or capacity limits.

Nova customization is also available with Amazon SageMaker with more options and controls. For example, if you have sufficient high-quality labeled data and you want deeper customization for your use case, full rank fine-tuning might produce higher accuracy. Full rank fine tuning is supported with SageMaker training jobs and SageMaker HyperPod.

Data preparation best practices

The quality and structure of your training data fundamentally determine the success of fine-tuning. Here are key steps and considerations for preparing effective multimodal datasets and configuring your fine-tuning job:

Dataset analysis and base model evaluation

Our demonstration uses a synthetic dataset of W2 tax forms: the Fake W-2 (US Tax Form) Dataset. This public dataset comprises simulated US tax returns (W-2 statements for years 2016-19), including noisy images that mimic low-quality scanned W2 tax forms.

Before fine-tuning, it’s crucial to:

Analyze dataset characteristics (image quality, field completeness, class distribution), define use-case-specific evaluation metrics, and establish baseline model performance.
Compare each predicted field value against the ground truth, calculating precision, recall, and F1 scores for individual fields and overall performance.

Prompt optimization

Crafting an effective prompt is essential for aligning the model with task requirements. Our system comprises two key components:

System prompt: Defines the task, provides detailed instructions for each field to be extracted, and specifies the output format.
User prompt: Follows Nova vision understanding best practices, utilizing the {media_file}-then-{text} structure as outlined in the Amazon Nova model user guide.

Iterate on your prompts using the base model to optimize performance before fine-tuning.

Dataset preparation

Prepare your dataset in JSONL format and split it into training, validation, and test sets:

Training set: 70-80% of data
Validation set: 10-20% of data
Test set: 10-20% of data

Fine-tuning job configuration and monitoring

Once the dataset is prepared and uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, we can configure and submit the fine-tuning job on Bedrock. When configuring your fine-tuning job on Amazon Bedrock, key parameters include:

Parameter	Definition	Purpose
Epochs	Number of complete passes through the training dataset	Determines how many times the model sees the entire dataset during training
Learning rate	Step size for gradient descent optimization	Controls how much model weights are adjusted in response to estimated error
Learning rate warmup steps	Number of steps to gradually increase the learning rate	Prevents instability by slowly ramping up the learning rate from a small value to the target rate

Amazon Bedrock customization provides validation loss metrics throughout the training process. Monitor these metrics to:

Assess model convergence
Detect potential overfitting
Gain early insights into model performance on unseen data

The following graph shows an example metric analysis:

Nova Fine-tuning training job training loss and validation loss per step metrics

When analyzing the training and validation loss curves, the relative behavior between these metrics provides crucial insights into the model’s learning dynamics. Optimal learning patterns can be observed as:

Both training and validation losses decrease steadily over time
The curves maintain relatively parallel trajectories
The gap between training and validation loss remains stable
Final loss values converge to similar ranges

Model inference options for customized models

Once your custom model has been created in Bedrock, you have two main ways to make inferences to that model: use on-demand custom model inference (ODI) deployments, or use Provisioned Throughput endpoints. Let’s talk about why and when to choose one over the other.

On-demand custom model deployments provide a flexible and cost-effective way to leverage your custom Bedrock models. With on-demand deployments, you only pay for the compute resources you use, based on the number of tokens processed during inference. This makes on-demand a great choice for workloads with variable or unpredictable usage patterns, where you want to avoid over-provisioning resources. The on-demand approach also offers automatic scaling, so you don’t have to worry about managing infrastructure capacity. Bedrock will automatically provision the necessary compute power to handle your requests in near real time. This self-service, serverless experience can simplify your operations and deployment workflows.

Alternatively, Provisioned Throughput endpoints are recommended for workloads with steady traffic patterns and consistent high-volume requirements, offering predictable performance and cost benefits over on-demand scaling.

This example uses the ODI option to leverage per-token based pricing; the following code snippet is how you can create an ODI endpoint for your custom model:

# Function to create on-demand inferencing deployment for custom model
def create_model_deployment(custom_model_arn):
    """
    Create an on-demand inferencing deployment for the custom model
    
    Parameters:
    -----------
    custom_model_arn : str
        ARN of the custom model to deploy
        
    Returns:
    --------
    deployment_arn : str
        ARN of the created deployment
    """
    try:
        print(f"Creating on-demand inferencing deployment for model: {custom_model_arn}")
        
        # Generate a unique name for the deployment
        deployment_name = f"nova-ocr-deployment-{time.strftime('%Y%m%d-%H%M%S')}"
        
        # Create the deployment
        response = bedrock.create_custom_model_deployment(
            modelArn=custom_model_arn,
            modelDeploymentName=deployment_name,
            description=f"on-demand inferencing deployment for model: {custom_model_arn}",
        )
        
        # Get the deployment ARN
        deployment_arn = response.get('customModelDeploymentArn')
        
        print(f"Deployment request submitted. Deployment ARN: {deployment_arn}")
        return deployment_arn
    
    except Exception as e:
        print(f"Error creating deployment: {e}")
        return None

Evaluation: Accuracy improvement with fine-tuning

Our evaluation of the base model and the fine-tuned Nova model shows significant improvements across all field categories. Let’s break down the performance gains:

Field category	Metric	Base model	Fine-tuned model	Improvement
Employee information	Accuracy	58%	82.33%	24.33%
	Precision	57.05%	82.33%	25.28%
	Recall	100%	100%	0%
	F1 score	72.65%	90.31%	17.66%
Employer information	Accuracy	58.67%	92.67%	34%
	Precision	53.66%	92.67%	39.01%
	Recall	100%	100%	0%
	F1 score	69.84%	96.19%	26.35%
Earnings	Accuracy	62.71%	85.57%	22.86%
	Precision	60.97%	85.57%	24.60%
	Recall	99.55%	100%	0.45%
	F1 score	75.62%	92.22%	16.60%
Benefits	Accuracy	45.50%	60%	14.50%
	Precision	45.50%	60%	14.50%
	Recall	93.81%	100%	6.19%
	F1 score	61.28%	75%	13.72%
Multi-state employment	Accuracy	58.29%	94.19%	35.90%
	Precision	52.14%	91.83%	39.69%
	Recall	99.42%	100%	0.58%
	F1 score	68.41%	95.74%	27.33%

The following graphic shows a bar chart comparing the F1 scores of the base model and fine-tuned model for each field category, with the improvement percentage shown in the previous table:

bar chart comparing the F1 scores of base model and fine-tuned model for each field category

Key observations:

Substantial improvements across all categories, with the most significant gains in employer information and multi-state employment
Consistent 100% recall maintained or achieved in the fine-tuned model, indicating comprehensive field extraction
Notable precision improvements, particularly in categories that were challenging for the base model

Clean up

To avoid incurring unnecessary costs when you’re no longer using your custom model, it’s important to properly clean up the resources. Follow these steps to remove both the deployment and the custom model:

Cost analysis

In our example, we chose to use Bedrock fine-tuning job which is PEFT and ODI is available. PEFT fine tuning Nova Lite paired with on-demand inference capabilities offers a cost-effective and scalable solution for enhanced document processing. The cost structure is straightforward and transparent:

One-time cost:

Model training: $0.002 per 1,000 tokens × number of epochs

Ongoing costs:

Storage: $1.95 per month per custom model
On-demand Inference: Same per-token pricing as the base model
- Example 1 page from above dataset: 1895 tokens/1000 * $0.00006 + 411 tokens/1000 * $0.00024 = $0.00021

On-demand inference allows you to run your custom Nova models without maintaining provisioned endpoints, enabling pay-as-you-go pricing based on actual token usage. This approach eliminates the need for capacity planning while ensuring cost-efficient scaling.

Conclusion

In this post, we’ve demonstrated how fine-tuning Amazon Nova Lite can transform document processing accuracy while maintaining cost efficiency. Our evaluation shows significant performance gains, with up to 39% improvement in precision for critical fields and perfect recall across key document categories. While our implementation did not require constrained decoding, tool calling with Nova can provide additional reliability for more complex structured outputs, especially when working with intricate JSON schemas. Please refer to the resource on structured output with tool calling for further information.

The flexible deployment options, including on-demand inference with pay-per-use pricing, eliminate infrastructure overhead while maintaining the same inference costs as the base model. With the dataset we used for this example, runtime inference per page cost was $0.00021, making it a cost-effective solution. Through practical examples and step-by-step guides, we’ve shown how to prepare training data, fine-tune models, and evaluate performance with clear metrics.

To get started with your own implementation, visit our GitHub repository for complete code samples and detailed documentation.

About the authors

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Arlind Nocaj is a GTM Specialist Solutions Architect for AI/ML and Generative AI for europe central based in AWS Zurich Office, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include Machine Learning, Generative AI and in particular Agentic systems with Multi-modal LLMs for document processing and structured insights.

Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Malte Reimann is a Solutions Architect based in Zurich, working with customers across Switzerland and Austria on their cloud initiatives. His focus lies in practical machine learning applications—from prompt optimization to fine-tuning vision language models for document processing. The most recent example, working in a small team to provide deployment options for Apertus on AWS. An active member of the ML community, Malte balances his technical work with a disciplined approach to fitness, preferring early morning gym sessions when it’s empty. During summer weekends, he explores the Swiss Alps on foot and enjoying time in nature. His approach to both technology and life is straightforward: consistent improvement through deliberate practice, whether that’s optimizing a customer’s cloud deployment or preparing for the next hike in the clouds.

Artificial Intelligence