Running pgvector in production on Amazon Aurora PostgreSQL

Running pgvector on Amazon Aurora PostgreSQL-Compatible Edition gives you a production-grade vector store on a database you already know, backed by the operational tooling, high availability, and scaling behaviour of Amazon Aurora. That combination is why pgvector has become a common choice for Retrieval Augmented Generation (RAG) workloads moving from proof of concept into production with Service Level Agreements (SLAs). Production traffic does introduce a predictable set of operational considerations: query latency as the corpus grows, recall on filtered vector searches, memory headroom during index builds, and connection behaviour under load. This post is scoped to the database operations that keep the RAG retrieval layer healthy. Model customization through fine-tuning or continued pre-training is out of scope.

In this post, we cover the operational practices that keep a pgvector workload healthy once you depend on it: choosing the right index and distance function, scaling with quantization and partitioning, managing Hierarchical Navigable Small World (HNSW) churn, sizing for memory-resident operation, and the observability signals that catch problems early.

How this post is organized

First, choose an index type (HNSW or Inverted File with Flat Compression) for your dataset and write pattern. Second, set up a baseline schema and query with the AWS-recommended parameters. Third, pick a distance operator that matches your embedding model. Fourth, scale the index to your target dataset size with quantization, parameter tuning, and partitioning. Fifth, plan for churn, capacity, and observability before traffic arrives. Each section builds on the one before.

The examples throughout use a multi-tenant document-store schema on Aurora PostgreSQL-Compatible with Amazon Titan Text Embeddings V2, an Amazon Bedrock foundation model, as the embedding model. Each SQL example is runnable against an Aurora PostgreSQL-Compatible cluster with the vector extension enabled.

This post covers the self-managed path, where you are responsible for the vector store and retrieval pipeline. For a fully managed RAG capability that handles ingestion, embedding, and retrieval for you, use Amazon Bedrock Knowledge Bases, which supports Amazon Aurora PostgreSQL with pgvector as one of its vector store options.

Prerequisites

To follow the examples in this post, you need:

An Aurora PostgreSQL-Compatible cluster running PostgreSQL version 17.4+, 16.8+, 15.12+, 14.17+, or 13.20+, which supports pgvector 0.8.0. pgvector 0.8.0 is available in all AWS commercial Regions and AWS GovCloud (US) Regions, except AWS China Regions. For the launch details, see the pgvector 0.8.0 Aurora announcement. For the current Region list, see the Aurora PostgreSQL extension versions page.
The vector extension enabled on your database.
An embedding model. The examples assume Amazon Titan Text Embeddings V2, a foundation model available through Amazon Bedrock that produces 1024-dimensional embeddings by default.

Choosing the right index strategy

pgvector ships two Approximate Nearest Neighbor (ANN) index types: HNSW and IVFFlat. Both trade perfect accuracy for speed, because a brute-force exact search across millions of vectors is too slow for online queries. The AWS Prescriptive Guidance on vector database options covers where pgvector fits among AWS vector stores. The Self-managed multi-tenant vector search with Amazon Aurora PostgreSQL post walks through the pgvector basics if you want a refresher on the terminology.

For production RAG on Aurora PostgreSQL, HNSW is the default choice for most workloads. But there are two cases where the right answer is no index at all: small datasets, and partitioned datasets where 100% recall is required. The rest of this section explains why HNSW wins for the common case, when IVFFlat still makes sense, and when skipping the index entirely is the correct production decision.

How HNSW works, and why it is the default

HNSW stands for Hierarchical Navigable Small World. It builds a multi-layer proximity graph. Layer 0 contains every vector, densely connected to its nearest neighbors. Each higher layer samples a smaller subset of vectors with longer-range connections. A single vector is the entry point, present in every layer. For a detailed walk-through of both HNSW and IVFFlat internals beyond the summary here, see the deep dive into IVFFlat and HNSW techniques.

A search starts at the entry point in the top layer. Within a layer, pgvector performs a greedy best-first walk along the graph edges, moving to whichever connected neighbor is closest to the query vector. When it can no longer get closer, it descends to the same node in the layer below and continues the walk there. At layer 0 the walk expands into a beam search controlled by the hnsw.ef_search parameter, and the top candidates are returned.

The following diagram shows that descent in action, from the entry point at the top layer down to the beam search at layer 0.

HNSW layered descent. A query enters at the entry point in the top layer, performs a greedy walk within each layer to find the local nearest neighbor, descends to the same node in the layer below, and expands into a beam search at layer 0 to return the top-k results.

This structure makes HNSW fast to query and lets it accept inserts incrementally. You can write new vectors directly into a live index without pausing or batching. Recall and latency are controlled at query time through hnsw.ef_search, so you can tune per workload without rebuilding the index.

The trade-off is build cost. HNSW indexes take longer to build and use more memory than IVFFlat, because the graph itself must be stored alongside the vectors. For the production RAG pattern this post targets, the trade-off is worth it: you build the index once, and queries stay fast provided you manage churn (covered later in the Managing index churn section).

When IVFFlat still makes sense

IVFFlat stands for Inverted File with Flat compression. Instead of a graph, it groups vectors into clusters using k-means. Each cluster has a representative point at its center, called a centroid. At query time, pgvector compares the query vector to the centroids, picks the nearest clusters, and searches only inside those clusters.

The following diagram shows the space partitioned into clusters, with a query compared against the centroids and then against the vectors inside the nearest clusters.

IVFFlat clustering. The vector space is partitioned into clusters with a centroid at the center of each. A query is compared to the centroids first and then searched only inside the nearest clusters.

IVFFlat is cheaper to build than HNSW: clustering is simpler than graph construction, and the index stores no graph, so it uses less memory. However, IVFFlat fixes the centroids at build time. If your data changes after the index is built, new vectors might not match the existing clusters well, and recall drops. The fix is to rebuild the index, which recomputes the centroids against the current data. Centroid retraining means re-running k-means over the full dataset to produce a fresh set of cluster centers.

IVFFlat fits a narrow set of workloads: large, mostly static corpora with an existing batch rebuild schedule. Everything else belongs on HNSW.

When no index is the right choice

There are two production patterns where no index outperforms either ANN option.

The first is the small-dataset case. For tables with roughly 10,000 to 50,000 vectors, a parallel sequential scan is fast enough on its own, and skipping the index avoids build time, maintenance cost, and the recall loss inherent to approximate search. This is the case called out in the pgvector 0.8.0 best practices, and it is the usual starting point for new workloads before the corpus grows.

The second is the partitioned, recall-critical case. If your schema partitions data so that every query touches a bounded subset (for example, per tenant or per user), and your use case requires 100% recall, a brute-force parallel scan within a partition can beat an ANN index. The Ring engineering team runs exactly this pattern in production: 100 to 200 billion embeddings are spread across per-user partitions of roughly 1 GB each. Each partition is scanned with max_parallel_workers_per_gather = 16, with no vector index at all. Removing the index was the change that let PostgreSQL pick parallel sequential scans over single-threaded index scans, which drove EBS throughput from about 50 MB/s to nearly 500 MB/s. Full details are in the Ring billion-scale semantic video search post.

Two conditions must hold for this pattern to work. First, the per-partition scan has to fit the latency budget, which in practice means each partition stays small enough to be served mostly from the buffer cache or local NVMe. Second, your workload must genuinely need 100% recall. If the ANN recall achievable with HNSW is acceptable, HNSW is simpler and faster. When both conditions are met, skipping the index removes an entire class of operational concerns: no build cost, no churn management, no memory sizing for index pages.

Quick decision guide

The following flowchart summarises the decision, starting from dataset size and walking through write pattern, HNSW build cost, and partitioning to a concrete index choice.

Index choice flowchart. Pick an index type based on dataset size, write pattern, HNSW build cost, and partitioning.

The “skip the index” branch covers both cases from the previous section: small tables where a sequential scan is already fast enough, and partitioned schemas where 100% recall is required within a bounded per-partition scan. Ring runs the second pattern in production, with per-customer partitions so every search reads only that customer’s data. For how they designed for it, see the preceding no-index discussion and the Ring billion-scale semantic video search post.

A runnable baseline

Before going deeper into similarity functions and scaling, here is a baseline schema, index, and query that the rest of the post builds on. It reflects the AWS-recommended HNSW parameters from the pgvector 0.8.0 best practices section and the iterative scan default we recommend for production.

The baseline uses HNSW, which fits the common production RAG case: per-tenant partitions are large enough that a sequential scan is not competitive, and ANN recall at the default settings is acceptable. If your workload looks like the no-index pattern covered earlier, small tables, or partitioned schemas where 100% recall is required within a bounded per-partition scan, omit the HNSW index from the following example and keep the rest of the schema as shown.

-- Enable the extension (one-time, per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- A multi-tenant document store with Titan Text Embeddings V2 (1024 dimensions)
CREATE TABLE documents (
    id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    tenant_id text NOT NULL,
    content text NOT NULL,
    embedding vector(1024) NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now()
);

-- B-tree on the filter column, so the planner has an alternative path
CREATE INDEX documents_tenant_idx ON documents (tenant_id);

-- HNSW with the AWS-recommended starting parameters, cosine distance
CREATE INDEX documents_embedding_hnsw_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);

-- Production-ready filtered vector query (iterative scan values are identifiers, not strings)
SET hnsw.iterative_scan = relaxed_order;
SET hnsw.ef_search = 100;

SELECT id, content
FROM documents
WHERE tenant_id = '<your-tenant-id>'
ORDER BY embedding <=> '<your-query-embedding>'
LIMIT 10;

On a populated documents table, this returns up to 10 rows ordered by cosine distance:

id	content
42	Q3 revenue grew 18% year over year, driven by enterprise…
117	The FY26 capital plan was approved at the November board…
203	Customer churn declined to 2.1% following the support model…
…	…
(10 rows)

In production code, wrap the two SET statements and the SELECT in a transaction and use SET LOCAL instead, so the values apply only to that query rather than the whole session. Swap vector_cosine_ops for vector_ip_ops (and the <=> operator for <#>) if your embeddings are unit-normalized. The next section explains when each choice applies.

Similarity functions at scale

pgvector supports several distance operators. The three that matter for text and semantic embeddings are listed in the following table. Each operator is paired with a specific HNSW operator class at index-creation time. If you use the wrong pair, the planner cannot use the index.

Operator	Meaning	HNSW operator class	When to use
<=>	Cosine distance	`vector_cosine_ops`	Safe default for text or semantic embeddings. Ignores vector magnitude.
<#>	Negative inner product	`vector_ip_ops`	Faster than cosine when vectors are already unit-normalized.
<->	L2 (Euclidean) distance	`vector_l2_ops`	Rarely appropriate for semantic search.

A note on <#>: pgvector returns the negative inner product because PostgreSQL can only use ASC-order index scans. This means ORDER BY embedding <#> query ASC returns the most similar vectors first. If you want a positive similarity score in the result, multiply by -1: SELECT (embedding <#> query) * -1 AS similarity.

Picking cosine or inner product

Amazon Titan Text Embeddings V2 normalizes by default through the normalize API parameter, which defaults to true. For Amazon Titan Text Embeddings V2 output with the default settings, inner product (<#>) is the right choice. On normalized vectors, cosine and inner product produce identical rankings, because cosine divides the inner product by the product of the vector norms, and for unit vectors that divisor is always 1. The norms and division still get computed per comparison, but inner product skips that work. If you override normalize to false, or use an embedding model that does not normalize (such as BGE or older Sentence-Transformers models), fall back to cosine.

Verify that your stored vectors are actually unit-normalized before switching operators. Norm should be 1.0 within floating-point tolerance:

SELECT id,
       sqrt(greatest(0, (embedding <#> embedding) * -1)) AS norm
FROM documents
LIMIT 10;

If one or more rows return a norm that is materially different from 1.0, your data is not normalized and the <#> shortcut will produce wrong results. The greatest(0, …) guard protects against sqrt of a tiny negative floating-point residual for vectors that are unit-normalized within tolerance.

Iterative scans for filtered queries

pgvector 0.8.0 introduced iterative index scans, which fix the overfiltering problem. Before 0.8.0, a query that combined a WHERE clause with a vector search often returned fewer results than the LIMIT asked for, because the index returned its top-k candidates before the filter was applied and most candidates were filtered out.

Iterative scans keep pulling candidates from the index until the query satisfies the filter or hits a configurable limit. Three modes are available:

off: the pre-0.8.0 behavior. Fastest, but can under-return.
strict_order: preserves exact distance ordering. Safer, but slower for selective filters.
relaxed_order: returns the correct count with approximate ordering within the result set. We recommend this mode for most production use cases.

Two related parameters bound the scan when you enable iterative mode: hnsw.max_scan_tuples (default 20,000) caps how far the scan walks the index, and hnsw.scan_mem_multiplier (default 1) caps how much memory the scan uses as a multiple of work_mem. Raise scan_mem_multiplier first if max_scan_tuples increases alone do not improve recall for a filtered query.

Scaling to millions of vectors

Three levers matter when the dataset grows beyond a few hundred thousand vectors: quantization, HNSW parameter tuning, and partitioning.

Quantization reduces memory footprint at a small cost to recall. pgvector supports halfvec (16-bit floats) and binary quantization. The AWS pgvector 0.7.0 benchmarks show halfvec cuts memory roughly in half with minimal recall loss, and binary quantization produces large build speed-ups at the cost of recall that you recover with a re-ranking pass, shown in the next subsection. For most workloads, start with halfvec. Move to binary quantization only when you already have a re-rank pipeline in place.

The benchmarks used OpenAI (5M vectors, 1536 dimensions) and Cohere (10M vectors, 768 dimensions) datasets. Amazon Titan Text Embeddings V2 at 1024 dimensions sits between them, so the memory and build-time trade-offs should apply proportionally. Validate against your own data before committing to a quantization strategy.

The following chart shows the memory and recall trade-off across float32, halfvec, and binary representations on those benchmarks.

Memory footprint and recall for float32, halfvec, and binary vector representations, sourced from the AWS pgvector 0.7.0 benchmarks. Halfvec halves memory with minimal recall loss; binary quantization reduces memory further but requires a re-rank pass to recover recall.

Two-stage retrieval with binary quantization

An HNSW index and its full-precision vectors must stay memory-resident for performance. When the working set exceeds what RAM can hold, the choices are to shrink the in-memory representation (quantization), extend effective memory with Aurora Optimized Reads, or both. Binary quantization with re-ranking does the shrinking: coarse candidate selection runs on tiny binary vectors that fit easily, and re-ranking by cosine distance pulls full-precision vectors only for the final top-N.

The following SQL pattern implements that two-stage retrieval: Hamming distance for the coarse pass, cosine distance for the re-rank. The re-rank is what recovers most of the recall lost to quantization.

-- Coarse recall pass (fast, on binary-quantized vectors), then re-rank by cosine
SELECT id, content
FROM (
    SELECT id, content, embedding
    FROM documents
    WHERE tenant_id = '<your-tenant-id>'
    ORDER BY binary_quantize(embedding)::bit(1024)
            <~> binary_quantize('<your-query-embedding>')
    LIMIT 100
) candidates
ORDER BY embedding <=> '<your-query-embedding>'
LIMIT 10;

The outer query sees only the 100 coarse candidates, so the expensive cosine comparison runs on a small set. The inline binary_quantize() cast is shown for clarity. For production at scale, materialize the binary vectors in a stored column and index them with bit_hamming_ops so the coarse pass uses an index rather than computing the quantization at query time. For benchmarks of this pattern on pgvector 0.8.0 and Aurora, see Supercharging vector search performance and relevance with pgvector 0.8.0 on Amazon Aurora PostgreSQL.

HNSW exposes three parameters worth tuning:

m: max connections per layer. Higher values improve recall but increase memory use and build time. AWS recommends m = 16 as the starting point.
ef_construction: dynamic candidate list size during build. Higher values improve index quality at build time. AWS recommends ef_construction = 128. The pgvector default is 64.
ef_search: dynamic candidate list size during query. Higher values improve recall at query latency cost. The pgvector default is 40, which is often too low for production. Tune per workload. Do not hardcode.

Partitioning keeps individual indexes manageable. Partition by tenant for multi-tenant workloads, by time for append-heavy workloads such as event or log data, or by category when queries are scoped to a known subset. Partitioning also supports parallel index builds and per-partition rebuilds.

For bulk loads, defer indexing. Load the data first, then build the index once at the end. Inserting into a live HNSW index is slower than a single post-load build, and the resulting index is often better structured.

When the working set outgrows RAM

RAM residency is the first target, but at some dataset sizes keeping the entire HNSW graph in shared_buffers stops being economical. Aurora Optimized Reads extends effective cache capacity by up to 5x the instance memory using local NVMe storage as a tiered cache, with up to 8x lower read latency for queries that would otherwise fetch from Aurora storage. The feature documentation lists pgvector nearest-neighbor search across millions of vector embeddings as a target use case.

Optimized Reads tiered cache requires an Aurora I/O-Optimized cluster on an NVMe-backed instance family (r6gd, r8gd, or r6id) and is enabled automatically on those instance classes. On an Aurora Standard cluster, Optimized Reads provides only temporary object acceleration, not tiered cache. Monitor the AuroraOptimizedReadsCacheHitRatio Amazon CloudWatch metric alongside BufferCacheHitRatio to see how much of the traffic misses RAM but still hits NVMe rather than storage.

Treat tiered cache as an extension of the memory budget, not a substitute for it: RAM is still faster than NVMe, and NVMe is still faster than Aurora storage. Size the instance so the hot working set stays in RAM, and let Optimized Reads absorb the long tail.

Managing index churn

HNSW indexes degrade with deletes and updates. Each change leaves an invalid entry in the graph. Over time, this hurts recall and inflates index size. There is no in-place compaction.

The following chart illustrates how recall drifts downward between rebuilds and recovers at each scheduled REINDEX CONCURRENTLY.

Illustrative HNSW recall over time. Without scheduled rebuilds, recall drifts downward as invalid graph entries accumulate. With a scheduled REINDEX CONCURRENTLY, recall recovers at each rebuild. Actual curves depend on workload, write mix, and index parameters.

Three patterns work in production:

REINDEX CONCURRENTLY rebuilds the index without blocking writes, but is resource-intensive. Schedule it during low-traffic windows and monitor maintenance_work_mem headroom. For large indexes, plan hours, not minutes.

REINDEX INDEX CONCURRENTLY documents_embedding_hnsw_idx;

Partition-based rebuilds are often cleaner. If you partition by time, you can drop and rebuild an entire partition’s index in one operation rather than churning a monolithic graph. This also caps the scope of a failed rebuild.

Append-only with periodic compaction suits workloads where old data becomes irrelevant. Write new vectors to an active partition, then periodically move or drop older partitions. The active index stays small and fast.

Your write pattern determines which one to pick. Workloads that are mostly inserts with few updates do fine on a scheduled REINDEX CONCURRENTLY. Workloads with heavy updates or deletes are better served by partitioning (by time or by batch) and rebuilding partitions independently. Workloads where relevance is time-bounded, for example a search over only the last 90 days, suit the append-only pattern with periodic compaction of old partitions.

Capacity planning

Churn management determines how often the graph is rebuilt. Capacity planning determines whether each rebuild and each query has the memory it needs. The two go together.

An HNSW index must stay memory-resident. When it spills to disk, search latency degrades, because the graph traversal pattern is random and each miss is a page fault. Plan for the index to fit in RAM with headroom for concurrent queries and maintenance operations.

The HNSW graph consumes more memory than the raw vector data, driven primarily by the m parameter. Size Aurora instances to keep shared_buffers plus connection and work memory below the instance’s available RAM.

Choose a memory-optimized instance class for Aurora PostgreSQL vector workloads, since the HNSW graph must fit in RAM. The Amazon Relational Database Service (Amazon RDS) r-series instance families are sized for memory-bound workloads such as vector search. For the current list of supported instance classes, see the Aurora DB instance classes documentation.

Tune these PostgreSQL parameters:

shared_buffers: Aurora sets a reasonable default, but verify it covers your expected index working set.
effective_cache_size: signals the query planner about available OS-level cache. Set to roughly 75% of instance memory on Aurora.
maintenance_work_mem: critical during index builds. Too low, and CREATE INDEX or REINDEX spill and slow down. pgvector emits a NOTICE when the HNSW graph no longer fits in maintenance_work_mem. If you see it in your build logs, raise the parameter on the instance and re-run the build.
work_mem: per-operation. With many concurrent vector queries, low work_mem causes spills. Too high a value risks memory pressure under load.
max_parallel_maintenance_workers: pgvector 0.7.0 added parallel HNSW builds. Set this to take advantage of larger instances during index creation.

For pricing, see the Amazon Aurora pricing page.

Observability

You cannot tune what you cannot see. Four observability layers matter for pgvector: query-level statistics, instance-level metrics, wait-event analysis, and application-defined custom metrics. Connection handling rounds them out.

Query-level statistics come from pg_stat_statements and the Aurora-specific aurora_stat_statements function. The Aurora variant adds storage I/O and peak memory columns, which matter for vector queries because a spilling search looks different in I/O than a cached one. To find the top vector queries by total time and peak memory:

SELECT substring(query, 1, 80) AS query,
       calls,
       round(total_exec_time::numeric, 1) AS total_ms,
       max_exec_peakmem
FROM aurora_stat_statements(true)
WHERE query ILIKE '%<=>%' OR query ILIKE '%<#>%'
   OR query ILIKE '%<->%' OR query ILIKE '%<~>%'
ORDER BY total_exec_time DESC
LIMIT 10;

total_exec_time is in milliseconds and max_exec_peakmem is in bytes. The peak memory columns require Aurora PostgreSQL 16.3, 15.7, or 14.12 and higher. On earlier minor versions, remove max_exec_peakmem from the SELECT list.

Instance-level metrics in CloudWatch tell you whether the index fits in memory. Watch BufferCacheHitRatio (should stay above 99% for healthy vector workloads), SwapUsage (should be zero), and ReadIOPS (sustained high values on a read-heavy vector workload suggest the index is spilling).

Amazon CloudWatch Database Insights surfaces slow queries with wait event breakdowns. Use it to find vector queries blocked on I/O or lock contention.

Custom metrics close the gap between database health and application health. Two are worth building:

Recall tracking: periodically run a fixed set of evaluation queries whose correct top-k results you know, compare what pgvector returns against the known correct set, and emit the recall percentage to CloudWatch. Drops in recall indicate index drift or query parameter regression.
p99 latency for vector searches, tagged by query type. Tail latency on vector search often moves before CPU or memory metrics do, because a handful of queries that evict the HNSW graph from cache can degrade recall and latency without changing averages.

Connection handling also needs monitoring. Vector queries are memory-heavy, so an over-subscribed connection pool exhausts work_mem quickly. Use Amazon RDS Proxy in front of Aurora and watch DatabaseConnections, ClientConnections, and MaxDatabaseConnectionsAllowed.

Clean up

If you created the baseline documents table and its indexes to follow along with the examples, remove them when you are done to avoid incurring ongoing storage charges for unused data:

DROP INDEX IF EXISTS documents_embedding_hnsw_idx;
DROP INDEX IF EXISTS documents_tenant_idx;
DROP TABLE IF EXISTS documents;

The vector extension does not incur charges on its own, but if you enabled it only for these examples, you can remove it with DROP EXTENSION IF EXISTS vector;. Do not drop the extension if other databases in the cluster are using it.

If you provisioned a dedicated Aurora PostgreSQL-Compatible cluster for testing, delete the cluster and its automated backups with the AWS Command Line Interface (AWS CLI) once you are done. Aurora PostgreSQL-Compatible clusters incur charges for as long as they exist, including storage charges on stopped clusters, so delete the cluster after testing to stop these charges:

# Delete all DB instances in the cluster first
aws rds delete-db-instance \
    --db-instance-identifier <your-db-instance-id> \
    --skip-final-snapshot

# Then delete the cluster
aws rds delete-db-cluster \
    --db-cluster-identifier <your-cluster-id> \
    --skip-final-snapshot \
    --delete-automated-backups

Manual snapshots persist after a cluster is deleted and continue to incur storage charges until you delete them explicitly. If you took any during testing, remove them:

# Remove any manual snapshots you took during testing
aws rds delete-db-cluster-snapshot \
    --db-cluster-snapshot-identifier <your-snapshot-id>

Replace , , and with the values from your test environment. The --skip-final-snapshot flag is appropriate for a disposable test cluster. Do not use it on a cluster that holds data you want to keep.

Key takeaways

Before going live, verify these five:

Plan for HNSW rebuilds from day one. Choose REINDEX CONCURRENTLY on a schedule, partition-based rebuilds, or append-only with compaction based on your write pattern.
Set hnsw.ef_search explicitly at session or query level. The default (40) is often too low for production recall. A value of 100 is a sensible starting point.
Size maintenance_work_mem to cover your largest index build, with headroom. A build that spills to disk produces a lower-quality graph.
Account for concurrent vector searches when sizing connection quotas. Each concurrent query consumes work_mem. Multiply out and use Amazon RDS Proxy in front of the cluster.
Treat BufferCacheHitRatio as the first metric to check. It is the earliest signal that an index has outgrown its instance.

Conclusion

Running pgvector on Amazon Aurora PostgreSQL at production scale rewards teams who design for their workload up front. Corpus size, write mix, filter patterns, and recall targets shape the right index choice, parameter settings, and capacity plan. The sooner you make those decisions deliberately, the smoother the path from proof of concept to SLA-backed production.

In this post, we covered the operational practices that matter: picking the right index and distance function, scaling with quantization and partitioning, managing HNSW churn, sizing for memory-resident operation, and the observability signals that catch problems early. Treat this as a checklist for your first production rollout, and revisit it when traffic or dataset size doubles.

AWS Database Blog