Improve Solana node performance and reduce costs on AWS

Solana Agave v2.0.14 was released on October 18, 2024. Since then, operators of Solana nodes reported that occasionally their Solana nodes struggle to stay in sync with the latest slots on mainnet-beta. Searching for “catch up” on Solana’s StackExchange reveals numerous discussions of this challenge.

In an earlier post, we explained how to run Solana nodes on AWS. In this post, we explore how to configure your nodes for faster operations and initial synchronization. Additionally, we share the experimental traffic optimization technique to make you data transfer for your Solana node cost-effective on Amazon Elastic Compute Cloud (Amazon EC2).

Changes in Solana Agave v2.x

One of the most significant changes in Solana Agave v2.x is the new central scheduler, which existed but was inactive by default in v1.18.x. With the v2.x update, the central scheduler is now enabled by default.

As a core component of the Solana Agave client, the central scheduler significantly changes the transaction processing architecture. It introduces the Scheduling Thread, a single-threaded coordination mechanism that replaces the previous four-thread processing model. For more information about this change, see Introducing the Central Scheduler: An Optional Feature of Agave v1.18.

This change had a significant impact on the recommended minimum CPU clock speed, which increased requirements from 2.8 GHz to 3.2 GHz for optimal performance of transaction processing. Based on that, we have updated our recommended AWS instance types for running Solana nodes in the original blog post and switched to R7a and I7ie EC2 instance families. Those instance families also provide from 384 GiB to over 1,5 TiB of RAM required to run Solana nodes in different configurations.

Other important performance optimizations were introduced in Agave v2.2.15 and focused on removing operations with block storage from the “hot path” of processing logic. This helped reduce the number of input/output operations per second (IOPS) and latency required by block storage devices and reduce the costs of running Agave clients on the cloud even further.

Overcoming challenges with synchronizing Solana in Asia-Pacific AWS Regions

If a Solana Agave node doesn’t have a pre-downloaded snapshot during startup, the client downloads that snapshot from trusted validator nodes. Those nodes are set by the --known-validator flag, as illustrated in the Anza RPC node startup command example. However, those trusted validators are often located in North America or Europe AWS Regions, which creates a problem for clients running in Asia-Pacific Regions such as Tokyo, Hong Kong, Seoul, or Singapore. The snapshot download speeds in those locations are typically slower compared to clients in North America or Europe Regions.

After downloading the snapshot, Agave checks whether the difference between the snapshot’s slot is over 2,500 slots behind the latest. If this difference is greater than 2,500, the client re-downloads the snapshot, further delaying the synchronization process. This problem has been documented in GitHub issue #24486.

By default, the maximum_local_snapshot_age parameter is set to 2,500. Although you can increase this value to avoid re-downloading snapshots, we don’t recommend this approach. Because Solana generates a new slot every 400 milliseconds (approximately 150 slots per minute), setting this value too high might result in the node never catching up to the latest slot.

To improve snapshot download performance when running a Solana node in Asia-Pacific Regions, we recommend the following:

Identify trusted nodes:
- Use validator.app to find trusted nodes’ identities closer to your location in Asia-Pacific Regions.
- Set these identities as parameters to the --known-validator flag of your agave-validator startup command.
- Avoid using validator nodes marked with a warning sign.
Restrict snapshot sources:
- Use the --only-known-rpc flag to configure the client to download snapshots only from trusted nodes.
Set a minimum download speed:
- Use the --minimal-snapshot-download-speed command to define a minimum required snapshot download speed, preventing downloads from slow sources. In our tests, we used 104,857,600 (100 MiBps).

By implementing these optimizations, you can speed up the initial sync and make your Solana node deployment faster and more reliable in Asia-Pacific Regions.

Optimizing data transfer costs for Agave RPC nodes

Solana Agave clients generate a high volume of outbound data traffic due to its data propagation protocol called Turbine. In recent years, monthly traffic volume has increased and currently ranges from 100 TiB to over 200 TiB, even for nodes configured as RPC only. To better manage these costs, we conducted a set of experiments to determine the minimum outbound data throughput required to keep a Solana node synchronized.

We first created a script that checks the node’s “Slots Behind” metric one time every minute and after the initial sync is done. One minute interval is usually enough to reliably detect if Solana node is syncing well or started falling behind. When “Slots Behind” metric reaches zero, indicating the node is fully synced, another script applies a user-defined bandwidth limit in MiBps. If the “Slots Behind” metric exceeds 10, the limit is temporarily removed until the node catches up.To maintain operational efficiency, the system excludes internal network traffic from these restrictions. Traffic within standard internal IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16) remains unrestricted, making sure AWS applications using internal IPs function normally.Although this feature is highly effective for RPC nodes, it should not be implemented on consensus nodes. Restricting outbound traffic on consensus nodes will compromise performance and is not recommended for optimal network participation.

To test this traffic optimization technique, we ran a comparative test with five i7ie.12xlarge EC2 instances in the eu-central-1 Region for 5 days. We configured four nodes with traffic shaping scripts and one control node with no limits and collected “Current Slots” and “Slots Behind” metrics to compare how fast and how consistently the nodes sync.

Testing showed that a node can maintain synchronization with as low as 20 MiBps of outbound traffic bandwidth (approximately 6.5 TiB per month) and optimal price to performance ratio is reached at 40-50 MiBps, reducing estimated data transfer costs by over 85%.Throughout the entire testing period, all five nodes remained synchronized, confirming that reducing outbound bandwidth doesn’t impact how well a node stays in sync. Surprisingly, nodes with Agave v2.2.16 restricted to 20–50 MiBps of outbound traffic bandwidth stayed in sync up to 5% more consistently than a control node with no restrictions.

Configuring traffic shaping for Agave RPC nodes

Based on these results, we have introduced dynamic traffic shaping in the Solana blueprint for AWS Blockchain Node Runners (see Optimizing Data Transfer Costs). The key parts of its implementation are:

net-rules-start.sh turns on traffic shaping:

#!/bin/bash

# Specify max value for outbound data traffic in Mbps. 
LIMIT_OUT_TRAFFIC_MBPS=20

# Step 1: Create nftables rules to mark packets going to public IPs
# Create table if it doesn't exist
if ! nft list table inet mangle >/dev/null 2>&1; then
    nft add table inet mangle
fi

# Create chain if it doesn't exist
if ! nft list chain inet mangle output >/dev/null 2>&1; then
    nft add chain inet mangle output { type route hook output priority mangle\; }
fi

# Check if specific private IP return rule exists
if ! nft list chain inet mangle output | grep -q "10\.0\.0\.0/8.*172\.16\.0\.0/12.*192\.168\.0\.0/16.*169\.254\.0\.0/16.*return"; then
    nft add rule inet mangle output ip daddr { 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16 } return
fi

# Check if mark rule with value 1 exists
if ! nft list chain inet mangle output | grep -q "meta mark set 0x00000001"; then
    nft add rule inet mangle output meta mark set 1
fi

# Step 2: Set up tc with filter for marked packets
INTERFACE=$(ip -br addr show | grep -v '^lo' | awk '{print $1}' | head -n1)

# Check if root qdisc already exists
if ! tc qdisc show dev $INTERFACE | grep -q "qdisc prio 1:"; then
    tc qdisc add dev $INTERFACE root handle 1: prio
fi

# Step 3: Add the tbf filter for marked packets
# Check if filter already exists
if ! tc filter show dev $INTERFACE | grep -q "handle 0x1 fw"; then
    tc filter add dev $INTERFACE parent 1: protocol ip handle 1 fw flowid 1:1
fi

# Check if tbf qdisc already exists on class 1:1
if ! tc qdisc show dev $INTERFACE | grep -q "parent 1:1"; then
    tc qdisc add dev $INTERFACE parent 1:1 tbf rate "${LIMIT_OUT_TRAFFIC_MBPS}mbit" burst 20kb latency 50ms
fi

net-rules-stop.sh removes all traffic shaping:

#!/bin/bashINTERFACE=$(ip -br addr show | grep -v '^lo' | awk '{print $1}' | head -n1)# Remove tc rulestc qdisc del dev $INTERFACE root 2>/dev/null# Remove nftables rules# Delete the entire mangle table (removes all chains and rules)nft delete table inet mangle 2>/dev/nullexit 0;

For simplicity, automation scripts net-rules-start.sh and net-rules-stop.sh are controlled through a systemd service net-rules.service:

[Unit]
Description="ipables and Traffic Control Rules"
After=network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/opt/instance/network/net-rules-start.sh
ExecStop=/opt/instance/network/net-rules-stop.sh

[Install]
WantedBy=multi-user.target

net-syncchecker.sh checks node synchronization status using an internally accessed API and triggers traffic shaping on and off with a net-rules service. It needs to be called every 1 minute, and you can use systemd timer or other services, like cron, to schedule it.

#!/bin/bash

INIT_COMPLETED_FILE=/data/data/init-completed
MAX_SOLANA_SLOTS_BEHIND=10

# Check if jq is available
if ! command -v jq &> /dev/null; then
    echo "Error: jq is required but not installed"
    exit 1
fi

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
if [ -z "$TOKEN" ]; then
    echo "Error: Failed to get EC2 metadata token"
    exit 1
fi

EC2_INTERNAL_IP=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/local-ipv4)

# Start checking the sync node status only after the node has finished the initial sync
if [ -f "$INIT_COMPLETED_FILE" ]; then
    SOLANA_SLOTS_BEHIND_DATA=$(curl -s -X POST -H "Content-Type: application/json" -d ' {"jsonrpc":"2.0","id":1, "method":"getHealth"}' http://$EC2_INTERNAL_IP:8899 | jq .error.data)
    SOLANA_SLOTS_BEHIND=$(echo $SOLANA_SLOTS_BEHIND_DATA | jq .numSlotsBehind -r)

    if [ "$SOLANA_SLOTS_BEHIND" == "null" ] || [ -z "$SOLANA_SLOTS_BEHIND" ]
    then
        SOLANA_SLOTS_BEHIND=0
    fi

    if [ $SOLANA_SLOTS_BEHIND -gt $MAX_SOLANA_SLOTS_BEHIND ]
    then
        if systemctl is-active --quiet net-rules; then
            systemctl stop net-rules
        fi
    fi

    if [ $SOLANA_SLOTS_BEHIND -eq 0 ]
    then
        if ! systemctl is-active --quiet net-rules; then
            systemctl start net-rules
        fi
    fi
fi

Before using in production, the code from this post or the AWS Blockchain Node Runners:

Review and test these scripts in a secure environment
Add input validation where necessary
Implement proper error handling and logging
Run the scripts with minimal required privileges

Conclusion

In this post, we explored how the changes introduced in Solana Agave v2.x have increased required CPU clock speed, reviewed a way to improve sync time for Solana Agave clients in the Asia-Pacific Regions, and introduced a way to optimize data transfer costs for Solana RPC nodes. Test these optimizations on your own node or use the Solana blueprint for the AWS Blockchain Node Runners initiative. If you have further questions, feel free to ask on AWS re:Post with the tag “blockchain,” participate in discussions on Solana StackExchange, or reach out to the Solana community.

AWS Web3 Blog

Improve Solana node performance and reduce costs on AWS

Changes in Solana Agave v2.x

Overcoming challenges with synchronizing Solana in Asia-Pacific AWS Regions

Optimizing data transfer costs for Agave RPC nodes

Configuring traffic shaping for Agave RPC nodes

Conclusion

About the authors

Learn

Resources

Developers

Help