How do I decrease the load time for Neptune Bulk Loader?
Last updated: 2020-10-09
I'm trying to load data into my Amazon Neptune cluster using the Neptune Bulk Loader. However, the process is taking more time than I expected. How do I decrease the load time?
Set the writer instance to the maximum size before bulk loading. You can maintain different instance classes for both writers and readers to be sure that you are scaling only one node in the Neptune cluster. When the Bulk Upload is complete, you can scale down the writer instance.
If you aren't performing any other operations during a write, use the OVERSUBSCRIBE parallelism parameter. This parameter sets Neptune Bulk Loader to use all available resources, which decreases the overall load time.
File size and request throughput
It's a best practice to use larger bulk load files (such as a csv file) on your Amazon Simple Storage Service (Amazon S3) bucket. Doing so enables the Bulk Load API to break up those larger files and manage concurrency for you. If you have multiple files, consider one of the following approaches:
- Set the S3 source as a folder: The Bulk Load API automatically starts with the vertex files, and then uploads the edge files afterwards.
- Neptune can load files in parallel, having multiple files can speed up the load process
- Compress the files: The Bulk Load API supports gzip files, which can reduce the overhead of fetching larger files from your S3 bucket.
- Use the queueRequest and dependencies request parameters. If the queueRequest parameter is set to "TRUE", Neptune queues several jobs simultaneously. You can set up multiple levels of dependency, so that the failure of one job causes all jobs to fail. Setting up your dependencies this way can also prevent inconsistencies in your data.
For example, if Job-A and Job-B are independent of each other, set Job-C so that Job-A and Job-B finish before Job-C begins. Submit Job-A and Job-B after each other (in any order), and save their load IDs. Then, submit load-job-C with the load IDs of the two jobs in the dependencies field:
"dependencies" : ["Job_A_loadID", "Job_B_loadID"]
If either Job-A or Job-B fails, then Job-C doesn't run, and its status is set to LOAD_FAILED_BECAUSE_DEPENDENCY_NOT_SATISFIED.
- Convert your data into a supported load data format.
- Use updateSingleCardinalityProperties to maintain your data. For example, if you have a vertex for a Person and you want to change the phone number, Neptune Bulk Loader rejects the change. The new value is treated as an error. However, if you set updateSingleCardinalityProperties to "TRUE", then Neptune Bulk Loader replaces the value for that vertex property.
Important: The headers of the CSV file must indicate a single cardinality of the property. If the cardinality and content of updateSingleCardinalityProperties parameter aren't "single" or single-valued, then the parameter doesn't work. For more information, see the Specifying the cardinality of a column section in Property column headers.
- Remove any duplicates or known errors from the bulk load files before the start of a bulk load.
- Reduce the number of unique predicates (such as properties of edges and vertices).
Load job behavior
You can use the failOnError parameter to determine whether bulk load operations should continue when an error is encountered. Or you can use the mode parameter to be sure that your load job resumes only for files where the load initially failed. For more information about the mode and failOnError parameters, see Request parameters.
Troubleshooting failed Bulk Loader jobs
You can use the Get-Status API to obtain more information about a particular load job. Use the following cURL call to check the health status and look for any errors in Neptune Bulk Loader:
curl -s "https://neptune:8182/loader/<loaderid>?details=true&errors=true"
For more information about troubleshooting Neptune Bulk Loader, see How do I resolve processing errors in Amazon Neptune Bulk Loader?