Overview
The TPC-DS Glue connector enables Glue ETL Jobs to generate TPC-DS compliant datasets with your preferred scale. The generated datasets can be used for any benchmarking purpose in AWS Glue jobs, Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, etc.
Highlights
- Randomly generate TPC-DS compliant datasets from AWS Glue Jobs inside the connector.
Details
Features and programs
Financing for AWS Marketplace purchases
Pricing
Vendor refund policy
We do not currently support refunds (you can cancel at any time)
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Glue 3.0
- Amazon ECS
- Amazon EKS
Container image
Containers are lightweight, portable execution environments that wrap server application software in a filesystem that includes everything it needs to run. Container applications run on supported container runtimes and orchestration services, such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). Both eliminate the need for you to install and operate your own container orchestration software by managing and scheduling containers on a scalable cluster of virtual machines.
Version release notes
TPC-DS data generator for AWS Glue.
Additional details
Usage instructions
Please subscribe to the product from AWS Marketplace and Activate the Glue connector for Glue 3.0 from Glue Studio .
What is the TPC-DS connector for AWS Glue?
This connector generates TPC-DS compliant datasets. To generate the datasets, you don't need any data sources. After writing the datasets on your resource such as Amazon S3 by the Glue ETL job, the dataset can be used for any benchmarking purpose in your workload.
Please refer to http://www.tpc.org/tpcds/ about TPC-DS and http://tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf about TPC-DS specification.
Connector options you need to set
You can pass the following options to the connector.
- table (required) - A table name. You can pick up a table from 25 tables. The table list is in https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-tpcds .
- scale (optional, default is 1) - Generated data size. Possible range is 1 to 100000. You can specify the size of generated data. Specifically scale 1 means that all table data will be generated with 1GB. For example, when you specify 7 for scale, 7GB data of all tables will be generated. This scale factor is described in the section 3 in http://tpc.or/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf .
- numPartitions (optional, default is 1) - The maximum number of concurrency to generate table data in parallel. Please set more than 1 for the concurrent processing with Spark.
- It's recommended that the value of numPartitions parameter be set based on "Number of Workers" of your ETL job. Here's the calculation formula. The following calculation takes consideration in Glue 2.0 and 3.0. Please be aware that the calculation depends on the "Worker type" of your job as follows.
- G.1X - numPartitions = (Number of Workers - 1) * 4
- G.2X - numPartitions = (Number of Workers - 1) * 8
- For example, when your Glue 3.0 job is set to G.1X as the worker type and 10 number of workers, the numPartitions will be calculated by (10 - 1) * 4 = 36.
- It's recommended that the value of numPartitions parameter be set based on "Number of Workers" of your ETL job. Here's the calculation formula. The following calculation takes consideration in Glue 2.0 and 3.0. Please be aware that the calculation depends on the "Worker type" of your job as follows.
You can set up the connector by the below steps in AWS Glue Studio.
Using the TPC-DS connector for AWS Glue
Here's the setup steps for using the TPC-DS connector.
- Setup TPC-DS custom connector and a related connection on Glue Studio console.
- Create a job. You set connector options and a necessary job parameter.
- Save and run the job.
Step 1: Setup TPC-DS connector and create a relevant connection
To set up the TPC-DS connector and create a connection for your job:
- Subscribe the product and Activate the connector using AWS Glue Studio from the top of this instruction page.
- Enter your connection name and choose "Create connection and active connector". You can optionally add a description and "Network options". For "Connection access", keep it empty.
Step 2: Create a job
To create a job from your connection which is created in the previous step:
- Choose the connection and "create job".
- Select your created connection figure on the visual canvas.
- Add connection options and enter the necessary information. Specifically table option is required, and if needed, you can specify scale and numPartitions options. (e.g.) table = customer, scale = 10, numPartitions = 30
- Enter job properties in the "Job details" tab, and Choose "Save"
Step 3. Save and run the job
After filling in all parameters and creating the connector job, run the job.
Resources
Vendor resources
Support
Vendor support
Please allow 24 hours
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.