Overview
The Google BigQuery Connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from BigQuery, and also load data into BigQuery. This connector provides comprehensive access to BigQuery data, facilitating cloud ETL processes for operational reporting, backup and disaster recovery, data governance, and more.
Highlights
- * Connect to Google BigQuery from AWS Glue Jobs * Simplify data extracts from Google BigQuery * Simplify data loads to Google BigQuery
Details
Pricing
Vendor refund policy
No Refunds
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Glue 3.0
- Amazon ECS
- Amazon EKS
Container image
Containers are lightweight, portable execution environments that wrap server application software in a filesystem that includes everything it needs to run. Container applications run on supported container runtimes and orchestration services, such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). Both eliminate the need for you to install and operate your own container orchestration software by managing and scheduling containers on a scalable cluster of virtual machines.
Version release notes
Google BigQuery Connector for AWS Glue 0.24.2.
- This version is built with spark-bigquery-connector 0.24.2.
- This version is compatible with AWS Glue 3.0, 2.0 and 1.0.
- This version supports both read from and write into Google BigQuery.
Additional details
Usage instructions
Please subscribe to the product from AWS Marketplace and Activate the Glue connector from AWS Glue Studio .
Pre-requisite
- An account in Google Cloud, specifically a service account that has permissions to Google BigQuery
- GCP credentials (service_account_json_file)
- GCS bucket (only for writes)
- BigQuery dataset (only for writes)
- AWS Secrets Manager secret (you can create the secret in following steps)
Create a new secret for Google BigQuery in AWS Secrets Manager
We create a secret in AWS Secrets Manager to store the Google service account file contents as a base64-encoded string.
1.Download the service account credentials JSON file from Google Cloud.
- For base64 encoding, you can use one of the online utilities or system commands to do that. For Linux and Mac, you can use
base64 [service_account_json_file]
to print the file contents as a base64-encoded string.
- On the Secrets Manager console, choose Store a new secret.
- For Secret type, select Other type of secret.
- Enter your key as
credentials
and the value as the base64-encoded string. - Leave the rest of the options at their default.
- Choose Next.
- Give a name to the secret
bigquery_credentials
. - Follow through the rest of the steps to store the secret.
Connection options
You can pass the following options to the connector.
parentProject
(required): The Google Cloud Project ID of the tabledataset
(optional unless omitted intable
): The BigQuery dataset containing the table.table
(required): The BigQuery table in the format[[project:]dataset.]table
temporaryGcsBucket
(optional. required for writes):
You can see other available options here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/0.24.2
Spark configurations
Following Spark configurations are required only for writes into BigQuery.
spark.hadoop.fs.gs.impl
=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.google.cloud.auth.service.account.json.keyfile
=true
You also need to configure credentials in one of following set of configurations.
Credential file
spark.hadoop.fs.gs.auth.service.account.json.keyfile
=credentials.json
You need to upload credentials.json
to your S3 bucket, and set the file path in Referenced files path.
Private key
spark.hadoop.fs.gs.auth.service.account.email
=[your-email-extracted-from-service_account_json_file]
spark.hadoop.fs.gs.auth.service.account.private.key.id
=[your-private-key-id-extracted-from-service_account_json_file]
spark.hadoop.fs.gs.auth.service.account.private.key
=[your-private-key-body-extracted-from-service_account_json_file]
You can set these Spark configurations in one of following ways.
- The param
--conf
of Glue job parameters - The job script using
SparkConf
from pyspark.conf import SparkConf conf = SparkConf() conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") conf.set("spark.hadoop.fs.gs.auth.service.account.enable", "true") conf.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "credentials.json")
Support
Vendor support
Please allow 24 hours
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products
Customer reviews
Aws
This is the coolest product ever and it's so useful, and really amazing I appreciate it, so have, it guys
No fuss connectivity to Bigquery from AWS Gglue
- We use it for bringing the GA data amounting to multiple gigabytes and process it using Pyspark in AWS glue