Tuesday, July 16, 2024

Pulling BigQuery Tables to S3 with PySpark and AWS Glue

 


Extracting data from Google BigQuery and storing it in Amazon S3 is a common task in data pipelines. This article guides you through creating a PySpark script within an AWS Glue job to achieve this.

Prerequisites:

  • An AWS account with Glue and S3 enabled.
  • A Google Cloud Platform (GCP) project with BigQuery enabled.
  • IAM roles with necessary permissions in both AWS and GCP for data access.

Understanding the Workflow:

The PySpark script within the Glue job will perform the following tasks:

  1. Establish Connections: Set up connections to both BigQuery and S3 using Glue connection configurations.
  2. Define Tables: Specify the BigQuery tables you want to extract data from.
  3. Read Data: Utilize PySpark's BigQuery connector to read data from the chosen tables.
  4. Transform (Optional): You can optionally include data transformations within the script if needed.
  5. Write Data: Write the processed data to designated locations within your S3 bucket.

Building the PySpark Script:

Here's a breakdown of the PySpark script with explanations:

Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# 1. Establish Connections (Replace with your connection names)
spark = SparkSession.builder \
    .appName("BigQueryToS3") \
    .config("spark.bigquery.connection.connectionString", "<BIGQUERY_CONNECTION_STRING>") \
    .config("fs.s3a.access.key", "<S3_ACCESS_KEY>") \
    .config("fs.s3a.secret.key", "<S3_SECRET_KEY>") \
    .getOrCreate()

# 2. Define BigQuery Tables (Replace with your table names)
bigquery_tables = ["dataset1.table1", "dataset2.table2"]

# 3. Read Data from BigQuery
for table in bigquery_tables:
    df = spark.read.format("bigquery") \
        .option("table", table) \
        .load()

    # 4. Optional Transformations (Example: Filtering)
    # filtered_df = df.filter(col("column_name") > 10)

    # 5. Write Data to S3 (Replace with your S3 bucket path)
    df.write.format("parquet").mode("overwrite").save("s3://your-bucket/path/to/data/")

spark.stop()

Explanation:

  • We import necessary libraries: SparkSession for working with Spark DataFrames and col function for data manipulation (optional).
  • Lines 4-8 establish connections:
    • Replace <BIGQUERY_CONNECTION_STRING> with the connection string for your BigQuery service account created in AWS Glue.
    • Replace <S3_ACCESS_KEY> and <S3_SECRET_KEY> with your S3 access credentials.
  • Lines 10-11 define the list of BigQuery tables you want to extract data from. Replace with your actual table names and datasets.
  • The loop iterates through each table name, reading data using the spark.read.format("bigquery") syntax.
  • The # Optional Transformations section demonstrates an example of filtering data using the col function. You can add other transformations as needed.
  • Finally, the data is written to your designated S3 bucket path using the write.format("parquet") method. Replace "s3://your-bucket/path/to/data/" with your desired location within the S3 bucket.
  • The script concludes by stopping the SparkSession.


Setting Up the Glue Job:

  1. In AWS Glue, create a new Glue job and choose "Spark" as the job type.
  2. Under "Script libraries," choose "Python 3 (Glue version 1.0 or later)" and upload the script file containing the PySpark code.
  3. Configure connections to your BigQuery and S3 resources using the connection names you specified in the script.
  4. In the "Job arguments" section (optional), you can pass arguments to your script if needed (e.g., specific table names to extract).
  5. Save and run the Glue job.

Running the Job and Monitoring:

Once your Glue job is configured, run it. AWS Glue will handle provisioning resources and executing the PySpark script. You can monitor the job's progress and logs within the Glue console.

No comments:

Post a Comment

Enhancing User Experience: Managing User Sessions with Amazon ElastiCache

In the competitive landscape of web applications, user experience can make or break an application’s success. Fast, reliable access to user ...