Upload Data from Google Shared Drive to BigQuery: Python Connector

Ignacy Janiszewski
5 min readJun 8, 2021
Google BigQuery

Moving your data from Cloud Storage or Campaign Manager to BigQuery is easy-peasy, because it is supported by BigQuery. But what if you want to transfer your data from, e.g. Google Shared Drive?

It this article I will try to show you how to move data to BQ from:

  • Google Storage (and why it’s easy)
  • Google Shared Drive (and why it’s more complicated), and finally
  • Google Shared Drive with specific requirements while loading (e.g. files with unusual encoding)

Use case: move CSVs from Storage to Big Query.

Things to do:
→ move CSVs that are in years-folders

→ skip some unwanted tables via passing tables_to_skip List
→ check if number of rows in file is the same as uploaded to BQ

Couple of more important code parts:

  1. Create BQ Load Job Config (your parameters to properly ingest the data from Storage)
Create BQ Load Job Config

2. Load data from Storage

Load all matching files from Storage to Big Query

3. Upload data to Big Query

Load file to BQ based on Storage Blob URI

The whole code you can find on my GitHub repository:
(Please remember to add .env file with path to your Google credentials JSON — GOOGLE_APPLICATION_CREDENTIALS )

As I told you → easy-peasy.

The problem is when you want to move your data from, e.g. Google Shared Drive, which is “not currently supported”:

https://cloud.google.com/bigquery/docs/loading-data

You can query a file on Drive as external table in BQ:

CREATE TABLE my_dataset.my_non_external_table 
AS SELECT * FROM my_dataset.my_external_table

But in such scenario you don’t have much control over the data format while loading — for example, I had a problem with columnID. This column looks like INT 067872, but it’s ID of transaction— it should be a string. Unfortunately, BQ treats it as INT and eats 0 at the beginning. You can explicitly define the schema, but afterwards (while uploading to BQ table), not before the loading.

At first, I tried as described in this blog post, suggested in the answer to my question:

BQ: Query external table

But the second problem occurred — BQ allows loading df only in two encoding:
→ UTF-8
→ ISO-8859–1

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#csv-options

One of the files I need to transfer is encoded in CP1250.

So if you have the proper encoding you can load it as an external table and skip the rest of the article. If you want to have the full control over the loading data you, please continue.

So I put a new question on stack, and in the meantime I decided:

I need to create my own Drive → BQ connector.

Use case: Move to Big Query all CSVs uploaded to Google Shared Drive

assumptions:
→ only files uploaded (or modified) since yesterday
→ only files in a given list of folders
→ without ‘trashed’ files (what is not so obvious — ‘trashed’ file can still be uploaded yesterday & be in folder that is monitored)
→ schema of given CSVs and uploaded files to BQ must be always the same

I encourage you to get familiar with Introduction to Google Drive API.

  1. First, you need to prepare your config file. Which folders you want to monitor (all_folders) and schema of your CSV files:
Google Shared Drive Connector — configs

You will use it while loading CSV from Drive into Pandas dataframe, so any configs as encoding, skiprows, parse_dates, dayfirst, dtype, delimiter or decimal are more than welcome.

2. Then, some initialization. Get credentials (with scope for Drive and BigQuery) based on GOOGLE_APPLICATION_CREDENTIALS to connect to Drive and BQ APIs. Build a Python representation of the API (self.service) for Drive, BQ client and initialize SendMail class to send email alerts (my solution, but you can use any way to alert, obviously) about completed uploads and errors.

Some initializations

3. Get shared drive ID based on given string (self.shared_drive_name) by Search for files and folders. Will be necessary to search for files.

Get Shared Drive ID

4. Get folders IDs — will be necessary to select files just from wanted folders.
Fields based on Files reference.

Get folders IDs from Drive

5. Create a dict with folder ID (from Drive) to folder name mapping. It will look like this:

folders_dict = {'folder_id': 'folder_name'}
Get folders dict based on IDs and names

6. Get items IDs. Create search file query only for files uploaded/modified since yesterday, only CSVs, and there is for example possibility to search only for files in particular folder (passing folder_id)

Get folders IDs from Drive

7. Then iterate through all found files.

Iterate through all matched files

Check if file is not trashed, check if folder (parent) is in our list of wanted folders. If the query will be more strict than mine, you can skip such validation. For example, you can define parent ID and then checking if file has proper parent is unnecessary. I want to iterate through almost all files on the Drive, so for me it’s the best option.

Next, replace spaces in folder_names to “_” (BQ does not allow creating tables with white spaces), check if CSV is not already uploaded in BQ and then start the uploading process → based on file_id. At the end, send confirmation mail about all successfully uploaded files.

8. Next, we need to download CSV to pandas dataframe, based on file_id

Download pandas dataframe from CSV file_id

Download a file stored on Google Drive, then pass params created based on config file.

9. Then some cleaning (preparing) a dataframe

Prepare dataframe

What is important here is that sometimes float values are saved as string, with commas instead of dots and with spaces (for example 7 931,29) → then it’s difficult to load it as float. We need to load it as string and then convert it to float values. Also, I am adding current timestamp (to know when the file was loaded to BQ) and file_name to check if this file is already uploaded to BQ. You should make your cleaning dataframe here.

10. Then create schema for uploading table, based on config file.

Create schema based on types from config file

We need to map python types into BQ, and handle date columns. Then, based on types passed while loading dataframe we create schema to load a table into Big Query.

10. Then, upload it to Big Query.

Upload cleaned dataframe to Big Query

Be kind to BQ API — do not send too many requests. That’s why counter is implemented. BQ Job Config is the same as in Storage Connector.

And voilà!

I know there is a space for optimization in my code, just wanted to show you the way. Hopefully you will use part of it with your handmade connector, and maybe I will save you some time:)

GitHub repository with all the code.

Cheers!

--

--