There are many ways to load data into BQ but using Python gives us the most control. Below I’m going to demonstrate loading files into BQ in parallel.

My test scenario is: 191 csv files, sizing from 10Mb – 1Gb each with total size of 30GB; this data translates to 200 million records in BQ.

Obviously loading them sequentially will take a long time therefore I considered multiple threading first. 

Multiple threading

threading .Thread is used to spawn number_of_processes(threads) , wp.start() then wp.join() to wait them to finish. Note logging is thread safe so it can be safely used in function bq_load_local.

In bq_load_local, load_table_from_file is used to read CSV into BQ

As you can see, multiple threading is quite easy to implement. My test scenario is more IO bound than CPU bound, this is probably the best solution; However if you want to experiment multiprocessing, below is the modified script

Multiple processing

In order to use logging in multiprocess mode, the script has to be modified by utilizing a queue. A lock is also used as when many processes are running, it’s likely BQ resource limitation is reached: in my case, 5 operations per 10 seconds on a BQ table. Refer to this

In general, multiprocessing is very similar to multi-threading. Multiple processes obviously will consume more memory and at startup, my machine reached 100% cpu for a short period but during execution there was no bottleneck(My machine has a 4-core CPU with 16Gb memory). I did notice a performance gain comparing to multi-threading, I think it’s probably because my task was more IO intensive and memory/cpu were not constrained.

A variant of the above is by using Pool class

Use GCP bucket

Another different approach is to upload all files to GCP storage and loading all files from the bucket. This turns out to be quicker as disk IO is now on GCP platform. Below is the demo code (based on multi-processing code base)

Further enhancement

For my 30Gb data(191 CSV files), multi-processing took 30 minutes, multiple-threading took 40 minutes and bucket loading took 20 minutes(apart from uploading time to the bucket).

Further performance improvement could be multiple nodes(machines), you might be using RQ or Celery, this article might be helpful

Bigquery parallel loading
Tagged on:                 

Leave a Reply

Your email address will not be published. Required fields are marked *

37 − = 32

This site uses Akismet to reduce spam. Learn how your comment data is processed.