Introduction

We all are living in an Era where we are generating PBs of data every second. Data is the new gold, after which almost all of the enterprises are running after. Insights from data, which is structured, semi-structured and unstructured can change the business decisions completely. For Analysing, there were various tools available already such as Teradata, Hive, RedShift and many more. But there are various problems that occur with these tools, which makes them slower and unproductive. Smart minds at Google solved this problem by creating BigQuery. It is available on Google Cloud Platform, using which many pipelines are sorted. If you want to create pipelines using various Big Data Open Source Technologies on GCP you can follow our courses named as “Big Data Crash Course” and “Big Data for Architects”.

Learning Objectives

  1. Introduction to BigQuery
  2. Historical Background
  3. Why BigQuery?
  4. BigQuery Architecture
  5. Hive vs BigQuery
  6. How to access BigQuery?
  7. How to ingest data into BigQuery?
  8. ML on BigQuery

Introduction to BigQuery

BigQuery is Google’s fully managed, scalable, low cost data warehouse for analytics. It is serverless, which means users will be charged for the amount of data being processed. As there is no infrastructure to manage and no database administrators are required, one can easily focus on analysing data to gain meaningful insights using SQL.

Historical Background

Google BigQuery was released in 2010. It was made available in 2011 i.e. v1, which include: Non-standard SQL, Lot of limitations w.r.t. Joins, etc. But in 2016, v2 was released, which includes Standard SQL, DML(Data Modification Language), Query Planner, etc.

Why BigQuery?

  1. Serverless and fully managed
  2. Perform Data Analysis at internet scale i.e. can perform Data Analysis on Petabytes of Data in a few seconds.
  3. It is a Columnar Database or Query Engine which is inspired from Google Dremel White Paper.
  4. It can store data upto ExaByte scala i.e. it is highly scalable.
  5. Can perform Machine Learning, Predictive Analysis, etc.
  6. Geo Spatial or GIS(Geographical Information System) queries are also supported.
  7. It can easily integrate with Data Visualization Tools such as Data Studio, Looker, etc.
  8. It is highly resilient(fault tolerant) and highly available.

BigQuery Architecture

On a very high level BigQuery has two components i.e. Storage and Computation.

Storage: It is based on the Colossus File System, which is Google’s newest File System. Durability in this file system is very high. Data stored here is replicated and distributed with Durability of about 99.9999999999%

Computation: It is inspired from Google's Dremel White Paper. It works on the data stored using a Distributed memory shuffle tier.

Other Component(s):

  • PetaBit Network: It is powerful network, which helps to enable decoupling among storage and computation i.e. You will have a different storage cluster and different computational cluster.

BigQuery has Distributed Storage from which data is fetched and stored by various workers. Workers basically fetched the data, then they shuffled it and finally aggregated it. It is somewhat similar to the MapReduce process, but not exactly MapReduce.

Hive v/s BigQuery


How to access BigQuery?

There are various options are available through which you can access BigQuery:

  1. Cloud Console: Using https://console.cloud.goole.com/bigquery link, you can easily access the BigQuery through Cloud Console.
  2. REST API: Using https://console.cloud.google.com/bigquery/docs/reference/v2 link, you can access the BigQuery through REST API.
  3. Cloud CLI: Using https://console.cloud.google.com/bigquery/docs/cli_tool link, you can access the BigQuery through Cloud CLI.
  4. For having SDKs: You can have Client Libraries in Java, .Net, Python(through Keras and TensorFlow) by using https://console.cloud.google.com/bigquery/docs/reference/libraries link.

How to ingest data in BigQuery?

Data in BigQuery can be ingested through Web-Based, CLI, REST API, Streaming Data, DataFlows(ETLs) , BigQuery Transfer Service.

Open the BigQuery page from the Navigation Bar in GCP. On the left side on the page you would be able to see the various options available. For Ingesting the data, click on the ADD DATA button in Resource option.


Select your data option from the available options i.e. Pin a Project(for ingesting data from any project), Explore Public Dataset(for ingesting data from the public dataset available in BigQuery), External Data Source(for ingesting data from dataset present on local file system).

For demonstrating some basic commands in BigQuery, we are Selecting one of the dataset available in Public Datasets i.e. covid_19_geographical_distribution_worldwide with following schema:



Now, click on the query table button and write the following query:

SELECT sum(daily_confirmed_cases) as total_cases , sum(daily_deaths) as total_deaths FROM `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`

Click on Run and the query will be executed.

Run another query by executing following command:

SELECT countries_and_territories, sum(daily_confirmed_cases) as total_cases , sum(daily_deaths) as total_deaths FROM `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide` group by countries_and_territories order by total_cases desc

Result:



ML on BigQuery

In BigQuery, if we have to perform the predictive analysis using various ML algorithms then the best part is we don’t have to move the data out from BigQuery. If we start moving data from the BigQuery system to another system, then it would be a long and complicated process. As synchronising the data would not be easy.

Big Query let’s you perform ML algorithms which are available in it. Using BigQuery, one can experiment and innovate fastly. Also you can automate HyperParam Tuning, etc.

Many algorithms such as Linear Regression, Clustering, Product Recommendation, Importing TensorFlow Models, etc. can be implemented easily and effectively in BigQuery.

Sample command for creating a model:

Sample command for predicting on new dataset:



Conclusion

BigQuery is a fully managed, scalable, low cost data warehouse for Analysis. It is a serverless system along with features such as High Availability, Highly Scalable, etc. It allows various ML algorithms, Predictive Analytics, GIS Queries, etc. It can integrate with various Data Visualization tools such as Data Studio, Looker, and many more.