Digging into BigData with Google’s BigQuery
What is Big Data?
More than 100 thousand tweets in 60 seconds are generated , more than 7 million posts have been posted on Facebook, before you read this sentence.So the data is generated faster than you could ever think before.Big data and analysis has exploded recently but there is a barrier. That barrier is indeed it needs lot of money resources and time to setup the infrastructure.Also it needs skillful people to make it all happen. so google solves all these big query.Big query,One of the products of google cloud platform that allows us to easily work with big data.It is google’s fully managed data analysis service offering in the cloud. It enables super fast analysis.Easily store and analyse big data in google infrastructure.
Lets get familiar with the components.
Projects are going to be the top level item inside the google cloud platform. Project contains, users authentication billing information and that is where data sets are going to live.
Data set is really a container for tables.Access controllers cannot be done on tables so that they are don through projects and data sets.Project contains data sets, data sets contains tables.
Tables where the data lives
Jobs are going to be asynchronous process that run on the background to load, export and to execute large queries.
Lets see How to use google cloud platform for big data solution.Architecture is divided in to two workflows named data workflow and visualization workflow.We need to get our source data into big query using any ETL tool and pipe into google cloud storage. Extract it from source and de normalize it,Biquery likes less joins the better .we can use hadoop clusters running on computers to do many pre processing and transforming data.once its in bigquery its all about visualization. Most of the use cases are log analysis which is used to analyse application behavior and user behavior in order to improve the system.Retail forecast – the more data, business has the more accurately they can predict product sales for the next month,that allows them to plan better. Lets see how we can use big query to analyse lots of data in very short time they handle the infrastructure and we can just simply focus on getting our data and analyse it.
Google handles Big Data every second of every day to provide services like Search, YouTube, Gmail and Google Docs.Can you imagine how Google handles this kind of Big Data during daily operations? How they are doing it?
As an example, let’s consider the following SQL query, which requests the Wikipedia® content titles that includes numeric characters in it:
select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH
(title, ‘[0-9]*’) AND wp_namespace = 0;
Notice the following:
• This “wikipedia” table holds all the change history records on Wikipedia’s article content and consists of 314 millions of rows – that’s 35.7GB.
• The expression REGEXP_MATCH(title, ‘[0-9]+’) means it executes a regular expression matching on title of each change history record to extract rows that includes numeric characters in its title (e.g. “United States presidential election, 2015”).
• Most importantly, note that there was no index or any pre-aggregated values
for this table prepared in advance.
Dremel can even execute a complex regular expression text matching on ahuge logging table that consists of about 35 billion rows and 20 TB, in merely tens of seconds. This is the power of Dremel; it has super high scalability and most of the time it returns results within seconds or tens of seconds no matter how big the queried data set is.
Two core technologies which gives Dremel this performance:
1. Columnar Storage. Data is stored in a columnar storage fashion which
makes possible to achieve very high compression ratio and scan throughput.
2. Tree Architecture is used for dispatching queries and aggregating results
across thousands of machines in a few seconds.
Dremel stores data in its columnar storage, which means it separates a record into column values and stores each value on different storage volume, whereas
traditional databases normally store the whole record on one volume.
• Traffic minimization. Only required column values on each query are scanned and transferred on query execution. For example, a query “SELECT top(title) FROM foo” would access the title column values only. In case of the Wikipedia table example, the query would scan only 9.13GB out of 35.7GB.
• Higher compression ratio. One study reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values, especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage. Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of Dremel, it simply doesn’t support any update operations.
One of the challenges Google had in designing Dremel was how to dispatch queries and collect results across tens of thousands of machines in a matter of seconds. The challenge was resolved by using the Tree architecture. The architecture forms a massively parallel distributed tree for pushing down a query to the tree and then aggregating the results from the leaves at a blazingly fast speed.
The tree architecture also enables multiple queries to run at once within the tree, which lets
different users share the same hardware.You might have heard of hadoop map reduce mechanism,
so what is the difference between mad reduce and bigquery
- Google Spreadsheet
- Web application
- Command Line Tool
- Bigquery UI