Sunday, October 23, 2016

Google BigQuery – Analytics Data Warehouse

GoogleCloudPlatformGoogle handles Big Data every second of every day to provide services like Search, YouTube, Gmail and Google Docs. Google created a Query Service named “Dremel” which was used just internally within Google.

Dremel is a query service that allows you to run SQL-like queries against very, very large data sets and get accurate results in mere seconds. You just need a basic knowledge of SQL to query extremely large datasets in an ad hoc manner.

BigQuery is the public implementation of Dremel. BigQuery provides the core set of features available in Dremel to third party developers. It does so via a REST API, a command line interface, a Web UI, access control and more, while maintaining the unprecedented query performance of Dremel.

BigQuery can scan billions of rows in a highly performant manner for ad hoc query analysis. It does achieve high performance through Columnar Storage and Tree Architecture. BigQuery Client Libraries - https://cloud.google.com/bigquery/client-libraries

Currently Microsoft is planning to provide Google BigQuery connector for Power BI. In the interim, you can import data from Google BigQuery using an ODBC driver, which is fully supported for Import scenarios in Power BI Desktop, and Personal/Enterprise Gateway for Refresh purposes.

BigQuery vs MapReduce

MapReduce is a distributed computing technology that allows to implement custom “mapper” and “reducer” functions programmatically and run batch processes with them on hundreds or thousands of servers concurrently. MapReduce is designed as a batch processing framework, so it’s not suitable for ad hoc and trial-and-error data analysis.

BigQuery is designed to handle structured data using SQL.MapReduce is a better choice when you want to process unstructured data programmatically. The mappers and reducers can take any kind of data and apply complex logic to it.

Use BigQuery

  • Finding particular records with specified conditions. For example, to find request logs with specified account ID.
  • Quick aggregation of statistics with dynamically-changing conditions. For example, getting a summary of request traffic volume from the previous night for a web application and draw a graph from it.
  • Trial-and-error data analysis. For example, identifying the cause of trouble and aggregating values by various conditions, including by hour, day and etc...

Use MapReduce

  • Executing a complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms.
  • Executing large join operations across huge datasets.
  • Exporting large amount of data after processing.

No comments:

Post a Comment