Week 3 | Notion

OLAP vs OLTP

There are two main types of Data Warehouses: OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing).

The first one is designed to be used by data analysts, business managers, where data is constantly updated in batch, to allow decision making. The latter is designed to control and run essential business operations in real time and the updates are fast, but in smaller amount.

In terms of historical information, OLAP provides multi-dimensional view of the whole data available in a company where as OLTP concatenates daily transactions, for example.

What is a Data Warehouse?

It is a OLAP solution and and it works as follows:

Extracted from: https://datawarehouseinfo.com/wp-content/uploads/2018/09/DW-architecture.jpg

Generally speaking, a DW stores 3 types of data (summary data, raw data and metadata), which are derived from many sources and these data can be split into different data marts depending on the goal, in order to support decision making and help solve business problems. In Staging, it is where we create the ETL layer, responsible for transforming data the way we need

BigQuery

It is a serverless data warehouse, that can scale and has high-availability depending on our needs! It is a DW solution provided by Google and it is charged based on demand.

Partitioning in BQ

Imagine we have raw data , based on questions posted on StackOverflow and these question are marked with their creation date, title and tags. If we only want to query questions made on March 2nd 2018, BigQuery will have to look for every row and filter out only those from March 2nd 2018, which is an expensive task, depending on how much data we have. However, here comes partitioning: we can create the same table partitioned by the creation data, so that BQ will find our desired data much faster and cost us less, like below:

Extracted from: https://storage.googleapis.com/gweb-cloudblog-publish/images/BigQuery_Explained_Storage_5.max-1400x1400.png

Clustering in BQ

From our previous section, we saw that we are able to partition our data in such a way that our queries becomes faster and light-weighted, reducing our costs! Additionally, we can create one more level of compactness, by clustering the data inside the partitions. So, in our previous image, we can see that on March 2nd 2018, we have questions for Android and Linux and this way, we can cluster these 5 rows by Tags, that would make our first 3 rows as being Android and the last 2 rows as being Linux, optimizing our further queries.

Inside Partitioning and Clustering

Partitioning
- Time-unit column
- It can be performed daily, hourly or monthly/yearly partitioning
Clustering
- We can specify up to 4 clustering columns
- The order of the clustering column is important as it determines the sort order (it is like a group by clause)
- Useful when we need more granularity than partitioning allows

NOTE: tables sizes less than 1GB, partitioning and clustering don’t show significant improvement