Week 2 | Notion

What is Data Lake?

Data lake is a repo that holds Big Data from different resources and can be used for a variety of objectives by data engineers, data scientists, data analysts, etc. There are two main concerns that must be considered when building a Data Lake: Security and Scalability.

Data Lake vs Data Warehouse

Extracted from: https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/

  Extracted from: [<https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/>](<https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/>)

ETL vs ELT

ETL stands for Extraction, Transform and Load, whereas ELT stands for Extract Load and Transform. The last one is the framework used in data lakes, in which we have a need to store data as quickly as possible, without worrying about transformation at first. The first one is more used by data warehouses, in which we have smaller amount of data (when we say small, we mean in order of terabytes. In Data Lake, the order is of petabytes). Not so small, though, right? ;)

One last thing concerning this topic is that in Data Lakes, the schema is obtained during data reading! (Schema defines the fields and data types contained in a data source)

When Data lakes becomes a nightmare?

Data swamp (it happens when there is no data quality or data governance)
There is no versioning
Incompatible schemas for the same data (parquet, delta, csv. etc)
There is no metadata associated
Joins are not possible

Cloud providers for data lake

GCP - cloud storage
AWS - S3
Azure - Blob

Introduction to Workflow orchestration

In week 1, we wrote a ingest file, in which we download a parquet file and then ingest the data into PostgreSQL. However, this is not best practice. Since we have two main steps (download and ingestion), we should split the process into two different scripts and establish an order of execution, so that, after downloading the data, we trigger the ingestion script.

In real scenarios, we may have many steps to be executed and it is recommended to orchestrate a data workflow, so that we minimize the chance of forgetting a step and it also allows us having a big view of what we need to do.