Data lake is a repo that holds Big Data from different resources and can be used for a variety of objectives by data engineers, data scientists, data analysts, etc. There are two main concerns that must be considered when building a Data Lake: Security and Scalability.
Extracted from: [<https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/>](<https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/>)
ETL stands for Extraction, Transform and Load, whereas ELT stands for Extract Load and Transform. The last one is the framework used in data lakes, in which we have a need to store data as quickly as possible, without worrying about transformation at first. The first one is more used by data warehouses, in which we have smaller amount of data (when we say small, we mean in order of terabytes. In Data Lake, the order is of petabytes). Not so small, though, right? ;)
One last thing concerning this topic is that in Data Lakes, the schema is obtained during data reading! (Schema defines the fields and data types contained in a data source)
In week 1, we wrote a ingest file, in which we download a parquet file and then ingest the data into PostgreSQL. However, this is not best practice. Since we have two main steps (download and ingestion), we should split the process into two different scripts and establish an order of execution, so that, after downloading the data, we trigger the ingestion script.
In real scenarios, we may have many steps to be executed and it is recommended to orchestrate a data workflow, so that we minimize the chance of forgetting a step and it also allows us having a big view of what we need to do.