A data lake is a flexible storage pattern that is typically used for storing massive amounts of raw data in its native format.[^1] Data lakes are flexible in that they can store practically any type of data from structured (tabular data), semi-structured (JSON, XML), and unstructured data (videos, images, audio). Data lakes utilize blob storage which is a cheap and abundant and a compute engine of the user's choice. ```mermaid %%{init: { "flowchart": { "useMaxWidth": true } } }%% graph LR     A1((Structured Data)) --> B[(Blob Storage)]     A2((Semi-structured Data))-->B     A3((Unstructured Data))-->B     B -->D[Data Engineer]     B -->E[Data Scientist]     B -->F[Machine Learning Engineer] ``` ## Data Lake Advantages - Cheaply store large amounts of data. - Flexibly store any type of data (futureproof). - Allows for compute optimization by mixing and matching compute options for different workloads. - All data is stored in one place where all stakeholders can work. ## Data Lake Disadvantages - Data governance is more challenging and relies on robust cataloging and metadata to make the data useful. Cloud providers often offer additional services to address these issues. - Because storage is cheap, there's a tendency to store more data regardless of it's business value. ## Data Lake Reference Architectures - AWS: [Deploy and manage a serverless data lake on the AWS Cloud by using infrastructure as code](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code.html?did=pg_card&trk=pg_card) - Azure: [Introduction to Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/training/modules/introduction-to-azure-data-lake-storage/) - GCP: [Data lakes in cloud with Kafka and Confluent](https://cloud.google.com/blog/products/data-analytics/data-lakes-in-cloud-with-kafka-and-confluent) [^1]: [AWS, What is a data lake?](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/) %% wiki footer: Please don't edit anything below this line %% ## This note in GitHub <span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Architecture/Data%20Lake.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Architecture/Data%20Lake.md "git-hub-copy-note")</span> <span class="git-footer">Was this page helpful? [👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Architecture/Data%20Lake) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Architecture/Data%20Lake)</span>