Let’s talk about Data Lakes! Maybe you are here because you want to know more about them, or maybe you suspect you have a Data Swamp and want to learn what to do about it. Either way, we will cover some strategies and questions to think about when designing your Data Lake.
What is a Data Lake?
A scalable, organized, blob-oriented data storage paradigm. It is a place to store data of all sizes and kinds. You can have some structured data in Parquet, say, and/or some unstructured text data in a raw txt file. It is a great place for, for example, storing raw data prior to processing, old reporting datasets that don’t need to be accessed on demand.
Those features sound nice and Data Lakes have been recommended as a great tool for organizations of all sizes, so what is the catch? Well, a Data Lake without proper planning and maintenance can turn into a Data Swamp.
What is a Data Swamp?
A collection of data that is hard for analysts to find the data they need whether seeking the correct dataset or the most up to date data. Often a group has versions of a dataset stored in multiple places with slightly different properties or dates.
If you find yourself there, have no fear: we can engineer a way out!