Without a proper metadata management – it is like you buying a million dollar home with no power connection. There is not much you can do with it!
Metadata management is still a problem in the traditional data management landscape and now with the Big Data and technologies like Hadoop – it is becoming even more complex. The whole value of Hadoop is the freedom it provides. It doesn’t enforce to get fit into a schema. You are no more just restricted to structured data and so on.
The most value from Hadoop is when you extract structured information from the vast amount of unstructured data. And in an enterprise context, you need to connect that extracted information with the rest of the traditional (or structured) data. Here is when the metadata management becomes so critical.
What is Metadata?
In simple terms, it is the data about the data. Metadata is usually categorized into three.
- Business Metadata
This contains the business definitions/details. This is the way by which the business users understand the data. This will have the additional details like – who generated this, who owns this data etc. It can also have some of the policies, changes and even some of the regulatory/compliance aspects. In nutshell, Business metadata is the key reference point for making any data driven decisions.
- Technical Metadata
This is more technical in nature- like the database information, tables, columns etc. Now if you have proper business metadata, you can now easily connect that to the table level details in the technical metadata.
- Operational Metadata
This is to do with the currency and lineage. It could say the history of the data, if the data is active or archived, It could be even who accessed the data and when, how the data changed over a period of time.
In summary – business can work on any data is there is a proper metadata available.
In a traditional Data Warehousing environment ( that deals with structured data), the capturing, maintaining and managing is fairly possible ( though not done properly in many organizations) because :
- You know where the data is coming from
- You are fitting the data into a proper model
- You have a proper governance mechanism on the Datalakes
Now think about the concept of Data Lakes! The definition itself says – you ingest data (structured or unstructured) as the way it is and you swim and fish for information out of the lake. The key purpose of building a lake is:
- All data in one place
- You don’t have to model upfront (Schema on read)
- Be more agile on Data – business get to see/use data overnight
- Enable more and more self service
By now, I am sure you realize, how complex it will be to address the metadata aspect in a data lake?