Data Catalog

A Key Driver in the Digital Transformation Journey

Ram Tyagi
8 min readApr 26, 2021

This article was earlier published by me on LinkedIn

Data is key for the success of any business, and this is more relevant than ever before in the current crisis that industry and mankind are facing. Data insights will be a key driver in dealing with the situation of COVID-19 and it will be instrumental in finding the cure as well. Data insights are also important for the financial industry to read the current and upcoming market trends as events unfold every day. After spending two decades of my career in the financial industry, I have realized that most firms lag in data maturity, and this crisis is revealing many loopholes in their governance process. As I start my journey into retail and transportation with my recent client, I am realizing that never before was data so important for the retail sector, and especially for grocers as it is now. The players who know their data better will not only survive but will be benefitted the most during and post-crisis.

What Issues Organizations Face with Data

You may be wondering why I’m describing such a trivial fact. We all know that data and metrics drive the business, as seen with Facebook, Amazon, Uber, Google, and Netflix. But the question is, why the rest of the industry verticals still don’t have a good handle on their data like the FAANG group? Why, when it comes to compliance, they struggle to locate data and provide reports on time? In my experience, leading IBs, Asset Managers, and retailers still rely on key resources to compile the data, and some even rely on the excel sheets. Ask your team about your data, metadata, and related material definitions, and most people will answer ambiguously. The following are very common responses that I have seen when interacting with the operation leads, compliance officers, and business-tech liaisons:

  1. It takes too long to find data
  2. Data lives in silos
  3. Same data found on multiple places
  4. No way to tell who is using what data
  5. We don’t know the actual source of data
  6. Documentation is out of date and not trusted. Documents are not at a common place where everyone can view and collaborate.
  7. Different people have a different definition of data, No common language and common definition of data
  8. Not easy to choose the right data sets for AI and ML engines
  9. We are so dependent on a few key people for certain reports because only they know how to query data
  10. We still have a lot of data into excel sheets and in our heads

Take a look into below diagram on how users struggle about their data

All these problems have a common symptom — that we are not able to identify data and its sources. Data should be easily locatable, like the way we search for information on Google, and this data can only be made rich through collaboration, which is only possible when people in the organization can uniquely identify pieces of information. I often ask my clients if their teams are data literate, and this means not just a few developers, but the entire organization including all business people. Being data literate means, users can define and a common definition of their data, so that entire organization including data authors and consumers can speak in the same “data language”. “Poor data literacy is ranked as the second-biggest internal roadblock to the success of the office of the chief data officer,” according to the Gartner Annual Chief Data Officer Survey. All this leads to having a platform/tool/place that everyone can refer to, a common inventory of their metadata, definitions, and related artifacts. Let’s start with a few questions that might be coming in your mind now — How can we make everyone speak the same language about data in the organization? What is a Data Catalog and what does it contain?

What is a Data Catalog

Imagine, you can connect all your data stores to one system that can extract all the metadata from these sources and provide a cohesive view, able to connect all the dots between your multiple tables & data constituents, and facilitate you & peers to tag the definition and description of each column, schema & tables. Extend this to collaborating on documentation, linking these docs to metadata and metadata to docs, able to discuss/debate about data elements, and able to see statistics about most used data & most interested consumers of your data. And eventually able to run these stats through a pattern engine that can make suggestions for your data and its quality! This all can be done if you have the right data catalog in place.

Data Catalog is a tool or platform that empowers an organization to store, enrich, and collaborate on key data constituents like data domains, entities, data glossaries, and documentation. The data catalog is a search engine for your metadata and a platform where people can perform Google-like searches and collaborate on writing Wikipedia-like articles. Data Catalog should be crafted and drafted in the way that it defines a common language of your organization’s data, and makes the organization data literate. A Data Catalog automatically captures the rich context of enterprise data and keeps it updated with human-machine learning collaboration. A data catalog must at least provide the following (detail list is provided in the diagram) :

  • Metadata Catalog (Business, Tech, Legal and Customer-specific metadata)
  • Query Catalog
  • Human-Contributed Context (articles, documents, white papers, definition) — Wiki style articles for documentation about data entities, domains, sources, tables and column
  • Templates for collaboration
  • Usage Context (Learned from Queries)
  • Artificial intelligence-powered automation and recommendation
  • Data Stewardship and Lineage
  • A 360-degree search of metadata and documentation

A data catalog can be used to facilitate a common language for your data across the organization or industry, and to collaborate across multiple data sources like Teradata, SQL Server, Hive, Vertica, MySQL, Oracle, SQL Server, cloud data sources, data lakes, and reservoirs, etc.

Once a data catalog is built, it can be enriched over time through collaboration from data stewards, data owners, data engineers, data analysts, and business analysts. All members of the organization can participate to add the value, and as a result, everyone in the organization becomes data literate. This helps quickly deploy data integration tools, AI Engines, ML algorithms, and analytical tools.

How Data Catalog Can Help in Your Digital Transformation Journey

As per Gartner, “By 2021, organizations that offer a curated catalog of internal and external data to diverse users will realize twice the business value from their data and analytics investments than those that do not.”

Following are some of the key benefits that data catalog will drive in your journey of data maturity and digitization:

  • No more Key Person Dependency as everyone is now data-aware and all data is digital
  • Everyone in the organization (yes, including business people, too) can query the data-related information without relying on developers from a commonplace
  • People can ask questions, others can answer, and discussions can be documented for future reference
  • Help in generating key insight about the frequent visitors of data and frequent queries — this data can be instrumental in TCO decisions when deciding in-premise or cloud computing resources
  • A driver for business requirements and implementation where business and tech can collaborate on data requirements, design, data lineage, and implementation
  • Uncover manual data processes and excel like dependencies
  • Artificial intelligence-powered automation and recommendation

In the above diagram, the data catalog connects all your data sources, crawls through the data storages to collect all the metadata.

If we expand the scope of data catalog across organizations, then the product that I call catalog of catalogs can be a catalyst in solving even bigger issues that span across industries and organizations. For example, Our researchers and scientists are struggling to speak a common data language when it comes to solving the COVID-19 where you need to coordinate not just between entities of an organization but across the research institutions and countries! I’m going to further elaborate on this in the next section. Another good use of data catalog across the industry is for regulators such as SEC, CFTC. In the transportation industry, the data catalog can be useful not only for the individual OEMs for their vehicle, dealer, parts, and consumer data but also can be a good driver for DMS (Dealership Management System) that spans across the OEMs.

Data Catalog’s Role Into COVID-19 Collaboration

While individual organizations can solve their issues by leveraging their data catalogs, we have to deal with data on the global front as well, like in the current COVID-19 situation. A global data catalog can help us fighting COVID-19 where scientists can collaborate on a common platform. According to a press release from Alation, a key data catalog provider, they recently released a catalog on the pandemic that includes data from dozens of key sources, including case data from the COVID Tracking Project, Johns Hopkins University, and contextualizing data including census information, comorbidity trends, weather patterns, vaccination histories, and more.

“Combatting, containing, and responding to COVID-19 is a massive data problem and in order to succeed, the brightest minds from different fields will have to work with the best data sets and collaborate with one another. Our catalog will give data and domain experts a single platform to discover relevant data sets, combine, annotate, and analyze them with confidence, and collaborate to generate and validate results,” said Aaron Kalb, co-founder and Chief Data Officer at Alation.

Alation press release further mentions that COVID-19 Data Catalog will enable community members to:

  • Search for and discover relevant data sets
  • Upload and register new data sets for inclusion in the data catalog so they can be combined with and compared to existing data sets
  • Collaborate on answering COVID-19 research questions and pose new questions to the community
  • Post “lab notebooks” and articles on specific topics and have conversations around them — all with easily embedded data
  • Define and publish queries and business intelligence artifacts, (e.g., Tableau visualizations) that can be shared and searched within the catalog

Market Offerings on Data Catalog Tool

While leading cloud providers AWS and Azure offer managed data catalog services (AWS Glue and Azure Data Catalog), these are still in the early stage of providing key features of a data catalog. The following report from “The Forrester Wave: Machine Learning Data Catalogs, Q2 2018” provides a comparison of data catalog tools that market offers:

My colleague Suresh Kandula wrote an interesting series of articles on data governance that also includes details on data catalog, I would recommend reading entire his series from part 1 to part 6-

Data Governance 101 — (part 6) Tools & Automation Medium Thanks for coming back!

In conclusion, digital transformation can yield the right results when a data catalog is integrated with your data stores and used to its full potential. Data catalog empowers and facilitates employees to not be dependent on specialized resources, and helps organizations to build a 360-degree view of metadata. As organizations are finding new ways of collaboration due to remote working and social distancing, the data catalog can be a driver to accelerate collaboration and increase productivity.

--

--