The Rise of the Data Catalogs

As the need for data to fuel AI applications grows, tools to organise data are proliferating

As discussed in a previous post, the data engineering team (DE) at AI Singapore was formed to architect and deploy a data platform for our organisation. A more robust and consistent approach to data management has been expressed as a critical need by our technical leaders. They want to know where data is located, who has been accessing that data, what data products and other artefacts have been generated from the original dataset, whether there is an expiry date on the data and many other details. This data platform will support our engineers with both improved processes and tools that provide a comprehensive view of our data resources. This, in turn, will enable data governance policies and processes to be applied to better protect our data assets.

Enter the Data Catalog

Over the last few months we have been looking at the technology landscape for data catalogs for the AI Singapore data platform and were surprised at the activities in this domain. There are a number of open source and commercial projects launched by big tech to better organise their data layer and facilitate data scientists and engineers when locating relevant information. Yet another strong indicator that organisations view data as a strategic asset and want to manage it more effectively.

Our team is reviewing the feature sets across these frameworks and will then collaborate with the platform engineering and AI engineering team leaders to align on what the important capabilities are. As an organisation with the mission to develop the AI ecosystem in Singapore, we have data needs that are somewhat unique; we work with external collaborators who provide data that should be restricted only to the project team members. This data and subsequently generated data products should be erased when the project is completed, while metadata about the artefacts created should be maintained so that metrics can be collected to help us refine our practices. Without providing a formal comparison or detailed laundry list of features, this post highlights some of the reasons we found these frameworks interesting and how the technology might fit into the AI Singapore environment.

Mature Solutions

CKAN is an acronym for Comprehensive Knowledge Archive Network, although hardly anyone knows that. When we initially began this effort, CKAN was probably the only data catalog that most data engineers could name off the top of their heads other than TDS. CKAN has been the data portal standard for a number of years with a large community and many plugins and extensions. So many extensions, in fact, that at this point it is a bit daunting to know which to choose and what combinations will work well. Users can also write their own extensions. Initially oriented toward making datasets available to the public or large institutions for search and download, it would be interesting to see if it has evolved to support the more heterogeneous world of AI data.

Apache Atlas is a project oriented to data governance. It provides robust authentication and authorisation features using Apache Ranger. Metadata objects are stored internally as graphs in JanusGraph and a searchable index is generated. Atlas can ingest objects from HBase, Hive, Storm and Kakfa, of which we only use Kafka frequently. It also supports data expiry.

Recent Offerings

What is clear to us is that managing metadata is a common challenge if well known tech companies are creating fully featured solutions for their internal development teams. Multiple projects reference the Ground project from Berkeley as shaping their thinking about design. Ground used a layered graph structure to track versions, models and lineage. Just as clear is that there is not a silver bullet available if they have all decided to build their own solutions as this requires a large commitment of resources to create and maintain.

Google Data Cloud is a relatively new product that integrates with their data storage tools (BigQuery, Pub/Sub, and GCS) to extract and make available both technical and business metadata. The flexible tag schema and ability to attach metadata as a tag to any data asset (down to a specific column in a table) is well designed. This facilitates faceted searches being performed over those tags. As we use a different cloud provider, this will probably not be our choice. However, Google has a knack for delivering simple, flexible solutions and we want to understand how they approach this challenge.

DataHub is an open source project at LinkedIn which is still in development even though it has already been deployed internally for their data engineering teams. The UI is intuitive and provides strong search capabilities and a broad feature set. One initial concern is that many of the supporting technologies are other LinkedIn open source projects, so an organisation will be buying into a lot of LinkedIn technologies. Additionally, it is important to know that this is a complex system composed of about a dozen docker images and many technologies. Currently, there is no standard Kubernetes deployment script, but that is apparently in the works, so we deployed with Docker Compose to do some further prototyping.

Also still in development, Amundsen from Lyft seems a less complex system than DataHub, but it also appears to have a narrower scope. Designed as a set of micro services and based on well known open source projects such as Neo4j, Elastic Search and Airflow, Amundsen uses discovery as well as user annotation to gather information and context, currently focusing more on the technical metadata than the business metadata. The underlying engine is Neo4j which seems a good match for tracking and surfacing the relationships between users, datasets, reports, etc. Unfortunately, in the current version, there are no authorisation capabilities to restrict metadata to project members only (this is on the roadmap). In addition, there are only three types of resources : Users, Tables and Dashboard, but the data model is extensible. The community on Slack is also very active.

Making the Choice

Ideally, the solution we choose will provide integrations that collect much of the technical metadata (schemas, size, create/modified timestamps, etc) automatically from the repositories which would include RDBMS, logs, and object stores. However, only so much collection can be automated, other information needs to be added by the team that is responsible for the associated data. Both Amundsen and DataHub documentations note that a metadata system is only as good as the community that maintains it. Clear processes and team buy-in will be as important as the features of the framework.

Once a solution is chosen there will still be a considerable amount of work to configure the framework to support our projects, integrate it with existing frameworks, and augment the environment to make it as intuitive as possible for the AI engineers.

We could, of course, roll out our own and build only what we need, but like many modern organisations, we run a lean development team and to commit to a significant internal development project at this time is not appealing. It is not only the initial design, development, test and deployment effort, but also the subsequent maintenance effort. A viable open source solution is preferred.

We are still in evaluation mode as part of our effort to enable our engineering teams with a more organised approach to data management. Our team will provide an update down the road on our decisions and progress overall. If you have ideas, success stories with a certain solution or similar challenges, feel free to leave a comment.

The Data Engineering Series