Data Catalog: Your Centralized & Governed Data Registry

Arkham's Data Catalog is the definitive source of truth for all data assets within your organization. It acts as a centralized metadata repository, providing a comprehensive, searchable, and governed view of your entire data landscape. By clearly separating data into Staging and Production tiers, the Catalog provides a clear, reliable path for builders to find and use the right data for the job.

Our Data Catalog is not just a passive registry; it's an active component of your data strategy. It automatically captures metadata, tracks data lineage, and profiles data quality, providing the context necessary for effective data management and governance.

How It Works: The Three Data Tiers

Our Data Catalog is designed around a three-tier system to enforce data quality and provide a clear lifecycle for your data assets. This structure is automatically managed by our platform as you use the core developer tools.

  • Staging Tier: This tier contains raw, un-validated data ingested directly from your source systems by Connectors. Staging datasets provide an immediate, queryable snapshot of your sources and serve as the direct input for your transformation pipelines.
  • Production Tier: This tier holds the clean, validated, and transformed datasets that are the output of the Pipeline Builder. These are your high-quality, trusted data assets, ready for consumption.
  • ML Models Tier: This tier contains the direct outputs of your machine learning models from the ML Hub. Datasets here include inference results, training/testing data, and model performance metrics, providing a complete, auditable record of your model's activity.

Key Technical Benefits

  • Clear Data Lifecycle: The three-tier system provides a clear, prescriptive path for all data development, from raw ingestion to ML-driven insights.
  • Automated Data Discovery: The catalog automatically registers datasets from all sources—Connectors, Pipeline Builder, and the ML Hub—ensuring it is always an up-to-date reflection of your Lakehouse.
  • Data Lineage and Provenance: Provides a complete lineage graph for every data asset, allowing you to trace data from its source to its consumption. This is critical for impact analysis, root cause analysis, and regulatory compliance.
  • Fine-Grained Access Control: Secure your data with robust governance tools. You can apply Access Control Lists (ACLs) directly to datasets, ensuring that users and roles only have permission to see and query the data they are authorized to access.
  • Integration with the Arkham Ecosystem: The Data Catalog is the central hub connecting all other components, from Connectors to the Playground, enabling a seamless builder experience.

AI-Assisted Discovery with TARS

The Data Catalog is where TARS's deep understanding of your data landscape shines. It acts as an intelligent discovery tool, saving builders hours of manual exploration and helping decision-makers quickly understand data lineage and impact. You can ask complex questions like:

  • "Search for datasets related to customer orders."
  • "What is the schema for the production_users dataset? Show me the column descriptions."
  • "Show me the lineage for this dataset. What pipeline creates it and what workbooks consume it?"
  • "Write a SQL query to get the first 100 rows from @production_users."

See how Data Connectors feed the rest of the platform:

  • Data Platform Overview: Understand how the Data Catalog acts as the central hub in the data workflow.
  • Connectors: The source of all datasets in the Staging Tier.
  • Pipeline Builder: Consumes data from the Staging Tier and publishes trusted datasets to the Production Tier.
  • Playground: The primary tool for exploring, querying, and validating datasets in the Catalog.
  • TARS: Your AI co-pilot for intelligent data discovery, schema exploration, and lineage tracing.