Data Catalog: The Single Source of Truth for Data

"Where is the data?" In most organizations, this simple question is the start of a time-consuming forensic investigation through undocumented tables, slack histories, and conflicting business intelligence dashboards. The result is duplicated effort, inconsistent metrics, and a lack of trust in data.

Arkham's Data Catalog is engineered to be the definitive source of truth for all data assets in your organization. It's not a passive registry; it's an active, central component of your data strategy. By automatically ingesting metadata and organizing assets into clear Staging, Production, and ML Model tiers, our Catalog provides a reliable, searchable, and governed path for builders to find and use the right data for the job.

Our Data Catalog, Arkham's single source of truth, providing rich metadata, lineage, and AI-powered suggestions from TARS to accelerate data discovery.

How It Works: Our Three Data Tiers

Our Data Catalog is designed around a three-tier system to enforce data quality and provide a clear lifecycle for your data assets. This structure is automatically managed by our platform as you use the core developer tools.

  • Staging Tier: This tier contains raw, un-validated data ingested directly from your source systems by Connectors. Staging datasets provide an immediate, queryable snapshot of your sources and serve as the direct input for your transformation pipelines.
  • Production Tier: This tier holds the clean, validated, and transformed datasets that are the output of our Pipeline Builder. These are your high-quality, trusted data assets, ready for consumption.
  • ML Models Tier: This tier contains the direct outputs of your machine learning models from our ML Hub. Datasets here include inference results, training/testing data, and model performance metrics, providing a complete, auditable record of your model's activity.

🤖 AI-Assisted Discovery with TARS

Our Data Catalog is where TARS's deep understanding of your data landscape shines. It acts as an intelligent discovery tool, saving you hours of manual exploration. You can ask complex questions in natural language:

"Show me the lineage for the @production_orders dataset. What pipelines create it and what workbooks consume it?"

TARS can also help you explore schemas, profile columns, and even generate sample queries, making data discovery faster and more intuitive.

Key Technical Benefits

  • Clear Data Lifecycle: Our three-tier system provides a clear, prescriptive path for all data development, from raw ingestion to ML-driven insights.
  • Automated Data Discovery: Our catalog automatically registers datasets from all sources—Connectors, Pipeline Builder, and our ML Hub—ensuring it is always an up-to-date reflection of your Lakehouse.
  • Data Lineage and Provenance: Provides a complete lineage for every data asset, allowing you to trace data from its source to its consumption. This is critical for impact analysis, root cause analysis, and regulatory compliance.
  • Fine-Grained Access Control: Secure your data with robust governance tools. You can apply Access Control Lists (ACLs) directly to datasets, ensuring that users and roles only have permission to see and query the data they are authorized to access.
  • Integration with Arkham's Ecosystem: Our Data Catalog is the central hub connecting all other components, from Connectors to our Playground, enabling a seamless builder experience.
  • Data Platform Overview: Understand how our Data Catalog acts as the central hub in the data workflow.
  • Connectors: The source of all datasets in the Staging Tier.
  • Pipeline Builder: Consumes data from the Staging Tier and publishes trusted datasets to the Production Tier.
  • Playground: The primary tool for exploring, querying, and validating datasets in our Catalog.
  • TARS: Your AI co-pilot for intelligent data discovery, schema exploration, and lineage tracing.