Data Catalog: The Single Source of Truth for Data
"Where is the data?" In most organizations, this simple question is the start of a time-consuming forensic investigation through undocumented tables, slack histories, and conflicting business intelligence dashboards. The result is duplicated effort, inconsistent metrics, and a lack of trust in data.
Arkham's Data Catalog is engineered to be the definitive source of truth for all data assets in your organization. It's not a passive registry; it's an active, central component of your data strategy. By automatically ingesting metadata and organizing assets into clear Staging, Production, and ML Model tiers, the Catalog provides a reliable, searchable, and governed path for builders to find and use the right data for the job.

How It Works: The Three Data Tiers
Our Data Catalog is designed around a three-tier system to enforce data quality and provide a clear lifecycle for your data assets. This structure is automatically managed by our platform as you use the core developer tools.

- Staging Tier: This tier contains raw, un-validated data ingested directly from your source systems by Connectors. Staging datasets provide an immediate, queryable snapshot of your sources and serve as the direct input for your transformation pipelines.
- Production Tier: This tier holds the clean, validated, and transformed datasets that are the output of the Pipeline Builder. These are your high-quality, trusted data assets, ready for consumption.
- ML Models Tier: This tier contains the direct outputs of your machine learning models from the ML Hub. Datasets here include inference results, training/testing data, and model performance metrics, providing a complete, auditable record of your model's activity.
🤖 AI-Assisted Discovery with TARS
The Data Catalog is where TARS's deep understanding of your data landscape shines. It acts as an intelligent discovery tool, saving you hours of manual exploration. You can ask complex questions in natural language:
"Show me the lineage for the production_orders
dataset. What pipelines create it and what workbooks consume it?"
TARS can also help you explore schemas, profile columns, and even generate sample queries, making data discovery faster and more intuitive.
Key Technical Benefits
- Clear Data Lifecycle: The three-tier system provides a clear, prescriptive path for all data development, from raw ingestion to ML-driven insights.
- Automated Data Discovery: The catalog automatically registers datasets from all sources—Connectors, Pipeline Builder, and the ML Hub—ensuring it is always an up-to-date reflection of your Lakehouse.
- Data Lineage and Provenance: Provides a complete lineage graph for every data asset, allowing you to trace data from its source to its consumption. This is critical for impact analysis, root cause analysis, and regulatory compliance.
- Fine-Grained Access Control: Secure your data with robust governance tools. You can apply Access Control Lists (ACLs) directly to datasets, ensuring that users and roles only have permission to see and query the data they are authorized to access.
- Integration with the Arkham Ecosystem: The Data Catalog is the central hub connecting all other components, from Connectors to the Playground, enabling a seamless builder experience.
- Data Platform Overview: Understand how the Data Catalog acts as the central hub in the data workflow.
- Connectors: The source of all datasets in the Staging Tier.
- Pipeline Builder: Consumes data from the Staging Tier and publishes trusted datasets to the Production Tier.
- Playground: The primary tool for exploring, querying, and validating datasets in the Catalog.
- TARS: Your AI co-pilot for intelligent data discovery, schema exploration, and lineage tracing.