Data Lakehouse: Your Unified Data Foundation

Arkham's Data Lakehouse provides a modern, open architecture that combines the flexibility and cost-efficiency of a data lake with the data management and transactional guarantees of a data warehouse. It is the central repository for all your enterprise data, from raw logs to curated, business-ready tables, designed to power every data workflow.

Core Architecture

Our Lakehouse implementation is built on a decoupled architecture that ensures scalability, reliability, and openness. It integrates a transactional layer directly over cloud object storage, delivering performance and strong governance.

This architecture delivers key technical advantages:

  • Decoupled Storage and Compute: Scale your storage and compute resources independently. You can store petabytes of data cost-effectively in object storage and spin up compute clusters only when needed for queries or transformations.
  • ACID Transactions on the Lake: We bring the reliability of database transactions to your data lake. This prevents data corruption from failed jobs and ensures that users always have a consistent view of the data.
  • Performance with Open Formats: By using optimized, columnar file formats like Apache Parquet under the hood, queries are fast and efficient. Since the formats are open, you avoid vendor lock-in.
  • Schema Enforcement and Evolution: Prevent "schema drift" and data quality issues. The transactional layer validates that all data written to a table conforms to its schema, while also supporting safe, atomic schema evolution over time.
  • Time Travel & Reproducibility: Query previous versions of your data down to the millisecond. This is invaluable for debugging pipelines, auditing changes, and reproducing ML experiments.

Detailed comparison with alternative architectures.

Code
1 | Feature | Data Lakes | Data Warehouses | **Lakehouse (Best of Both)** |
2 |---------|------------|-----------------|-------------------------------|
3 | **Storage Cost** | ✅ Very low (S3) | ❌ High (compute+storage) | ✅ Very low  (S3) |
4 | **Data Formats** | ✅ Any format (JSON, CSV, Parquet) | ❌ Structured only | ✅ Any format + structure |
5 | **Scalability** | ✅ Petabyte scale | ❌ Limited by cost | ✅ Petabyte scale |
6 | **ACID Transactions** | ❌ No guarantees | ✅ Full ACID support | ✅ Full ACID  support |
7 | **Data Quality** | ❌ No enforcement | ✅ Strong enforcement | ✅ Strong  enforcement |
8 | **Schema Evolution** | ❌ Manual management | ❌ Rigid structure | ✅ Automatic  evolution |
9 | **Query Performance** | ❌ Slow, inconsistent | ✅ Fast, optimized | ✅ Fast,  optimized |
10| **ML/AI Support** | ✅ Great for ML | ❌ Poor ML support | ✅ Great for ML |
11| **Real-time Analytics** | ❌ Batch processing | ✅ Real-time queries | ✅  Real-time queries |
12| **Time Travel** | ❌ Not available | ❌ Limited versions | ✅ Full version  history |
13| **Setup Complexity** | ✅ Simple | ❌ Complex ETL | ✅ Moderate complexity |

How It Works: From Source to Lakehouse

For a builder, the Lakehouse architecture directly translates to a more efficient and reliable development experience:

  • One Source of Truth: Eliminate data silos and redundant copies. You can run BI queries, real-time analytics, and ML model training against the same, consistent data repository.
  • Simplified Data Pipelines: The Medallion Architecture becomes simple to implement. You can reliably transform data from raw (Bronze) to aggregated (Gold) with transactional guarantees at each step.
  • Direct Data Access: Connect your favorite tools. Because our architecture is built on open standards, you can query the Lakehouse directly from external tools like Power BI or Tableau, in addition to Arkham's native tools.

Explore how the Data Lakehouse integrates with other core components: