ADR-0001: Snapshots contain semantic information¶

Status¶

Accepted

Context¶

The main purpose of this package is to track potential semantic changes on datasets, for example the fact that new classes have been added or removed, that distributions have changed, etc. There are other tools that either support snapshotting the entire dataset (DVC, git LFS, etc.), or store dataset versions within their own platform ecosystem (W&B, HuggingFace, Valohai, etc.).

Decision¶

Snapshots should contain the following semantic dataset information in each snapshot, grouped by concern.

Snapshot Identity & Provenance¶

Field	Description
`format_version`	Schema version; required for forward compatibility
`snapshot_id`	Unique identifier (e.g. UUID) for this snapshot
`created_at`	ISO-8601 timestamp of when the snapshot was taken
`created_by`	User or process that created it
`dataset_name`	Human-readable name for the dataset
`dataset_path`	Source path or URI at snapshot time
`source_type`	Origin kind: `file`, `directory`, `database`, `huggingface`, `s3`, etc.
`source_hash`	Cryptographic hash of the raw source bytes or Merkle tree hash of a directory
`description`	Free-text description of what this dataset represents
`tags`	Arbitrary key-value labels for filtering and organization

Schema¶

Field	Description
`num_columns`	Total number of columns/features
`columns`	Ordered list of column descriptors (see below)
`schema_hash`	Hash of `(name, dtype)` tuples in order; changes when schema changes

Each entry in columns contains:

name: column identifier.
dtype: storage type (int64, float32, string, bool, datetime64, bytes).
nullable: whether null values are structurally allowed.

Volume¶

Field	Description
`num_rows`	Total number of records

Per-Column Statistics¶

Stored under column_stats[<name>] for each column.

For numeric columns (int*, float*):

null_count, null_fraction.
mean, std, min, max.
percentiles: {p5, p25, p50, p75, p95, p99}.
histogram: {bin_edges: [...], counts: [...]} (fixed number of bins, e.g. 50).
num_unique: exact count if cheap, otherwise sketch estimate.

For string columns:

null_count, null_fraction.
num_unique: cardinality.
top_values: top-K {value: count} pairs (e.g. K=50).
top_values_coverage: fraction of rows covered by top_values.
avg_char_length, min_char_length, max_char_length.
avg_token_count, min_token_count, max_token_count (whitespace-split).
detected_languages: top detected language codes with fractions.

For datetime columns:

null_count, null_fraction.
min, max: ISO-8601 boundaries.
range_days: (max - min) in days.

Data Quality Signals¶

Field	Description
`duplicate_row_fraction`	Estimated fraction of exact-duplicate rows
`near_duplicate_estimate`	Approximate fraction of near-duplicate rows via MinHash/LSH
`constant_columns`	List of column names with zero variance / single unique value
`high_null_columns`	List of columns where `null_fraction > 0.5`

Content Fingerprints & Sketches¶

These compact structures enable cheap diff and similarity queries across snapshots without loading raw data.

Field	Description
`row_minhash`	MinHash signature of the row set; enables Jaccard similarity between snapshots
`row_hyperloglog`	HyperLogLog sketch for cardinality estimation of unique rows

Lineage¶

Field	Description
`parent_snapshot_id`	ID of the previous snapshot of the same dataset, if any
`pipeline_stage`	Where in the pipeline this lives: `raw`, `cleaned`, `featurized`, `augmented`, `sampled`, `final`
`transform_description`	Human-readable description of changes since the parent snapshot

Consequences¶

Snapshots describe data as-is, with no experiment configuration embedded; the same snapshot can be reused across multiple models or tasks without modification.
Per-column statistics vary in structure by dtype, so consumers must branch on column type when reading column_stats.
Content fingerprints (MinHash, HyperLogLog) enable cheap cross-snapshot diff and similarity queries without loading raw data, at the cost of being approximate.
Scope is intentionally limited to tabular data; non-tabular modalities (images, audio, video, graphs) are not supported and would require a new ADR to extend the schema.
format_version is required on every snapshot to allow the schema to evolve without breaking existing readers.
Statistics are computed in a single in-memory pass, so worst-case memory grows with the row count (per-column values, the distinct-row set, and per-string value counts are retained until the pass completes). A configurable max_rows limit bounds this: datasets larger than the limit are rejected with a clear error rather than silently exhausting memory. The limit is exposed on StatsComputer and via the dstrack track --max-rows option, and may be disabled from Python when enough memory is available.