Readers

`readers` ¶

Tabular data readers for dstrack.

Built-in readers¶

CsvReader - reads .csv files; no extra dependencies.

Extending¶

Implement TabularReader on any class to create a custom reader. That alone is enough to build a snapshot from Python:

Examples:

>>> from dstrack.readers import ColumnInfo, TabularReader
>>> class MyParquetReader:
...     def columns(self):
...         return [ColumnInfo("x", "int64")]
...     def iter_batches(self, batch_size=1000):
...         return iter([[]])
...
>>> isinstance(MyParquetReader(), TabularReader)
True

To make a reader reachable by name as well - from the CLI, or by extension inference - it must also satisfy ReaderFactory (a from_path classmethod), and be registered. There are two ways in, for two different situations:

Shipping a reader in a package others install: declare an entry point in the dstrack.readers group, and set EXTENSIONS on the class. It is then picked up automatically, and dstrack track data.parquet just works with nothing extra typed:
```
[project.entry-points."dstrack.readers"]
parquet = "dstrack_parquet:ParquetReader"
```
A reader in your own project, not installed as a plugin: call register_reader from Python, or name it on the command line as --reader "mypackage.readers:ExcelReader".

Classes:

Name	Description
`ColumnInfo`	Metadata for one column of a tabular dataset.
`CsvReader`	Reads a CSV file using the standard-library `csv` module.
`ReaderFactory`	Construction contract for readers reached by name rather than by instance.
`TabularReader`	Structural protocol for tabular data sources.

Functions:

Name	Description
`available_readers`	Return all registered readers by short name, built-in and plugin alike.
`known_extensions`	Return all registered extensions and the reader class that claims each.
`load_reader_class`	Import a reader class from a `"package.module:ClassName"` spec.
`register_reader`	Register a reader under a short name and the extensions it handles.
`resolve_reader`	Build a reader for `path`, explicitly named or inferred from its extension.
`resolve_reader_class`	Choose the reader class for `path`, without instantiating it.

`ColumnInfo(name, dtype, nullable=True)` `dataclass` ¶

Metadata for one column of a tabular dataset.

Attributes:

Name	Type	Description
`name`	`str`	Column name as it appears in the source.
`dtype`	`str`	Storage type using snapshot-schema vocabulary: `int64`, `float64`, `string`, `bool`, `datetime64`, or `bytes`.
`nullable`	`bool`	`True` if any value in this column may be `None`.

`CsvReader(path, *, sample_rows=200, encoding='utf-8', rename_duplicates=False, column_dtypes=None, **csv_kwargs)` ¶

Reads a CSV file using the standard-library csv module.

Satisfies TabularReader without inheriting from it. Column dtypes are inferred from the first sample_rows data rows; every subsequent call to columns() returns the cached result.

The file's modification time and size are recorded when schema inference runs. iter_batches() checks them again before reading and raises RuntimeError if the file has changed, preventing silent schema/data mismatches.

Note

Change detection relies on mtime_ns and file size reported by the OS. On filesystems with coarse modification-time resolution (FAT32, some network or CI mounts), two writes that happen within the same clock tick will share the same mtime_ns, so a modification that also preserves the file size may go undetected. If you need guaranteed detection in such environments, ensure at least one clock tick (≥ 10 ms on most systems) elapses between calling columns() and overwriting the file.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the CSV file.	required
`sample_rows`	`int`	Number of rows to read for dtype inference. The values are read and data types for each column are inferred from them.	`200`
`encoding`	`str`	File encoding passed to open. Defaults to `"utf-8"`; use `"cp1252"` or `"latin-1"` for Excel exports.	`'utf-8'`
`rename_duplicates`	`bool`	When `True`, duplicate header names are made unique by appending a counter suffix (e.g. `col`, `col_1`, `col_2`). Headers that already appear exactly once in the file are treated as reserved: generated suffixes will never overwrite them (e.g. `["a", "a", "a_1"]` → `["a", "a_2", "a_1"]`). When `False` (default), a ValueError is raised instead.	`False`
`column_dtypes`	`dict[str, str] \| None`	Optional mapping of column name to dtype string that overrides the inferred dtype for those columns. Only the listed columns are affected; all others are still inferred automatically. `"bytes"` is not a valid override (see ADR-0002); passing it raises ValueError.	`None`
`**csv_kwargs`	`Any`	Forwarded verbatim to DictReader (e.g. `delimiter=";"`, `quotechar="'"`).	`{}`

Methods:

Name	Description
`columns`	Return column descriptors, inferring dtypes on the first call.
`from_path`	Build a reader for `path` with default options.
`iter_batches`	Yield batches of coerced rows.

Attributes:

Name	Type	Description
`path`	`Path`	Path to the CSV file this reader is bound to.

`path` `property` ¶

Path to the CSV file this reader is bound to.

`columns()` ¶

Return column descriptors, inferring dtypes on the first call.

Returns:

Type	Description
`list[ColumnInfo]`	Ordered list of ColumnInfo
`list[ColumnInfo]`	objects, one per CSV field.

`from_path(path)` `classmethod` ¶

Build a reader for path with default options.

Satisfies ReaderFactory, which is how the registry and --reader construct a reader they only know by name. Options other than the path are not reachable this way; construct the reader directly to set them.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the CSV file.	required

Returns:

Type	Description
`CsvReader`	A CsvReader bound to `path`.

`iter_batches(batch_size=1000)` ¶

Yield batches of coerced rows.

Opens the file once per call. Before reading, compares the file's current mtime_ns and size against the values recorded during schema inference and raises if they differ.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Maximum number of rows per batch.	`1000`

Yields:

Type	Description
`list[list[Cell]]`	A list of rows; each row is a list of [Cell][dstrack.readers._protocol.Cell] values aligned with columns().

Raises:

Type	Description
`RuntimeError`	If the file was modified since columns() was last called.

`ReaderFactory` ¶

Bases: Protocol


              flowchart TD
              dstrack.readers.ReaderFactory[ReaderFactory]

              

              click dstrack.readers.ReaderFactory href "" "dstrack.readers.ReaderFactory"

Construction contract for readers reached by name rather than by instance.

TabularReader describes how a reader is read, and says nothing about how one is built: code that already holds an instance never needs to know. But the registry and the "package.module:ClassName" spec only ever yield a class, so they need a uniform way to turn that class into an instance given a source path. from_path() is that way, and it is checked against this protocol before the class is ever called.

This is deliberately a second, separate protocol: a reader used only from Python (constructed by the caller, handed straight to SnapshotBuilder) still needs nothing beyond TabularReader. Only readers that are registered, or named on the command line, must also satisfy this one.

Examples:

>>> from pathlib import Path
>>> from dstrack.readers import CsvReader, ReaderFactory
>>> isinstance(CsvReader, ReaderFactory)  # the class, not an instance
True

Methods:

Name	Description
`from_path`	Build a reader for `path` using default options.

`from_path(path)` ¶

Build a reader for path using default options.

Implemented as a classmethod on the reader class; the protocol is therefore checked against the class object itself (isinstance(MyReader, ReaderFactory)).

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to the dataset source the reader will read.	required

Returns:

Type	Description
`TabularReader`	An instance satisfying
`TabularReader`	TabularReader.

`TabularReader` ¶

Bases: Protocol


              flowchart TD
              dstrack.readers.TabularReader[TabularReader]

              

              click dstrack.readers.TabularReader href "" "dstrack.readers.TabularReader"

Structural protocol for tabular data sources.

Any class that exposes columns() and iter_batches() satisfies this protocol, no inheritance required. Third-party readers for Parquet, SQL, HuggingFace datasets, etc. only need to implement these two methods.

Examples:

>>> class MyParquetReader:
...     def columns(self):
...         return [ColumnInfo("x", "int64")]
...
...     def iter_batches(self, batch_size=1000):
...         return iter([[]])
>>> isinstance(MyParquetReader(), TabularReader)
True

Methods:

Name	Description
`columns`	Return column descriptors.
`iter_batches`	Yield non-empty batches of rows.

`columns()` ¶

Return column descriptors.

May open or inspect the source on the first call; subsequent calls should return a cached result.

Returns:

Type	Description
`list[ColumnInfo]`	Ordered list of ColumnInfo
`list[ColumnInfo]`	objects, one per column.

`iter_batches(batch_size=1000)` ¶

Yield non-empty batches of rows.

Each row is a list of coerced values aligned with columns(). Missing values are represented as None.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Maximum number of rows per batch.	`1000`

Yields:

Type	Description
`list[list[Cell]]`	A list of rows, each row being a list of
`list[list[Cell]]`	[Cell][dstrack.readers._protocol.Cell] values.

`available_readers()` ¶

Return all registered readers by short name, built-in and plugin alike.

`known_extensions()` ¶

Return all registered extensions and the reader class that claims each.

`load_reader_class(spec)` ¶

Import a reader class from a "package.module:ClassName" spec.

The class is validated against both reader protocols before being returned, so a mistyped spec that happens to name some other importable object fails here rather than part-way through a snapshot.

Parameters:

Name	Type	Description	Default
`spec`	`str`	A `"<module>:<class>"` string. The module part is imported and the class part is looked up on it.	required

Returns:

Type	Description
`type[TabularReader]`	The referenced class (not an instance).

Raises:

Type	Description
`ValueError`	If `spec` is not of the form `"module:ClassName"`, the module cannot be imported, or the class is not found on it.
`TypeError`	If the referenced object is not a usable reader class.

`register_reader(reader_cls, *, name, extensions=None)` ¶

Register a reader under a short name and the extensions it handles.

Parameters:

Name	Type	Description	Default
`reader_cls`	`type[TabularReader]`	The reader class. Must satisfy both TabularReader and ReaderFactory.	required
`name`	`str`	Short name, as typed for `--reader` (e.g. `"csv"`).	required
`extensions`	`Sequence[str] \| None`	Extensions this reader claims, leading dot included. When omitted, the class's `EXTENSIONS` attribute is used, so a plugin can declare its extensions once on the class and register through a bare entry point.	`None`

Raises:

Type	Description
`TypeError`	If `reader_cls` does not satisfy both reader protocols.
`ValueError`	If `name` or any extension is already taken. Registration never silently displaces an existing reader: two packages fighting over `.parquet` is a conflict the user has to see, not one to resolve by import order.
`ValueError`	If any extension is malformed. An extension is matched against Path.suffix, so it must be a single leading-dot suffix such as `".csv"`: a value like `"csv"` or `".tar.gz"` could never match and is rejected rather than silently registered as dead weight.

`resolve_reader(path, *, reader=None)` ¶

Build a reader for path, explicitly named or inferred from its extension.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the dataset source the reader will read.	required
`reader`	`str \| None`	Optional short name or `"package.module:ClassName"` spec. When omitted, the reader is inferred from `path`'s file extension.	`None`

Returns:

Type	Description
`TabularReader`	An instantiated reader satisfying
`TabularReader`	TabularReader.

Raises:

Type	Description
`ValueError`	If `reader` names nothing known, a spec is malformed, or no reader is registered for `path`'s extension.
`TypeError`	If the resolved class is not a usable reader.

`resolve_reader_class(path, *, reader=None)` ¶

Choose the reader class for path, without instantiating it.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the dataset source.	required
`reader`	`str \| None`	Either a registered short name (`"csv"`) or a `"package.module:ClassName"` spec, told apart by the `":"`. When omitted, the reader is inferred from `path`'s extension.	`None`

Returns:

Type	Description
`type[TabularReader]`	A validated reader class.

Raises:

Type	Description
`ValueError`	If `reader` names nothing known, a spec is malformed, or no reader is registered for `path`'s extension.
`TypeError`	If the resolved class is not a usable reader.

Readers

readers ¶

Built-in readers¶

Extending¶

ColumnInfo(name, dtype, nullable=True) dataclass ¶

CsvReader(path, *, sample_rows=200, encoding='utf-8', rename_duplicates=False, column_dtypes=None, **csv_kwargs) ¶

path property ¶

columns() ¶

from_path(path) classmethod ¶

iter_batches(batch_size=1000) ¶

ReaderFactory ¶

from_path(path) ¶

TabularReader ¶

columns() ¶

iter_batches(batch_size=1000) ¶

available_readers() ¶

known_extensions() ¶

load_reader_class(spec) ¶

register_reader(reader_cls, *, name, extensions=None) ¶

resolve_reader(path, *, reader=None) ¶

resolve_reader_class(path, *, reader=None) ¶

`readers` ¶

`ColumnInfo(name, dtype, nullable=True)` `dataclass` ¶

`CsvReader(path, *, sample_rows=200, encoding='utf-8', rename_duplicates=False, column_dtypes=None, **csv_kwargs)` ¶

`path` `property` ¶

`columns()` ¶

`from_path(path)` `classmethod` ¶

`iter_batches(batch_size=1000)` ¶

`ReaderFactory` ¶

`from_path(path)` ¶

`TabularReader` ¶

`columns()` ¶

`iter_batches(batch_size=1000)` ¶

`available_readers()` ¶

`known_extensions()` ¶

`load_reader_class(spec)` ¶

`register_reader(reader_cls, *, name, extensions=None)` ¶

`resolve_reader(path, *, reader=None)` ¶

`resolve_reader_class(path, *, reader=None)` ¶