Skip to content

Readers

readers

Tabular data readers for dstrack.

Built-in readers

CsvReader - reads .csv files; no extra dependencies.

Extending

Implement TabularReader on any class to create a custom reader.

Examples:

>>> from dstrack.readers import ColumnInfo, TabularReader
>>> class MyParquetReader:
...     def columns(self):
...         return [ColumnInfo("x", "int64")]
...     def iter_batches(self, batch_size=1000):
...         return iter([[]])
...
>>> isinstance(MyParquetReader(), TabularReader)
True

Classes:

Name Description
ColumnInfo

Metadata for one column of a tabular dataset.

CsvReader

Reads a CSV file using the standard-library csv module.

TabularReader

Structural protocol for tabular data sources.

ColumnInfo(name, dtype, nullable=True) dataclass

Metadata for one column of a tabular dataset.

Attributes:

Name Type Description
name str

Column name as it appears in the source.

dtype str

Storage type using snapshot-schema vocabulary: int64, float64, string, bool, datetime64, or bytes.

nullable bool

True if any value in this column may be None.

CsvReader(path, *, sample_rows=200, encoding='utf-8', rename_duplicates=False, column_dtypes=None, **csv_kwargs)

Reads a CSV file using the standard-library csv module.

Satisfies TabularReader without inheriting from it. Column dtypes are inferred from the first sample_rows data rows; every subsequent call to columns() returns the cached result.

The file's modification time and size are recorded when schema inference runs. iter_batches() checks them again before reading and raises RuntimeError if the file has changed, preventing silent schema/data mismatches.

Note

Change detection relies on mtime_ns and file size reported by the OS. On filesystems with coarse modification-time resolution (FAT32, some network or CI mounts), two writes that happen within the same clock tick will share the same mtime_ns, so a modification that also preserves the file size may go undetected. If you need guaranteed detection in such environments, ensure at least one clock tick (≥ 10 ms on most systems) elapses between calling columns() and overwriting the file.

Parameters:

Name Type Description Default
path str | Path

Path to the CSV file.

required
sample_rows int

Number of rows to read for dtype inference. The values are read and data types for each column are inferred from them.

200
encoding str

File encoding passed to :func:open. Defaults to "utf-8"; use "cp1252" or "latin-1" for Excel exports.

'utf-8'
rename_duplicates bool

When True, duplicate header names are made unique by appending a counter suffix (e.g. col, col_1, col_2). Headers that already appear exactly once in the file are treated as reserved: generated suffixes will never overwrite them (e.g. ["a", "a", "a_1"]["a", "a_2", "a_1"]). When False (default), a :exc:ValueError is raised instead.

False
column_dtypes dict[str, str] | None

Optional mapping of column name to dtype string that overrides the inferred dtype for those columns. Only the listed columns are affected; all others are still inferred automatically. "bytes" is not a valid override (see ADR-0002); passing it raises ValueError.

None
**csv_kwargs Any

Forwarded verbatim to DictReader (e.g. delimiter=";", quotechar="'").

{}

Methods:

Name Description
columns

Return column descriptors, inferring dtypes on the first call.

iter_batches

Yield batches of coerced rows.

Attributes:

Name Type Description
path Path

Path to the CSV file this reader is bound to.

path property

Path to the CSV file this reader is bound to.

columns()

Return column descriptors, inferring dtypes on the first call.

Returns:

Type Description
list[ColumnInfo]

Ordered list of ColumnInfo objects, one per CSV field.

iter_batches(batch_size=1000)

Yield batches of coerced rows.

Opens the file once per call. Before reading, compares the file's current mtime_ns and size against the values recorded during schema inference and raises if they differ.

Parameters:

Name Type Description Default
batch_size int

Maximum number of rows per batch.

1000

Yields:

Type Description
list[list[Cell]]

A list of rows; each row is a list of [Cell][dstrack.readers.Cell] values aligned with columns().

Raises:

Type Description
RuntimeError

If the file was modified since columns() was last called.

TabularReader

Bases: Protocol


              flowchart TD
              dstrack.readers.TabularReader[TabularReader]

              

              click dstrack.readers.TabularReader href "" "dstrack.readers.TabularReader"
            

Structural protocol for tabular data sources.

Any class that exposes columns() and iter_batches() satisfies this protocol, no inheritance required. Third-party readers for Parquet, SQL, HuggingFace datasets, etc. only need to implement these two methods.

Examples:

>>> class MyParquetReader:
...     def columns(self):
...         return [ColumnInfo("x", "int64")]
...
...     def iter_batches(self, batch_size=1000):
...         return iter([[]])
>>> isinstance(MyParquetReader(), TabularReader)
True

Methods:

Name Description
columns

Return column descriptors.

iter_batches

Yield non-empty batches of rows.

columns()

Return column descriptors.

May open or inspect the source on the first call; subsequent calls should return a cached result.

Returns:

Type Description
list[ColumnInfo]

Ordered list of ColumnInfo objects, one per column.

iter_batches(batch_size=1000)

Yield non-empty batches of rows.

Each row is a list of coerced values aligned with columns(). Missing values are represented as None.

Parameters:

Name Type Description Default
batch_size int

Maximum number of rows per batch.

1000

Yields:

Type Description
list[list[Cell]]

A list of rows, each row being a list of [Cell][dstrack.readers.Cell] values.