Readers
readers
¶
Tabular data readers for dstrack.
Built-in readers¶
CsvReader - reads .csv files; no extra dependencies.
Extending¶
Implement TabularReader on any class to create a custom reader.
Examples:
>>> from dstrack.readers import ColumnInfo, TabularReader
>>> class MyParquetReader:
... def columns(self):
... return [ColumnInfo("x", "int64")]
... def iter_batches(self, batch_size=1000):
... return iter([[]])
...
>>> isinstance(MyParquetReader(), TabularReader)
True
Classes:
| Name | Description |
|---|---|
ColumnInfo |
Metadata for one column of a tabular dataset. |
CsvReader |
Reads a CSV file using the standard-library |
TabularReader |
Structural protocol for tabular data sources. |
ColumnInfo(name, dtype, nullable=True)
dataclass
¶
CsvReader(path, *, sample_rows=200, encoding='utf-8', rename_duplicates=False, column_dtypes=None, **csv_kwargs)
¶
Reads a CSV file using the standard-library csv module.
Satisfies TabularReader without inheriting from
it. Column dtypes are inferred from the first sample_rows data rows;
every subsequent call to columns() returns the cached result.
The file's modification time and size are recorded when schema inference
runs. iter_batches() checks them again before reading and raises
RuntimeError if the file has changed, preventing silent
schema/data mismatches.
Note
Change detection relies on mtime_ns and file size reported by the
OS. On filesystems with coarse modification-time resolution (FAT32,
some network or CI mounts), two writes that happen within the same
clock tick will share the same mtime_ns, so a modification that
also preserves the file size may go undetected. If you need
guaranteed detection in such environments, ensure at least one clock
tick (≥ 10 ms on most systems) elapses between calling
columns() and overwriting the file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the CSV file. |
required |
sample_rows
|
int
|
Number of rows to read for dtype inference. The values are read and data types for each column are inferred from them. |
200
|
encoding
|
str
|
File encoding passed to :func: |
'utf-8'
|
rename_duplicates
|
bool
|
When |
False
|
column_dtypes
|
dict[str, str] | None
|
Optional mapping of column name to dtype string that
overrides the inferred dtype for those columns. Only the listed
columns are affected; all others are still inferred automatically.
|
None
|
**csv_kwargs
|
Any
|
Forwarded verbatim to DictReader
(e.g. |
{}
|
Methods:
| Name | Description |
|---|---|
columns |
Return column descriptors, inferring dtypes on the first call. |
iter_batches |
Yield batches of coerced rows. |
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path
|
Path to the CSV file this reader is bound to. |
path
property
¶
Path to the CSV file this reader is bound to.
columns()
¶
Return column descriptors, inferring dtypes on the first call.
Returns:
| Type | Description |
|---|---|
list[ColumnInfo]
|
Ordered list of ColumnInfo objects, one per CSV field. |
iter_batches(batch_size=1000)
¶
Yield batches of coerced rows.
Opens the file once per call. Before reading, compares the file's
current mtime_ns and size against the values recorded during
schema inference and raises if they differ.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Maximum number of rows per batch. |
1000
|
Yields:
| Type | Description |
|---|---|
list[list[Cell]]
|
A list of rows; each row is a list of [Cell][dstrack.readers.Cell] values aligned with columns(). |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If the file was modified since columns() was last called. |
TabularReader
¶
Bases: Protocol
flowchart TD
dstrack.readers.TabularReader[TabularReader]
click dstrack.readers.TabularReader href "" "dstrack.readers.TabularReader"
Structural protocol for tabular data sources.
Any class that exposes columns() and iter_batches() satisfies this
protocol, no inheritance required. Third-party readers for Parquet, SQL,
HuggingFace datasets, etc. only need to implement these two methods.
Examples:
>>> class MyParquetReader:
... def columns(self):
... return [ColumnInfo("x", "int64")]
...
... def iter_batches(self, batch_size=1000):
... return iter([[]])
>>> isinstance(MyParquetReader(), TabularReader)
True
Methods:
| Name | Description |
|---|---|
columns |
Return column descriptors. |
iter_batches |
Yield non-empty batches of rows. |
columns()
¶
Return column descriptors.
May open or inspect the source on the first call; subsequent calls should return a cached result.
Returns:
| Type | Description |
|---|---|
list[ColumnInfo]
|
Ordered list of ColumnInfo objects, one per column. |
iter_batches(batch_size=1000)
¶
Yield non-empty batches of rows.
Each row is a list of coerced values aligned with columns().
Missing values are represented as None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Maximum number of rows per batch. |
1000
|
Yields:
| Type | Description |
|---|---|
list[list[Cell]]
|
A list of rows, each row being a list of [Cell][dstrack.readers.Cell] values. |