ADR-0002: bytes dtype is not supported in CsvReader¶
Status¶
Accepted
Context¶
ADR-0001 lists bytes as a valid dtype value in the snapshot schema's column
descriptor (dtype field, Schema section). That document describes the
snapshot schema, which is the serialisation format shared by all readers. It does not
imply that every reader can infer or emit a bytes column.
CSV is a plain-text format. A bytes column has no canonical text
representation: hex strings (0xDEAD), Base64, percent-encoding, and raw
Latin-1 bytes are all plausible encodings, and none is universally adopted.
Auto-detecting which encoding a column uses from sample values is error-prone
(SHA-1 hashes, UUIDs, and random strings all look like Base64 or hex), and
silently picking the wrong encoding produces corrupted data without raising an
error.
Decision¶
CsvReader will not support the bytes dtype, either through auto-inference or
through the column_dtypes override parameter. Passing "bytes" as a dtype
override raises a ValueError immediately, rather than silently returning
strings or corrupted binary data.
Binary data in a CSV pipeline should be handled by:
- Keeping the column as
stringand decoding it explicitly in application code with the correct codec (bytes.fromhex(...),base64.b64decode(...), etc.). - Switching to a binary-native format (Parquet, HDF5, Arrow IPC) and using a dedicated reader once one is available.
This decision does not affect the snapshot schema in ADR-0001; bytes
remains a valid dtype token for readers that can natively handle binary columns
(e.g. a future ParquetReader).
Consequences¶
CsvReaderraisesValueErrorif"bytes"appears incolumn_dtypes.- Binary columns in CSV files that were previously kept as
stringcontinue to work unchanged. The dtype will still be inferred asstring. - The restriction is isolated to
CsvReader; other readers are free to supportbyteswhen their source format makes the encoding unambiguous.