Storage API#
- class bocoel.corpora.storages.ConcatStorage(storages: Sequence[Storage], /)[source]#
Storage that concatenates multiple storages together. Concatenation is done on the first dimension. The resulting storage is read-only and has length equal to the sum of the lengths of the storages.
- class bocoel.corpora.storages.DatasetsStorage(path: str, name: str | None = None, split: str | None = None)[source]#
Storage for datasets from HuggingFace Datasets library. Datasets are loaded on disk, so they might be slow(er) to load, but are more memory efficient.
- class bocoel.corpora.storages.Storage(*args, **kwargs)[source]#
Storage is responsible for storing the data. This can be thought of as a table.
- __init__(*args, **kwargs)#
- class bocoel.corpora.storages.PandasStorage(df: DataFrame, /)[source]#
Storage for pandas DataFrame. Since pandas DataFrames are in-memory, this storage is fast, but might be memory inefficient and require a lot of RAM.
- classmethod from_jsonl_file(path: str | Path, /) PandasStorage [source]#
Load data from a JSONL file.
- Parameters:
path – The path to the file.
- Returns:
A PandasStorage instance.