Storage API#

class bocoel.corpora.storages.ConcatStorage(storages: Sequence[Storage], /)[source]#

Storage that concatenates multiple storages together. Concatenation is done on the first dimension. The resulting storage is read-only and has length equal to the sum of the lengths of the storages.

__init__(storages: Sequence[Storage], /) None[source]#
class bocoel.corpora.storages.DatasetsStorage(path: str, name: str | None = None, split: str | None = None)[source]#

Storage for datasets from HuggingFace Datasets library. Datasets are loaded on disk, so they might be slow(er) to load, but are more memory efficient.

__init__(path: str, name: str | None = None, split: str | None = None) None[source]#
class bocoel.corpora.storages.Storage(*args, **kwargs)[source]#

Storage is responsible for storing the data. This can be thought of as a table.

__init__(*args, **kwargs)#
class bocoel.corpora.storages.PandasStorage(df: DataFrame, /)[source]#

Storage for pandas DataFrame. Since pandas DataFrames are in-memory, this storage is fast, but might be memory inefficient and require a lot of RAM.

__init__(df: DataFrame, /) None[source]#
classmethod from_jsonl_file(path: str | Path, /) PandasStorage[source]#

Load data from a JSONL file.

Parameters:

path – The path to the file.

Returns:

A PandasStorage instance.

classmethod from_jsonl(data: Sequence[Mapping[str, str]], /) PandasStorage[source]#

Load data from a JSONL object or a list of JSON.

Parameters:

data – The JSONL object or list of JSON.

Returns:

A PandasStorage instance.