Corpus API#
- class bocoel.corpora.corpora.Corpus(index: Index, storage: Storage)[source]#
Corpus is the entry point to handling the data in this library.
A corpus has 3 main components: - Index: Searches one particular column in the storage.Provides fast retrival. - Storage: Used to store the questions / answers / texts. - Embedder: Embeds the text into vectors for faster access.
An index only corresponds to one key. If search over multiple keys is desired, a new column or a new corpus (with shared storage) should be created.
- classmethod index_storage(storage: ~bocoel.corpora.storages.interfaces.Storage, embedder: ~bocoel.corpora.embedders.interfaces.Embedder, keys: ~collections.abc.Sequence[str], index_backend: type[~bocoel.corpora.indices.interfaces.indices.Index], concat: ~collections.abc.Callable[[~collections.abc.Iterable[~typing.Any]], str] = <built-in method join of str object>, **index_kwargs: ~typing.Any) Corpus [source]#
Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings,
- Parameters:
storage – The storage to index.
embedder – The embedder to use.
keys – The keys to use for the index.
index_backend – The index class to use.
concat – The function to use to concatenate the keys.
**index_kwargs – Additional arguments to pass to the index class.
- Returns:
The created corpus.
- classmethod index_mapped(storage: Storage, embedder: Embedder, transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]], index_backend: type[Index], **index_kwargs: Any) Corpus [source]#
Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings, using the specified batched transform function.
- Parameters:
storage – The storage to index.
embedder – The embedder to use.
transform – The function to use to transform the storage entries.
index_backend – The index class to use.
**index_kwargs – Additional arguments to pass to the index class.
- Returns:
The created corpus.
- classmethod index_embeddings(storage: Storage, embeddings: ndarray[Any, dtype[_ScalarType_co]], index_backend: type[Index], **index_kwargs: Any) Corpus [source]#
Create the corpus with the given embeddings. This can be used to save time by encoding once and caching embeddings.
- Parameters:
storage – The storage to use.
embeddings – The embeddings to use.
index_backend – The index class to use.
**index_kwargs – Additional arguments to pass to the index class.
- Returns:
The created corpus.