Corpus API#

class bocoel.corpora.corpora.ComposedCorpus(index: Index, storage: Storage)[source]#

Simply a collection of components.

index: Index#

Index searches one particular column in the storage into vectors.

storage: Storage#

Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.

classmethod index_storage(storage: ~bocoel.corpora.storages.interfaces.Storage, embedder: ~bocoel.corpora.embedders.interfaces.Embedder, keys: ~collections.abc.Sequence[str], index_backend: type[~bocoel.corpora.indices.interfaces.indices.Index], concat: ~collections.abc.Callable[[~collections.abc.Iterable[~typing.Any]], str] = <built-in method join of str object>, **index_kwargs: ~typing.Any) ComposedCorpus[source]#

Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings,

Parameters:
  • storage – The storage to index.

  • embedder – The embedder to use.

  • keys – The keys to use for the index.

  • index_backend – The index class to use.

  • concat – The function to use to concatenate the keys.

  • **index_kwargs – Additional arguments to pass to the index class.

Returns:

The created corpus.

classmethod index_mapped(storage: Storage, embedder: Embedder, transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]], index_backend: type[Index], **index_kwargs: Any) ComposedCorpus[source]#

Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings, using the specified batched transform function.

Parameters:
  • storage – The storage to index.

  • embedder – The embedder to use.

  • transform – The function to use to transform the storage entries.

  • index_backend – The index class to use.

  • **index_kwargs – Additional arguments to pass to the index class.

Returns:

The created corpus.

classmethod index_embeddings(storage: Storage, embeddings: ndarray[Any, dtype[_ScalarType_co]], index_backend: type[Index], **index_kwargs: Any) ComposedCorpus[source]#

Create the corpus with the given embeddings. This can be used to save time by encoding once and caching embeddings.

Parameters:
  • storage – The storage to use.

  • embeddings – The embeddings to use.

  • index_backend – The index class to use.

  • **index_kwargs – Additional arguments to pass to the index class.

Returns:

The created corpus.

__init__(index: Index, storage: Storage) None#
class bocoel.corpora.corpora.Corpus(*args, **kwargs)[source]#

Corpus is the entry point to handling the data in this library.

A corpus has 3 main components: - Index: Searches one particular column in the storage.Provides fast retrival. - Storage: Used to store the questions / answers / texts. - Embedder: Embeds the text into vectors for faster access.

An index only corresponds to one key. If search over multiple keys is desired, a new column or a new corpus (with shared storage) should be created.

storage: Storage#

Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.

__init__(*args, **kwargs)#
index: Index#

Index searches one particular column in the storage into vectors.