Embedder API#

class bocoel.corpora.embedders.EnsembleEmbedder(embedders: Sequence[Embedder], sequential: bool = False)[source]#

An ensemble of embedders. The embeddings are concatenated together.

__init__(embedders: Sequence[Embedder], sequential: bool = False) None[source]#
Parameters:
  • embedders – The embedders to use.

  • sequential – Whether to use sequential processing.

Raises:

ValueError – If the embedders have different batch sizes.

property batch: int#

The batch size to use when encoding.

property dims: int#

The dimensions of the embeddings

class bocoel.corpora.embedders.HuggingfaceEmbedder(path: str, device: str = 'cpu', batch_size: int = 64, transform: ~collections.abc.Callable[[~typing.Any], ~torch.Tensor] = <function HuggingfaceEmbedder.<lambda>>)[source]#

Huggingface embedder. Uses the transformers library. Not a traditional encoder but uses a classifier and logits as embeddings.

__init__(path: str, device: str = 'cpu', batch_size: int = 64, transform: ~collections.abc.Callable[[~typing.Any], ~torch.Tensor] = <function HuggingfaceEmbedder.<lambda>>) None[source]#

Initializes the Huggingface embedder.

Parameters:
  • path – The path to the model.

  • device – The device to use.

  • batch_size – The batch size for encoding.

  • transform – The transformation function to use.

Raises:
  • ImportError – If transformers is not installed.

  • ValueError – If the model does not have a config.id2label attribute.

property batch: int#

The batch size to use when encoding.

property dims: int#

The dimensions of the embeddings

class bocoel.corpora.embedders.Embedder(*args, **kwargs)[source]#

Embedders are responsible for encoding text into vectors. Embedders in this project are considered volatile because it requires CPU time, unless some database with encoder capability is used.

encode_storage(storage: Storage, /, transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]]) ndarray[Any, dtype[_ScalarType_co]][source]#

Encodes the storage into embeddings.

Parameters:
  • storage – The storage to encode.

  • transform – The transformation function to use.

Returns:

The encoded embeddings. The shape must be [len(storage), self.dims].

encode(text: Sequence[str], /) ndarray[Any, dtype[_ScalarType_co]][source]#

Calls the encode function and performs some checks. Would try to encode the text in batches.

Parameters:

text – The text to encode.

Returns:

The encoded embeddings. The shape must be [len(text), self.dims].

abstract property batch: int#

The batch size to use when encoding.

abstract property dims: int#

The dimensions of the embeddings

__init__(*args, **kwargs)#
class bocoel.corpora.embedders.SbertEmbedder(model_name: str = 'all-mpnet-base-v2', device: str = 'cpu', batch_size: int = 64)[source]#

Sentence-BERT embedder. Uses the sentence_transformers library.

__init__(model_name: str = 'all-mpnet-base-v2', device: str = 'cpu', batch_size: int = 64) None[source]#

Initializes the Sbert embedder.

Parameters:
  • model_name – The model name to use.

  • device – The device to use.

  • batch_size – The batch size for encoding.

Raises:

ImportError – If sentence_transformers is not installed.

property batch: int#

The batch size to use when encoding.

property dims: int#

The dimensions of the embeddings