Embedder API#
- class bocoel.corpora.embedders.EnsembleEmbedder(embedders: Sequence[Embedder], sequential: bool = False)[source]#
An ensemble of embedders. The embeddings are concatenated together.
- class bocoel.corpora.embedders.HuggingfaceEmbedder(path: str, device: str = 'cpu', batch_size: int = 64, transform: ~collections.abc.Callable[[~typing.Any], ~torch.Tensor] = <function HuggingfaceEmbedder.<lambda>>)[source]#
Huggingface embedder. Uses the transformers library. Not a traditional encoder but uses a classifier and logits as embeddings.
- __init__(path: str, device: str = 'cpu', batch_size: int = 64, transform: ~collections.abc.Callable[[~typing.Any], ~torch.Tensor] = <function HuggingfaceEmbedder.<lambda>>) None [source]#
Initializes the Huggingface embedder.
- Parameters:
path – The path to the model.
device – The device to use.
batch_size – The batch size for encoding.
transform – The transformation function to use.
- Raises:
ImportError – If transformers is not installed.
ValueError – If the model does not have a config.id2label attribute.
- class bocoel.corpora.embedders.Embedder(*args, **kwargs)[source]#
Embedders are responsible for encoding text into vectors. Embedders in this project are considered volatile because it requires CPU time, unless some database with encoder capability is used.
- encode_storage(storage: Storage, /, transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]]) ndarray[Any, dtype[_ScalarType_co]] [source]#
Encodes the storage into embeddings.
- Parameters:
storage – The storage to encode.
transform – The transformation function to use.
- Returns:
The encoded embeddings. The shape must be [len(storage), self.dims].
- encode(text: Sequence[str], /) ndarray[Any, dtype[_ScalarType_co]] [source]#
Calls the encode function and performs some checks. Would try to encode the text in batches.
- Parameters:
text – The text to encode.
- Returns:
The encoded embeddings. The shape must be [len(text), self.dims].
- __init__(*args, **kwargs)#
- class bocoel.corpora.embedders.SbertEmbedder(model_name: str = 'all-mpnet-base-v2', device: str = 'cpu', batch_size: int = 64)[source]#
Sentence-BERT embedder. Uses the sentence_transformers library.
- __init__(model_name: str = 'all-mpnet-base-v2', device: str = 'cpu', batch_size: int = 64) None [source]#
Initializes the Sbert embedder.
- Parameters:
model_name – The model name to use.
device – The device to use.
batch_size – The batch size for encoding.
- Raises:
ImportError – If sentence_transformers is not installed.