Index API

Index API#

Indices are used for fast nearest neighbor search. Optionally, they may also perform transformation prior to indexing.

The module provides a few index implementations:

FaissIndex: Uses the Faiss library for fast nearest neighbor search.
HnswlibIndex: Uses the hnswlib library for fast nearest neighbor search.
PolarIndex: Transforms spatial coordinates into polar coordinates for indexing.
WhiteningIndex: Whitens the data before indexing.

class bocoel.corpora.indices.FaissIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, index_string: str, cuda: bool = False, batch_size: int = 64)[source]#

Faiss index. Uses the faiss library.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, index_string: str, cuda: bool = False, batch_size: int = 64) → None[source]#

Initializes the Faiss index.

Parameters:

embeddings – The embeddings to index.
distance – The distance metric to use.
index_string – The index string to use.
cuda – Whether to use CUDA.
batch_size – The batch size to use for searching.

property batch: int#

The batch size used for searching.

Returns:: The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:: The data.

property distance: Distance#

The distance metric used by the index.

Returns:: The distance metric.

property dims: int#

The number of dimensions that the query vector should be.

Returns:: The number of dimensions.

class bocoel.corpora.indices.HnswlibIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, threads: int = -1, batch_size: int = 64)[source]#

HNSWLIB index. Uses the hnswlib library.

Score is calculated slightly differently nmslib/hnswlib

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, threads: int = -1, batch_size: int = 64) → None[source]#

Initializes the HNSWLIB index.

Parameters:

embeddings – The embeddings to index.
distance – The distance metric to use.
normalize – Whether to normalize the embeddings.
threads – The number of threads to use.
batch_size – The batch size to use for searching.

Raises:

ValueError – If the distance is not supported.

property batch: int#

The batch size used for searching.

Returns:: The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:: The data.

property distance: Distance#

The distance metric used by the index.

Returns:: The distance metric.

class bocoel.corpora.indices.Boundary(bounds: ndarray[Any, dtype[_ScalarType_co]])[source]#

The boundary of embeddings in a corpus. The boundary is defined as a hyperrectangle in the embedding space.

bounds: ndarray[Any, dtype[_ScalarType_co]]#: The boundary arrays of the corpus. Must be of shape [dims, 2], where dims is the number of dimensions. The first column is the lower bound, the second column is the upper bound.

property dims: int#: The number of dimensions.

property lower: ndarray[Any, dtype[_ScalarType_co]]#: The lower bounds. Must be of shape [dims].

property upper: ndarray[Any, dtype[_ScalarType_co]]#: The upper bounds. Must be of shape [dims].

__init__(bounds: ndarray[Any, dtype[_ScalarType_co]]) → None#

classmethod fixed(lower: float, upper: float, dims: int) → Boundary[source]#

Create a fixed boundary for all dimensions.

Parameters:

lower – The lower bound.
upper – The upper bound.
dims – The number of dimensions.

Returns:

A Boundary instance.

Raises:

ValueError – If lower > upper.

class bocoel.corpora.indices.Distance(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Distance metrics.

L2 = 'L2'#: L2 distance. Also known as Euclidean distance.

INNER_PRODUCT = 'IP'#: Inner product distance. When normalized, this is equivalent to cosine similarity.

class bocoel.corpora.indices.Index(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, **kwargs: Any)[source]#

Index is responsible for fast retrieval given a vector query.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, **kwargs: Any) → None[source]#

Calls the search function and performs some checks.

Parameters:

query – The query vector. Must be of shape [batch, query_dims].
k – The number of nearest neighbors to return.

Returns:

A SearchResultBatch instance. See SearchResultBatch for details.

abstract property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:: The data.

abstract property batch: int#

The batch size used for searching.

Returns:: The batch size.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:: The boundary of the input.

abstract property distance: Distance#

The distance metric used by the index.

Returns:: The distance metric.

property dims: int#

The number of dimensions that the query vector should be.

Returns:: The number of dimensions.

class bocoel.corpora.indices.InternalResult(distances, indices)[source]#

distances: ndarray[Any, dtype[_ScalarType_co]]#: Calculated distance.

indices: ndarray[Any, dtype[_ScalarType_co]]#: Index in the original embeddings. Must be integers.

class bocoel.corpora.indices.SearchResult(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]])[source]#

A non-batched version of search result.

__init__(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]]) → None#

class bocoel.corpora.indices.SearchResultBatch(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]])[source]#

A batched version of search result.

__init__(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]]) → None#

class bocoel.corpora.indices.PolarIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, polar_backend: type[Index], **backend_kwargs: Any)[source]#

Index that uses N-sphere coordinates as interfaces. See wikipedia linked below for details.

Converting the spatial indices into spherical coordinates has the following benefits:

Since the coordinates are normalized, the radius is always 1.
The search region is rectangular in spherical coordinates,
ideal for bayesian optimization.

[Wikipedia link on N-sphere](https://en.wikipedia.org/wiki/N-sphere#Spherical_coordinates)

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, polar_backend: type[Index], **backend_kwargs: Any) → None[source]#

Parameters:

embeddings – The embeddings to index.
distance – The distance metric to use.
polar_backend – The backend to use for indexing.
**backend_kwargs – The backend specific keyword arguments.

property batch: int#

The batch size used for searching.

Returns:: The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:: The data.

property distance: Distance#

The distance metric used by the index.

Returns:: The distance metric.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:: The boundary of the input.

Convert an N-sphere coordinates to cartesian coordinates. See wikipedia linked in the class documentation for details.

Parameters:

r – The radius of the N-sphere. Has the shape [N].
theta – The angles of the N-sphere. Hash the shape [N, D].

Returns:

The cartesian coordinates of the N-sphere.

Convert cartesian coordinates to N-sphere coordinates. See wikipedia linked in the class documentation for details.

Parameters:: x – The cartesian coordinates. Has the shape [N, D].
Returns:: A tuple. The radius and the angles of the N-sphere.

class bocoel.corpora.indices.InverseCDFIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, distribution: str | Distribution = Distribution.NORMAL, inverse_cdf_backend: type[Index], **backend_kwargs: Any)[source]#

An index that maps a fixed range [0, 1) with the inverse cumulative distribution function (CDF) to index embeddings.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, distribution: str | Distribution = Distribution.NORMAL, inverse_cdf_backend: type[Index], **backend_kwargs: Any) → None[source]#

Parameters:

embeddings – The embeddings to index.
distance – The distance metric to use.
polar_backend – The backend to use for indexing.
**backend_kwargs – The backend specific keyword arguments.

property batch: int#

The batch size used for searching.

Returns:: The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:: The data.

property distance: Distance#

The distance metric used by the index.

Returns:: The distance metric.

property dims: int#

The number of dimensions that the query vector should be.

Returns:: The number of dimensions.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:: The boundary of the input.

class bocoel.corpora.indices.WhiteningIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, reduced: int, whitening_backend: type[Index], **backend_kwargs: Any)[source]#

Whitening index. Whitens the data before indexing. See https://arxiv.org/abs/2103.15316 for more info.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, reduced: int, whitening_backend: type[Index], **backend_kwargs: Any) → None[source]#

Initializes the whitening index.

Parameters:

embeddings – The embeddings to index.
distance – The distance metric to use.
reduced – The reduced dimensionality. NOP if larger than embeddings shape.
whitening_backend – The backend to use for indexing.
**backend_kwargs – The backend specific keyword arguments.

property batch: int#

The batch size used for searching.

Returns:: The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

Returns the data. This does not necessarily have the same dimensionality as the original transformed embeddings.

Returns:: The data.

property distance: Distance#

The distance metric used by the index.

Returns:: The distance metric.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:: The boundary of the input.

Index API

Contents

Index API#