Index API#

Indices are used for fast nearest neighbor search. Optionally, they may also perform transformation prior to indexing.

The module provides a few index implementations:

  • FaissIndex: Uses the Faiss library for fast nearest neighbor search.

  • HnswlibIndex: Uses the hnswlib library for fast nearest neighbor search.

  • PolarIndex: Transforms spatial coordinates into polar coordinates for indexing.

  • WhiteningIndex: Whitens the data before indexing.

class bocoel.corpora.indices.FaissIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, index_string: str, cuda: bool = False, batch_size: int = 64)[source]#

Faiss index. Uses the faiss library.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, index_string: str, cuda: bool = False, batch_size: int = 64) None[source]#

Initializes the Faiss index.

Parameters:
  • embeddings – The embeddings to index.

  • distance – The distance metric to use.

  • index_string – The index string to use.

  • cuda – Whether to use CUDA.

  • batch_size – The batch size to use for searching.

property batch: int#

The batch size used for searching.

Returns:

The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:

The data.

property distance: Distance#

The distance metric used by the index.

Returns:

The distance metric.

property dims: int#

The number of dimensions that the query vector should be.

Returns:

The number of dimensions.

class bocoel.corpora.indices.HnswlibIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, threads: int = -1, batch_size: int = 64)[source]#

HNSWLIB index. Uses the hnswlib library.

Score is calculated slightly differently nmslib/hnswlib

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, threads: int = -1, batch_size: int = 64) None[source]#

Initializes the HNSWLIB index.

Parameters:
  • embeddings – The embeddings to index.

  • distance – The distance metric to use.

  • normalize – Whether to normalize the embeddings.

  • threads – The number of threads to use.

  • batch_size – The batch size to use for searching.

Raises:

ValueError – If the distance is not supported.

property batch: int#

The batch size used for searching.

Returns:

The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:

The data.

property distance: Distance#

The distance metric used by the index.

Returns:

The distance metric.

class bocoel.corpora.indices.Boundary(bounds: ndarray[Any, dtype[_ScalarType_co]])[source]#

The boundary of embeddings in a corpus. The boundary is defined as a hyperrectangle in the embedding space.

bounds: ndarray[Any, dtype[_ScalarType_co]]#

The boundary arrays of the corpus. Must be of shape [dims, 2], where dims is the number of dimensions. The first column is the lower bound, the second column is the upper bound.

property dims: int#

The number of dimensions.

property lower: ndarray[Any, dtype[_ScalarType_co]]#

The lower bounds. Must be of shape [dims].

property upper: ndarray[Any, dtype[_ScalarType_co]]#

The upper bounds. Must be of shape [dims].

__init__(bounds: ndarray[Any, dtype[_ScalarType_co]]) None#
classmethod fixed(lower: float, upper: float, dims: int) Boundary[source]#

Create a fixed boundary for all dimensions.

Parameters:
  • lower – The lower bound.

  • upper – The upper bound.

  • dims – The number of dimensions.

Returns:

A Boundary instance.

Raises:

ValueError – If lower > upper.

class bocoel.corpora.indices.Distance(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Distance metrics.

L2 = 'L2'#

L2 distance. Also known as Euclidean distance.

INNER_PRODUCT = 'IP'#

Inner product distance. When normalized, this is equivalent to cosine similarity.

class bocoel.corpora.indices.Index(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, **kwargs: Any)[source]#

Index is responsible for fast retrieval given a vector query.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, **kwargs: Any) None[source]#
search(query: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], k: int = 1) SearchResultBatch[source]#

Calls the search function and performs some checks.

Parameters:
  • query – The query vector. Must be of shape [batch, query_dims].

  • k – The number of nearest neighbors to return.

Returns:

A SearchResultBatch instance. See SearchResultBatch for details.

abstract property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:

The data.

abstract property batch: int#

The batch size used for searching.

Returns:

The batch size.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

The boundary of the input.

abstract property distance: Distance#

The distance metric used by the index.

Returns:

The distance metric.

property dims: int#

The number of dimensions that the query vector should be.

Returns:

The number of dimensions.

class bocoel.corpora.indices.InternalResult(distances, indices)[source]#
distances: ndarray[Any, dtype[_ScalarType_co]]#

Calculated distance.

indices: ndarray[Any, dtype[_ScalarType_co]]#

Index in the original embeddings. Must be integers.

class bocoel.corpora.indices.SearchResult(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]])[source]#

A non-batched version of search result.

__init__(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]]) None#
class bocoel.corpora.indices.SearchResultBatch(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]])[source]#

A batched version of search result.

__init__(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]]) None#
class bocoel.corpora.indices.PolarIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, polar_backend: type[Index], **backend_kwargs: Any)[source]#

Index that uses N-sphere coordinates as interfaces. See wikipedia linked below for details.

Converting the spatial indices into spherical coordinates has the following benefits:

  • Since the coordinates are normalized, the radius is always 1.

  • The search region is rectangular in spherical coordinates,

    ideal for bayesian optimization.

[Wikipedia link on N-sphere](https://en.wikipedia.org/wiki/N-sphere#Spherical_coordinates)

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, polar_backend: type[Index], **backend_kwargs: Any) None[source]#
Parameters:
  • embeddings – The embeddings to index.

  • distance – The distance metric to use.

  • polar_backend – The backend to use for indexing.

  • **backend_kwargs – The backend specific keyword arguments.

property batch: int#

The batch size used for searching.

Returns:

The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:

The data.

property distance: Distance#

The distance metric used by the index.

Returns:

The distance metric.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

The boundary of the input.

static polar_to_spatial(r: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], theta: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) ndarray[Any, dtype[_ScalarType_co]][source]#

Convert an N-sphere coordinates to cartesian coordinates. See wikipedia linked in the class documentation for details.

Parameters:
  • r – The radius of the N-sphere. Has the shape [N].

  • theta – The angles of the N-sphere. Hash the shape [N, D].

Returns:

The cartesian coordinates of the N-sphere.

static spatial_to_polar(x: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) tuple[ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]]][source]#

Convert cartesian coordinates to N-sphere coordinates. See wikipedia linked in the class documentation for details.

Parameters:

x – The cartesian coordinates. Has the shape [N, D].

Returns:

A tuple. The radius and the angles of the N-sphere.

class bocoel.corpora.indices.InverseCDFIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, distribution: str | Distribution = Distribution.NORMAL, inverse_cdf_backend: type[Index], **backend_kwargs: Any)[source]#

An index that maps a fixed range [0, 1) with the inverse cumulative distribution function (CDF) to index embeddings.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, distribution: str | Distribution = Distribution.NORMAL, inverse_cdf_backend: type[Index], **backend_kwargs: Any) None[source]#
Parameters:
  • embeddings – The embeddings to index.

  • distance – The distance metric to use.

  • polar_backend – The backend to use for indexing.

  • **backend_kwargs – The backend specific keyword arguments.

property batch: int#

The batch size used for searching.

Returns:

The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

The underly data that the index is used for searching.

Note

This has the shape of [n, dims], where dims is the transformed space.

Returns:

The data.

property distance: Distance#

The distance metric used by the index.

Returns:

The distance metric.

property dims: int#

The number of dimensions that the query vector should be.

Returns:

The number of dimensions.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

The boundary of the input.

class bocoel.corpora.indices.WhiteningIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, reduced: int, whitening_backend: type[Index], **backend_kwargs: Any)[source]#

Whitening index. Whitens the data before indexing. See https://arxiv.org/abs/2103.15316 for more info.

__init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, reduced: int, whitening_backend: type[Index], **backend_kwargs: Any) None[source]#

Initializes the whitening index.

Parameters:
  • embeddings – The embeddings to index.

  • distance – The distance metric to use.

  • reduced – The reduced dimensionality. NOP if larger than embeddings shape.

  • whitening_backend – The backend to use for indexing.

  • **backend_kwargs – The backend specific keyword arguments.

property batch: int#

The batch size used for searching.

Returns:

The batch size.

property data: ndarray[Any, dtype[_ScalarType_co]]#

Returns the data. This does not necessarily have the same dimensionality as the original transformed embeddings.

Returns:

The data.

property distance: Distance#

The distance metric used by the index.

Returns:

The distance metric.

property boundary: Boundary#

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

The boundary of the input.