Index API#
Indices are used for fast nearest neighbor search. Optionally, they may also perform transformation prior to indexing.
The module provides a few index implementations:
FaissIndex: Uses the Faiss library for fast nearest neighbor search.
HnswlibIndex: Uses the hnswlib library for fast nearest neighbor search.
PolarIndex: Transforms spatial coordinates into polar coordinates for indexing.
WhiteningIndex: Whitens the data before indexing.
- class bocoel.corpora.indices.FaissIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, index_string: str, cuda: bool = False, batch_size: int = 64)[source]#
Faiss index. Uses the faiss library.
- __init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, index_string: str, cuda: bool = False, batch_size: int = 64) None [source]#
Initializes the Faiss index.
- Parameters:
embeddings – The embeddings to index.
distance – The distance metric to use.
index_string – The index string to use.
cuda – Whether to use CUDA.
batch_size – The batch size to use for searching.
- class bocoel.corpora.indices.HnswlibIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, threads: int = -1, batch_size: int = 64)[source]#
HNSWLIB index. Uses the hnswlib library.
Score is calculated slightly differently nmslib/hnswlib
- __init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, normalize: bool = True, threads: int = -1, batch_size: int = 64) None [source]#
Initializes the HNSWLIB index.
- Parameters:
embeddings – The embeddings to index.
distance – The distance metric to use.
normalize – Whether to normalize the embeddings.
threads – The number of threads to use.
batch_size – The batch size to use for searching.
- Raises:
ValueError – If the distance is not supported.
- class bocoel.corpora.indices.Boundary(bounds: ndarray[Any, dtype[_ScalarType_co]])[source]#
The boundary of embeddings in a corpus. The boundary is defined as a hyperrectangle in the embedding space.
- class bocoel.corpora.indices.Distance(value)[source]#
Distance metrics.
- L2 = 'L2'#
L2 distance. Also known as Euclidean distance.
- INNER_PRODUCT = 'IP'#
Inner product distance. When normalized, this is equivalent to cosine similarity.
- class bocoel.corpora.indices.Index(*args, **kwargs)[source]#
Index is responsible for fast retrieval given a vector query.
- __init__(*args, **kwargs)#
- search(query: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], k: int = 1) SearchResultBatch [source]#
Calls the search function and performs some checks.
- Parameters:
query – The query vector. Must be of shape [batch, query_dims].
k – The number of nearest neighbors to return.
- Returns:
A SearchResultBatch instance. See SearchResultBatch for details.
- abstract property data: ndarray[Any, dtype[_ScalarType_co]]#
The underly data that the index is used for searching.
Note
This has the shape of [n, dims], where dims is the transformed space.
- Returns:
The data.
- property boundary: Boundary#
The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.
- Returns:
The boundary of the input.
- class bocoel.corpora.indices.SearchResult(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]])[source]#
A non-batched version of search result.
- class bocoel.corpora.indices.SearchResultBatch(query: ndarray[Any, dtype[_ScalarType_co]], vectors: ndarray[Any, dtype[_ScalarType_co]], distances: ndarray[Any, dtype[_ScalarType_co]], indices: ndarray[Any, dtype[_ScalarType_co]])[source]#
A batched version of search result.
- class bocoel.corpora.indices.PolarIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, polar_backend: type[Index], **backend_kwargs: Any)[source]#
Index that uses N-sphere coordinates as interfaces. See wikipedia linked below for details.
Converting the spatial indices into spherical coordinates has the following benefits:
Since the coordinates are normalized, the radius is always 1.
- The search region is rectangular in spherical coordinates,
ideal for bayesian optimization.
[Wikipedia link on N-sphere](https://en.wikipedia.org/wiki/N-sphere#Spherical_coordinates)
- __init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, polar_backend: type[Index], **backend_kwargs: Any) None [source]#
- Parameters:
embeddings – The embeddings to index.
distance – The distance metric to use.
polar_backend – The backend to use for indexing.
**backend_kwargs – The backend specific keyword arguments.
- property data: ndarray[Any, dtype[_ScalarType_co]]#
The underly data that the index is used for searching.
Note
This has the shape of [n, dims], where dims is the transformed space.
- Returns:
The data.
- property boundary: Boundary#
The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.
- Returns:
The boundary of the input.
- static polar_to_spatial(r: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], theta: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) ndarray[Any, dtype[_ScalarType_co]] [source]#
Convert an N-sphere coordinates to cartesian coordinates. See wikipedia linked in the class documentation for details.
- Parameters:
r – The radius of the N-sphere. Has the shape [N].
theta – The angles of the N-sphere. Hash the shape [N, D].
- Returns:
The cartesian coordinates of the N-sphere.
- static spatial_to_polar(x: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) tuple[ndarray[Any, dtype[_ScalarType_co]], ndarray[Any, dtype[_ScalarType_co]]] [source]#
Convert cartesian coordinates to N-sphere coordinates. See wikipedia linked in the class documentation for details.
- Parameters:
x – The cartesian coordinates. Has the shape [N, D].
- Returns:
A tuple. The radius and the angles of the N-sphere.
- class bocoel.corpora.indices.InverseCDFIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, distribution: str | Distribution = Distribution.NORMAL, inverse_cdf_backend: type[Index], **backend_kwargs: Any)[source]#
An index that maps a fixed range [0, 1) with the inverse cumulative distribution function (CDF) to index embeddings.
- __init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, distribution: str | Distribution = Distribution.NORMAL, inverse_cdf_backend: type[Index], **backend_kwargs: Any) None [source]#
- Parameters:
embeddings – The embeddings to index.
distance – The distance metric to use.
polar_backend – The backend to use for indexing.
**backend_kwargs – The backend specific keyword arguments.
- property data: ndarray[Any, dtype[_ScalarType_co]]#
The underly data that the index is used for searching.
Note
This has the shape of [n, dims], where dims is the transformed space.
- Returns:
The data.
- class bocoel.corpora.indices.WhiteningIndex(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, reduced: int, whitening_backend: type[Index], **backend_kwargs: Any)[source]#
Whitening index. Whitens the data before indexing. See https://arxiv.org/abs/2103.15316 for more info.
- __init__(embeddings: ndarray[Any, dtype[_ScalarType_co]], distance: str | Distance, *, reduced: int, whitening_backend: type[Index], **backend_kwargs: Any) None [source]#
Initializes the whitening index.
- Parameters:
embeddings – The embeddings to index.
distance – The distance metric to use.
reduced – The reduced dimensionality. NOP if larger than embeddings shape.
whitening_backend – The backend to use for indexing.
**backend_kwargs – The backend specific keyword arguments.