Vector_db
Vector Database
This module provides functionality for creating and querying a vector database, which is a collection of document vectors and their associated file paths. The database is represented as a matrix of vector representations and an index that maps the index of a document in the matrix to the file path of the document.
The main data type is t
, which represents the vector database. The module also provides functions for creating a corpus, querying the database, and managing document vectors.
The Vec
module defines the vector representation of documents and provides functions for reading and writing vectors to and from disk.
This represents the vector db. corpus is the matrix of vector representations of the underlying docs. the index is a hash table that maps the index of a doc in the matrix to the file path of the doc
type path = Eio.Fs.dir_ty Eio.Path.t
module Vec : sig ... end
create_corpus docs'
creates a vector database from an array of document vectors docs'
.
The function normalizes each document vector and constructs a matrix corpus
where each column represents a normalized document vector. It also creates an index, which is a hash table that maps the index of a document in the matrix to the file path of the document.
val query : t -> Owl.Mat.mat -> int -> int array
query t doc k
returns the top k
most similar documents to the given doc
in the vector database t
. The function computes the cosine similarity between the input doc
and the documents in the database t.corpus
, and returns the indices of the top k
most similar documents.
val add_doc : Owl.Mat.mat -> float array -> Owl.Mat.mat
initialize file
initializes a vector database by reading in an array of Vec.t
from disk and creating a corpus.
The function reads an array of document vectors with their associated file paths from disk using the Vec.read_vectors_from_disk
function. It then creates a corpus and index using the create_corpus
function.
val get_docs : Eio.Fs.dir_ty Eio.Path.t -> t -> int array -> string list
get_docs dir t indexs
reads the documents corresponding to the given indices from disk and returns an array of their contents.
The function retrieves the file paths of the documents using the index hash table in t
and reads the contents of the documents from disk using the Doc.load_prompt
function.