Module Vector_db

Vector Database

This module provides functionality for creating and querying a vector database, which is a collection of document vectors and their associated file paths. The database is represented as a matrix of vector representations and an index that maps the index of a document in the matrix to the file path of the document.

The main data type is t, which represents the vector database. The module also provides functions for creating a corpus, querying the database, and managing document vectors.

The Vec module defines the vector representation of documents and provides functions for reading and writing vectors to and from disk.

type t = {
  1. corpus : Owl.Mat.mat;
  2. index : (int, string) Core.Hashtbl.t;
}

This represents the vector db. corpus is the matrix of vector representations of the underlying docs. the index is a hash table that maps the index of a doc in the matrix to the file path of the doc

module Vec : sig ... end
val create_corpus : Vec.t array -> t

create_corpus docs' creates a vector database from an array of document vectors docs'.

The function normalizes each document vector and constructs a matrix corpus where each column represents a normalized document vector. It also creates an index, which is a hash table that maps the index of a document in the matrix to the file path of the document.

  • parameter docs'

    is an array of document vectors with their associated file paths.

  • returns

    a record of type t containing the matrix of vector representations corpus and the index mapping document indices to file paths.

val query : t -> Owl.Mat.mat -> int -> int array

query t doc k returns the top k most similar documents to the given doc in the vector database t. The function computes the cosine similarity between the input doc and the documents in the database t.corpus, and returns the indices of the top k most similar documents.

  • parameter t

    is the vector database containing the corpus and index.

  • parameter doc

    is the document vector to be compared with the documents in the database.

  • parameter k

    is the number of top similar documents to be returned.

  • returns

    an array of indices corresponding to the top k most similar documents in the database.

val add_doc : Owl.Mat.mat -> float array -> Owl.Mat.mat
val initialize : path -> t

initialize file initializes a vector database by reading in an array of Vec.t from disk and creating a corpus.

The function reads an array of document vectors with their associated file paths from disk using the Vec.read_vectors_from_disk function. It then creates a corpus and index using the create_corpus function.

  • parameter file

    is the file name used to read the array of document vectors from disk.

  • returns

    a record of type t containing the matrix of vector representations corpus and the index mapping document indices to file paths.

val get_docs : Eio.Fs.dir_ty Eio.Path.t -> t -> int array -> string list

get_docs dir t indexs reads the documents corresponding to the given indices from disk and returns an array of their contents.

The function retrieves the file paths of the documents using the index hash table in t and reads the contents of the documents from disk using the Doc.load_prompt function.

  • parameter [dir]

    is the directory for loading the documents.

  • parameter t

    is the vector database containing the corpus and index.

  • parameter indexs

    is an array of indices corresponding to the documents to be read from disk.

  • returns

    an list of strings containing the contents of the documents read from disk.