Chroma, Langchain and the k-th "best" result

(If you just want the solution, click [[#how-to Keeping Langfuse informed]] For a project, I implemented RAG using ChromaDB within Langchain which worked fine for the most part, returning above 90% on average. But looking at the entries it missed I encountered some highly odd workings. ```python vectorstore.as_retriever(search_type='similarity', search_kwargs={'k':4}) ``` Reading the docs ([here](https://www.google.com/url?q=https://api.python.langchain.com/en/latest/vectorstores/langchain_chroma.vectorstores.Chroma.html%23langchain_chroma.vectorstores.Chroma.as_retriever&sa=D&source=editors&ust=1716379924565548&usg=AOvVaw01cl9Bfls_kT2sdf3zj_wF)), you would get this info on what k does: ``` search_kwargs (Optional[Dict]) – Keyword arguments to pass to the search function. Can include things like: k: Amount of documents to return (Default: 4) ``` As I understand it: give me the top-k documents, right? Anyway, cutting things short; the default 4 is probably not good. In my project, I have around 3k entries that are all of the same form/template. Due to not having duplicates, the retriever is able to do its job quite well, but not definitely not with the default k. Checking whether the retrieved list has the correct info at the top for k in between 1 and 100 for some entry that it missed otherwise, I got this: ![[Pasted image 20240522131334.png]] I was able to reproduce this every time, as the retrieved elements for some specific string would not change - in this regard, there does not seem to be non-determinism at play (esp. not with my workflow). There had been a Github issue discussing this ([check here](https://www.google.com/url?q=https://github.com/langchain-ai/langchain/issues/1946&sa=D&source=editors&ust=1716379924566514&usg=AOvVaw1R3-aoopdVK_FfLYj5x1MP)), but to me at least it doesn’t seem to have been resolved. I was quite dumbfounded the way the algorithm works, so this is supposed to be an FYI of some sorts. The performance hit from changing `k=1` to `k=3000` was around double. I attached some numbers at the bottom: [[#Metrics]] . Unsure whether treating the k as a hyperparameter to tune is worth it in the long run. I'm going to update the post when I get new measurements. For now, I am going to stick with: ```python NUMDOCS = nomic_vectorstore._collection.count() retriever = nomic_vectorstore.as_retriever(search_kwargs={'k': NUMDOCS}) ``` ![[Pasted image 20240522131350.png]] ## how-to: Keeping Langfuse informed In case you are using Langfuse with Langchain for example, your `format_doc` won't return meaningful results to Langfuse. Working around the issue with `@observe(capture_input=False, capture_output=False)` didn't really cut it. But that might be on me. The best way for me to alleviate this situation is ugly, but easily done. All you need is to modify the following file: `.../python3.12/site-packages/langchain_chroma/vectorstores.py`. Overwrite Chroma's vectorstore.py: ```python class Chroma(VectorStore): """`ChromaDB` vector store.""" ... def __init__( self, ... top_k = None ) -> None: ... self.top_k = top_k ``` In the same file, modify the `similarity_search` function, s.t. ```python def similarity_search( ... ) ... if top_k: return [doc for doc, _ in docs_and_scores][:self.top_k] return [doc for doc, _ in docs_and_scores] ``` After doing so, you shouldn't forget to actually call the constructor with the value when using it afterwards :) ```python vectorstore = langchain_chroma.Chroma(..., top_k=6) ``` ## Metrics **Below: per 100 requests, in seconds** ``` 1: 6.295977249974385 (number is dodgy, see below) 50: 5.069456249999348 500: 5.898071000003256 1000: 6.084303040988743 1500: 6.105507917003706 2000: 7.373185000033118 3000: 9.040459915995598 ``` And then 4 runs each twice for k=1, 3000 ``` 1: 4.9681604580255225 1: 4.562177540967241 1: 4.857097916014027 1: 4.935935541987419 3000: 9.643621333001647 3000: 9.103753249975853 3000: 9.236042082950007 3000: 9.164883833029307 ```