In any RAG System, user write a query, that query turns into vector embedding and then it searches all relevant document in vector database where domain specific knowledge stored in vector embedding form only. Based on semantic similarity between the original query and stored document, system pulls all relevant document. These documents are then augmented with original query as a prompt to the LLM to generate the final response.
Evaluating scope of such system –
One can think of evaluating whether the retrieved documents / context is relevant to the original query, we can call it retrieval side evaluation as context relevance.
One can think of evaluating whether the response is relevant to the original query. We can call this generation side evaluation as response relevance.
one can also think of evaluating whether the response is supported by the context retrieved. We can call thing generation side evaluation as Groundfulness .
Context Relevance –
Let’s now focus only on retrieval part of the system, where we need to evaluate if the document retrieved is relevant to the query. To evaluate it we need ground truth as candidate document list for each possible query which is impossible to have in any setting. Generating this ground truth manually is practically not possible in many cases . To solve this problem there is a popular way of using LLM as a judge to generate the ground truth .
LLM Judge :
Basically give content of the document and query to LLM and ask if the document provided is relevant to the query or not . USing prompt we can get a relevancy score 0 ( not relevant ) , 1 ( relevant ) and that will be considered as ground truth . A dataset can be created to store query , retried document and relevancy score as grouth truth generated by LLM as judge .
Use Dataset to evaluate the System :
Once the dataset is ready , we can evaluate our system using below two matrices –
Precesion@k : How many relevant documents are pulled in top K .
Recall@K : Out of all relevant document how many are pulled in top K .
You can calculate these two for each query and then take an average to get the baseline of your system .