Understanding ‘Top K’ in Vector Search: A LlamaIndex and Gemini Example with ISTQB AI Testing Data

Sami Sabir
4 min readJul 30, 2024

--

Understanding ‘Top K’ in Vector Search: A LlamaIndex and Gemini Example with ISTQB AI Testing Data

In the realm of information retrieval and vector search, the concept of “Top K” is crucial for efficient and effective querying. This tutorial will explain Top K using a practical example with LlamaIndex and Google’s Gemini model, utilizing real-world AI testing documentation.

Test Data: ISTQB Artificial Intelligence Tester Syllabus

For this tutorial, we’ll be using the “CT-AI syllabus” PDF document from the International Software Testing Qualifications Board (ISTQB). This document contains valuable information about AI testing methodologies and practices.

To follow along:

  1. Download the PDF from the ISTQB website: CT-AI Syllabus or at https://istqb-main-web-prod.s3.amazonaws.com/media/documents/ISTQB_CT-AI_Syllabus_v1.0_mghocmT.pdf
  2. Create a directory structure in your project: ./data/istqb/
  3. Place the downloaded PDF in this directory.

This setup ensures that our code can access and process the syllabus content.

What is “Top K”?

“Top K” refers to the K most relevant or similar items retrieved from a dataset in response to a query. In vector search:

  • K is a user-defined integer specifying the number of results to return.
  • “Top” indicates we’re interested in the most relevant or similar items.

Practical Example: Document Retrieval with LlamaIndex and Gemini

Let’s break down a real-world example using LlamaIndex and the Gemini model:

from llama_index.llms.gemini import Gemini
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
import os
from dotenv import load_dotenv

# Load environment variables and set up API key
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
os.environ["GOOGLE_API_KEY"] = api_key

# Initialize the Gemini model
llm = Gemini(model="models/gemini-1.5-flash-001")
try:
# Load documents from the ISTQB directory
documents = SimpleDirectoryReader('./data/istqb').load_data()
# Create a vector store index
index = VectorStoreIndex.from_documents(documents)
# Set up the retriever with Top K = 10
retriever = VectorIndexRetriever(index=index, similarity_top_k=10)
query_engine = RetrieverQueryEngine(retriever=retriever)
# Generate a response about AI concepts from the syllabus
response = query_engine.query("explain what is Overfitting and Underfitting")
print("Generated text:")
print(response)
except Exception as e:
print(f"An error occurred: {str(e)}")

How Top K Works in This Example

  1. Document Loading: The script loads the ISTQB AI Testing syllabus from the specified directory.
  2. Indexing: VectorStoreIndex.from_documents(documents) creates a vector index of the syllabus content.
  3. Retriever Setup: We set similarity_top_k=10, meaning the retriever will fetch the 10 most similar sections from the syllabus for any given query.
  4. Query Processing:
  • The query “explain what is Overfitting and Underfitting” is converted into a vector.
  • This query vector is compared to all vectorized sections of the syllabus.
  • The 10 most similar sections (because Top K = 10) are retrieved.

5. Response Generation: The Gemini model uses these 10 most relevant sections to generate a response explaining overfitting and underfitting in the context of AI testing.

Why is Top K Important in This Context?

  1. Efficiency: By limiting to 10 sections, we reduce the amount of text the Gemini model needs to process, speeding up response time.
  2. Relevance: We’re focusing on the 10 most similar sections, increasing the chances of getting relevant information from the syllabus.
  3. Balancing Comprehensiveness and Precision: 10 sections provide a good balance between having enough information and not overwhelming the model with irrelevant data.

Adjusting the Top K Value

You can easily experiment with different K values:

retriever = VectorIndexRetriever(index=index, similarity_top_k=5)  # More focused
retriever = VectorIndexRetriever(index=index, similarity_top_k=20) # More comprehensive
  • A smaller K (e.g., 5) might be faster but could miss relevant information from the syllabus.
  • A larger K (e.g., 20) might be more comprehensive but could introduce noise and slow down processing.

The Value of AI Tester Certification

Using the ISTQB AI Testing syllabus in this example not only provides real-world content for our tutorial but also highlights the importance of specialized knowledge in AI testing. The ISTQB Certified Tester AI Testing certification offers several benefits:

  1. Specialized Knowledge: Gain in-depth understanding of AI systems, machine learning models, and their testing methodologies.
  2. Industry Recognition: ISTQB certifications are globally recognized, enhancing your credibility in the field of software testing.
  3. Career Advancement: Demonstrate your expertise in a rapidly growing field, opening up new career opportunities.
  4. Improved Testing Practices: Apply standardized, best-practice approaches to testing AI systems, ensuring higher quality and reliability.
  5. Staying Current: The syllabus covers cutting-edge topics, helping you stay updated with the latest trends and challenges in AI testing.

Conclusion

Understanding and properly utilizing Top K is crucial for building efficient and effective information retrieval systems. In this LlamaIndex and Gemini example with the ISTQB AI Testing syllabus, Top K determines how many similar sections of the document are used to generate a response, directly impacting the quality and speed of the results.

By experimenting with different K values, you can find the optimal balance between response quality and speed for your specific use case. This approach can be particularly valuable when working with comprehensive documents like the ISTQB syllabus, allowing you to efficiently extract and apply relevant information.

Whether you’re using these techniques for personal learning, professional development, or preparing for the ISTQB AI Testing certification, mastering concepts like Top K in vector search can significantly enhance your capabilities in the field of AI and software testing.

--

--

Sami Sabir
Sami Sabir

Written by Sami Sabir

Senior Software Development Engineer in Test Automation. Sharing the latest trending topics within the software development space with Artificial Intelligence.

No responses yet