Prototyping Hybrid Search (RRF) to improve Dataset Discovery

Zohaib_Shahid · January 27, 2026, 8:51pm

Hi all,

I’ve been testing the KnowledgeSpace Agent locally and investigating edge cases where standard Semantic (Vector) Search falls short.

While vector embeddings are excellent for capturing general concepts, I noticed they sometimes struggle with exact match requirements—specifically when searching for precise author names, gene identifiers, or specific dataset codes that don’t carry strong “semantic” weight.

To address this, I have prototyped a Hybrid Search architecture using Reciprocal Rank Fusion (RRF).

The Logic: Instead of relying solely on vector similarity, this implementation:

Runs a Vector Search (for meaning).
Runs a Keyword Search (for exact term matching).
Fuses the two lists using the RRF formula: score = 1 / (k + rank).

In my local benchmarks, this approach significantly boosts specific datasets that were previously buried in the 10th-20th position up to the top 3.

I have submitted a PR with this architectural change here: [ Issue 13 hybrid search by zohaib-7035 · Pull Request #34 · INCF/knowledge-space-agent · GitHub ]

I would love to get feedback on the ranking logic. Does anyone have a list of “hard queries” or known failures in the current system? I’d like to run them through this new pipeline to measure the improvement.

Best, Zohaib