Prototyping Hybrid Search (RRF) to improve Dataset Discovery

Hi all,

I’ve been testing the KnowledgeSpace Agent locally and investigating edge cases where standard Semantic (Vector) Search falls short.

While vector embeddings are excellent for capturing general concepts, I noticed they sometimes struggle with exact match requirements—specifically when searching for precise author names, gene identifiers, or specific dataset codes that don’t carry strong “semantic” weight.

To address this, I have prototyped a Hybrid Search architecture using Reciprocal Rank Fusion (RRF).

The Logic: Instead of relying solely on vector similarity, this implementation:

  1. Runs a Vector Search (for meaning).
  2. Runs a Keyword Search (for exact term matching).
  3. Fuses the two lists using the RRF formula: score = 1 / (k + rank).

In my local benchmarks, this approach significantly boosts specific datasets that were previously buried in the 10th-20th position up to the top 3.

I have submitted a PR with this architectural change here: [ Issue 13 hybrid search by zohaib-7035 · Pull Request #34 · INCF/knowledge-space-agent · GitHub ]

I would love to get feedback on the ranking logic. Does anyone have a list of “hard queries” or known failures in the current system? I’d like to run them through this new pipeline to measure the improvement.

Best, Zohaib