Hi all,
I’ve been testing the KnowledgeSpace Agent locally and investigating edge cases where standard Semantic (Vector) Search falls short.
While vector embeddings are excellent for capturing general concepts, I noticed they sometimes struggle with exact match requirements—specifically when searching for precise author names, gene identifiers, or specific dataset codes that don’t carry strong “semantic” weight.
To address this, I have prototyped a Hybrid Search architecture using Reciprocal Rank Fusion (RRF).
The Logic: Instead of relying solely on vector similarity, this implementation:
- Runs a Vector Search (for meaning).
- Runs a Keyword Search (for exact term matching).
- Fuses the two lists using the RRF formula:
score = 1 / (k + rank).
In my local benchmarks, this approach significantly boosts specific datasets that were previously buried in the 10th-20th position up to the top 3.
I have submitted a PR with this architectural change here: [ Issue 13 hybrid search by zohaib-7035 · Pull Request #34 · INCF/knowledge-space-agent · GitHub ]
I would love to get feedback on the ranking logic. Does anyone have a list of “hard queries” or known failures in the current system? I’d like to run them through this new pipeline to measure the improvement.
Best, Zohaib