
December 9, 2025
Omnigrep: State-of-the-Art Agentic Code Search
Alexandru Ungureanu, Shane Rohan Barakat, Polarity Labs
We introduce Omnigrep, a multi-turn agentic code search system that achieves state-of-the-art performance on the CodeSearchEval benchmark. By combining general-purpose LLM reasoning with parallel tool orchestration, we outperform Cognition's SWE-grep by 15% and Claude Code by 33%.
The Problem
Code search is a critical bottleneck in AI-assisted software engineering. When autonomous coding agents tackle repository-level tasks like debugging or feature implementation, they must first locate relevant code spans. Cognition AI's own analysis indicates that agent trajectories spend over 60% of their first turn on context retrieval alone.
Existing approaches fall into two camps: embedding-based retrieval (fast but imprecise) and LLM-based agents (accurate but slow). Neither adequately addresses the precision-recall trade-off inherent to code search.
Our Approach
Omnigrep implements a 4-turn agentic loop with 8 parallel tool calls per turn, orchestrated through intermediate chain-of-thought reasoning. The key insight is that code search benefits from iterative hypothesis refinement: early turns cast a wide net, while subsequent turns validate and narrow results through explicit reasoning.
How Omnigrep Works
Natural Language Query
Developer asks: "Where is the authentication middleware defined?"
Discovery Turn
8 parallel tool calls explore the codebase structure.
glob **/*.pyripgrep "middleware"ripgrep "auth"Chain-of-Thought Reasoning
Model analyzes results, forms hypotheses, plans next search strategy. This intermediate reasoning is the key differentiator.
Refinement Turns (2-4)
Validate candidates, read file contents, narrow down to precise locations.
read src/middleware/auth.py:1-50Precise Code Location
Returns exact file paths and line numbers (e.g. src/auth/middleware.py:42-58) with high precision.
We use just three high-quality primitives: ripgrep for regex pattern matching, glob for file system traversal, and read for content retrieval. The tool set is intentionally minimal. Our ablation studies demonstrate that these three primitives achieve full coverage of code search subtasks when properly orchestrated.
Why Reasoning Wins
The critical insight behind Omnigrep is that reasoning between tool calls enables capabilities impossible with single-turn approaches. After initial results, the model refines its search hypothesis, synthesizes context across outputs, plans subsequent calls based on accumulated evidence, and recovers from failed searches with alternative strategies.
Impact of Intermediate Reasoning
F0.5 Score. Full reasoning provides +18.7% absolute improvement over no reasoning.
Our ablation shows that removing all reasoning steps degrades F0.5 by 39.4% relative. Intermediate reasoning contributes 18.7% absolute improvement, making it the key differentiator.
Results
On CodeSearchEval (128 tasks across 34 repositories), Omnigrep achieves an F0.5 score of 0.475, representing a 33.1% relative improvement over Claude Code (0.357) and 14.9% improvement over Cognition's SWE-grep (0.413).
CodeSearchEval Leaderboard
F0.5 Score on CodeSearchEval (128 tasks, 34 repositories). Higher is better.
Omnigrep also achieves 38% higher precision than Claude Code (0.46 vs 0.33) while maintaining comparable recall. This precision advantage directly translates to reduced context pollution in downstream tasks.
Speed
While Omnigrep is slower than non-agentic methods at 17.5 seconds mean latency, it maintains 2x faster latency than Claude Code and 17x faster than GPT-5, making it practical for interactive use.
Latency Comparison
| System | Mean | Median |
|---|---|---|
| Embedding (top-5) | 0.95s | 0.82s |
| SWE-grep | 2.79s | 2.35s |
| Omnigrep | 17.5s | 15.8s |
| Claude Sonnet 4.5 | 35.9s | 31.2s |
| GPT-5 (High) | 290.6s | 245.8s |
Omnigrep is 2x faster than Claude Code and 17x faster than GPT-5.
Key Takeaways
Our analysis demonstrates that a general-purpose LLM with proper reasoning architecture outperforms RL-specialized models like SWE-grep. The path to better code search lies not in specialized training, but in principled reasoning architecture.
Three simple primitives (ripgrep, glob, read) suffice when properly orchestrated through reasoning. Multi-turn reasoning with a general LLM can exceed specialized model performance, provided the reasoning architecture is well-designed.
For the full report, send an email to support@polarity.cc to see all quantifiable data.