December 9, 2025

Omnigrep: State-of-the-Art Agentic Code Search

Alexandru Ungureanu, Shane Rohan Barakat, Polarity Labs

We introduce Omnigrep, a multi-turn agentic code search system that achieves state-of-the-art performance on the CodeSearchEval benchmark. By combining general-purpose LLM reasoning with parallel tool orchestration, we outperform Cognition's SWE-grep by 15% and Claude Code by 33%.

The Problem

Code search is a critical bottleneck in AI-assisted software engineering. When autonomous coding agents tackle repository-level tasks like debugging or feature implementation, they must first locate relevant code spans. Cognition AI's own analysis indicates that agent trajectories spend over 60% of their first turn on context retrieval alone.

Existing approaches fall into two camps: embedding-based retrieval (fast but imprecise) and LLM-based agents (accurate but slow). Neither adequately addresses the precision-recall trade-off inherent to code search.

Our Approach

Omnigrep implements a 4-turn agentic loop with 8 parallel tool calls per turn, orchestrated through intermediate chain-of-thought reasoning. The key insight is that code search benefits from iterative hypothesis refinement: early turns cast a wide net, while subsequent turns validate and narrow results through explicit reasoning.

How Omnigrep Works

Natural Language Query

Developer asks: "Where is the authentication middleware defined?"

Discovery Turn

8 parallel tool calls explore the codebase structure.

glob **/*.pyripgrep "middleware"ripgrep "auth"

Chain-of-Thought Reasoning

Model analyzes results, forms hypotheses, plans next search strategy. This intermediate reasoning is the key differentiator.

Refinement Turns (2-4)

Validate candidates, read file contents, narrow down to precise locations.

read src/middleware/auth.py:1-50

Precise Code Location

Returns exact file paths and line numbers (e.g. src/auth/middleware.py:42-58) with high precision.

We use just three high-quality primitives: ripgrep for regex pattern matching, glob for file system traversal, and read for content retrieval. The tool set is intentionally minimal. Our ablation studies demonstrate that these three primitives achieve full coverage of code search subtasks when properly orchestrated.

Why Reasoning Wins

The critical insight behind Omnigrep is that reasoning between tool calls enables capabilities impossible with single-turn approaches. After initial results, the model refines its search hypothesis, synthesizes context across outputs, plans subsequent calls based on accumulated evidence, and recovers from failed searches with alternative strategies.

Impact of Intermediate Reasoning

None

0.29

Turn 1

0.38

Turns 1-2

0.43

All Turns

0.48

F0.5 Score. Full reasoning provides +18.7% absolute improvement over no reasoning.

Our ablation shows that removing all reasoning steps degrades F0.5 by 39.4% relative. Intermediate reasoning contributes 18.7% absolute improvement, making it the key differentiator.

Results

On CodeSearchEval (128 tasks across 34 repositories), Omnigrep achieves an F0.5 score of 0.475, representing a 33.1% relative improvement over Claude Code (0.357) and 14.9% improvement over Cognition's SWE-grep (0.413).

CodeSearchEval Leaderboard

Omnigrep (Ours)0.475

SWE-grep (Cognition)0.413

Claude Sonnet 4.50.357

GPT-5 Codex0.244

Embedding (rerank)0.265

F0.5 Score on CodeSearchEval (128 tasks, 34 repositories). Higher is better.

Omnigrep also achieves 38% higher precision than Claude Code (0.46 vs 0.33) while maintaining comparable recall. This precision advantage directly translates to reduced context pollution in downstream tasks.

Speed

While Omnigrep is slower than non-agentic methods at 17.5 seconds mean latency, it maintains 2x faster latency than Claude Code and 17x faster than GPT-5, making it practical for interactive use.

Latency Comparison

System	Mean	Median
Embedding (top-5)	0.95s	0.82s
SWE-grep	2.79s	2.35s
Omnigrep	17.5s	15.8s
Claude Sonnet 4.5	35.9s	31.2s
GPT-5 (High)	290.6s	245.8s

Omnigrep is 2x faster than Claude Code and 17x faster than GPT-5.

Key Takeaways

Our analysis demonstrates that a general-purpose LLM with proper reasoning architecture outperforms RL-specialized models like SWE-grep. The path to better code search lies not in specialized training, but in principled reasoning architecture.

Three simple primitives (ripgrep, glob, read) suffice when properly orchestrated through reasoning. Multi-turn reasoning with a general LLM can exceed specialized model performance, provided the reasoning architecture is well-designed.

For the full report, send an email to support@polarity.cc to see all quantifiable data.

← Back to Research