Intent Engine: Building a Self‑Improving Search System

Traditional search stacks get slower and staler the more you rely on them. The Intent Engine takes the opposite approach: every user query makes the system faster, smarter, and more aligned with what people actually want.

This post walks through how the Intent Engine evolved into a self‑improving search platform with a Go crawler, vector search, intent extraction, multi‑layer caching, and production‑grade observability.

What is the Intent Engine?

The Intent Engine is a modular search stack that turns free‑form queries into structured intent objects, routes them through multiple backends, and returns results ranked to match what the user is actually trying to do.

At a high level, the system runs as a FastAPI service in Python, backed by a Go‑based crawler and indexer, a SearXNG meta‑search engine, a Qdrant vector database, Redis for caching and queues, and PostgreSQL for persistent storage.

The self‑improving search loop

The defining feature of v2 is a self‑improving search loop where every user search seeds new URLs into the crawler, continuously expanding and refining the knowledge base.

A typical loop looks like this:

The user searches for something like "golang tutorial".
The Intent Engine extracts structured intent (goal, use cases, complexity, skill level, temporal horizon, etc.).
SearXNG fans out the query to engines like Google, Brave, DuckDuckGo, and Bing and returns aggregated results.
The top unique URLs are extracted and pushed into a Redis‑backed crawl queue with priority scoring.
The Go crawler processes the queue, respecting language filters and depth limits.
The Go indexer enriches crawled pages with intent metadata and builds the search index.
Qdrant stores vector embeddings for semantic retrieval.
Future searches over the same topic become richer and more relevant.

In production tests, just three searches were enough to add over 634,000 URLs into the crawl queue, showing how aggressively the system expands its coverage when traffic starts flowing.

Architecture at a glance

The architecture is intentionally polyglot: Python handles the API and orchestration, while Go powers crawling and indexing for raw performance.

Core components include:

FastAPI intent engine API (port 8000): Entry point for intent extraction, unified search, ranking, and health checks.
Go unified search API (port 8081/8082): Federated search layer that talks to Go indexes, SearXNG, and vector search in parallel.
SearXNG (port 8080): Privacy‑respecting meta‑search that feeds the self‑improving loop with fresh URLs.
Go crawler and indexer: Crawler with intent‑aware indexing backed by Redis queues and PostgreSQL storage.
Redis (port 6379): Used as both a result cache (1‑hour TTL) and a prioritized crawl queue holding more than a million URLs in real runs.
PostgreSQL (port 5432): Stores crawled pages, intent metadata, session data, and ad‑related information.
Qdrant (port 6333): Vector database that stores document embeddings for semantic search over intent‑tagged content.
Prometheus (port 9090) and Grafana (port 3000): Full observability stack with dashboards for latency, cache hit rates, crawl queue size, and service health.

All of this can be brought up with a single docker-compose up -d, with health checks verifying that every service is ready before you start hitting the search API.

Making search meaningfully faster

Early versions of the Intent Engine suffered from painful latencies: uncached searches took between 7 and 17 seconds, and cold starts could stretch to 108 seconds..

A dedicated Search Latency Optimization pass introduced several key strategies:

Multi‑level caching:
- L1 in‑memory LRU cache inside the API process (around 2,000 entries) for intent extraction results, delivering cached lookups in around 1 millisecond.
- L2 Redis cache for full unified search responses with a 1‑hour TTL and target hit rate of 60–80 percent, yielding around 11× faster responses on cache hits.
Query normalization: Normalizes queries like "best laptop for programming?" and "best laptop programming" into a canonical form to increase cache reuse.
Top‑K ranking: Runs expensive ML ranking models only on the top 40 candidates instead of the full result set, cutting ranking time roughly 5×.
Timeouts and graceful fallbacks: Applies per‑stage timeouts to backends, falling back to SearXNG‑only results when federated search is slow.
Connection pooling with HTTP2: Uses persistent HTTPX clients with connection pools and HTTP2 support to reduce per‑request overhead by tens of milliseconds.

With these changes in place, the team brought uncached search latency down to around 3 seconds P95, cached queries down to roughly 100 milliseconds, and cold starts down to about 5 seconds—a 10–100× improvement depending on the scenario.

Crawling only what matters

A self‑improving system is only useful if it knows when not to learn. Language filtering for the Go crawler is implemented using a multi‑layer detection strategy instead of naive domain blocking.

The crawler:

Reads the HTML lang attribute (e.g., html lang="en", html lang="en-GB", html lang="zh-CN") as a first, cheap signal.
Uses the lingua-go library to analyze up to the first 500 characters of page content across 75 languages.
Combines both signals with decision logic: prefer the lang attribute when valid, but verify with content; skip pages confidently detected as non‑English.

This approach allows the crawler to index English pages on international domains while skipping non‑English pages even on .com hosts, with sentence‑level detection accuracy reported around 96 percent in the library’s benchmarks.

Observability and control in production

Search systems fail in subtle ways—stuck crawlers, exploding queues, cache regressions—so v2 ships with a serious monitoring stack out of the box.

Key metrics include:

Unified search request counts, latencies, and cache hit rates.
Intent extraction latency histograms and throughput counters.
Crawl queue depth (often over 1.2 million URLs in active runs) and skip reasons (such as non‑English pages).
Database connections, memory usage, and error rates visualized via Grafana dashboards.

Prometheus scrapes all services, and Grafana auto‑loads curated dashboards via provisioned YAML and JSON configs, so you get ready‑made panels for latency, cache efficiency, and crawler health on day one.

What’s coming next

The current roadmap focuses on continuing to harden and scale the self‑improving loop: letting the crawler run overnight to deepen coverage, upgrading to higher‑quality sentence‑transformer models for embeddings, and horizontally scaling crawler instances as the URL queue grows.

Beyond that, the project envisions more advanced topic discovery, ML‑driven categorization, multi‑language support, and tighter feedback loops between user behavior and what the crawler chooses to learn next.

If you are interested in building search systems that get better with every query instead of decaying over time, the Intent Engine provides a concrete, end‑to‑end blueprint you can study, adapt, and extend.