Building a Self-Improving Search Engine for the Privacy-First Web
"What if your search engine didn’t track you—but actually understood you? IntentForge reimagines search from the ground up: intent-first ranking instead of keyword guessing, Tor-routed queries instead of surveillance pipelines, and ultra-lightweight semantic indexing powered by binary quantized vectors. No logs. No profiling. Just fast, private, and genuinely relevant results. This isn’t a tweak to search—it’s a complete reset. "
Tags: search-engine, privacy, rust, machine-learning, open-source, tor, intentforge
Introduction
Search engines today are billion-user surveillance machines. Every query reveals intent, belief, and vulnerability — data that's harvested, profiled, and monetized without your consent. The privacy-first movement has given us VPNs and incognito modes, but these are halfway measures. They encrypt your traffic while letting the destination (the search engine) log everything that matters.
IntentForge was built with a different premise: what if the search engine itself respected your intent, not exploited it?
This is the technical story of how IntentForge works — from intent-first ranking and Tor-routed meta-search to self-improving indexes and binary quantized vectors.
The Core Problem with Keyword Search
Traditional search engines match keywords. Query "best coffee shops near me" and you get pages optimized for "coffee," "shops," "near," "me" — not necessarily the best coffee shops. SEO armies have spent decades gaming this system.
IntentForge takes a different approach. Instead of keyword matching, we rank documents by semantic alignment with user intent. The system classifies incoming queries by intent type, expands queries with synonyms, scores documents based on intent alignment, and filters commercial spam content using dedicated anti-signals.
This means searching for "how to bake sourdough" returns baking guides, not bakery e-commerce pages.
Architecture Overview
IntentForge is built in Rust for performance-critical components (crawler, indexer, inference) with Python services for the semantic search layer.
The system consists of:
- Crawler — Batch and robot-compliant web crawling
- Discovery — Sitemap parsing, link following, RSS ingestion
- Indexer — Meilisearch-powered document indexing with binary quantization
- Inference — ONNX-based semantic embeddings with two-stage re-ranking
- Redis Queue — Priority-based task queue with Bloom filters for deduplication
Key Innovations
1. Binary Quantized Vectors (8× Compression)
Standard vector search uses 32-bit floats per dimension. A 384-dimension embedding at float32 is 1,536 bytes per document.
IntentForge uses binary quantization — compressing each dimension to a single bit. This reduces vectors to 48 bytes per document (384 ÷ 8), achieving 8× compression with only 3.3% NDCG loss.
This makes privacy-first search viable on commodity hardware. No GPUs. No terabytes of RAM.
2. Tor-Routed Meta-Search
Every search query is routed through the Tor network using Snowflake bridges:
- No IP-based tracking or profiling
- Bypassing of geo-restrictions and IP blocks
- Fresh exit IPs on every circuit rotation (5-minute intervals)
- Access to the entire web, not just "allowed" portions
Eight direct search providers (DuckDuckGo, Bing, GitHub, ArXiv, Reddit, GDELT, Invidious) plus SearXNG (70+ engines) are queried in parallel through Tor.
3. Self-Improving Index
When a query returns poor results, the system automatically:
- Queues the query for background enrichment
- Expands it using synonym graphs and intent classification
- Crawls additional sources via Tor
- Re-ranks results using RRF + semantic alignment
- Updates the index with improved documents
This happens in 500ms cycles, achieving an average quality score of 8/15 per query.
4. Intent-Gated Crawling
Before fetching a page, FastIntentScorer evaluates the URL against current query intent. Pages scoring below 0.6 relevance are skipped — reducing unnecessary traffic by 30-50%.
Performance Metrics
| Metric | Value |
|---|---|
| Search Providers | 8 direct + SearXNG (70+ via Tor) |
| Vector Size | 48 bytes/doc (binary quantized) |
| Query Latency (P95) | <50ms |
| Indexing Throughput | ~30k pages/hr (Starter tier) |
| RSS Sources | 150+ verified feeds across 20+ categories |
| Self-Improvement | 500ms rounds, 8/15 avg quality |
| Tor Coverage | All providers via Snowflake bridges |
Privacy by Design
IntentForge is built on the principle that privacy is the foundation, not a feature:
- Zero logging: No user queries stored, ever
- Tor-only routing: All traffic exits through anonymous circuits
- No cookies or tracking pixels: Sessionless by design
- Open source: Entire codebase auditable
- Self-hosted option: Run it on your own infrastructure
Getting Started
IntentForge runs fully containerized with Docker Compose:
git clone https://github.com/oxiverse-labs/intentforge
cd intentforge
cp .env.example .env
docker compose --profile full up -d
API is available at http://localhost:9100:
curl "http://localhost:9100/search?q=rust+programming"
The Future of Search
IntentForge is a proof of concept that privacy, performance, and relevance are not mutually exclusive. By combining intent-first ranking, Tor-routed meta-search, and self-improving indexes, we can build search infrastructure that respects users.
Links:
- Live Search: https://search.oxiverse.com
- GitHub: https://github.com/oxiverse-labs/intentforge
- Documentation: https://github.com/oxiverse-labs/intentforge/tree/master/docs
Related Content_
Binary Quantization — 8× Vector Compression with Minimal Accuracy Loss
*“Vector search is memory-hungry. Binary quantization is the answer – but traditional methods lose 15-20% accuracy.”* We cracked asymmetric binary quantization: 48 bytes per document instead of 1,536 bytes, 5× faster queries, and only 3.3% NDCG loss. No GPUs. No terabytes of RAM. Just efficient, accurate search on commodity hardware. Dive into the math, the implementation, and why this makes privacy-first search viable.
IntentForge Architecture — How We Built a Privacy-First Search Engine with Tor
“Google knows what you searched for. Your ISP sees every site you visit. IntentForge changes the game.” Most search engines treat privacy as an afterthought. IntentForge was built with Tor integration from day one – routing every query through Snowflake bridges, matching intent instead of keywords, and running on a self-improving binary-quantized index. No logs. No tracking. No manipulation. Just a search engine that respects you. Read how we built a privacy-first search engine on a $20 VPS.