Semantic Search Engine Built with CocoIndex in 2 Days
These articles are AI-generated summaries. Please check the original sources for full details.
How I Built a Semantic Search Engine with CocoIndex
Linghua Jin built a semantic search engine using CocoIndex, achieving 30-second indexing for 500+ documents and 50ms query responses.
Why This Matters
Traditional keyword-based search fails to capture context, leading to poor user experiences. Semantic search, powered by vector embeddings, bridges this gap but requires efficient infrastructure. CocoIndex demonstrates how lightweight vector storage and embedding models can achieve sub-50ms query times, avoiding the complexity of traditional systems.
Key Insights
- “500+ markdown files indexed in 30 seconds” (Real-World Example)
- “Semantic embeddings allow ‘teaching computers’ to match ‘machine learning’” (Key Features)
- “Batch indexing improves performance for large document collections” (Performance Tips)
Working Example
# Install CocoIndex
pip install cocoindex
from cocoindex import CocoIndex
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
doc_embeddings = data_scope.add_collector()
# Process and chunk documents
with data_scope["documents"].row() as doc:
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500
)
# Embed chunks and export to Postgres
with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Perform semantic search
def search(pool: ConnectionPool, query: str, top_k: int = 5):
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
query_vector = text_to_embedding.eval(query)
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, text, embedding <=> %s::vector AS distance
FROM {table_name} ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
for row in cur.fetchall()
]
Practical Applications
- Use Case: Documentation search with 500+ markdown files using semantic embeddings
- Pitfall: Choosing embedding dimensions without balancing accuracy and performance (e.g., 384 dimensions vs. higher-dimensional models)
References:
Continue reading
Next article
How I Installed Nagios on EC2 and Created My Own Disk Monitoring Plugin
Related Content
Hardening Astropy's Core Stability: Testing Raw C-Extensions
Reem Hamraz joins GSoC 2026 to harden Astropy's core stability by implementing low-level tests for Cython extensions.
Why I Built the 🕍 Cathedral Roo Architect Mode: A Technical Vision for Open-Source Game Development
Rebecca Susan Lemke explains her custom Roo mode for aligning cathedral-real, a Godot 4 open-world game and creative OS, with ethical, spec-driven development practices.
BorrowHood: Open-Source Community Rental Platform Built with FastAPI and SQLAlchemy
Wilhelm Tell launches BorrowHood, an open-source rental platform featuring 10,625 lines of Python and a 250-test suite for community resource sharing.