AI & ML DevOps General Tech Community Best Practices & Tools All News About Contact
advertisement
General Tech

I Deleted Pinecone, Redis, and 400 Lines of Python. My RAG Pipeline Still Works.

May 2026 8 min read
I Deleted Pinecone, Redis, and 400 Lines of Python. My RAG Pipeline Still Works.
Back to General Tech

Last November I received a Pinecone bill for $70. For a side project. Which no one was using yet.

I sat there looking at my architecture and counted: Pinecone for vectors, Redis for caching, a FastAPI service that hits the OpenAI embed endpoint, LangChain tying it all together with about 400 lines of Python, and PostgreSQL, which had my real data the whole time, sitting there doing nothing interesting.

Five moving parts. I couldn't even run it locally without spinning up four Docker containers and praying they all connected. When Pinecone had that 20-minute outage in October, my “AI-powered” app just… died. The documents were in Postgres. The user was talking to Postgres. But the recovery step went through San Francisco and back, so tough luck.

I spent a weekend ripping it all out. This is what replaced it.

before and after

Before:

User query → FastAPI → OpenAI (embed) → Pinecone (search) → Redis (cache check)

→ PostgreSQL (get documents) → OpenAI (generate) → Response

After:

User Query → PostgreSQL → Response

Same results. Better results actually, because I was finally able to do a hybrid search (more on this later). And the Pinecone bill went to $0.

the configuration

Two extensions. That is the infrastructure change.

CREATE EXTENSION IF NO vector EXISTS;

CREATE EXTENSION IF NOT EXISTS ai;

pgvector handles vector storage and similarity searching. pgai talks to Ollama (or OpenAI, if you prefer) for embeddings and LLM calls. Both install in seconds.

I keep seeing people treat this like it's some exotic gimmick. Which is not. pgvector has been around since 2021 and is shipped by default now by Supabase, Neon, and all major Postgres cloud providers.

the table

Nothing clever here:

CREATE TABLE documents (

id BIGINT ALWAYS GENERATED AS PRIMARY IDENTITY KEY,

Source TEXT NOT NULL,

title NOT NULL TEXT,

NON-NULL TEXT content,

fragment of NOT NULL TEXT,

chunk_index INTEGER NOT NULL DEFAULT 0,

embedding vector (768),

DEFAULT JSONB metadata '{}',

created_in TIMESTAMPTZ DEFAULT now(),

updated_in TIMESTAMPTZ DEFAULT now(),

UNIQUE (source, fragment_index)

);

the embedding

The column is a native pgvector type. 768 dimensions because I'm using nomic-embed-text, but you'd use 1536 for OpenAI's ada-002, 384 for all of MiniLM, whatever. The point is that it lives right next to the text. No synchronization. There's no "oh, the integrated store has a different version of this document than the main database" nonsense.

Generating embeds without leaving SQL

This was the part where I was like "wait, really?"

UPDATE documents

SET embed = ai.ollama_embed(

'nomic-embedded-text',

piece,

host => 'http://ollama:11434'

)

WHERE the embedding IS NULL;

A declaration. Each row missing an embedding gets one. I don't need a Python script, I don't need a queue, I don't need to worry about rate or batch limits. Ollama runs locally, so it's as fast as my GPU allows.

For new insertions, a trigger handles it:

CREATE OR REPLACE generate_embedding() FUNCTION

RETURN TRIGGER AS $$

GET STARTED

NEW.embedding := ai.ollama_embed(

'nomic-embedded-text',

NEW.piece,

host => 'http://ollama:11434'

);

RETURN NEW;

END;

$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_embed_on_insert

BEFORE INSERTING INTO DOCUMENTS

FOR EACH ROW

WHEN (NEW.embedding IS NULL)

RUN FUNCTION generate_embedding();

INSERT a row, get an embed. There is no application code involved.

Indexing (because sequential scans do not scale)

pgvector without index performs a sequential scan. Good for a few thousand rows. Painful at 100K+. Unusable at 1M.

CREATE INDEX idx_documents_hnsw

IN documents USING hnsw (

embedding vector_cosine_ops

)

WITH (m = 16, construction_ef = 128);

meter = 16

It means that each node in the graph connects to 16 neighbors. construction_ef = 128

is how hard the algorithm works during index creation. These are not magic numbers: I tried several combinations and these gave the best balance between recovery and speed.

Related Coverage

General Tech

Top 7 Featured DEV Posts of the Week

General Tech

3 words worth a billion dollars: Drift to Determinism (DriDe)