RAG (Retrieval-Augmented Generation) is an AI technique that combines a retriever with a generator to allow language models to answer questions based on external, up-to-date, or private knowledge that isn't part of their original training data.
What does this really mean though?
Company policy chatbot
you want to build a
Company policy chatbot
you want to build a
but no LLMs were trained on your company policy docs, so you...
Augment the LLMs training by passing your own data to the prompt context
but you can't just dump in all your docs...
but you can't just dump in all your docs
1. cost too much
2. be slow
3. and be more likely to produce hallucinations
You retrieve the docs most relevant to the user request
so before you pass in your policy docs data...
but how do you know what's most relevant?
You could do traditional keyword searching
that's been around forever...
What tools are available for this?
Vector embeddings!
What are vector embeddings?
So instead you want to do a fuzzier, more semantic meaning kind of search
but this is likely to miss a lot of relevant docs.
What are vector embeddings?
Numerical representations of text designed so that similar meanings are close together in vector space, even if the exact words are different.
===
For example...
or...
there are many of them out there...
All major LLMs have them (OpenAI, etc) or you can use Ollama to use open source models
With AI of course... specifically an embeding model
How do you create these embeddings?
So you pick one, create vector embeddings for all your docs, and store them in your own vector databse
4. And finally return a grounded/informed response
Then at the time of an end user question...
2. query the vector DB embeddings within a certain distance of that query
3. pass on the matched context to the Generative AI
1. make an embedding of the user question
(important you use the same embedding model!)
A process totally separate from and before the users' request
At request time (ie the moment the user prompts the chatbot)
Specialized RAG SAAS
offer vector db hosting plus more
Vector DB Solutions
(ie. sqlite flavor created by
what we're using in the exercise project)
via pgvector extension
Yes this means you can do RAG with
or postgres
Cloudflare Vectorize DB
+ more!
// Generate Embeddings
import fs from "fs/promises";
import path from "path";
import dotenv from "dotenv";
import OpenAI from "openai";
import { useDb, schema } from "../server/utils/db";
import { sql } from "drizzle-orm";
dotenv.config();
const folder = "company_policies";
const openai = new OpenAI({
apiKey: process.env.NUXT_OPENAI_API_KEY!,
});
const db = useDb();
await db.run(sql`
CREATE INDEX IF NOT EXISTS company_policies_vector_idx
ON company_policies(libsql_vector_idx(embedding));
`);
const directory = path.join(process.cwd(), folder);
async function getMarkdownFiles(dir: string): Promise<string[]> {
const files = await fs.readdir(dir);
return files.filter((f) => f.endsWith(".md")).map((f) => path.join(dir, f));
}
// Chunking splits large files into smaller parts to fit token limits and improve semantic search.
async function chunkText(text: string, maxTokens = 50): Promise<string[]> {
// Split by paragraphs (empty lines)
const paragraphs = text.split(/\n\s*\n/);
const chunks: string[] = [];
let current = "";
for (const para of paragraphs) {
if ((current + para).length > maxTokens * 4) {
if (current.trim()) chunks.push(current.trim());
current = "";
}
current += para + "\n\n";
}
if (current.trim()) {
chunks.push(current.trim());
}
return chunks;
}
async function embedAndStore() {
const files = await getMarkdownFiles(directory);
for (const file of files) {
const content = await fs.readFile(file, "utf-8");
const chunks = await chunkText(content);
console.log(file, chunks.length);
for (const [i, chunk] of chunks.entries()) {
const res = await openai.embeddings.create({
model: "text-embedding-ada-002",
input: chunk,
});
if (!res.data[0]) throw new Error("Failed to generate embedding");
const embedding = res.data[0].embedding;
await db.insert(schema.companyPolicies).values([
{
filepath: file,
content: chunk,
chunkNumber: i,
embedding: sql`vector32(${JSON.stringify(embedding)})`,
},
]);
}
console.log(`✅ Embedded ${file}`);
}
}
embedAndStore().catch(console.error);
Walk through together
// Retreive results similar to query
import dotenv from "dotenv";
import OpenAI from "openai";
import readline from "readline";
import { useDb, schema } from "../server/utils/db";
import { sql } from "drizzle-orm";
dotenv.config();
const openai = new OpenAI({
apiKey: process.env.NUXT_OPENAI_API_KEY!,
});
const db = useDb();
async function getQueryEmbedding(query: string): Promise<number[]> {
const res = await openai.embeddings.create({
model: "text-embedding-ada-002",
input: query,
});
if (!res.data[0]) throw new Error("Failed to generate embedding");
return res.data[0].embedding;
}
// topK is the number of results to return
async function searchcompanyPolicies(query: string, topK = 5) {
const embedding = await getQueryEmbedding(query);
// Use vector_top_k to find the most similar vectors, then join with the actual table
// vector_top_k returns records with the primary key/rowid of matching rows
const matches = await db
.select({
id: sql`vt.id`,
content: schema.companyPolicies.content,
filepath: schema.companyPolicies.filepath,
chunkNumber: schema.companyPolicies.chunkNumber,
})
.from(
sql`vector_top_k('company_policies_vector_idx', vector32(${JSON.stringify(
embedding
)}), ${topK}) as vt`
)
.leftJoin(
schema.companyPolicies,
sql`${schema.companyPolicies.id} = vt.id`
);
console.log(`\n🔍 Top ${topK} results for: "${query}"\n`);
for (const [i, match] of matches.entries()) {
console.log(`--- Result ${i + 1} ---`);
console.log(`Source: ${match.filepath} (Chunk ${match.chunkNumber})`);
console.log(`Content: ${match.content?.substring(0, 200)}...`);
console.log("");
}
return matches;
}
function promptForQuery() {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
rl.question("🧠 Enter your question: ", async (query) => {
const answer = await answerQuery(query);
console.log(answer);
rl.close();
});
}
async function answerQuery(query: string) {
const matches = await searchcompanyPolicies(query);
const context = matches.map((match) => match.content).join("\n");
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `You are a helpful assistant that answers questions about company policies.
You are given a context and a question. You should answer the question based on the context.
If you don't know the answer, say 'I don't know'.
<context>${context}</context>`,
},
{ role: "user", content: query },
],
});
if (!response.choices[0]) throw new Error("Failed to generate response");
const answer = response.choices[0].message.content;
return answer;
}
promptForQuery();
Walk through together
There are plenty of ways you can fine-tune your RAG pipeline for your use case....
Why it matters: Determines how many documents (or chunks) are retrieved for the model to consider.
Higher dimensions = better accuracy, but larger DB size and slower queries.
Choose based on the model (e.g., OpenAI uses 1536, MiniLM is 384).
Why it matters: Higher dimensional vectors can encode more nuanced semantic meaning.
Why it matters: Controls how the original documents are split.
filter: { type: 'policy', language: 'en' }
Don't be overwhelmed by the different options!