3

Section

Retrieval-Augmented Generation (RAG)

What is RAG?

RAG (Retrieval-Augmented Generation) is an AI technique that combines a retriever with a generator to allow language models to answer questions based on external, up-to-date, or private knowledge that isn't part of their original training data.

What does this really mean though?

RAG Use Case Example

Company policy chatbot

you want to build a

Company policy chatbot

you want to build a

but no LLMs were trained on your company policy docs, so you...

Augment the LLMs training by passing your own data to the prompt context

but you can't just dump in all your docs...

but you can't just dump in all your docs

That would:

1. cost too much

2. be slow

3. and be more likely to produce hallucinations

You retrieve the docs most relevant to the user request

so before you pass in your policy docs data...

but how do you know what's most relevant?

You could do traditional keyword searching

that's been around forever...

SELECT * FROM docs WHERE content
LIKE '%[user request]%'

What tools are available for this?

Vector embeddings!

What are vector embeddings?

So instead you want to do a fuzzier, more semantic meaning kind of search

but this is likely to miss a lot of relevant docs.

What are vector embeddings?

Numerical representations of text designed so that similar meanings are close together in vector space, even if the exact words are different.

Some random string of text

===

For example...

“Reset password” → [0.12, -0.93, 0.56, ...]
“Forgot login credentials” → a vector very close to the one above

or...

“Apple” → [0.12, -0.93, 0.56, ...]
“Banana” → a vector pretty close to the one above
"House" → a vector NOT close to either of these

there are many of them out there...

All major LLMs have them (OpenAI, etc) or you can use Ollama to use open source models

With AI of course... specifically an embeding model

How do you create these embeddings?

So you pick one, create vector embeddings for all your docs, and store them in your own vector databse

Vector

4. And finally return a grounded/informed response

Then at the time of an end user question...

2. query the vector DB embeddings within a certain distance of that query

3. pass on the matched context to the Generative AI

1. make an embedding of the user question

(important you use the same embedding model!)

Summary of RAG Process

Generate embeddings of your data
Store the embeddings in a vector database

(R)etrieve the data most relevant to their request.
(A)ugment the user's prompt by adding the query results
(G)enerative AI (the LLM) can now generate a relevant response
And return it to the user

A process totally separate from and before the users' request

At request time (ie the moment the user prompts the chatbot)

Vector

What vector database solutions exist?

Specialized RAG SAAS

offer vector db hosting plus more

Vector DB Solutions

What vector database solutions exist?

(ie. sqlite flavor created by
what we're using in the exercise project)

via pgvector extension

Yes this means you can do RAG with

or postgres

Cloudflare Vectorize DB

+ more!

RAG Code Example

// Generate Embeddings

import fs from "fs/promises";
import path from "path";
import dotenv from "dotenv";
import OpenAI from "openai";
import { useDb, schema } from "../server/utils/db";
import { sql } from "drizzle-orm";

dotenv.config();
const folder = "company_policies";

const openai = new OpenAI({
  apiKey: process.env.NUXT_OPENAI_API_KEY!,
});

const db = useDb();

await db.run(sql`
  CREATE INDEX IF NOT EXISTS company_policies_vector_idx
  ON company_policies(libsql_vector_idx(embedding));
`);

const directory = path.join(process.cwd(), folder);

async function getMarkdownFiles(dir: string): Promise<string[]> {
  const files = await fs.readdir(dir);
  return files.filter((f) => f.endsWith(".md")).map((f) => path.join(dir, f));
}

// Chunking splits large files into smaller parts to fit token limits and improve semantic search.
async function chunkText(text: string, maxTokens = 50): Promise<string[]> {
  // Split by paragraphs (empty lines)
  const paragraphs = text.split(/\n\s*\n/);
  const chunks: string[] = [];
  let current = "";

  for (const para of paragraphs) {
    if ((current + para).length > maxTokens * 4) {
      if (current.trim()) chunks.push(current.trim());
      current = "";
    }
    current += para + "\n\n";
  }

  if (current.trim()) {
    chunks.push(current.trim());
  }

  return chunks;
}

async function embedAndStore() {
  const files = await getMarkdownFiles(directory);

  for (const file of files) {
    const content = await fs.readFile(file, "utf-8");
    const chunks = await chunkText(content);
    console.log(file, chunks.length);

    for (const [i, chunk] of chunks.entries()) {
      const res = await openai.embeddings.create({
        model: "text-embedding-ada-002",
        input: chunk,
      });

      if (!res.data[0]) throw new Error("Failed to generate embedding");

      const embedding = res.data[0].embedding;

      await db.insert(schema.companyPolicies).values([
        {
          filepath: file,
          content: chunk,
          chunkNumber: i,
          embedding: sql`vector32(${JSON.stringify(embedding)})`,
        },
      ]);
    }

    console.log(`✅ Embedded ${file}`);
  }
}

embedAndStore().catch(console.error);

Walk through together

RAG Code Example

// Retreive results similar to query

import dotenv from "dotenv";
import OpenAI from "openai";
import readline from "readline";
import { useDb, schema } from "../server/utils/db";
import { sql } from "drizzle-orm";

dotenv.config();

const openai = new OpenAI({
  apiKey: process.env.NUXT_OPENAI_API_KEY!,
});

const db = useDb();

async function getQueryEmbedding(query: string): Promise<number[]> {
  const res = await openai.embeddings.create({
    model: "text-embedding-ada-002",
    input: query,
  });

  if (!res.data[0]) throw new Error("Failed to generate embedding");

  return res.data[0].embedding;
}

// topK is the number of results to return
async function searchcompanyPolicies(query: string, topK = 5) {
  const embedding = await getQueryEmbedding(query);

  // Use vector_top_k to find the most similar vectors, then join with the actual table
  // vector_top_k returns records with the primary key/rowid of matching rows
  const matches = await db
    .select({
      id: sql`vt.id`,
      content: schema.companyPolicies.content,
      filepath: schema.companyPolicies.filepath,
      chunkNumber: schema.companyPolicies.chunkNumber,
    })
    .from(
      sql`vector_top_k('company_policies_vector_idx', vector32(${JSON.stringify(
        embedding
      )}), ${topK}) as vt`
    )
    .leftJoin(
      schema.companyPolicies,
      sql`${schema.companyPolicies.id} = vt.id`
    );

  console.log(`\n🔍 Top ${topK} results for: "${query}"\n`);
  for (const [i, match] of matches.entries()) {
    console.log(`--- Result ${i + 1} ---`);
    console.log(`Source: ${match.filepath} (Chunk ${match.chunkNumber})`);
    console.log(`Content: ${match.content?.substring(0, 200)}...`);
    console.log("");
  }
  return matches;
}

function promptForQuery() {
  const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });

  rl.question("🧠 Enter your question: ", async (query) => {
    const answer = await answerQuery(query);
    console.log(answer);
    rl.close();
  });
}

async function answerQuery(query: string) {
  const matches = await searchcompanyPolicies(query);
  const context = matches.map((match) => match.content).join("\n");
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `You are a helpful assistant that answers questions about company policies. 
           You are given a context and a question. You should answer the question based on the context. 
           If you don't know the answer, say 'I don't know'.
           <context>${context}</context>`,
      },
      { role: "user", content: query },
    ],
  });
  if (!response.choices[0]) throw new Error("Failed to generate response");
  const answer = response.choices[0].message.content;
  return answer;
}

promptForQuery();

Walk through together

RAG:Other Considerations

There are plenty of ways you can fine-tune your RAG pipeline for your use case....

1. Number of Vectors to Retrieve (Top-K)

Higher = more context, but increased latency and risk of irrelevant info.
Lower = faster, but may miss important context.
Typical values: 3–10 (can go higher for complex queries).

Why it matters: Determines how many documents (or chunks) are retrieved for the model to consider.

2. Vector Dimension Size

Higher dimensions = better accuracy, but larger DB size and slower queries.
Choose based on the model (e.g., OpenAI uses 1536, MiniLM is 384).

Why it matters: Higher dimensional vectors can encode more nuanced semantic meaning.

3. Chunk Size and Overlap

Too large = sparse matches and harder for models to reason.
Too small = context gets broken across chunks.
Overlap helps maintain coherence between chunks.

Why it matters: Controls how the original documents are split.

Plus More

Document Metadata Filtering (adding more query parameters, for example)
Embedding Model - price, speed, number of vector dimensions, etc
Caching / Index Refresh Interval - too often is expensive, too rare creates stale results
Retrieval Scoring Function - Different ways of measuring similarity (Cosine vs Euclidean vs Dot)

filter: { type: 'policy', language: 'en' }

Don't be overwhelmed by the different options!

Start with basic RAG and then optimize for your use case

3

Section

Retrieval-Augmented Generation (RAG)

What is RAG?

What is RAG?

RAG Use Case Example

That would:

SELECT * FROM docs WHERE content LIKE '%[user request]%'

Some random string of text

Vector

Vector

Summary of RAG Process

Vector

Vector

Vector

What vector database solutions exist?

What vector database solutions exist?

What vector database solutions exist?

RAG Code Example

RAG Code Example

RAG:Other Considerations

1. Number of Vectors to Retrieve (Top-K)

2. Vector Dimension Size

3. Chunk Size and Overlap

Plus More

Start with basic RAG and then optimize for your use case

SELECT * FROM docs WHERE content
LIKE '%[user request]%'