Live

Retreival Augmented Generation

I've been wanting to share my thoughts about RAG for a while now, the premise behind the framework seems so obvious yet it is widely misunderstood, and it's impact severely under appreciated. So I thought to myself, the best way to truly grasp the framework instead of regurgitating existing attempts at explaining RAG would be to build a retrieval-augmented generation system myself and really get under the hood.

Why Does RAG Exist?

Before we understand the how, we must explore the why?

There are four primary limitations of generative AI:

1) Hallucinations

One of the first drawbacks you've probably heard of, this is where an LLM will very confidently reply to your query or prompt with a response that sounds plausible and very convincing, but is not factually correct or is fabricated entirely.

2) Knowledge Cut-offs

Generative AI applications that do not have access to real-time search functionalities suffer from knowledge restrictions brought about by pre-training cut off dates. If you require information that is very recent or time-specific, past an LLM's knowledge cut off you might as well resort to asking Clippy for help.

3) Limited Context Windows

Large language models have finite context windows, amongst commercially available models Google Gemini's 1.5 Pro has the largest context window at the time of writing this article, boasting a context length of up to 2 million tokens. That's approximately 3000 single-spaced A4 pages of text with a 12pt font size, using Times New Roman. Can you maintain the context of a 3000 page document? As impressive as this is, for longer-form content or conversations due to the finite nature of these context windows, eventually these models slow down and the accuracy of their responses start to decay.

4) Cost Implications

If you need to generate high-quality text, images or videos, especially for longer-form content in high volumes, this often requires the latest and greatest AI models available and their API costs are not cheap. PROVIDE EXAMPLE OF COSTLY AI MODEL APIS TO GENERATE IMAGES AND VIDEOS

RAG aims to combat the first two limitations, Hallucinations and Knowledge Cut-offs by providing a real-time search layer that can ground an LLM's response in a verifiable source, allowing for more reliable responses.

I'll explain why this is necessary with an anecdote. A few weeks ago I managed to pry my mates from their new born babies to decompress over a nice dinner where we reminisced about old times and chatted about our about travels. Not too long into the conversation one of my friend's asked "I wonder how many official countries there are in the world?". I leapt at this, "Ooh, I read an article about this, there's about 205 official countries in the world!". Sounds plausible and so he accepted this as fact. However, there were two things wrong with my answer (in addition to my answer being wrong entirely):

  1. I had no credible source to substantiate my answer.
  2. I haven't been keeping up with the official country count since I last read the article a few years ago.

Now what would have happened if I looked up the answer to this question via a credible source such as Worldometer? Well, I would be able to respond with an answer that was a lot more believable - "There are 195 official countries as of 2025", grounding my answer in something a lot more believable and factual. This is exactly what RAG aims to solve.