Panda Coding School
Back to Blog

The Day We Stopped Predicting the Next Word: How ChatGPT Was Actually Built

The Day We Stopped Predicting the Next Word: How ChatGPT Was Actually Built

Aayush PagareJune 2, 20266 min read

How ChatGPT was build

The Day We Stopped Predicting the Next Word: How ChatGPT Was Actually Built

When people look at ChatGPT, they often think the breakthrough was purely about scale—just throwing billions of parameters and the entire internet into a massive cluster of GPUs.

But as engineers who work with these systems daily, we know that’s only half the story.

If you just train a massive model on the internet, you don't get a helpful chatbot. You get a hyper-complete autocomplete engine. If you typed, "Can you help me write a Python script to sort a list?" a raw base model like GPT-3 might reply with: "Can you help me write a Javascript script to filter an array?" because it’s just trying to predict what text logically follows on a forum page.

The real breakthrough that gave birth to OpenAI’s first true conversational model (which powered ChatGPT’s launch in late 2022) wasn't raw scale. It was Alignment.

Here is exactly how OpenAI took a raw, chaotic text predictor and turned it into an assistant that actually listens, using a three-step pipeline called RLHF (Reinforcement Learning from Human Feedback).


Step 1: Supervised Fine-Tuning (SFT)

Teaching the Model How to Act Like an Assistant

Imagine you have a genius intern who has read every book in the world but has absolutely no social skills or concept of a workplace. Before you let them talk to clients, you need to show them examples of a proper email.

That is Supervised Fine-Tuning. OpenAI hired a large team of human AI trainers to manually write out high-quality dialogues.

  • They gave the trainers a prompt: "Explain quantum computing to a 5-year-old."
  • Instead of letting a base model guess the answer, the humans wrote out the ideal, friendly, step-by-step response.

OpenAI took thousands of these curated prompt-and-response pairs and fine-tuned the base GPT-3.5 model on them. This taught the model the structure of a conversation and the tone of an assistant.

The Limitation: It’s incredibly expensive and slow to scale human-written responses. You can't write a manual script for every question a user might ever ask. We needed a scalable way for the model to learn what humans like.


Step 2: Training a Reward Model

Creating the "Internal Critic"

Instead of having humans write answers, what if the model generates a bunch of options, and humans just point to the best one? It turns out, grading is much faster than writing from scratch.

For this phase, OpenAI took a prompt and ran it through the model to generate multiple different responses (let's say Responses A, B, C, and D).

[User Prompt] ---> [Model Generates 4 Variations] ---> [Human Ranks Them]
                                                       1. Response C (Best)
                                                       2. Response A
                                                       3. Response B
                                                       4. Response D (Worst)

Human labelers ranked these variations from best to worst based on helpfulness, accuracy, and safety.

OpenAI then took this ranking data and used it to train a completely separate neural network called the Reward Model. This model's sole job is to look at any prompt-plus-response combo and output a mathematical score representing how much a human would like it. It became an automated proxy for human taste.


Step 3: Reinforcement Learning (PPO)

Letting the AI Practice in the Simulator

Now comes the actual reinforcement learning phase, using an algorithm called PPO (Proximal Policy Optimization). This is where the magic happens.

We pit the conversational model and the Reward Model against each other millions of times in a closed loop:

  1. The conversational model gets a prompt and generates a response.
  2. The Reward Model looks at the response and dishes out a score (a "reward" or a "penalty").
  3. The conversational model adjusts its internal weights to maximize its score next time.

💡 The Puppy Training Analogy Think of it like training a puppy. You can't explain to a dog what "sit" means using dictionary definitions. Instead, you wait for the dog to naturally sit, and the moment it does, you give it a treat. Over time, the dog connects the action to the positive reward.

Through millions of iterations of this automated feedback loop, the model learned to optimize for clarity, politeness, and utility. It learned to reject harmful queries, admit when it was wrong, and sustain a coherent thread over multiple turns.


Why ChatGPT Still Asks: "Which response is better?"

If you've ever used ChatGPT and noticed it occasionally generates two different responses side-by-side, asking you to click on the one you prefer, you are actively participating in Step 2 of this pipeline.

Alignment is never truly finished. Language changes, new use cases emerge, and models can drift. By presenting you with two choices, OpenAI is gathering free, real-world comparative data to continuously update and refine their Reward Models. Your single click serves as the human evaluation signal that guides the next generation of reinforcement learning.


The Takeaway for Engineers

When ChatGPT launched, it wasn't a brand-new architectural design; it was an interface and alignment triumph built on top of InstructGPT's foundations. It proved that how we guide the weights after the massive pre-training phase matters just as much as the raw data compute itself.

By moving away from pure next-token prediction and toward maximizing a human reward signal, OpenAI built a system that didn't just know language—it knew how to cooperate with us.


References & Technical Deep Dives

  • Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." (The foundational paper introducing InstructGPT and the core RLHF framework used for ChatGPT).
  • Christiano, P. F., et al. (2017). "Deep reinforcement learning from human preferences." (The original framework establishing how complex goals can be optimized using human rankings rather than explicit reward functions).
  • Schulman, J., et al. (2017). "Proximal Policy Optimization Algorithms." (The core reinforcement learning algorithm utilized in Step 3 of the alignment process).
  • OpenAI Blog (2022). "Introducing ChatGPT." official release technical notes.

Enjoyed this article?

Get more AI engineering insights delivered to your inbox.