Interactive5 min read

How AI Actually Works:
Why It Has No Memory

You've chatted with ChatGPT, Claude, or Gemini. It feels like they remember your conversation. But what if I told you they have absolutely no memory?

The Illusion of Memory

Open ChatGPT. Ask it something. Ask a follow-up. It responds perfectly, referencing what you said earlier. It feels like a conversation with someone who remembers everything.

But here's the thing: it doesn't remember anything. Not a single word. Every time you hit send, the AI starts completely from scratch. It has no idea what you said 5 seconds ago unless something very clever is happening behind the scenes.

Let me show you exactly what that "something clever" is.

The AI is a Stateless Black Box

At its core, a large language model (LLM) like GPT-4 or Claude does one thing: it takes in text and predicts the next word (technically, the next token). That's it. No thinking. No understanding. No memory. Just pattern matching at an incredible scale.

Stateless means the model doesn't retain any information between requests. Each API call is completely independent. The model has no idea what happened in the previous call.

Watch this: text goes in, the model processes it, and tokens come out one at a time. The model receives the entire input at once, then generates one token, adds it to the output, and repeats.

Input (Full Context)
system: You are a helpful assistant.
user: What is 2+2?
LLM
stateless
Output (Next Tokens)
waiting...

Key insight: The model has no memory. It receives the entire conversation as input and predicts the next token. Every single time.

So How Does It "Remember"?

Here's the trick: every time you send a message, the application (ChatGPT, Claude, etc.) doesn't just send your latest message. It sends the entire conversation history as a JSON array. Every. Single. Time.

The conversation is stored as an array of message objects, each with a role and a content field:

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is the capital of France?" },
    { "role": "assistant", "content": "The capital of France is Paris!" },
    { "role": "user", "content": "What about Germany?" }
  ]
}

There are three roles:

  • systemThe system prompt — a hidden instruction written by the developer that you never see. It defines the AI's personality, rules, and behavior. For example, ChatGPT's system prompt tells it to be helpful and harmless. A customer support bot might have a system prompt like "You are a support agent for Acme Corp. Only answer questions about our products. Be polite." This is the developer's secret control layer, and it's always the very first message in the array.
  • userYour messages. What you type in the chat box.
  • assistantThe AI's responses. What it generated previously.

Try It Yourself

Here's a simulated chat interface. On the left is what you see — a normal chat. On the right is what's actually happening — the raw JSON array being sent to the API. Click "Send next message" and watch the array grow with every exchange.

What you see
What's actually sent to the API
ChatGPT
Click "Send next message" to start
API Request Payload

No messages yet

Send a message to see the JSON array grow here

Notice on the third exchange: the user asks "Which one has more people?" The AI knows they mean France and Germany only because the entire previous conversation was sent again. Without that context, "which one" would be meaningless.

The Payload Keeps Growing

Here's where it gets really interesting. Step through these API calls and watch the payload grow. Each call includes everything from all previous calls, plus the new messages.

API Call #1(2 messages sent)
Payload size~15 tokens
systemYou are a helpful assistant.new
userWhat is the capital of France?new

This is the first API call. Just the system prompt and the user's question.

This is why:

  • Long conversations get slower — the model has to process more input every time
  • Long conversations cost more — you're paying per token, and the token count keeps growing
  • There's a limit — every model has a "context window" (e.g., 128K tokens for GPT-4). When you hit it, older messages get dropped
  • Starting a "new chat" resets everything — the AI genuinely forgets because nothing is being passed anymore

The Aha Moment

Every single time you send a message,
the ENTIRE conversation
is sent again from scratch.

The AI has no memory. No state. No continuity. The application creates the illusion of a conversation by replaying the full history every time.

This is the fundamental architecture of every AI chatbot you use today. ChatGPT, Claude, Gemini — they all work this way. The model itself is a stateless function:

f(messages[]) → next_token

That's it. A function that takes an array of messages and predicts the next token. The magic is in the scale of the pattern matching, the quality of the training data, and the clever engineering of the applications that wrap around it.

Why Context is "Engineered"

Since the entire conversation is passed to the model every single time, the payload keeps growing. Eventually, the AI hits its context window — the maximum number of tokens (words/pieces of words) it can process in a single request. Once you hit that limit, no more input messages can be passed. The conversation is effectively over.

This is exactly why context is "engineered". Developers don't just blindly dump everything into the messages array — they carefully choose what to include and what to leave out, so the model gets the most relevant information within its token budget.

This is also why RAG (Retrieval-Augmented Generation) architectures exist. Instead of stuffing the entire history or an entire knowledge base into the prompt, RAG systems retrieve only the most relevant pieces of information and inject them into the messages array right before sending. The model gets exactly what it needs — nothing more, nothing less.

The context window is a hard ceiling. Every AI application has to decide what to pass to the model and what to drop. That decision — what context to include — is where the real engineering happens.