How AI Actually Works:
Why It Has No Memory
You've chatted with ChatGPT, Claude, or Gemini. It feels like they remember your conversation. But what if I told you they have absolutely no memory?
The Illusion of Memory
Open ChatGPT. Ask it something. Ask a follow-up. It responds perfectly, referencing what you said earlier. It feels like a conversation with someone who remembers everything.
But here's the thing: it doesn't remember anything. Not a single word. Every time you hit send, the AI starts completely from scratch. It has no idea what you said 5 seconds ago unless something very clever is happening behind the scenes.
Let me show you exactly what that "something clever" is.
The AI is a Stateless Black Box
At its core, a large language model (LLM) like GPT-4 or Claude does one thing: it takes in text and predicts the next word (technically, the next token). That's it. No thinking. No understanding. No memory. Just pattern matching at an incredible scale.
Watch this: text goes in, the model processes it, and tokens come out one at a time. The model receives the entire input at once, then generates one token, adds it to the output, and repeats.
Key insight: The model has no memory. It receives the entire conversation as input and predicts the next token. Every single time.
So How Does It "Remember"?
Here's the trick: every time you send a message, the application (ChatGPT, Claude, etc.) doesn't just send your latest message. It sends the entire conversation history as a JSON array. Every. Single. Time.
The conversation is stored as an array of message objects, each with a role and a content field:
{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the capital of France?" }, { "role": "assistant", "content": "The capital of France is Paris!" }, { "role": "user", "content": "What about Germany?" } ] }
There are three roles:
- systemThe system prompt — a hidden instruction written by the developer that you never see. It defines the AI's personality, rules, and behavior. For example, ChatGPT's system prompt tells it to be helpful and harmless. A customer support bot might have a system prompt like "You are a support agent for Acme Corp. Only answer questions about our products. Be polite." This is the developer's secret control layer, and it's always the very first message in the array.
- userYour messages. What you type in the chat box.
- assistantThe AI's responses. What it generated previously.
Try It Yourself
Here's a simulated chat interface. On the left is what you see — a normal chat. On the right is what's actually happening — the raw JSON array being sent to the API. Click "Send next message" and watch the array grow with every exchange.
No messages yet
Send a message to see the JSON array grow here
The Payload Keeps Growing
Here's where it gets really interesting. Step through these API calls and watch the payload grow. Each call includes everything from all previous calls, plus the new messages.
This is the first API call. Just the system prompt and the user's question.
This is why:
- Long conversations get slower — the model has to process more input every time
- Long conversations cost more — you're paying per token, and the token count keeps growing
- There's a limit — every model has a "context window" (e.g., 128K tokens for GPT-4). When you hit it, older messages get dropped
- Starting a "new chat" resets everything — the AI genuinely forgets because nothing is being passed anymore
The Aha Moment
Every single time you send a message,
the ENTIRE conversation
is sent again from scratch.
The AI has no memory. No state. No continuity. The application creates the illusion of a conversation by replaying the full history every time.
This is the fundamental architecture of every AI chatbot you use today. ChatGPT, Claude, Gemini — they all work this way. The model itself is a stateless function:
f(messages[]) → next_tokenThat's it. A function that takes an array of messages and predicts the next token. The magic is in the scale of the pattern matching, the quality of the training data, and the clever engineering of the applications that wrap around it.
Why Context is "Engineered"
Since the entire conversation is passed to the model every single time, the payload keeps growing. Eventually, the AI hits its context window — the maximum number of tokens (words/pieces of words) it can process in a single request. Once you hit that limit, no more input messages can be passed. The conversation is effectively over.
This is exactly why context is "engineered". Developers don't just blindly dump everything into the messages array — they carefully choose what to include and what to leave out, so the model gets the most relevant information within its token budget.
This is also why RAG (Retrieval-Augmented Generation) architectures exist. Instead of stuffing the entire history or an entire knowledge base into the prompt, RAG systems retrieve only the most relevant pieces of information and inject them into the messages array right before sending. The model gets exactly what it needs — nothing more, nothing less.