A CPU Story: Why must a function return?

Table of Contents

Intro: What does `return` actually return? #

Have you ever stopped and asked yourself: “Why am I even writing return?”

On the surface it’s trivial. “Return returns a value, who cares.” You’ve been doing it for years; it’s muscle memory. But think back to the last time you wrote return at the end of a function that doesn’t really return anything. That tiny moment of hesitation — “What value am I returning here, exactly?” — probably flashed by and got suppressed under “the language wants it, whatever”. GeeksforGeeks+1

If we phrase the question a bit more precisely:

Why does every function have to return in some way?

In some languages you don’t even write the keyword; the compiler just injects a ghost return at the end of the block. With void functions things get even weirder: the type literally says “nothing”, yet the function must still return. So what is actually leaving the function?

What leaves is not a value; it’s where the CPU was.

Every program running on your machine has to live in RAM: the OS, your IDE, the browser, your scripts, your test runner. They all float in the same physical address space; virtual memory and isolation are “how we organize the ocean”, not separate universes. In this sea of instructions the CPU only has one compass: the program counter . Its entire job is to answer a simple question: “Which address in memory am I executing right now?” Wikipedia+3

When you call a function, the CPU must remember whichever address that compass was pointing to, because once the function finishes it has to jump back and continue from there. This is the real meaning of return:

A moment ago you were at this address in memory — go back there.

Whether or not you return a value, at the hardware level every call eventually “returns” exactly one thing: the address the CPU needs in order to take its next step. Wikipedia+3

That pushes us from “why does a function exist?” into “how do we model computation?”.

From computation’s point of view: FSM and “where you left off” #

This story doesn’t start in hardware. Long before computers, we were using the same idea: computation is just moving from one state to another.

Even on paper, doing basic arithmetic, you’re essentially saying: “I was at 7, I added 3, now I’m at 10.” This view is called an FSM . There’s an initial state, a sequence of intermediate states, and eventually a halting state where you’re done. Wikipedia+2

You can look at a program the same way: each line, each branch, each function call moves you into a new state. A function is a smaller machine embedded in that chain: it has its own initial state, its own internal steps, and a point where it stops and hands you back to the outer machine — to where you left off.

In theory, “where you left off” is the caller FSM’s next state.
In the CPU, “where you left off” is the caller’s next instruction address.

That’s why return is more than “produce a value”: in the FSM world it means “go back to the caller’s next state”, in the hardware world it means “restore the caller’s PC”. It’s not a convention the language designers made up on a whim; it’s what you get when you push computation theory all the way down into CPU registers. Wikipedia+2

From the CPU’s point of view: PC and call #

The CPU’s world is much simpler than ours. It really only cares about one thing: the program counter (PC). The PC stores a single memory address. Every clock it runs the same ritual: Baeldung+1

Fetch the instruction at the address in PC.
Decode and execute it.
Compute the “next” PC.

Those three steps are the common heartbeat of every CPU that’s ever shipped — your modern x86-64, and the old Z80 alike.

Look at a classic block diagram and the names become concrete: program counter, stack pointer, register file, the buses between them. They’re not just boxes in a slide; they occupy literal area on the die.

Intel 8085 block diagram — program counter, stack pointer and register file as physical blocks connected by buses; modern CPUs run the same architecture at a different scale

For ordinary instructions, “next” basically means “current address + instruction size”. Branches and function calls change the game: “next” is no longer the neighbor — it’s somewhere else entirely. A function is really just a pattern that makes this jump reusable and predictable.

The compiler takes your func add(a, b) and turns it into machine code at some address in memory. The CPU neither knows nor cares about the name add; the only thing it knows is something like: “this function starts at 0x400580.” Wikipedia+2

When a call add executes, the CPU does three things: Wikipedia+2

It computes the address of the next instruction — the one right after the call — and treats that as the return address .
It writes that address somewhere (on almost all mainstream ISAs, to the top of the stack; on some, into a dedicated link register).
It sets PC to the entry address of add.

You just wrote add(2, 3) in source. Underneath, what actually happened is:

There’s a call at this address. I saved the address of the next instruction as the return address. I set PC to 0x400580. We’re going in.

What add(2, 3) becomes — arguments moved into registers via mov, then a call to the function’s address; the original expression disappears and only addresses remain

Stack pointer: the parking slot for return addresses #

Time to bring stack pointer on stage. SP points at the top of a LIFO region in RAM. When a call runs on a typical architecture, the sequence looks like this: Wikipedia+3

The CPU reads the address of the next instruction.
It moves SP “down” one step (on most architectures the stack grows downward).
It writes that return address at the new top of the stack.
It sets PC to the function’s entry address.

Each call pushes another return address (and usually some frame data) onto the stack; each ret pops the topmost return address back into the PC and moves SP back up. Stack accesses are fast precisely because of this: everything happens at the top, so the cache pattern is ideal. yuriygeorgiev+2

Call/ret in three frames — before the call (SP and PC in the caller), during the call (return address pushed, PC jumped to callee), after the ret (top of stack loaded back into PC, SP restored)

The heap lives in a different universe. That’s the place for variable-sized, long-lived, messy objects. Pointers can connect the two worlds, but the heap doesn’t have anything like a strict “return address” discipline. The stack does: every call pushes a record on top, every ret pops from the top. There is no “remove from the middle”. Wikipedia+3

Process memory layout — code at the bottom, heap growing upwards, stack growing downwards; two different allocation disciplines sharing the same address space

Stack overflow is what happens when that discipline hits its physical ceiling.
In practice it’s mostly one of two things: you forgot a base case in a recursive function, or your call chain goes so deep that the base case is effectively unreachable.
Each call adds another frame and another return address, SP marches further down every time, and at some point the stack pushes against the limit the OS reserved for this process. That’s when the CPU/OS effectively says “that’s enough”, and your program crashes. Wikipedia+4

Keep that picture in the back of your mind — we’ll come back to it when we talk about the cost of functions.

If you never return: how “nothingness” actually works #

Once you’ve seen the call/ret mechanics, “What happens if I don’t return?” is no longer an abstract question.

When I say “if there’s no return, there’s only nothingness”, I mean something very specific: the CPU keeps ticking, keeps executing instructions, keeps updating PC — but the ret never runs.
The return address at the top of the stack never makes its way back into the PC.
The program is technically still “alive”, but the line after the call is gone for good. Wikipedia+3

Through the FSM lens, what you’ve done is this:
you abandoned the outer state machine (the caller) and decided to spin forever inside the inner one (the callee).
You’ve cut the link that should take you back to the caller’s next state. return is exactly the thing that re-opens that link. Wikipedia+2

The two classic shapes:

An infinite loop inside the function
PC keeps bouncing between a handful of instructions, SP doesn’t move, and control never gets back to the line after the call. In a single-threaded program, everything “after” that call is effectively frozen. StackOverflow+1
Recursion with no base case
Each call writes a new return address onto the stack, SP moves down, and no ret ever consumes those addresses. Eventually you slam into the physical end of the stack region and get a stack overflow. Wikipedia+3

The choice of language doesn’t matter here. Go, C, Java, Python… if you call a function whose body is an infinite loop, the code after the call simply never runs. The CPU might be very busy executing instructions, but the PC never visits the “line after the call” again.

Inside the compass: where PC and SP actually live #

So far we’ve talked about PC and SP as “special registers”.
On the theory side we’ve used FSMs; on the CPU side we’ve used call/ret + stack and drawn the whole picture at the ISA level.
Now let’s take one more step down and ask: where does this return necessity physically land on the die? Where are these state transitions actually stored?

This is why we peek at the die:
to see that return isn’t a purely syntactic rule the language enforces, but a concrete hardware behavior — specific rows in the register file and specific bytes in RAM changing in a disciplined pattern.
Keeping that in mind changes how you look at code: instead of “what’s the harm?”, you start asking “exactly which physical thing am I stressing here?” and you get better at predicting which abstractions will bite you later.

On diagrams we draw a box and label it “register file”. In a debugger we see RSP = 0x7ffeefbff5c0. Easy to think of those as just names. In reality these are physical structures living inside each core, constantly flipping bit patterns on every clock: rows of flip-flops etched into silicon — register rows. Wikipedia+2

Zoom in a bit: under the package there’s the silicon die ; on the die there are cores. Inside each core: cache, ALUs, branch units, and a block called the register file . That’s the neighborhood where RAX, RBX, RSP and the PC all live. Wikipedia+1

Intel Sandy Bridge per-core floor plan — a clear view of where the register file, branch unit and cache sit inside a core

The register file is not one opaque box; it’s more like a grid — rows and columns. Each cell is a tiny flip-flop holding a single bit; each row is a 64‑bit register. The labels you see in ISA docs — RSP, RIP, RAX — are really row names on this grid. Wikipedia+2

Register file as a grid — each row is a register, each cell is a flip-flop holding a single bit

When your debugger says RSP = 0x7ffeefbff5c0, all 64 flip-flops in that row are holding the bits of that address. When a call runs, the bit pattern in that row changes; when a ret runs, the pattern shifts back. Climbing one frame up the call stack literally means “this row of flip-flops changed back to its previous value”.

This is why I don’t want to leave return at “because the language wants it”: the honest sentence is closer to “when this function finishes, the CPU must restore this particular register row with this address pattern”.

The real cost of a function, and the compiler’s quiet answer #

Let’s go back up one level: functions are not free abstractions.

Every call / ret comes with work attached:

If you slice your code into lots of tiny functions, what you’re really saying at the physical level is: “We’re going to jump PC and SP around more often, and we’re going to grow and shrink the stack more frequently.” In exchange you get readability, testability, reuse.

Looking at that, you might be tempted to say: “Fine, then I’ll just write everything in one giant function and never call anything.”
That has a cost too: a huge body that’s painful to reason about and debug. If you had to manually choose the sweet spot between “everything inline” and “everything tiny”, you’d be right to worry.

Modern compilers step in exactly here. They aggressively inline small, hot functions: instead of emitting an actual call, they paste the function body straight into the call site. call / ret vanish, SP and PC move less, the binary grows a bit, the hot path gets faster. StackOverflow+3

The mental model I’d use:

Don’t be afraid to factor code into meaningful functions.
But don’t write a new function every other line either; pointless abstractions still bloat the stack in debug and in cold paths.
The compiler will try to inline “small and hot” functions for you; when you overdo it, you end up paying the real call cost in the places where heuristics decide not to inline.

In debug builds inlining is usually disabled. That’s why you get a nice, clean call stack there, and a much messier picture in release builds — the optimizer is eating calls and spitting out straight-line instructions. UC Berkeley+1

Where we ended up, and what comes next #

We started at the bottom: PC and SP as rows inside a register file, the relative positions of stack and heap in memory, and what stack overflow actually hits. We walked the same function call across four layers:

Theory layer — FSM: computation as state→state transitions, and the function’s obligation to bring you back to “where you left off”.
CPU layer — call and ret , how PC and SP move.
Hardware layer — actual register rows and flip-flops holding PC and SP.
Compiler/developer layer — inlining, function cost, and how much abstraction to introduce.

After this, it’s hard for me to say “a function just returns” with a straight face. Every return you type nudges a specific row in the register file, touches specific bytes in RAM, and updates the little graph living inside the branch predictor.

I’d like you to leave this piece with one more question in mind:

If every core only has one PC and every thread has its own stack, how does parallelism actually work?

That’s the next story. We’ll talk about per‑thread stacks, per‑core register sets, how PC and SP are saved and restored during context switches, and where branch prediction and the Return Address Stack fit into all of this. In other words:

How can we be “in” so many functions at once?

We’ll answer that from the CPU’s side of the table. Wikipedia+3

Thanks #

UC Berkeley’s open course material is the backbone of this series. Computer architecture, CPU internals, stack/register/memory hierarchy — being able to learn this stuff for free, as someone who came to these details later in their career, is a huge privilege.

🎥 The lecture series that shaped this post:
UC Berkeley — Computer Architecture (YouTube playlist)

Intro: What does return actually return? #