Arena allocator programming in Rust for interpreters and compilers
You are building a Lox interpreter. Right now your parser builds an AST with many Box values and many owned String values. That is a good beginner design because it is clear. But when an interpreter or compiler grows, this design starts to hurt.
This page teaches arena allocator programming from first principles, with zero external crates. We will start from the reason arenas exist, then build multiple arena styles in Rust, then apply the idea to your Lox interpreter.
The word is usually spelled arena allocator. I will use arena in the article.
The short idea
An arena is a place where many objects are allocated together and freed together.
Instead of doing this:
let left = Box::new(Expr::Number(1.0));
let right = Box::new(Expr::Number(2.0));
let expr = Box::new(Expr::Binary { left, right });
You do this:
let left = arena.alloc(Expr::Number(1.0));
let right = arena.alloc(Expr::Number(2.0));
let expr = arena.alloc(Expr::Binary { left, right });
Or, even more common in Rust compilers:
let left = ast.exprs.alloc(Expr::Number(1.0));
let right = ast.exprs.alloc(Expr::Number(2.0));
let expr = ast.exprs.alloc(Expr::Binary { left, right });
Where left, right, and expr are small IDs, not Box pointers.
First principles: what problem are we solving?
When you parse source code, you create many small objects:
- tokens
- expressions
- statements
- declarations
- types
- scopes
- symbols
- errors
- compiler intermediate representation nodes
For a small program, this is not a problem. For a real compiler, it can be millions of objects.
The normal heap allocator is general-purpose. General-purpose means it must support many patterns:
- allocate one object
- free that object later
- allocate different sizes
- keep track of each allocation
- avoid memory corruption
- work across many parts of the program
That generality has a cost.
For an AST, we often do not need that generality. Most AST nodes have one simple lifetime:
create during parsing -> use during analysis/interpreting/codegen -> free all at once
That lifetime is perfect for an arena.
The compiler memory pattern
Most compiler phases look like this:
source text
│
▼
lexer creates tokens
│
▼
parser creates AST nodes
│
▼
resolver/type checker reads AST, creates symbols/types
│
▼
interpreter or code generator walks data
│
▼
drop whole phase memory together
Notice something important: we usually do not free a single AST node. We free the whole AST when the program or REPL input is done.
That is the main reason arenas are popular in compilers.
Why Box everywhere becomes expensive
Your current parser has code like this:
pub enum Expr {
Grouping(Box<Expr>),
Assignment {
id: Identifier,
assign: Box<Expr>,
},
UnaryOperation {
op: UnaryOperator,
operand: Box<Expr>,
},
BinaryOperation {
left: Box<Expr>,
op: BinaryOperator,
right: Box<Expr>,
},
}
This is valid Rust. It is also a common first AST design.
But every Box means one heap allocation. For this expression:
1 + 2 * 3 - 4
The parser creates multiple nodes. Many of them become separate heap allocations.
Box(Binary -)
├─ Box(Binary +)
│ ├─ Number(1)
│ └─ Box(Binary *)
│ ├─ Number(2)
│ └─ Number(3)
└─ Number(4)
The costs are:
- many calls to the global allocator
- many small allocations
- memory spread around the heap
- more pointer chasing
- less CPU cache friendliness
- harder ownership when later compiler phases want references into the AST
CPU cache friendliness means the CPU is faster when related data is close together in memory.
What arena allocation changes
With an arena, AST nodes are stored in a few big containers.
Arena memory
╭────────────────────────────────────────╮
│ Expr 0 │ Expr 1 │ Expr 2 │ Expr 3 │ ... │
╰────────────────────────────────────────╯
ExprId(3) points to Expr 3
Instead of each node owning children with Box, nodes can point to children with IDs or references.
pub struct ExprId(usize);
pub enum Expr {
Number(f32),
Binary {
left: ExprId,
op: BinaryOperator,
right: ExprId,
},
}
Now the AST is a graph of small IDs into an arena.
Why this is faster
Arena allocation is usually faster for compiler data because:
- Allocation is simple.
- Deallocation is almost free.
- Data is more compact.
- We avoid many Box allocations.
- We avoid cloning big AST values.
Allocation can be as simple as pushing into a Vec:
self.items.push(value);
ExprId(self.items.len() - 1)
Deallocation happens when the whole Vec is dropped.
drop(ast);
That is the whole cleanup.
Important warning: arena is not magic
Arena allocation is powerful, but it is not always the correct tool.
Use an arena when:
- many objects have the same lifetime
- you rarely delete individual objects
- you build graph-like data such as AST, HIR, IR, symbol tables
- you care about allocation overhead
Do not use an arena when:
- objects must be deleted one by one
- memory must be returned immediately
- objects have very different lifetimes
- simple Vec or Box is already enough
The main thinking model is this:
An arena trades precise freeing for simple and fast bulk freeing.
Precise freeing means freeing each object exactly when it is no longer used. Bulk freeing means freeing everything together.
Three arena styles you should know
For Rust interpreters and compilers, three styles matter most.
╭────────────────────╮
│ 1. Index arena │ returns ExprId, StmtId, DeclId
╰────────────────────╯
╭────────────────────╮
│ 2. Typed ref arena │ returns &'arena Expr
╰────────────────────╯
╭────────────────────╮
│ 3. Bump arena │ low-level bytes, fastest, unsafe core
╰────────────────────╯
For your current Lox interpreter, I recommend starting with the index arena.
Why? Because it is safe Rust, easy to debug, easy to serialize, and easy to use in later compiler passes.
Part 1: a simple index arena
An index arena stores values in a Vec and returns typed IDs.
Start simple:
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct Id<T> {
index: usize,
_marker: std::marker::PhantomData<fn() -> T>,
}
impl<T> Id<T> {
pub fn index(self) -> usize {
self.index
}
}
#[derive(Debug, Default)]
pub struct Arena<T> {
items: Vec<T>,
}
impl<T> Arena<T> {
pub fn new() -> Self {
Self { items: Vec::new() }
}
pub fn alloc(&mut self, value: T) -> Id<T> {
let id = Id {
index: self.items.len(),
_marker: std::marker::PhantomData,
};
self.items.push(value);
id
}
pub fn get(&self, id: Id<T>) -> &T {
&self.items[id.index]
}
pub fn get_mut(&mut self, id: Id<T>) -> &mut T {
&mut self.items[id.index]
}
pub fn len(&self) -> usize {
self.items.len()
}
pub fn is_empty(&self) -> bool {
self.items.is_empty()
}
}
The PhantomData field tells Rust that Id<T> is tied to the type T. This prevents mixing Expr IDs and Stmt IDs by accident.
Without PhantomData, this could happen:
let expr_id = ExprId(0);
let stmt = stmts.get(expr_id); // wrong arena, but same usize shape
With typed IDs, Rust helps stop that mistake.
How the index arena works
Arena<Expr>
╭───────╬────────────────────────────╮
│ index │ value │
├───────┼────────────────────────────┤
│ 0 │ Number(1) │
│ 1 │ Number(2) │
│ 2 │ Binary { left: 0, right:1 } │
╰───────╩────────────────────────────╯
ExprId { index: 2 } means expression at row 2
The arena owns the values. The ID is just a small handle.
Handle means a small value used to find a real value somewhere else.
Why indices are idiomatic for many Rust compilers
References are nice, but Rust references have lifetimes. Lifetimes are useful, but they can make compiler data structures harder.
IDs avoid many lifetime problems.
With IDs, you can store graphs easily:
pub enum Expr {
Number(f32),
Binary {
left: ExprId,
op: BinaryOperator,
right: ExprId,
},
}
No Box. No recursive ownership. No lifetime parameter on Expr.
That is a big ergonomic win.
Ergonomic means comfortable to use.
Part 2: redesigning your AST with IDs
Your current AST owns child expressions with Box. We can redesign it like this.
pub type ExprId = Id<Expr>;
pub type StmtId = Id<Stmt>;
pub type DeclId = Id<Declaration>;
#[derive(Debug, Clone)]
pub enum Expr {
Identifier(Identifier),
String(String),
Number(f32),
Bool(bool),
Nil,
Grouping(ExprId),
Assignment {
id: Identifier,
assign: ExprId,
},
UnaryOperation {
op: UnaryOperator,
operand: ExprId,
},
BinaryOperation {
left: ExprId,
op: BinaryOperator,
right: ExprId,
},
}
#[derive(Debug, Clone)]
pub enum Stmt {
Block(Vec<DeclId>),
PrintStmt(ExprId),
ExprStmt(ExprId),
}
#[derive(Debug, Clone)]
pub struct VarDecl {
pub identifier: Identifier,
pub expr: Option<ExprId>,
}
#[derive(Debug, Clone)]
pub enum Declaration {
VarDecl(VarDecl),
Stmt(StmtId),
}
#[derive(Debug)]
pub struct Program {
pub declarations: Vec<DeclId>,
}
#[derive(Debug, Default)]
pub struct Ast {
pub exprs: Arena<Expr>,
pub stmts: Arena<Stmt>,
pub decls: Arena<Declaration>,
pub program: Program,
}
Now Program does not own declarations directly. It owns IDs into Ast.decls.
The whole parsed result is Ast.
Ast
├─ exprs: Arena<Expr>
├─ stmts: Arena<Stmt>
├─ decls: Arena<Declaration>
└─ program: Vec<DeclId>
Program -> DeclId -> Declaration -> StmtId -> Stmt -> ExprId -> Expr
This is the core pattern.
Why this fixes cloning in your interpreter
Your current interpreter does this:
program
.declarations
.iter()
.try_for_each(|decl| self.eval_decl(decl.clone()))
That clone exists because eval_decl takes ownership of Declaration.
With IDs, the interpreter can borrow from the AST:
fn eval_program(&mut self, ast: &Ast) -> Result<(), RuntimeError> {
for decl_id in &ast.program.declarations {
self.eval_decl(ast, *decl_id)?;
}
Ok(())
}
fn eval_decl(&mut self, ast: &Ast, decl_id: DeclId) -> Result<(), RuntimeError> {
match ast.decls.get(decl_id) {
Declaration::VarDecl(var_decl) => self.eval_var_decl(ast, var_decl),
Declaration::Stmt(stmt_id) => self.eval_stmt(ast, *stmt_id),
}
}
Now eval_decl receives a small ID. It does not clone the AST node.
Part 3: parser shape with an index arena
Your parser can own an Ast while parsing.
pub struct Parser<I: Iterator<Item = Result<Token, LexError>>> {
tokens: std::iter::Peekable<I>,
ast: Ast,
}
impl<I: Iterator<Item = Result<Token, LexError>>> Parser<I> {
pub fn new(tokens: I) -> Self {
Self {
tokens: tokens.peekable(),
ast: Ast::default(),
}
}
pub fn parse(mut self) -> Result<Ast, ParseError> {
let program = self.program()?;
self.ast.program = program;
Ok(self.ast)
}
}
Notice parse now consumes self. This is useful because the parser owns the Ast and returns it when done.
Allocating expressions
You add small helper methods:
impl<I: Iterator<Item = Result<Token, LexError>>> Parser<I> {
fn alloc_expr(&mut self, expr: Expr) -> ExprId {
self.ast.exprs.alloc(expr)
}
fn alloc_stmt(&mut self, stmt: Stmt) -> StmtId {
self.ast.stmts.alloc(stmt)
}
fn alloc_decl(&mut self, decl: Declaration) -> DeclId {
self.ast.decls.alloc(decl)
}
}
Now every parser function returns IDs.
Before:
fn expr(&mut self) -> Result<Expr, ParseError>
After:
fn expr(&mut self) -> Result<ExprId, ParseError>
Example: parsing binary expressions
Your current parser builds a nested Expr value.
Arena version:
fn term(&mut self) -> Result<ExprId, ParseError> {
let mut node = self.factor()?;
while let Some(tk_type) = self.peek_tktype()? {
let op = match tk_type {
TokenType::Plus => BinaryOperator::Add,
TokenType::Minus => BinaryOperator::Sub,
_ => break,
};
self.advance()?;
let right = self.factor()?;
node = self.alloc_expr(Expr::BinaryOperation {
left: node,
op,
right,
});
}
Ok(node)
}
The important line is:
node = self.alloc_expr(Expr::BinaryOperation { left: node, op, right });
Instead of boxing the old node, we store it in the arena and point to it by ID.
Example: parsing primary expressions
fn primary(&mut self) -> Result<ExprId, ParseError> {
let token = self.advance()?;
let expr = match token.tk_type {
TokenType::Identifier(name) => Expr::Identifier(Identifier { name }),
TokenType::String(value) => Expr::String(value),
TokenType::Number(value) => Expr::Number(value),
TokenType::True => Expr::Bool(true),
TokenType::False => Expr::Bool(false),
TokenType::Nil => Expr::Nil,
TokenType::LeftParen => {
let inner = self.expr()?;
self.consume_if_matches(TokenType::RightParen)?;
Expr::Grouping(inner)
}
_ => return Err(ParseError::UnexpectedToken(token)),
};
Ok(self.alloc_expr(expr))
}
All expression parser functions now return ExprId.
Example: parsing statements and declarations
fn declaration(&mut self) -> Result<DeclId, ParseError> {
if let Some(TokenType::Var) = self.peek_tktype()? {
self.consume_if_matches(TokenType::Var)?;
let var_decl = self.var_decl()?;
return Ok(self.alloc_decl(Declaration::VarDecl(var_decl)));
}
let stmt_id = self.stmt()?;
Ok(self.alloc_decl(Declaration::Stmt(stmt_id)))
}
fn stmt(&mut self) -> Result<StmtId, ParseError> {
if let Some(TokenType::LeftBrace) = self.peek_tktype()? {
self.consume_if_matches(TokenType::LeftBrace)?;
let mut declarations = Vec::new();
while let Some(tk_type) = self.peek_tktype()? {
if tk_type == TokenType::RightBrace {
break;
}
declarations.push(self.declaration()?);
}
self.consume_if_matches(TokenType::RightBrace)?;
return Ok(self.alloc_stmt(Stmt::Block(declarations)));
}
if let Some(TokenType::Print) = self.peek_tktype()? {
self.consume_if_matches(TokenType::Print)?;
let expr = self.expr()?;
self.consume_if_matches(TokenType::Semicolon)?;
return Ok(self.alloc_stmt(Stmt::PrintStmt(expr)));
}
let expr = self.expr()?;
self.consume_if_matches(TokenType::Semicolon)?;
Ok(self.alloc_stmt(Stmt::ExprStmt(expr)))
}
Part 4: interpreter with arena IDs
The interpreter now receives Ast instead of Program.
impl Interpreter {
pub fn interpret(&mut self, ast: &Ast) -> Result<(), RuntimeError> {
self.eval_program(ast)
}
fn eval_program(&mut self, ast: &Ast) -> Result<(), RuntimeError> {
for decl_id in &ast.program.declarations {
self.eval_decl(ast, *decl_id)?;
}
Ok(())
}
}
The AST owns nodes. The interpreter only reads them.
eval_expr with IDs
fn eval_expr(&mut self, ast: &Ast, expr_id: ExprId) -> Result<RuntimeValue, RuntimeError> {
match ast.exprs.get(expr_id) {
Expr::Identifier(id) => {
self.global_scope
.get(&id.name)
.cloned()
.ok_or_else(|| RuntimeError::VariableDoesNotExist(id.name.clone()))
}
Expr::String(value) => Ok(RuntimeValue::String(value.clone())),
Expr::Number(value) => Ok(RuntimeValue::Number(*value)),
Expr::Bool(value) => Ok(RuntimeValue::Bool(*value)),
Expr::Nil => Ok(RuntimeValue::Nil),
Expr::Grouping(inner) => self.eval_expr(ast, *inner),
Expr::Assignment { id, assign } => {
if self.global_scope.contains_key(&id.name) {
let value = self.eval_expr(ast, *assign)?;
self.global_scope.insert(id.name.clone(), value.clone());
Ok(value)
} else {
Err(RuntimeError::VariableDoesNotExist(id.name.clone()))
}
}
Expr::UnaryOperation { op, operand } => {
let operand = self.eval_expr(ast, *operand)?;
self.eval_unary(op.clone(), operand)
}
Expr::BinaryOperation { left, op, right } => {
let left = self.eval_expr(ast, *left)?;
let right = self.eval_expr(ast, *right)?;
self.eval_binary(left, op.clone(), right)
}
}
}
This still clones strings and operators in a few places. That is okay at first. Later, you can reduce those clones too.
main.rs after arena AST
Your file mode becomes:
let tokens: Vec<Result<Token, _>> = Lexer::new(&src_code).collect();
let parser = Parser::new(tokens.into_iter());
let ast = parser.parse().unwrap();
let mut interpreter = Interpreter::new();
interpreter.interpret(&ast).unwrap();
In the REPL:
let parser = Parser::new(tokens.into_iter());
let ast = parser.parse().unwrap();
interpreter.interpret(&ast).unwrap();
Each input line gets its own AST arena. After interpret returns, that arena can drop.
This is safe if runtime values do not store references into the AST.
Part 5: the lifetime rule of arenas
Arena data must live longer than anything pointing into it.
With IDs, this is simple because IDs are just indexes. But you still must pass the Ast when reading the ID.
fn eval_expr(ast: &Ast, id: ExprId) { ... }
The ID alone is not useful without the arena.
This is good. It makes ownership clear.
ExprId alone
│
▼
just a number, cannot read node
ExprId + &Ast
│
▼
can read ast.exprs[id]
A common mistake
Do not store ExprId from one Ast and use it with another Ast.
let ast1 = parse("1 + 2");
let id = ast1.program.declarations[0];
let ast2 = parse("3 + 4");
ast2.decls.get(id); // wrong meaning
Typed IDs stop mixing ExprId and StmtId. They do not stop mixing two different Ast values.
For most interpreters, this is fine because you pass IDs around together with the Ast they came from.
Part 6: typed reference arena
An index arena is safe and practical. But many compilers also use arenas that return references.
Example target API:
let expr: &Expr = arena.alloc(Expr::Number(1.0));
This feels nice because child pointers are normal references.
pub enum Expr<'a> {
Number(f32),
Binary {
left: &'a Expr<'a>,
op: BinaryOperator,
right: &'a Expr<'a>,
},
}
But in Rust this requires care.
Why a fully safe reference arena is hard
This simple version looks nice:
pub struct Arena<T> {
items: Vec<T>,
}
impl<T> Arena<T> {
pub fn alloc(&mut self, value: T) -> &T {
self.items.push(value);
self.items.last().unwrap()
}
}
But it has a problem:
let a = arena.alloc(Expr::Number(1.0));
let b = arena.alloc(Expr::Number(2.0)); // borrow checker problem
println!("{:?}", a);
The returned reference keeps the arena borrowed. Rust will not let you mutably borrow the arena again while a is alive.
Also, Vec can move its items when it grows, which would break old references.
A typed arena using Box and small unsafe
One simple reference arena stores Box<T>. Box values have stable addresses. Moving the Box handle does not move the T inside the Box.
use std::cell::RefCell;
pub struct TypedArena<T> {
items: RefCell<Vec<Box<T>>>,
}
impl<T> TypedArena<T> {
pub fn new() -> Self {
Self {
items: RefCell::new(Vec::new()),
}
}
pub fn alloc(&self, value: T) -> &T {
let boxed = Box::new(value);
let ptr: *const T = &*boxed;
self.items.borrow_mut().push(boxed);
// Safety rule:
// - ptr points into a Box<T>
// - the Box<T> is stored in self.items
// - items are only dropped when the arena is dropped
// - we never move the T out of the Box
unsafe { &*ptr }
}
}
This uses unsafe, but the unsafe part is small and easy to reason about.
Unsafe means Rust cannot prove the code is correct, so you must prove it by design.
When to use typed reference arenas
Use a typed reference arena when:
- you want AST nodes to contain references
- your AST has one clear lifetime
- you want simple recursive tree walking
- you accept some unsafe inside the arena implementation
Avoid it when:
- you want fully safe code
- you need to mutate nodes a lot
- you want to store nodes across different arena lifetimes
- IDs would be easier
For your current Lox interpreter, I still recommend the ID arena first.
Part 7: bump arena
A bump arena is the low-level version.
It has a big byte buffer and a cursor.
Allocation means:
- align the cursor
- write the value at that memory
- move the cursor forward
chunk bytes
╭────────────────────────────────────────────╮
│ used values │ cursor -> free free free free │
╰────────────────────────────────────────────╯
alloc T:
cursor = align(cursor, align_of::<T>())
write T at cursor
cursor += size_of::<T>()
This can be extremely fast. But implementing a general bump arena correctly requires unsafe code, alignment handling, and drop handling.
Alignment means some types must start at memory addresses divisible by 2, 4, 8, or more. If you ignore alignment, the program can be wrong or crash.
A learning bump allocator sketch
This sketch shows the idea. It is not the first implementation I recommend shipping.
use std::alloc::{alloc, dealloc, Layout};
use std::cell::Cell;
use std::mem;
use std::ptr;
pub struct Bump {
ptr: *mut u8,
cap: usize,
pos: Cell<usize>,
}
impl Bump {
pub fn with_capacity(cap: usize) -> Self {
let layout = Layout::from_size_align(cap, 16).unwrap();
let ptr = unsafe { alloc(layout) };
if ptr.is_null() {
std::alloc::handle_alloc_error(layout);
}
Self { ptr, cap, pos: Cell::new(0) }
}
pub fn alloc<T>(&self, value: T) -> &T {
let align = mem::align_of::<T>();
let size = mem::size_of::<T>();
let start = align_up(self.pos.get(), align);
let end = start.checked_add(size).expect("arena overflow");
assert!(end <= self.cap, "bump arena out of memory");
let dst = unsafe { self.ptr.add(start) as *mut T };
unsafe { ptr::write(dst, value) };
self.pos.set(end);
unsafe { &*dst }
}
}
fn align_up(value: usize, align: usize) -> usize {
(value + align - 1) & !(align - 1)
}
impl Drop for Bump {
fn drop(&mut self) {
let layout = Layout::from_size_align(self.cap, 16).unwrap();
unsafe { dealloc(self.ptr, layout) };
}
}
This code has a serious limitation: it does not run destructors for allocated T values.
Destructor means code that runs when a value is dropped. String has a destructor because it must free its internal buffer.
So this bump allocator is only safe for values that do not need Drop, unless you also track destructors.
That is why crates like bumpalo exist. You can use them later, but you should understand the idea first.
Part 8: strings in compiler arenas
Your lexer currently allocates many String values:
pub enum TokenType {
Identifier(String),
String(String),
Number(f32),
}
pub struct Token {
pub tk_type: TokenType,
pub lexeme: String,
pub line: u32,
}
This is easy, but it copies text out of the source string.
A compiler often stores slices into the original source:
pub struct Token<'src> {
pub kind: TokenKind<'src>,
pub lexeme: &'src str,
pub line: u32,
}
pub enum TokenKind<'src> {
Identifier(&'src str),
String(&'src str),
Number(f32),
}
Now identifiers do not allocate a new String during lexing.
The source string must live longer than the tokens and AST that borrow from it.
source String
▲
│ borrowed by
tokens and AST names
For a file, this is easy. read_src_file returns a String, then lexer and parser borrow from it.
For a REPL, each input line has a short lifetime. If runtime values need names or strings after the line ends, you may need to store owned Strings or interned strings.
String interning
String interning means storing each unique string once and using an ID for it.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct Symbol(usize);
pub struct Interner {
strings: Vec<String>,
map: std::collections::HashMap<String, Symbol>,
}
Then identifier names are Symbol instead of String.
pub struct Identifier {
pub name: Symbol,
}
This is very common in compilers.
Part 9: typed arena vs index arena
Here is the practical difference.
Index arena
enum Expr {
Binary { left: ExprId, right: ExprId }
}
fn eval(ast: &Ast, id: ExprId) { ... }
Pros:
- safe Rust
- easy to store in structs
- easy to compare, copy, hash
- good for graphs
- good for later compiler passes
Cons:
- every access needs ast.exprs.get(id)
- IDs can be used with the wrong Ast if you are careless
Typed reference arena
enum Expr<'a> {
Binary { left: &'a Expr<'a>, right: &'a Expr<'a> }
}
Pros:
- tree walking looks simple
- fewer get calls
- direct references
Cons:
- lifetimes spread through many types
- implementation usually needs unsafe
- mutation is harder
- graphs can be harder
Recommendation for your Lox project
Use index arenas first.
They will teach you the compiler architecture pattern without fighting lifetimes too early.
Part 10: modern compiler usage
Modern compilers use arenas because compiler data often lives by phase.
Examples:
- Rust compiler uses arenas and interners for compiler-owned data such as high-level representations.
- Clang stores AST nodes in an ASTContext, which acts like a long-lived arena for the translation unit.
- LLVM has bump pointer allocators for many internal compiler data structures.
- Many language servers and analyzers use index-based arenas for syntax trees and semantic data.
Translation unit means one compiled source file plus the files it includes or imports.
The common pattern is:
╭──────────────╮
│ parse arena │ owns syntax tree nodes
╰──────┬───────╯
▼
╭──────────────╮
│ type arena │ owns type objects
╰──────┬───────╯
▼
╭──────────────╮
│ IR arena │ owns intermediate representation nodes
╰──────────────╯
Each arena has a clear lifetime.
Part 11: engineering rules for arenas
Follow these rules:
- Give each arena a clear owner.
- Do not hide global mutable arenas everywhere.
- Prefer typed IDs over plain usize.
- Keep IDs small and Copy.
- Do not delete from a simple index arena unless you understand invalid IDs.
- Pass &Ast to functions that use IDs.
- Keep arena lifetimes aligned with compiler phases.
- Start safe with index arenas, then learn unsafe bump arenas later.
About deletion
If you remove an item from Vec, all later indexes shift.
items.remove(3); // index 4 becomes index 3
That breaks old IDs.
So a simple arena should not remove nodes.
If you need deletion, use slots:
Vec<Option<T>>
Or use a generational arena.
Generational means an ID stores both index and version, so old IDs can be detected after deletion.
You do not need this for your AST.
Part 12: how to migrate your Lox interpreter step by step
Do not rewrite everything at once.
Step 1: add arena.rs
Create a new module:
// arena.rs
use std::marker::PhantomData;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct Id<T> {
index: usize,
_marker: PhantomData<fn() -> T>,
}
#[derive(Debug, Default)]
pub struct Arena<T> {
items: Vec<T>,
}
impl<T> Arena<T> {
pub fn alloc(&mut self, value: T) -> Id<T> {
let id = Id { index: self.items.len(), _marker: PhantomData };
self.items.push(value);
id
}
pub fn get(&self, id: Id<T>) -> &T {
&self.items[id.index]
}
pub fn get_mut(&mut self, id: Id<T>) -> &mut T {
&mut self.items[id.index]
}
}
Then in main.rs:
mod arena;
Step 2: change Expr children from Box<Expr> to ExprId
Do only expressions first.
Change parser expression functions to return ExprId.
Keep statements and declarations normal if needed. You can migrate gradually.
Step 3: add Ast wrapper
pub struct Ast {
pub exprs: Arena<Expr>,
pub program: Program,
}
Later add stmts and decls arenas.
Step 4: change interpreter eval_expr
fn eval_expr(&mut self, ast: &Ast, expr: ExprId) -> Result<RuntimeValue, RuntimeError>
Now all recursive expression eval uses IDs.
Step 5: move Stmt and Declaration into arenas
After expression migration works, migrate statements and declarations.
Step 6: remove unnecessary Clone usage
Your interpreter should walk by ID and borrow from Ast. Avoid cloning whole declarations and expressions.
Part 13: best final shape for your Lox project
A clean structure:
// parser.rs
use crate::arena::{Arena, Id};
pub type ExprId = Id<Expr>;
pub type StmtId = Id<Stmt>;
pub type DeclId = Id<Declaration>;
pub struct Ast {
pub exprs: Arena<Expr>,
pub stmts: Arena<Stmt>,
pub decls: Arena<Declaration>,
pub program: Program,
}
pub struct Program {
pub declarations: Vec<DeclId>,
}
Interpreter:
impl Interpreter {
pub fn interpret(&mut self, ast: &Ast) -> Result<(), RuntimeError> {
for decl in &ast.program.declarations {
self.eval_decl(ast, *decl)?;
}
Ok(())
}
}
main.rs:
let parser = Parser::new(tokens.into_iter());
let ast = parser.parse().unwrap();
interpreter.interpret(&ast).unwrap();
Part 14: thinking like a compiler engineer
When deciding how to allocate data, ask these questions:
- Who owns this data?
- How long does it live?
- Do I delete items one by one?
- Do many things point to it?
- Do I need mutation after creation?
- Is this phase-local or global?
For AST nodes:
- owner: Ast
- lifetime: one parse result
- deletion: no
- many pointers: yes
- mutation: little or none after parsing
- phase: parse plus later read-only phases
That is arena-friendly.
For runtime variables:
- owner: interpreter environment
- lifetime: dynamic during program execution
- deletion: yes, scopes end
- mutation: yes
That is usually not the same arena as AST.
Part 15: common mistakes
Mistake 1: putting everything in one arena
Do not create one giant arena for every object in your interpreter.
Better:
- Ast owns AST nodes
- Interner owns strings or symbols
- Interpreter owns runtime environments
- Type checker owns type data
Mistake 2: using arena to avoid understanding ownership
Arena does not remove ownership. It changes ownership.
The arena owns everything.
Mistake 3: deleting from index arena
Do not remove AST nodes from Vec. IDs would break.
Mistake 4: storing references into short-lived source text
If tokens borrow from REPL input, do not store those borrowed strings inside long-lived runtime values.
Mistake 5: starting with unsafe bump allocation too early
Learn index arenas first. They solve most of your current problem.
Part 16: what I would do in your exact interpreter
I would change your project in this order:
- Add arena.rs with Arena<T> and Id<T>.
- Change Expr to use ExprId instead of Box<Expr>.
- Make Parser own Ast.
- Make expression parser functions return ExprId.
- Make interpreter eval_expr take &Ast and ExprId.
- Then move Stmt and Declaration into arenas.
- Then reduce clones in interpreter.
- Later, consider token string slices or a string interner.
Do not start with a bump allocator. Start with an index arena.
Part 17: mental model summary
An arena is not just a faster allocator. It is a design decision.
You are saying:
These objects belong together. They are born together. They die together.
That statement matches compilers very well.
Without arena
╭─────╮ ╭─────╮ ╭─────╮ ╭─────╮
│ Box │ │ Box │ │ Box │ │ Box │ many small heap objects
╰─────╯ ╰─────╯ ╰─────╯ ╰─────╯
With arena
╭────────────────────────────────╮
│ node node node node node node │ one owned region
╰────────────────────────────────╯
For Rust, the best first arena is usually not a raw memory allocator. It is a typed index arena built on Vec.
Part 18: a deeper performance model
It is easy to say "arena is faster". But you should know exactly why.
Performance usually improves because of these causes:
- Fewer allocator calls.
- Fewer pointer indirections.
- Better memory locality.
- Cheaper cleanup.
- Better data layout for compiler passes.
Let us explain each one slowly.
Fewer allocator calls
The global allocator is powerful. It can allocate and free many sizes from many parts of the program. But each call has overhead.
Overhead means extra work around the real work.
When you do this many times:
Box::new(expr)
you ask the allocator for one new heap object each time.
With an index arena:
self.exprs.push(expr)
Vec sometimes grows, but not on every push. Most pushes only write into already reserved memory.
Box design
parse node -> allocator call
parse node -> allocator call
parse node -> allocator call
parse node -> allocator call
Arena Vec design
reserve chunk -> push node
-> push node
-> push node
-> push node
Better memory locality
Locality means related data is near each other in memory.
An AST traversal often visits many neighboring nodes. If the nodes live near each other, the CPU cache helps.
Bad locality
Expr A ───────▶ heap object far away
Expr B ─────────────▶ another far heap object
Expr C ───▶ another random place
Good locality
Expr Arena
╭──────────────────────────────╮
│ Expr A │ Expr B │ Expr C │ ... │
╰──────────────────────────────╯
This does not mean every arena program is automatically faster. It means the memory shape is usually better for compiler workloads.
Cheaper cleanup
With Box, each Box is dropped separately. Dropping one Box may call the allocator to free memory.
With Vec arena, the Vec drops all values and frees its backing buffer.
For AST nodes, that matches the real need:
When done with parse result, drop the whole parse result.
Performance is not only allocation speed
In compilers, speed is often affected by walking data again and again.
Examples:
- resolver walks AST
- type checker walks AST
- interpreter walks AST
- optimizer walks IR
- code generator walks IR
If the data layout is better, all those passes can benefit.
Pass means one stage that reads or changes compiler data.
Part 19: the complete expression migration shape
Earlier we showed term and primary. Now let us convert the whole expression side of your parser.
The target signatures:
fn expr(&mut self) -> Result<ExprId, ParseError>;
fn assignment(&mut self) -> Result<ExprId, ParseError>;
fn equality(&mut self) -> Result<ExprId, ParseError>;
fn comparision(&mut self) -> Result<ExprId, ParseError>;
fn term(&mut self) -> Result<ExprId, ParseError>;
fn factor(&mut self) -> Result<ExprId, ParseError>;
fn unary(&mut self) -> Result<ExprId, ParseError>;
fn primary(&mut self) -> Result<ExprId, ParseError>;
assignment
Assignment needs one special thing. You parse the left side first, then check if it is an identifier.
With IDs, you cannot directly match the ExprId. You must look it up in the arena.
fn assignment(&mut self) -> Result<ExprId, ParseError> {
let expr = self.equality()?;
if let Some(TokenType::Equal) = self.peek_tktype()? {
self.consume_if_matches(TokenType::Equal)?;
let value = self.assignment()?;
let id = match self.ast.exprs.get(expr) {
Expr::Identifier(id) => id.clone(),
_ => return Err(ParseError::WrongAssignment),
};
return Ok(self.alloc_expr(Expr::Assignment {
id,
assign: value,
}));
}
Ok(expr)
}
This is a common arena pattern:
have ID -> inspect node in arena -> allocate new node using old ID
equality
fn equality(&mut self) -> Result<ExprId, ParseError> {
let mut node = self.comparision()?;
while let Some(tk_type) = self.peek_tktype()? {
let op = match tk_type {
TokenType::BangEqual => BinaryOperator::NotEqual,
TokenType::EqualEqual => BinaryOperator::DoubleEqual,
_ => break,
};
self.advance()?;
let right = self.comparision()?;
node = self.alloc_expr(Expr::BinaryOperation {
left: node,
op,
right,
});
}
Ok(node)
}
comparison
Your code spells it comparision. That is okay for now, but the common spelling is comparison.
fn comparision(&mut self) -> Result<ExprId, ParseError> {
let mut node = self.term()?;
while let Some(tk_type) = self.peek_tktype()? {
let op = match tk_type {
TokenType::Greater => BinaryOperator::Greater,
TokenType::GreaterEqual => BinaryOperator::GreaterEqual,
TokenType::Less => BinaryOperator::Less,
TokenType::LessEqual => BinaryOperator::LessEqual,
_ => break,
};
self.advance()?;
let right = self.term()?;
node = self.alloc_expr(Expr::BinaryOperation {
left: node,
op,
right,
});
}
Ok(node)
}
factor
fn factor(&mut self) -> Result<ExprId, ParseError> {
let mut node = self.unary()?;
while let Some(tk_type) = self.peek_tktype()? {
let op = match tk_type {
TokenType::Star => BinaryOperator::Mul,
TokenType::Slash => BinaryOperator::Div,
_ => break,
};
self.advance()?;
let right = self.unary()?;
node = self.alloc_expr(Expr::BinaryOperation {
left: node,
op,
right,
});
}
Ok(node)
}
unary
fn unary(&mut self) -> Result<ExprId, ParseError> {
if let Some(tk_type) = self.peek_tktype()? {
match tk_type {
TokenType::Bang => {
self.advance()?;
let operand = self.unary()?;
Ok(self.alloc_expr(Expr::UnaryOperation {
op: UnaryOperator::Not,
operand,
}))
}
TokenType::Minus => {
self.advance()?;
let operand = self.unary()?;
Ok(self.alloc_expr(Expr::UnaryOperation {
op: UnaryOperator::Minus,
operand,
}))
}
_ => self.primary(),
}
} else {
Err(ParseError::UnexpectedEof)
}
}
Notice that recursion still exists. Arena allocation does not remove recursion. It changes where recursive child nodes are stored.
Part 20: full AST walking with IDs
Once you use IDs, you will write many functions like this:
fn walk_expr(ast: &Ast, expr: ExprId) {
match ast.exprs.get(expr) {
Expr::Number(_) => {}
Expr::String(_) => {}
Expr::Bool(_) => {}
Expr::Nil => {}
Expr::Identifier(_) => {}
Expr::Grouping(inner) => walk_expr(ast, *inner),
Expr::Assignment { assign, .. } => walk_expr(ast, *assign),
Expr::UnaryOperation { operand, .. } => walk_expr(ast, *operand),
Expr::BinaryOperation { left, right, .. } => {
walk_expr(ast, *left);
walk_expr(ast, *right);
}
}
}
This is the basic visitor pattern.
Visitor means code that walks a tree and does something at each node.
You can make visitors for:
- printing AST
- resolving variables
- type checking
- interpreting
- compiling
- collecting diagnostics
Diagnostic means an error or warning message for the programmer.
Pretty-printing an arena AST
This is useful for debugging.
fn print_expr(ast: &Ast, expr: ExprId) -> String {
match ast.exprs.get(expr) {
Expr::Number(n) => n.to_string(),
Expr::String(s) => format!("\"{}\"", s),
Expr::Bool(b) => b.to_string(),
Expr::Nil => "nil".to_string(),
Expr::Identifier(id) => id.name.clone(),
Expr::Grouping(inner) => format!("(group {})", print_expr(ast, *inner)),
Expr::UnaryOperation { op, operand } => {
format!("({:?} {})", op, print_expr(ast, *operand))
}
Expr::BinaryOperation { left, op, right } => {
format!(
"({:?} {} {})",
op,
print_expr(ast, *left),
print_expr(ast, *right),
)
}
Expr::Assignment { id, assign } => {
format!("(assign {} {})", id.name, print_expr(ast, *assign))
}
}
}
This is also a good test that your arena AST is linked correctly.
Part 21: spans and source locations
Real compilers need good error messages. For that, AST nodes often store spans.
Span means a range in the source code.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct Span {
pub start: usize,
pub end: usize,
pub line: u32,
}
Then your token stores a span:
pub struct Token {
pub tk_type: TokenType,
pub lexeme: String,
pub span: Span,
}
And your AST nodes can store spans too.
pub struct ExprNode {
pub kind: Expr,
pub span: Span,
}
pub type ExprId = Id<ExprNode>;
This is often better than putting Span inside every enum variant.
ExprNode
╭─────────────────────╮
│ span: source range │
│ kind: Expr enum │
╰─────────────────────╯
Now every expression has a source location.
Why spans matter with arenas
If your AST uses IDs, diagnostics can store IDs or spans.
For example:
pub struct Diagnostic {
pub span: Span,
pub message: String,
}
When type checking finds an error, it can use the node span:
let expr = ast.exprs.get(expr_id);
diagnostics.push(Diagnostic {
span: expr.span,
message: "expected number".to_string(),
});
This is how compilers connect internal data back to source code.
Part 22: separating node data from node kind
For a serious compiler, this shape is common:
pub struct ExprNode {
pub kind: ExprKind,
pub span: Span,
}
pub enum ExprKind {
Identifier(Identifier),
Number(f32),
Binary {
left: ExprId,
op: BinaryOperator,
right: ExprId,
},
}
Why is this nice?
- every node has common metadata
- enum stays focused on syntax meaning
- passes can access span for any expression
- later you can add node IDs, types, or flags
Metadata means data about data. Span is metadata about where the node came from.
You can do the same for statements:
pub struct StmtNode {
pub kind: StmtKind,
pub span: Span,
}
pub enum StmtKind {
Block(Vec<DeclId>),
Print(ExprId),
Expr(ExprId),
}
This is usually better than a giant enum with no common wrapper.
Part 23: a string interner with zero dependencies
Earlier we introduced string interning. Now let us implement it.
use std::collections::HashMap;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct Symbol(usize);
#[derive(Debug, Default)]
pub struct Interner {
strings: Vec<String>,
map: HashMap<String, Symbol>,
}
impl Interner {
pub fn intern(&mut self, value: &str) -> Symbol {
if let Some(symbol) = self.map.get(value) {
return *symbol;
}
let symbol = Symbol(self.strings.len());
self.strings.push(value.to_string());
self.map.insert(value.to_string(), symbol);
symbol
}
pub fn resolve(&self, symbol: Symbol) -> &str {
&self.strings[symbol.0]
}
}
Now Identifier can be:
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct Identifier {
pub name: Symbol,
}
Your runtime environment can use Symbol as the key:
pub struct Interpreter {
global_scope: HashMap<Symbol, RuntimeValue>,
}
This makes variable lookup cheaper than String lookup in many cases.
Where should the interner live?
For your interpreter, a good owner is a CompileSession.
pub struct CompileSession {
pub interner: Interner,
}
Then parser can borrow the interner mutably:
pub struct Parser<'a, I> {
tokens: std::iter::Peekable<I>,
ast: Ast,
interner: &'a mut Interner,
}
But be careful. If your lexer already creates String tokens, you can intern in the parser. If your lexer returns borrowed &str tokens, you can intern there or in the parser.
Interning keywords?
Usually keywords are TokenType variants, not interned strings.
TokenType::If
TokenType::While
TokenType::Identifier(Symbol)
Intern user-defined names:
- variable names
- function names
- class names
- property names
Part 24: REPL design with arenas
Your main.rs has two modes:
- file mode
- REPL mode
REPL means read-eval-print-loop. It reads one line, evaluates it, prints result, then loops.
Arena lifetime in file mode is simple:
read source file
│
▼
parse into Ast arena
│
▼
interpret Ast
│
▼
drop Ast at end
REPL mode is more subtle.
input line 1 -> Ast arena 1 -> interpret -> drop arena 1
input line 2 -> Ast arena 2 -> interpret -> drop arena 2
input line 3 -> Ast arena 3 -> interpret -> drop arena 3
The interpreter state lives across inputs:
let mut interpreter = Interpreter::new();
loop {
let ast = parse_line(...);
interpreter.interpret(&ast).unwrap();
}
This is okay only if the interpreter does not store references into ast.
Runtime values should own their data or use stable IDs from a long-lived interner.
For example, this is dangerous:
pub enum RuntimeValue<'ast> {
Function(&'ast FunctionDecl),
}
Because the AST for one REPL input may be dropped, but the function value could remain in the environment.
For a Lox interpreter with functions, you must think carefully about function declarations and closures.
Closure means a function value that remembers variables from its surrounding scope.
A practical REPL rule
For simple expression and variable Lox, drop each input AST after interpreting.
For functions/classes that must live after the input line, store runtime-owned function objects, not references to short-lived AST nodes.
Or use a long-lived REPL arena for declarations that become runtime functions.
Short-lived parse arena
used for normal statements
Long-lived REPL definition arena
stores function/class declarations that survive between inputs
This is advanced, but important.
Part 25: environments are not AST arenas
Your Interpreter currently has:
pub struct Interpreter {
global_scope: HashMap<String, RuntimeValue>,
local_scopes: Option<Box<Interpreter>>,
}
You probably do not want to put runtime scopes in the same arena as AST nodes.
AST data is mostly immutable after parsing.
Runtime scopes are dynamic:
- created when entering a block or function
- mutated by variable assignment
- dropped when leaving scope
That is a different lifetime.
A better environment shape might be:
pub struct Environment {
values: HashMap<Symbol, RuntimeValue>,
parent: Option<EnvId>,
}
pub type EnvId = Id<Environment>;
pub struct EnvArena {
envs: Arena<Environment>,
}
But this is optional. For a tree-walk interpreter, Rc<RefCell<Environment>> is also common.
No external crate is needed:
use std::cell::RefCell;
use std::rc::Rc;
type EnvRef = Rc<RefCell<Environment>>;
pub struct Environment {
values: HashMap<Symbol, RuntimeValue>,
parent: Option<EnvRef>,
}
Rc means reference-counted pointer. RefCell means runtime-checked mutable borrow. These are standard library tools.
For learning compilers, it is okay to use Rc<RefCell> for environments and arena IDs for AST.
Part 26: mutable compiler passes with index arenas
Sometimes a pass needs to mutate AST nodes.
Example: constant folding.
Constant folding means replacing constant expressions with their computed value.
1 + 2 -> 3
With an index arena, you can mutate a node:
fn fold_expr(ast: &mut Ast, expr: ExprId) {
let replacement = match ast.exprs.get(expr) {
Expr::BinaryOperation { left, op, right } => {
let left_id = *left;
let right_id = *right;
fold_expr(ast, left_id);
fold_expr(ast, right_id);
match (ast.exprs.get(left_id), op, ast.exprs.get(right_id)) {
(Expr::Number(a), BinaryOperator::Add, Expr::Number(b)) => Some(Expr::Number(a + b)),
_ => None,
}
}
_ => None,
};
if let Some(new_expr) = replacement {
*ast.exprs.get_mut(expr) = new_expr;
}
}
This example is simplified, but the pattern matters:
- Read child IDs.
- Drop immutable borrow.
- Recurse with mutable Ast.
- Mutate current node.
Rust will force you to avoid holding immutable and mutable borrows at the same time.
That can feel annoying, but it protects you.
Alternative: create a new arena for transformed AST
Many compilers do not mutate old trees. They create a new tree.
AST arena -> lowering pass -> HIR arena -> type pass -> typed HIR arena
Lowering means converting one representation into a simpler or more explicit representation.
Example:
for loop -> while loop
syntactic sugar -> core language
This is often cleaner than mutating the old AST.
Part 27: HIR, MIR, IR and why arenas keep appearing
Compilers often use several internal languages.
- AST: close to source syntax
- HIR: high-level intermediate representation
- MIR: middle-level intermediate representation
- IR: intermediate representation, a general term
Intermediate means between source code and final execution/codegen.
Example:
source code
▼
AST: keeps syntax shape
▼
HIR: names resolved, sugar removed
▼
MIR: control flow clearer
▼
bytecode or machine code
Each representation can have its own arena.
Why not one arena for everything?
Because each phase has different data and lifetime.
pub struct Compiler {
source: String,
interner: Interner,
ast: Ast,
hir: Hir,
types: TypeStore,
}
This is a clean compiler mental model.
Part 28: building a TypeStore arena
When you add static types later, you may store types in an arena too.
pub type TypeId = Id<Type>;
pub enum Type {
Number,
String,
Bool,
Nil,
Function {
params: Vec<TypeId>,
return_type: TypeId,
},
Class {
name: Symbol,
},
}
pub struct TypeStore {
types: Arena<Type>,
}
Then type checking expressions can return TypeId:
fn check_expr(&mut self, ast: &Ast, expr: ExprId) -> Result<TypeId, TypeError> {
match ast.exprs.get(expr) {
Expr::Number(_) => Ok(self.types.number()),
Expr::String(_) => Ok(self.types.string()),
Expr::Bool(_) => Ok(self.types.bool()),
_ => todo!(),
}
}
For built-in types, you may store fixed IDs:
pub struct TypeStore {
types: Arena<Type>,
number: TypeId,
string: TypeId,
bool_: TypeId,
nil: TypeId,
}
This is another arena pattern: allocate canonical objects once and reuse their IDs.
Canonical means the one official shared version.
Part 29: generational arenas
A simple index arena cannot safely delete items. But some programs need deletion.
For that, a generational arena stores an index and a generation.
Generation means version number.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct GenId {
index: usize,
generation: u32,
}
struct Slot<T> {
generation: u32,
value: Option<T>,
}
When you delete an item, you increment the generation.
old id: index 2, generation 0
delete slot 2 -> generation becomes 1
new id: index 2, generation 1
old id no longer matches
Sketch:
pub struct GenArena<T> {
slots: Vec<Slot<T>>,
free: Vec<usize>,
}
impl<T> GenArena<T> {
pub fn get(&self, id: GenId) -> Option<&T> {
let slot = self.slots.get(id.index)?;
if slot.generation != id.generation {
return None;
}
slot.value.as_ref()
}
}
You do not need generational arenas for AST nodes because AST nodes are not deleted.
But it is useful knowledge for runtime objects, editor buffers, ECS systems, and long-lived graphs.
ECS means entity component system, common in game engines.
Part 30: drop handling in bump arenas
The earlier bump allocator did not drop values. That is a big limitation.
If you allocate String in a bump arena and never run its destructor, the String internal heap memory leaks.
Leak means memory is not freed until the program ends.
To support Drop, a bump arena can store cleanup functions.
Conceptually:
struct DropRecord {
ptr: *mut u8,
drop_fn: unsafe fn(*mut u8),
}
When allocating T:
unsafe fn drop_value<T>(ptr: *mut u8) {
std::ptr::drop_in_place(ptr as *mut T);
}
Store:
self.drops.push(DropRecord {
ptr: dst as *mut u8,
drop_fn: drop_value::<T>,
});
When the arena drops:
for record in self.drops.iter().rev() {
unsafe { (record.drop_fn)(record.ptr) };
}
Then deallocate the raw byte chunks.
This is why production bump allocators are non-trivial.
Non-trivial means not simple.
Why reverse drop order?
Rust usually drops values in reverse creation order. If later objects reference earlier objects, reverse drop is often safer.
But with arena references, you should avoid destructors that depend on other arena objects still being alive. Keep arena-stored compiler nodes simple when possible.
Part 31: chunked bump arenas
A fixed-size bump arena can run out of memory.
A production bump arena usually stores chunks.
Chunk 1: full
╭────────────────────╮
│ used used used used │
╰────────────────────╯
Chunk 2: active
╭────────────────────╮
│ used used free free │
╰────────────────────╯
Chunk 3: allocated later if needed
When the active chunk is full, allocate a new chunk.
Large objects may get their own chunk.
This design avoids moving old allocations, so old references stay valid.
That is one reason bump arenas can return references safely from an API, even though the internal implementation uses unsafe.
Part 32: comparing allocation choices for your Lox AST
Here is a practical table.
╭─────────────────┬────────────┬───────────────┬────────────────────╮
│ Design │ Safe Rust? │ Easy walking? │ Best for │
├─────────────────┼────────────┼───────────────┼────────────────────┤
│ Box AST │ yes │ yes │ learning first AST │
│ Rc AST │ yes │ medium │ shared ownership │
│ Index arena │ yes │ medium │ compilers in Rust │
│ Ref arena │ unsafe core│ yes │ phase trees │
│ Raw bump arena │ unsafe core│ advanced │ low-level speed │
╰─────────────────┴────────────┴───────────────┴────────────────────╯
For your current level and project, index arena is the sweet spot.
Sweet spot means best balance.
Part 33: a realistic module layout
As your interpreter grows, organize like this:
src/
├─ main.rs
├─ arena.rs // Id<T>, Arena<T>
├─ interner.rs // Symbol, Interner
├─ span.rs // Span, source locations
├─ lexer.rs // tokens
├─ parser.rs // Ast, ExprId, StmtId, parser
├─ interpreter.rs // runtime execution
└─ error.rs // errors and diagnostics
This separates the concepts.
arena.rs should not know about Expr.
parser.rs should use Arena<ExprNode>.
interpreter.rs should not allocate AST nodes.
This separation makes the project easier to reason about.
Part 34: a more complete arena.rs
Here is a better zero-dependency arena module.
use std::marker::PhantomData;
use std::ops::{Index, IndexMut};
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct Id<T> {
index: u32,
_marker: PhantomData<fn() -> T>,
}
impl<T> Copy for Id<T> {}
impl<T> Clone for Id<T> {
fn clone(&self) -> Self {
*self
}
}
impl<T> Id<T> {
pub fn new(index: usize) -> Self {
assert!(index <= u32::MAX as usize, "too many arena items");
Self {
index: index as u32,
_marker: PhantomData,
}
}
pub fn index(self) -> usize {
self.index as usize
}
}
#[derive(Debug, Default)]
pub struct Arena<T> {
items: Vec<T>,
}
impl<T> Arena<T> {
pub fn new() -> Self {
Self { items: Vec::new() }
}
pub fn with_capacity(capacity: usize) -> Self {
Self { items: Vec::with_capacity(capacity) }
}
pub fn alloc(&mut self, value: T) -> Id<T> {
let id = Id::new(self.items.len());
self.items.push(value);
id
}
pub fn get(&self, id: Id<T>) -> &T {
&self.items[id.index()]
}
pub fn get_mut(&mut self, id: Id<T>) -> &mut T {
&mut self.items[id.index()]
}
pub fn iter(&self) -> impl Iterator<Item = (Id<T>, &T)> {
self.items
.iter()
.enumerate()
.map(|(index, value)| (Id::new(index), value))
}
pub fn len(&self) -> usize {
self.items.len()
}
pub fn is_empty(&self) -> bool {
self.items.is_empty()
}
}
impl<T> Index<Id<T>> for Arena<T> {
type Output = T;
fn index(&self, id: Id<T>) -> &Self::Output {
self.get(id)
}
}
impl<T> IndexMut<Id<T>> for Arena<T> {
fn index_mut(&mut self, id: Id<T>) -> &mut Self::Output {
self.get_mut(id)
}
}
Now you can write:
let expr = &ast.exprs[expr_id];
That is more readable.
Why u32 instead of usize? Many compilers use smaller IDs to reduce memory. u32 is enough for about 4 billion nodes. Your interpreter will not hit that.
Part 35: common Rust borrow problems and fixes
With arenas, you may hit borrow checker errors like:
cannot borrow ast.exprs as mutable because it is also borrowed as immutable
This often happens when you do:
let expr = ast.exprs.get(expr_id);
let new_id = ast.exprs.alloc(...); // mutable borrow while expr is alive
println!("{:?}", expr);
Fix it by copying out the IDs or data you need first:
let (left, right) = match ast.exprs.get(expr_id) {
Expr::BinaryOperation { left, right, .. } => (*left, *right),
_ => return,
};
let new_id = ast.exprs.alloc(...);
The immutable borrow ended before the mutable borrow started.
This style is common in Rust compiler code.
Part 36: arena allocation and error recovery
Parser error recovery means continuing after a parse error to find more errors.
Recovery can create partial AST nodes.
Partial means incomplete.
In an arena parser, this is okay. You can allocate error nodes.
pub enum ExprKind {
Error,
Number(f32),
Binary { left: ExprId, op: BinaryOperator, right: ExprId },
}
Then parser can return an ExprId even after an error:
fn error_expr(&mut self, span: Span) -> ExprId {
self.ast.exprs.alloc(ExprNode {
kind: ExprKind::Error,
span,
})
}
This makes later passes simpler because they can keep walking the tree.
Type checker can say:
ExprKind::Error => TypeId::ERROR
This is a real compiler technique.
Part 37: real compiler patterns in simple words
Clang style
Clang has an ASTContext. You can think of it as a large owner for AST-related data for one translation unit.
Simple mental model:
ASTContext
├─ owns declarations
├─ owns statements
├─ owns expressions
├─ owns types
└─ frees them together
This is arena thinking.
LLVM style
LLVM has bump pointer allocators in many places. A bump pointer allocator is a bump arena.
It is used when allocation is frequent and individual free is not needed.
Rust compiler style
The Rust compiler uses arenas, IDs, interners, and context objects. The exact internals are complex, but the mental model is the same:
- store compiler data in long-lived contexts
- refer to many things by IDs or interned references
- avoid allocating and freeing tiny objects one by one
- organize data by compiler phase
You do not need to copy rustc design exactly. You should copy the thinking model.
Part 38: benchmarking your arena migration
Do not guess performance. Measure it.
Zero-dependency simple timing:
use std::time::Instant;
let start = Instant::now();
let ast = parser.parse().unwrap();
let parse_time = start.elapsed();
println!("parse: {:?}", parse_time);
println!("expr nodes: {}", ast.exprs.len());
Things to measure:
- parse time
- number of expression nodes
- number of statement nodes
- number of declaration nodes
- interpreter time
- total allocations if you use an allocator profiler later
Run many times. One run can lie because the OS and CPU are noisy.
Noisy means affected by random outside things.
For a small Lox interpreter, the arena version may not look much faster at first. That is normal. The design becomes more valuable as programs and compiler phases grow.
Part 39: exercises to become confident
Do these in order.
Exercise 1
Implement Arena<T> and Id<T>. Write a test:
let mut arena = Arena::new();
let a = arena.alloc(10);
let b = arena.alloc(20);
assert_eq!(*arena.get(a), 10);
assert_eq!(*arena.get(b), 20);
Exercise 2
Convert only Expr to ExprId. Keep Stmt and Declaration unchanged.
Goal: make expression parsing and evaluation work.
Exercise 3
Convert Stmt to StmtId.
Goal: blocks store Vec<DeclId> or Vec<Declaration> temporarily.
Exercise 4
Convert Declaration to DeclId.
Goal: Program stores Vec<DeclId>.
Exercise 5
Add Span to ExprNode.
Goal: runtime and parse errors can point to source locations.
Exercise 6
Add an Interner and change Identifier name from String to Symbol.
Goal: variable lookup uses Symbol.
Exercise 7
Write a pretty-printer that prints any ExprId.
Goal: debug the arena AST.
Exercise 8
Add constant folding as a compiler pass.
Goal: learn mutable arena passes.
Part 40: what expert-level understanding feels like
You are becoming expert when you stop asking only:
How do I allocate this object?
And start asking:
What region owns this data, what phase uses it, and when does the whole region die?
That is the arena mindset.
Compiler engineers think in regions:
- source region
- token region
- AST region
- type region
- IR region
- runtime region
Each region has a purpose and a lifetime.
╭──────────────╮
│ Source text │ lives for parse/check
╰──────┬───────╯
▼
╭──────────────╮
│ AST arena │ lives for semantic passes
╰──────┬───────╯
▼
╭──────────────╮
│ HIR arena │ lives for type checking
╰──────┬───────╯
▼
╭──────────────╮
│ Runtime env │ lives during execution
╰──────────────╯
When this picture feels natural, you understand arena allocation at a compiler-design level.
Part 41: mastery track
This section turns the guide into a learning path. Reading gives you the map. These exercises make the knowledge real.
The goal is not only to know what an arena is. The goal is this:
I can design the memory layout of a compiler phase, explain why I chose it, implement it in Rust, debug it, and improve it when the compiler grows.
That is the level where you can give a talk and also build with confidence.
How to use this track
Do not do everything in one day. Arena allocation looks simple, but the real skill is design judgement.
Use this order:
- implement the tiny examples from memory
- rewrite one small part of your Lox interpreter
- add IDs and spans
- add a symbol interner
- add semantic passes
- measure and explain the result
- prepare a talk from your own code
If you can do those steps without copying the guide, you are no longer only reading. You are building the mental model.
Stage 1: explain the idea without code
Before touching Rust, you should be able to explain arenas in simple words.
Try to answer these out loud:
- What problem does an arena solve?
- Why does an AST often have one shared lifetime?
- Why is freeing every node one by one often wasted work?
- Why can bulk free be safe for compiler data?
- Why are arenas not always the right choice?
- What is the difference between a general heap allocator and an arena?
- What does phase-based memory mean?
Good answer shape:
A compiler creates many small objects in phases. Many of those objects die together. An arena stores those objects in one owner and frees them together. That can reduce allocation overhead, make memory more compact, and make ownership easier to reason about.
Bad answer shape:
Arena is faster.
That answer is too small. You must explain why it is faster, when it is faster, and what tradeoff you accepted.
Stage 2: implement a tiny typed arena from memory
Write this without looking at the guide:
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
struct Id(usize);
struct Arena<T> {
items: Vec<T>,
}
impl<T> Arena<T> {
fn new() -> Self {
Self { items: Vec::new() }
}
fn alloc(&mut self, value: T) -> Id {
let id = Id(self.items.len());
self.items.push(value);
id
}
fn get(&self, id: Id) -> &T {
&self.items[id.0]
}
fn get_mut(&mut self, id: Id) -> &mut T {
&mut self.items[id.0]
}
}
Then explain the hidden design choices:
- Id is Copy because it is just a small handle.
- The arena owns the actual values.
- The caller never owns the node directly.
- The ID is stable as long as we only push and never remove.
- Dropping the arena drops all values.
Then answer this:
Why does Vec moving its internal buffer not break Id?
The answer: Id is an index, not a pointer. If Vec moves its buffer, index 3 is still index 3.
Stage 3: make IDs type-safe
The first Id type is too weak. It lets you accidentally use an expression ID inside a statement arena.
A stronger design uses separate ID types:
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
struct ExprId(usize);
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
struct StmtId(usize);
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
struct DeclId(usize);
Your exercise:
- create ExprArena around Vec<ExprNode>
- create StmtArena around Vec<StmtNode>
- create DeclArena around Vec<DeclNode>
- make each arena return only its own ID type
This feels repetitive. That is okay. In compiler code, type clarity often matters more than removing five lines of repetition.
Talk explanation:
I use separate ID types because an ID is not just a number. It means a specific kind of compiler object. The Rust type system can stop me from mixing them.
Stage 4: convert your Lox Expr first
Do not convert the whole interpreter at once. Convert Expr first.
Current style:
pub enum Expr {
Grouping(Box<Expr>),
BinaryOperation {
left: Box<Expr>,
op: BinaryOperator,
right: Box<Expr>,
},
}
Arena index style:
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct ExprId(usize);
pub enum ExprNode {
Grouping(ExprId),
BinaryOperation {
left: ExprId,
op: BinaryOperator,
right: ExprId,
},
}
The important mental shift:
- ExprId is the handle.
- ExprNode is the stored data.
- Ast or Parser owns the arena.
- Parser functions return ExprId instead of Expr.
So this:
fn expr(&mut self) -> Result<Expr, ParseError>
becomes this:
fn expr(&mut self) -> Result<ExprId, ParseError>
Your first migration milestone:
- primary returns ExprId
- unary returns ExprId
- factor returns ExprId
- term returns ExprId
- comparison returns ExprId
- equality returns ExprId
- assignment returns ExprId
This is the exact place where arenas become real in your own code.
Stage 5: add a real Ast owner
A production-style interpreter should not make the parser own random global state. Make one owner for the parsed program data.
Example shape:
pub struct Ast {
pub exprs: ExprArena,
pub stmts: StmtArena,
pub decls: DeclArena,
}
impl Ast {
pub fn new() -> Self {
Self {
exprs: ExprArena::new(),
stmts: StmtArena::new(),
decls: DeclArena::new(),
}
}
}
Then your Program becomes:
pub struct Program {
pub declarations: Vec<DeclId>,
}
And parse returns both:
pub struct ParsedProgram {
pub ast: Ast,
pub program: Program,
}
Now the ownership story is clear:
╭──────────────╮
│ ParsedProgram│
╰──────┬───────╯
│ owns
▼
╭──────────────╮ ╭──────────────╮
│ Ast arenas │◀─────│ Program IDs │
╰──────┬───────╯ ╰──────────────╯
│
▼
╭──────────────╮
│ Expr/Stmt/...│
╰──────────────╯
This is a talk-worthy design because the diagram tells the ownership story.
Stage 6: make the interpreter read through IDs
Your evaluator changes from owning cloned nodes to borrowing nodes from Ast.
Old shape:
fn eval_expr(&mut self, expr: Expr) -> Result<RuntimeValue, RuntimeError>
Better arena shape:
fn eval_expr(&mut self, ast: &Ast, expr: ExprId) -> Result<RuntimeValue, RuntimeError> {
match ast.exprs.get(expr) {
ExprNode::Number(value) => Ok(RuntimeValue::Number(*value)),
ExprNode::Bool(value) => Ok(RuntimeValue::Bool(*value)),
ExprNode::BinaryOperation { left, op, right } => {
let left = self.eval_expr(ast, *left)?;
let right = self.eval_expr(ast, *right)?;
self.eval_binary(left, op.clone(), right)
}
_ => todo!(),
}
}
The important result:
- no AST clone for evaluation
- recursive structure is represented by IDs
- the arena owns all nodes
- interpreter only reads the AST
This is closer to real compiler engineering.
Stage 7: add Span everywhere
A compiler without source locations is hard to debug.
Add this:
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub struct Span {
pub start: usize,
pub end: usize,
pub line: u32,
}
Then wrap nodes:
pub struct ExprData {
pub kind: ExprNode,
pub span: Span,
}
Your arena stores ExprData, not only ExprNode.
Why this matters:
- parse errors point to source
- runtime errors point to source
- semantic errors point to source
- later compiler passes can keep useful diagnostics
Talk sentence:
Arena IDs are not only for memory. They also become stable keys that let later compiler passes attach extra information to syntax nodes.
Stage 8: add side tables
Modern compilers often avoid putting every piece of data directly inside the AST node.
Instead, they use side tables.
Side table means: a separate storage indexed by the same ID.
Example:
struct TypeTable {
expr_types: Vec<Option<TypeId>>,
}
If ExprId(10) exists, expr_types[10] stores the type for that expression.
Diagram:
ExprId(10)
│
├── ast.exprs[10] -> BinaryOperation
├── type_table[10] -> Number
├── const_table[10] -> maybe constant value
╰── diagnostics[10] -> maybe warning
This is one of the most important production compiler ideas.
Why it is useful:
- AST remains simple
- each pass owns its own computed facts
- you can throw away pass data without destroying AST
- you can recompute one table later
- IDs connect everything safely
Your exercise:
- create Type enum with Number, String, Bool, Nil, Unknown
- create Vec<Option<Type>> same length as expr arena
- write a type_check_expr function
- store the result in the side table
- make binary + check number plus number or string plus string
Stage 9: add a symbol interner
Right now your lexer creates many String values.
For identifiers, compilers usually intern strings.
Interning means:
Store one copy of a string and use a small ID to refer to it.
Example:
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct Symbol(usize);
pub struct Interner {
strings: Vec<String>,
}
Simple version:
impl Interner {
pub fn intern(&mut self, name: &str) -> Symbol {
for (index, existing) in self.strings.iter().enumerate() {
if existing == name {
return Symbol(index);
}
}
let symbol = Symbol(self.strings.len());
self.strings.push(name.to_string());
symbol
}
pub fn get(&self, symbol: Symbol) -> &str {
&self.strings[symbol.0]
}
}
This simple version is not the fastest because it scans linearly. Later you can add HashMap<String, Symbol>. But first learn the model.
Your migration:
- Identifier name changes from String to Symbol
- TokenType::Identifier can store Symbol if lexing owns an interner
- variable environments can use Symbol as key
Production idea:
Compiler data often becomes IDs pointing into arenas and interners. IDs are the compiler's internal language.
Stage 10: build a resolver pass
A resolver connects variable uses to variable declarations.
Your current interpreter looks up variables by name at runtime. That is okay for a small interpreter. But compilers often resolve names before execution.
Add this idea:
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct BindingId(usize);
pub struct ResolverTables {
pub expr_binding: Vec<Option<BindingId>>,
}
For each identifier expression:
- find the declaration in scope
- create or reuse a BindingId
- store binding in expr_binding[expr_id]
Diagram:
source: print name;
ExprId(22) Identifier(name)
│
▼
ResolverTables.expr_binding[22]
│
▼
BindingId(3) -> var name declared at line 1
Why this matters for arena mastery:
- the AST is stable
- the resolver produces side data
- the interpreter can use resolved bindings
- you start thinking like a compiler engineer
Stage 11: implement compiler passes as separate structs
Do not put every pass inside Parser or Interpreter.
Use shapes like this:
pub struct Resolver<'a> {
ast: &'a Ast,
scopes: Vec<Scope>,
tables: ResolverTables,
}
pub struct TypeChecker<'a> {
ast: &'a Ast,
resolver: &'a ResolverTables,
types: TypeTable,
}
This makes phase ownership clear:
Parser ───────▶ Ast
│
▼
Resolver ─────▶ ResolverTables
│
▼
TypeChecker ───▶ TypeTable
Each pass reads earlier data and writes its own data.
That is a production compiler habit.
Stage 12: add tests that prove the design
Arena bugs are often design bugs, not only syntax bugs.
Write tests for behavior like:
- allocating two expressions gives different IDs
- retrieving an ID returns the right node
- BinaryOperation stores child IDs correctly
- parser returns a Program with declaration IDs
- pretty-printer can walk the AST from root IDs
- resolver maps identifier expressions to bindings
- type checker writes type data for every expression
Example test:
#[test]
fn binary_expression_points_to_children() {
let mut ast = Ast::new();
let left = ast.exprs.alloc(ExprNode::Number(1.0));
let right = ast.exprs.alloc(ExprNode::Number(2.0));
let root = ast.exprs.alloc(ExprNode::BinaryOperation {
left,
op: BinaryOperator::Add,
right,
});
match ast.exprs.get(root) {
ExprNode::BinaryOperation { left: l, right: r, .. } => {
assert_eq!(*l, left);
assert_eq!(*r, right);
}
_ => panic!("expected binary expression"),
}
}
If your tests can walk IDs through many passes, your design is becoming solid.
Stage 13: measure the result
Do not say arena is faster without measuring.
For your Lox interpreter, create a benchmark-like command manually with std::time::Instant.
You can measure:
- parse time
- number of Expr nodes
- number of Stmt nodes
- number of Decl nodes
- max scope depth
- total source size
- time to interpret
Simple measurement:
let start = std::time::Instant::now();
let parsed = parser.parse().unwrap();
let parse_time = start.elapsed();
println!("parse time: {:?}", parse_time);
println!("expr nodes: {}", parsed.ast.exprs.len());
Add len to each arena:
fn len(&self) -> usize {
self.items.len()
}
Talk rule:
I do not claim speed only because I used an arena. I claim a better allocation model, then I measure whether it improves this program.
That sentence sounds like a serious engineer.
Stage 14: prepare a stage talk outline
If you want to give a talk, use this structure.
Title:
Arena allocation in Rust compilers: from Box trees to ID-based ASTs
Talk flow:
- show the beginner Box AST
- explain the problem with many small allocations
- explain compiler phase lifetimes
- introduce arena as bulk-owned memory
- show typed arena with ExprId
- show parser returning ExprId
- show interpreter reading through IDs
- show side tables for types and resolver data
- explain tradeoffs and failure modes
- show the final Lox architecture
Simple stage diagram:
Before:
Expr owns Box<Expr>
Stmt owns Expr
Interpreter clones/walks owned nodes
After:
Ast owns arenas
Program stores root IDs
Passes read IDs and write side tables
The best talk is not only theory. Use your own Lox interpreter as the running example. That makes it real.
Stage 15: final project
Final project name:
Lox with arena-backed AST and compiler-style side tables
Build these features:
- ExprId arena
- StmtId arena
- DeclId arena
- Ast owner
- Program root IDs
- Span on every AST node
- Symbol interner for identifiers
- resolver pass
- type or value classification table
- pretty printer using IDs
- interpreter that reads AST by IDs
- error messages that use Span
- simple measurements for parse time and node counts
When you finish this, you can honestly say:
I implemented a compiler-style arena architecture in Rust. I understand typed IDs, AST ownership, side tables, and phase-based memory.
That is a strong claim. It is not fake confidence. It is based on working code.
Production-readiness checklist
Before calling your design production-ready, ask these questions:
- Can every AST node be reached from a Program root or pass root?
- Are IDs type-safe, or can ExprId be mixed with StmtId?
- Do error messages still have source spans?
- Can later passes attach data without mutating every AST node?
- Is there a clear owner for AST, symbols, scopes, types, and IR?
- Can you drop temporary pass data when no longer needed?
- Are there tests for parser, resolver, type data, and interpreter walking IDs?
- Can you print/debug the AST by ID?
- Can you measure node counts and parse time?
- Do you know which arenas live for the whole program and which are scratch arenas?
- Can you explain why an index arena was chosen instead of references?
- Can you explain why an unsafe bump arena was not needed yet?
If you cannot answer one of these, that is not failure. It is the next thing to learn.
Self-interview questions
Use these to test whether you really understand the topic.
Beginner questions:
- What is an arena allocator?
- Why does an AST fit arena allocation well?
- What does bulk free mean?
- Why can many Box allocations be slower than arena allocation?
Intermediate questions:
- Why are IDs often easier than references in Rust compiler data structures?
- What is the difference between typed arena and index arena?
- Why should Parser return ExprId instead of Expr?
- Why should Program store DeclId roots?
- What is a side table?
Advanced questions:
- How would you add spans without making every pass messy?
- How would you resolve variables using side tables?
- How would you store inferred types for expressions?
- When would you choose generational IDs?
- When would a scratch arena be useful?
- What can go wrong if you delete from an index arena?
- Why is unsafe bump allocation harder to expose safely?
Expert questions:
- What are the lifetime boundaries of each compiler phase?
- Which data should live for the whole compilation?
- Which data should die after one pass?
- Which IDs are stable across passes?
- Which IDs should never escape a pass?
- How would this design change for incremental compilation?
- How would this design change for multi-file compilation?
- How would this design change if you add bytecode or IR?
If you can answer these with examples from your own interpreter, you can explain arena allocation on stage.
Implementation milestones for your current Lox code
Use this exact order. Do not skip ahead.
Milestone 1:
- add ExprId
- add ExprArena
- add Ast with exprs only
- parser expression methods return ExprId
- interpreter can evaluate expression IDs
Milestone 2:
- add StmtId
- add StmtArena
- statement parser methods return StmtId
- interpreter evaluates StmtId
Milestone 3:
- add DeclId
- add DeclArena
- Program stores Vec<DeclId>
- interpreter evaluates DeclId
Milestone 4:
- add Span
- store Span on tokens
- store Span on AST nodes
- show line information in errors
Milestone 5:
- add Symbol
- add Interner
- convert Identifier from String to Symbol
- environment uses Symbol keys
Milestone 6:
- add ResolverTables
- resolve variables before interpretation
- report undefined variable before runtime when possible
Milestone 7:
- add TypeTable or ValueKindTable
- check binary operations before runtime when possible
- keep RuntimeError for true runtime-only errors
Milestone 8:
- add pretty-printer
- add node count stats
- add parse time stats
- compare old design and new design on a larger Lox file
After milestone 8, your interpreter will not just use an arena. It will have the shape of a real compiler front end.
What confidence should feel like
Real confidence is not this:
I read one article, so I know production compilers.
Real confidence is this:
I built the core design. I know why each owner exists. I know what IDs point to. I know where side data lives. I know what gets dropped and when. I can debug it. I can explain the tradeoffs.
That is the target.
Final checklist
You understand arena allocation when you can answer these:
- What owns the nodes? The arena or Ast.
- What points to nodes? IDs or arena references.
- When are nodes freed? When the arena drops.
- Why is it fast? Fewer allocator calls, compact data, bulk free.
- Why is it good for compilers? Compiler data is phase-based.
- What is the safest Rust version? Index arena.
- What is the lower-level fast version? Bump arena.
- What should your Lox interpreter use first? Index arena for AST nodes.
If you implement the index arena version for your Lox AST, you will already understand the most useful arena pattern for real interpreters and compilers.