/prog/

The C++ Arena

by 029 | Dec 7, 2022

Arenas are a powerful tool for memory management; evaporating confusion and issues with lifetimes. Though some implementations exist, they're not used extensively in C++, which is unfortunate. The value that arenas provide to C++ programs and their architecture is immense and only continues growing along with the complexity of the program. Following are some ideas of note when implementing an arena that fits well into C++.

Required Reading

To start, you should read Ryan Fleury's excellent article on arenas. He covers in a very in-depth manner problems with traditional manual memory management, what arenas are, and how they solve those problems and simplify lifetimes.

I have to agree with Ryan's critiques of the current state and perception of memory management in C++. RAII1 containers and smart pointers do make memory management more automatic, but they serve to automate entirely the wrong problem. It essentially generates all the glue code that tightly couples lifetimes and memory allocations. There are ways to improve the performance characteristics of this; use a smarter malloc or overload the new/delete operators, but these are complex and do not solve the actual problem with lifetimes.

Using arenas by default simplifies lifetime management immensely: When does an object's lifetime end? When its containing arena kicks it out. This is simple, because it can only happen when popping from or clearing the arena, which also happens when the arena is destroyed.

But enough proselytization.

Basic Design

It's very easy to adapt the fundamental arena interface:

class Arena
{
public:
  // create or destroy a 'stack' - an "arena"
  static Arena* Create();
  static void Destroy(Arena* arena);

  // push some bytes onto the 'stack' - the way to allocate
  void* Push(u64 bytes, u64 alignment = 1);

  // get the # of bytes currently allocated.
  u64 GetUsage();
 
  // also some useful popping helpers:
  void PopUsageTo(u64 count);
  void Clear();
};

It's workable, but not particularly easy to use. It can be improved by introducing type information about what it is allocating by using templates:

template<typename T>
T* Place()
{
  void* mem = Push(sizeof(T), alignof(T));
  return static_cast<T*>(mem);
}

It is also useful to wrap a common pattern for allocating an array. This one chooses not to initialize anything, and notes that in the name. I find this to be a good thing to do, as it makes the user think about the fact the memory is uninitialized.

template<typename T>
T* PlaceArrayUninitialized(u64 count)
{
  return static_cast<T*>(Push(count * sizeof(T), alignof(T)));
}

When designing interfaces like this that use templates only for code ergonomics, I prefer to keep the amount of templatized code as small as possible and leave most of the details to the concrete implementation. This is one technique for preventing code bloat and long compile times from monomorphization.

C++isms

At the language level, C++ has a lot more to say about object lifetimes than C. Lifetimes must be properly "begun", and unexpected things will happen if destructors are not called before lifetimes end.2

Beginning lifetimes

Currently, Place does not properly handle beginning the lifetime of T, except in trivial cases where it is a plain struct. This can be fixed by beginning the lifetime of the object before returning it by using placement new.

return static_cast<T*>(new (mem) T{});

The lifetime is now properly begun and we are free of undefined behavior, but it can now only be used on types that are constructible with no arguments (default-constructible). The pattern used in STL containers that support emplace is to accept arguments of any type and pass them to the constructor. Place is essentially the same thing, so we will use the same technique.

template<typename T, typename... Args>
T* Place(Args&&... args)
{
  void* mem = Push(sizeof(T), alignof(T));
  return static_cast<T*>(new (mem) T{ std::forward<Args>(args)... });
}

Destructors

So far, the arena will work, but it behaves in a very unexpected way when ending the lifetimes of contained objects - the destructors are not called. Generally, container-like constructs in C++ are supposed to call the destructors of the objects they contain, but that's not happening yet.

First we need to be able to store destructors. In C++, it's not possible to take the address of a destructor (or of a constructor for that matter, though both have addresses.) Herb Sutter has a post that covers a nice solution. The gist of it is that we can use a captureless lambda inside a template to capture the type and call the destructor. This compiles directly to the address of the destructor, which is great. It looks like this:

struct Destructor
{
  template<typename T>
  static Destructor Create(T const* object)
  {
    return Destructor{
      .bound_object = object,
      .destroy = +[](void const* x) { static_cast<T const*>(x)->~T(); }
    };
  }

  void operator()() const
  {
    destroy(bound_object);
  }

  void const* bound_object;
  void(*destroy)(void const*);
};

Now whenever placing an object in the arena, we also will create a destructor object for it if it needs one.

void* mem = nullptr;
if constexpr (std::is_trivially_destructible_v<T>)
{
  mem = Push(sizeof(T), alignof(T));
}
else
{
  mem = PushWithDestructor(sizeof(T), alignof(T), Destructor::Create<T>(nullptr));
}

PushWithDestructor automatically fills in the address of the destructor, so it is passed as nullptr and the type explicitly specified. It also stores the destructor object in the arena's memory. In my arena, this is implemented by committing pages as needed from the back of the arena block's reservation and filling them with destructor objects as a downwards-growing stack. In the same way that traditionally the stack and heap segments of a process grew towards each other, so do the data section in the arena and the bookkeeping section filled with destructors grow towards each other.

The destructor objects are invoked and removed in the reverse order that objects were added to the arena, just like the built in stack. When popping from the arena, we can simply walk the destructors and check if each pointer is beyond the current end of the data section, and stop once it is not.

Suballocations

One of the best way to simplify lifetime management with arenas is when dealing with extra allocations that an object needs to make. I came across a great real-world example of how using arenas can both simplify code and make it more performant, rather than the common "optimization-complexity" tradeoff.

Serenity LibJS

In a recent video by Andreas Kling in which he reduces memory usage on Twitter in the Ladybird browser, an interesting architectural situation arises. It's bad manners to link a 2-hour video with no context, so here's the code and I'll also explain the process from the video.

CallExpression

Inside Ladybird's JavaScript interpreter LibJS, it stores various "nodes" corresponding to elements of JavaScript syntax such as statements, variable declarations, numbers, and function calls. The gist of the issue is that certain nodes can have different amounts of data in them, even within the same type. We'll focus on the CallExpression node, as it was the impetus for the largest change Andreas made when optimizing memory usage. As the name suggests, CallExpression represents a function call, something like my_func(a, b, c). Each of the arguments to the call are Expressions, and the call is an Expression itself.

When representing this in a traditional way, it might look something like the following:

class CallExpression : public Expression {
public:
  struct Argument {
    std::shared_ptr<Expression> value;
  };
  CallExpression(std::shared_ptr<Expression> callee, std::vector<Argument> arguments)
    : callee(std::move(callee))
    , arguments(std::move(arguments))
  {}
private:
  std::shared_ptr<Expression> callee;
  std::vector<Argument> arguments;
};

And this is indeed what the initial implementation looked like, though it had more features. Andreas realized that the number of arguments never changed after construction and decided to refactor it so that the arguments were contained in the same block of memory as the CallExpression. This would avoid incurring the extra heap allocation and size overhead of the dynamic array, and involved adding a function to create it:

static std::shared_ptr<CallExpression> create(std::shared_ptr<Expression> callee, std::vector<Argument> arguments)
{
  size_t allocation_size = sizeof(CallExpression) + sizeof(Argument) * arguments.size();
  void* slot = malloc(allocation_size);
  return std::shared_ptr<CallExpression>{ new (slot) CallExpression(std::move(callee), std::move(arguments)) };
}

The class needed to be updated with different members to be able to access the tail array, using the "zero size array" idiom which is common in OS development and C programming:

i32 argument_count;
Argument arguments[0];

This isn't standard C++ and relies on vendor extensions implemented by major compilers.
The address of arguments is equivalent to this + 1, or directly following the class's normally-sized members.

The arguments also need to be initialized in the constructor:

CallExpression(std::shared_ptr<Expression> callee, std::span<Argument> arguments)
  : callee(std::move(callee))
  , argument_count(arguments.size())
{
  for (i32 i = 0; i < argument_count; ++i) {
    new (&this->arguments[i]) Argument(std::move(arguments[i]));
  }
}

It was also necessary to overload the deletion operator to prevent mismatches between malloc and delete and deallocate properly. However, there's no such thing as a virtual deletion operator, so this meant that the base-most class in the AST hierarchy had to use malloc in its creation function as well.

class ASTNode : public std::enable_shared_from_this<ASTNode> {
public:
  // ...
  void operator delete(void* ptr)
  {
    free(ptr);
  }

  template<std::derived_from<ASTNode> T, typename... Args>
  static std::shared_ptr<T> create(Args&&... args)
  {
    size_t allocation_size = sizeof(T);
    void* slot = malloc(allocation_size);
    return (new (slot) T(std::forward<Args>(args)...)).shared_from_this();
  }
};

This already looks nearly identical to the implementation of Arena::Place in logic. Here I think it is useful to note the purpose of using reference-counted smart pointers everywhere in the tree structure. It actually has very little to do with lifetime semantics, i.e. allowing code to store a reference to part of the syntax tree, preventing its deallocation. That functionality is not used. Instead, the purpose of the smart pointers is to handle allocating and deallocating the tree. But there's already a decent amount of support code written for that, and it is flawed.
As Ryan said, misuse is still possible, and often not checkable in a language like C++. And indeed immediately after this change, Andreas was able to use a memory debugger to find an instance where someone forgot to call create and instead used new directly, resulting in a new/free mismatch. Nothing prevents future code from mistakenly allocating without using create. Also, CallExpression has a different creation function, so any time it is created, one must remember to use it instead of the generic ASTNode::create.

Tail Array

Jonathan Müller shared a sample of a pattern that he's used for code like this. It's a generic way of automatically allocating the extra space at the end (or tail, ergo tail array.) As of this writing, Andreas has implemented something similar, but the full solution could not actually be implemented as-is due to the fact that CallExpression actually has a subclass in NewExpression, which prevents the CRTP derived-class trick. The PR currently hasn't been merged yet.

Rethinking It

Using arenas, the entire situation can be simplified massively. There is no need for shared pointers, as the entire syntax tree's lifetime will be consolidated into the same arena that was provided to the parser. Nodes that need different sizes can simply take an arena parameter and use it to allocate more space for themselves.
In the case of CallExpression the members can look like this:

i32 argument_count;
Expression** arguments;

And the constructor can be parameterized with an arena:

CallExpression(Arena* arena, Expression* callee, std::span<Expression*> arguments)
  : callee(std::move(callee))
  , argument_count(arguments.size())
  , arguments(arena->PlaceArrayUninitialized<Expression*>(argument_count))
{
  for (i32 i = 0; i < argument_count; ++i) {
    this->arguments[i] = arguments[i];
  }
}

And all of the supporting create functions can be removed, along with the custom operator delete. The constructor delegates the responsibility of allocating memory to its caller.

For some added ease-of-use, a helper can be added to ensure correct usage:

std::span<Expression*> Arguments() {
  return { arguments, static_cast<size_t>(argument_count) };
}

And that's really it! Even if CallExpression is later derived from, the subclasses can simply also be parameterized by an arena and forward it to CallExpression's constructor. The level of indirection prevents issues with the tail array getting mixed up inside the subclasses. If the class were unable to be constructed outside of an arena (think private constructor and making Arena a friend) and also either marked final or somehow able to assert that all derived classes are the same size, then it could use the this + 1 trick to have the tail array inline. This would even work for multiple tail arrays as the count for each one is known, but it gets more complicated to calculate the pointer offset.

More Sugar

In a codebase written to use arenas as the memory management strategy, types parameterized with arenas are very common. Placing such an object with our current setup looks like this:

CallExpression* call = arena->Place<CallExpression>(arena, callee, arguments);

It's awkward to need to pass arena twice in the same function call (recall that the left-hand instance becomes the this pointer.) Since the implementation of Place already has the arena as a parameter, it can pass it whenever possible.

if constexpr (std::is_constructible_v<T, Arena*, Args...>)
{
  return static_cast<T*>(new (mem) T(this, std::forward<Args>(args)...));
}
else
{
  return static_cast<T*>(new (mem) T{ std::forward<Args>(args)... });
}

This will automatically insert the arena as the first parameter if possible, making for seamless parameterization.

CallExpression* call = arena->Place<CallExpression>(callee, arguments);

As a bonus, it's still possible to call with the explicit arena parameter, since that will modify Args and reject the implicit self-insertion.

Closing

That's about it for some of the details and reasoning behind C++ arenas! They are a fantastically useful concept and I hope that you gained enough value from this to try using this memory management strategy in your own C++ projects.

This is the first post on /prog/, if you liked it you can subscribe through RSS or follow me on Twitter as I will also post there when I make new posts here.

Bye for now, and see you soon!
~029


Footnotes