In the past I’ve talked a lot about “the perfect metaverse engine”, which was/is the goal of my Brane project. But more recently I’ve been thinking about the sustainable route to get there. Making an engine for a massive social/UGC userbase like that is probably the game engine equivalent of “for my first game I’m going to make an MMO”. So to scale it back, what if instead I made the perfect engine for myself and a couple friends to use, instead of everyone first?

What I’ve learned

When I started working on Brane, I’d only really worked as a either a lone wolf programmer, or on a team working on tools rather then an end product. More recently over the last 2 years or so I’ve been working more with teams that are directly shipping to the end user and building content, and it’s a very different experience.

I have a very high expectation for the quality of my work, and when you’re working on tools you can focus on making them perfect, but when you’re in an environment where you’re time constrained and need to ship things to an existing audience it’s a lot more of a juggling act. And I’ve learned a lot about what I want out of my tools.

Nowadays, way too often I feel like I’m fighting tooth and nail against the tools that I’m using. And I don’t mean that said tools are blocking me, but I feel like they’re way too happy to let me or my teammates trip over them, and when we do it’s a “skill issue”. I’m never blocked, but I do feel like I wish I could rely on tools more then I do. Maybe a better way to put it is that it feels like it’s my responsibility to keep my tools from falling apart, when Ideally the tools should add stability to my workflow.

What Should Tools Be Doing?

Let me make one thing clear. People don’t read docs. They don’t remember what teammates tell them. And people often don’t document what they’re doing in the first place. AI will hallucinate things that don’t work, and people will paste them in.

I am someone who likes to understand the whole picture, and I usually take the time to try and do that. However, even I have to admit that the way I go about understanding things is that I go in with a goal to do something, and then I learn how to do it. I don’t learn about anything outside of what’s required to accomplish that one task (well) at that one time, and then I move on to the next task, and eventually I understand the whole system. This works amazing in solo dev. This does not work as well for teams where the context is constantly shifting, and communication is the main bottleneck.

It should be the tool’s job to allow devs to express what their intention is as clearly as possible and then enforce that intention, that way if another dev tries to change a related system, they will be informed and will be able to quickly bring themselves up to speed on what’s going on there, and how to adjust the contracts of systems to work together in an expected way.

I’m going to use the word “contract” a lot here. If you think about what a contract is, you’re given a list of things expected of you, and what you get in return. Sometimes you get a list of what happens if you break the contract, but we write perfect code so we’re not going to worry about that for now. If we have “unbreakable” contracts, then we never have to account for that broken contract case anyways.

To give some specific examples, the main “broken contracts” I run into across every single project can mostly be split up into these categories:

Data shape
Data time write/use
Lack of support for error flow
Bandwith

Data Shape

A great example of “data shape” is the way that pointers are usually used. This mostly applies to c/c++, but the concept can apply to other languages as well, for example a null object in C#. If you think about the “contract” of a pointer, it’s that any pointer variable can either point to good data, a known null value, OR at any point anyone with an instance of that pointer can “break” it for everyone else by deallocating what it points too. The very loose contract in place basically says, this might point to data, and it might not, and we might tell you which case that is. Now if your program is flawless you’ll never get a pointer to invalid data, only null or valid.

But now this bring up an interesting question, am I now expected to always check if a pointer is null in every single place that I use that pointer? No. There’s an “implicit” contract in a lot of place where it can be assumed that a pointer is always valid, increasing the efficiently and readability of our code. But since in c/c++ non-nullness is very rarely enforced, there’s a lot of places where null can sneak in, “breaking” that implicit contract and corrupting/crashing the program. This is compensated for in a lot of cases by over-checking pointers even if it can be reasonably assumed that they’re valid, because one can never be sure. (AI is very fond of over checking pointers like this) BUT EVEN THEN THIS CAUSES MORE ISSUES! Lets say you early return if the pointer you think should always be valid is actually null, what happens to the state of your program? You’ve successfully avoided a crash at that location, but what you’ve actually done is introduce a “silent failure”. What you expected to do there, silently didn’t happen. If any logic relied on that code you just escaped working, that logic will now fail. It might be a hot take, but I’d rather the program crash at that first point of broken contract, so that I can then trace where that unexpected data came from and fix the root issue.

Another good example is editor-set properties. An extremely common issue in Unity is when you forget to set something in the editor, and this problem is even worse when you treat “null” as a valid value in your script, but in that particular instance you want to remember to set it to something.

Data Time

When you think about a single-threaded program, you’d usually think that there’s no chance for a race condition to occur right? WRONG. A very common form of bug you will run into in game engines is when you expect data from one part of the frame, but it’s either not been written to that variable yet, or it has been overwritten.

In the lifecycle of a VR character controller, it’s very common to see a tree of transforms representing the bones of the player go through a lifecycle that looks something like this:

Last frame values -> Animation resets values -> IK adjusts those values -> Physics Has it’s way -> Values sent to the GPU.

I’ve had some very hard to debug issues where an interaction system was using the post-animation values of transforms, instead of the much more usable post-ik or post-physics values that actually represent where a player’s hands are.

The key takeaway is: At different points in the frame, the same “variable” represents completely different “values”. So it’s very hard to create a contract saying “I want this specific value” since to grab that value, often you need to interact with the very lacking scheduling system of most engines. The standard way that engines handle this nowadays is with events that fire off in a known order tick/update, physics, animation, etc. And then for more granularity you get pre/post events. I consider this all BS.

And do NOT get me started on initialization. Bootstrapping Entities/Actors/GameObjects to a point where they’re valid is when event based time contracts break down the most. Because somehow all of my [game units] must implicitly hope that every other thing that’s initializing in the same step has handled all their data dependencies and timing correctly.

This gets even worse when you’re working with a networking system that can just decide to swap out values whenever it wants or even call callbacks. Did you know that in Unreal Engine networked variable change callbacks can fire off before any other initialization has happened other than base class constructor? Bane of my existence working with that engine because in most cases you want those callbacks to operate on a fully instantiated class that’s already had BeginPlay execute, especially when you have a seamless transition going on and the physics system doesn’t exist yet.

Lack of Error Flow

This one is kind of self explaining. Once you create and enforce contracts, I’ve found that “failure” cases for code become a lot harder to ignore. If I’m forced to check if a pointer is null or not before using it, and it is guaranteed that it could be null, now I need to account for that in my code. This means that the greater contract of the whole system must either always allow a section of code to complete correctly OR it must provide ways for that code to indicate that something went wrong. Basically, a half crash is never an option, tools should encourage always keeping the program in a known state instead of letting it silently slip into an invalid one. Either work perfectly or crash immediately and tell me where the problem is. And if I don’t have anywhere to “send” that error, it’s going to turn into either a silent or full crash no matter what I do even if I successfully detect it.

Bandwith

Code is written for the most part as if it executes in an instant. Yes we know that we should write optimized code and all that, but that’s optimizing an instant, not the code over time. What do we do when something takes time to complete? By default you get a lag spike. If you’re making an engine for a screen strapped to someone’s face, missing a frame means someone’s entire reality just froze. That’s “fine” most of the time on a PC where the user has a solid disconnect from the game’s systems, but if it’s reality itself? That messes with your head. Therefore, an engine running in a constrained environment must be aware of it’s constraints and be able to compensate for them. A non-constrained environment doesn’t exist, even supercomputers have localized data bandwith bottlenecks that coders must account for.

Outside of specific implementations, bandwidth is not a widely solved issue yet. The workflow right now is to create a system and then use profiling tools to determine how much bandwidth a system is using up, and then manually find bottlenecks in CPU and GPU compute, memory busses, IO, Networking, disk, etc and then correct for them on a case-by-case basis. This is a very slow and tedious process requiring domain specific knowledge of tools and raw intuition.

The solution?

There are two parts to this problem. The tools used to build the engine, and the architecture that the engine provides.

Language

For building the engine, after searching for a long time, attempting to create my own multiple times, and straight up avoiding it, I think rust is the best option for creating modern engines in. NOT because it’s “memory safe”, but because of the way it goes about that. On the micro level it fixes data shape by enforcing a contract at the compiler level that data must always be valid, and may only be accessed in a way that allows other code to access and mutate it in a valid way. Through doing that, it also accidentally solves timing issues on the micro scale.

But it’s not a silver bullet. It’s still up to the engine designer to create something that builds on top of all that and provides an environment that lets a team move quickly.

On the note of moving quickly, I want to address the sentiment that the borrow checker makes coding slower. Yes. On the small scale it makes writing code take longer because you have more things to express to the compiler. But have you ever had to spend weeks debugging a large project full of memory corruption errors? I have. I prefer negotiating with the borrow checker once or twice for a couple hours a week and then having my code work on the first successful compile most of the time.

Architecture

Once we have a language making it so that things are happy on the small scale, its on us to solve the issues on a larger scale. Rust doesn’t fix using variables at the wrong time. It doesn’t fix UX errors in the editor. And it doesn’t fix multithreading locks or resource bottlenecks.

The way to solve the timing issue is quite simple in theory, a little harder in practice. Entities in the engine should explicitly declare what data they need, and when they need it. They should also explicitly declare what data they produce, and when. A graph can then be made from data producers to data consumers, that can validate on-construction or mutation if we have any data loops and tell us if that’s a valid flow of data. Additionally when creating those links between producers of data and consumers of data, you should be able to specify if you want data from the tick currently being executed, or the previous frame. That solves most of the circular data issues by allowing some systems to rely on stale data instead.

Now, this might sound a little like an ECS where systems can declare dependencies on other systems, but I have a major problem with ECS, and it’s that you can’t cleanly have objects with similar operations happen at different times in the frame. Trees are really hard to represent because of the ordering inherently required for that.

This graph based architecture also makes it a LOT easier on the programmer. You no longer have to care what any other Actor in the system is doing. You can lay out the entire lifecycle of the object in one place, accepting data when needed, and producing data as relevant. By design, other Actors can’t access data at unpredictable moments, meaning they’ll always receive what they expect. You also no longer have to care about global events. Instead they turn into listening to when data from a system or dependency is ready.

This also has a side effect of making the engine a lot more deterministic. Maybe things execute in different orders, but the data flow remains predictable, which is what actually effects the results. If you also disallow “global” data, you turn all Actor computation into pure deterministic functions. Determinism means that networking with prediction becomes significantly easier, because you’ll always get the same result on both computers if you use the same input data.

The graph does not inherently solve bandwidth issues though. It can help by making automatic multithreading very simple from a runtime perspective, but if any node in that execution graph takes too long our frame still hitches. And that’s unacceptable.

The real and only true solution to bandwidth issues is for the engine itself to be aware of it’s limits in real-time, and then be able to scale what it does to keep a healthy margin in each bottlenecked domain.

This could look like the engine doing a quick profile of a device on-startup or while it’s running. See how fast it’s loading things from disk, how long does it take to transfer memory to the gpu, how much upload/download bandwith to we have from various servers, etc.

Then once it knows all it’s limit for everything, we provide it with operations that are negotiable in load. Any code that isn’t crucial to emit the next frame can be time-sliced, and the engine never provides game scripts with anything that can block a thread, so all IO, disk read, graphics, networking, etc should be provided through an ergonomic promise/future api.

Then through benchmarking how long non-negotiable tasks generally take and enforcing a small time budget for time-sliced operation’s steps, you can do all the required computation for a frame, and then spend the remaining extra time running increments of time sliced operations until you get close to running out of time budget. And just like that (he said leaving a lot of implementation details out) you never miss a frame due to cpu bound operations.

UX / The Development Experience

Before we were discussing how we’d make an engine that functions without issues. But how do we now make the rest of the experience of developing a game tolerable?

Iteration Cycles

In both Unity and especially Unreal, the iteration cycle is way too slow. You should be able to be in the game with a code terminal open and have it update in half a second the moment any asset changes. This is crucial for developing any “feel” based mechanic. And when I say in the game, I literally mean in it. The “soon” to be released steam frame is literally a linux laptop you can strap to your face, this theoretically means you could run a game engine editor on it.

UI

One potentially controversial opinion I hold is that there shouldn’t be automatic “default” values for properties intended to be set from UI. Perhaps there could be manual button press that fills in defaults. But especially for references, even optional ones, they should exist in an “invalid” state that blocks compiles/bakes until they’re set to a valid value, such as a valid reference, or in the case of an optional, it should take a user action to set them to a None value so that users are less likely to forget to set them. AND if references break, they should be set back to an invalid state, rather then silently switching to a None value.

Also, features of an engine hidden in a sub-menu, basically don’t exist until a user does a google search and finds them. The UI of an engine should naturally guide a user to the features they want through normal operation of the program.

Source Control/Artifact distribution

Modern source control is insufficient for working with game engines at the moment. When working with text based code, everything is just text and can all be stored in a text based source control. But what happens as soon as you have a team and you want do distribute binaries to them? Are you going to have the art team compile the engine every change? Are you going to have programmers run light bakes on their computer? The answers to these questions should be no, and one solution is to store those artifacts in some form of source control.

But that’s a terrible idea for many reasons. The main clue is in the name: “Source control”. By nature source control systems are for storing the source of your build process, not any baked/compiled/cached data derived from it. Also, do you need your servers storing multiple end versions of light bakes and binaries? No! At most you just need the last couple versions of those things, and then if you revert, you just re-generate those things from source as needed. But on standard source control those will just exist forever in your history now.

What we need is something more like a “project syncing solution” where source control is one half of it, and a way to distribute cached things derived from that source in one executable. Now if your artist needs anything that requires a toolchain, theoretically you can just build it on your side and tell them to refresh the project.

Also, an engine should provide an in-UI merging solution for EVERY asset type. Branches are a necessary solution for working on features in parallel without blocking team members or compromising the ability to ship patches if needed. BUT in a lot of game dev use cases they go completely unused because merging engine native assets is often significantly harder then it should be, resorting to checking out assets and blocking team members constantly. Ideally you should have a neat little UI tool that lets you navigate through every engine asset conflict in the project, so merging becomes easy and painless enough that keeping all your branches from going stale is a 3 minute process every now and then.

Who builds it?

Well… I guess I am, but I don’t plan on releasing anything soon, this’ll be proprietary tech I’m working on to use with friends.

But I hope sharing this inspires other people to take small bits and pieces here and there to improve their development wherever they are, using existing engines or something more custom. I’ll likely share the knowledge I gain along the way with the hopes that it increases the quality of what we see everywhere.

Liked it? Take a second to support WireWhiz on Patreon!

The Perfect Game Engine, For a VR Studio

What I’ve learned

What Should Tools Be Doing?

Data Shape

Data Time

Lack of Error Flow

Bandwith

The solution?

Language

Architecture

UX / The Development Experience

Iteration Cycles

UI

Source Control/Artifact distribution

Who builds it?

Related

GitHub

Sites I've helped build:

Disclaimer

What I’ve learned

What Should Tools Be Doing?

Data Shape

Data Time

Lack of Error Flow

Bandwith

The solution?

Language

Architecture

UX / The Development Experience

Iteration Cycles

UI

Source Control/Artifact distribution

Who builds it?

Share this:

Related

Disclaimer