Introduction
The Roblox engine is written in a combination of C++ and Lua, with the code that performs computationally intensive operations written in optimized C++, while game logic and scripts are written in Lua, for ease of development. For this model to be effective, it requires that the transitions between Lua and C++ be as fast as possible, as any time spent in this no man’s land is essentially just wasted milliseconds.
Over the past couple of months, we’ve been rolling out various improvements to this part of the system. One part specifically—the actual invocation of C++ methods from Lua—was particularly interesting, as it led to considerable speed improvements and required digging around in the guts of Lua to understand how things worked under the hood.
We ended up modifying the Lua VM itself, but before we get to that, we need to lay some groundwork.
Compilers, VM, and bytecode
When Lua source code is compiled, it’s compiled into Lua bytecode, which the Lua VM then runs. Lua bytecode has around 35 instructions in total, for things like reading/writing tables, calling functions, performing binary operations, jumps and conditionals, and so on. The Lua VM is register-based, as opposed to being stack-based like many other VMs, so part of what the compiler does when it generates bytecode is determine which registers each instruction should use.
Each instruction has the form “OP_CODE A B,” or “OP_CODE A B C,” where “OP_CODE” is the opcode (for example, CALL for calling a function) and A/B/C are the opcode arguments. The arguments (or registers) aren’t actual values. Instead, they are indices that point into one of two tables: the constant table (written Kst(..)) or the register table (written R(..)).
Interlude
For a detailed description of Lua bytecode, see “A No-Frills introduction to Lua 5.1 VM Instructions.” It’s a lot more exciting than it sounds; I promise!
To give you a feel for what Lua bytecode looks like, we’re going to go over some simple programs first and then progress to some more relevant examples.
Interlude
Using the Chunkspy utility, we can disassemble Lua bytecode into Lua assembly and get a listing of the code, as well as the constant table, so we can essentially see what bytecode gets generated for any given Lua source code.
Basic bytecode examples
A simple program like “x = 10” compiles into:
.const "x" ; 0 .const 10 ; 1 [1] loadk 0 1 ; 10 [2] setglobal 0 0 ; x
The first two lines show the constant table (with the string value “x” in slot 0 and the integer value 10 in slot 1), and the following two lines are the disassembled opcodes.
[Line 1] Looking up the LOADK opcode in “No Frills,” we see that it has the form “LOADK A Bx — R(A) := Kst(Bx).” So, LOADK has two arguments (registers A and B) and its operation is to assign the value found in the constant table at the slot given by the second register, Kst(Bx), to the register table in the slot given by the first argument, R(A). “Bx” just means that because the opcode only has two arguments, the B register is extended and assigned more bits.
[Line 2] SETGLOBAL has the form “SETGLOBAL A Bx — Gbl[Kst(Bx)] := R(A).” It assigns a value to the global table using the key given by the constant table at the slot of the second argument. Because the second argument is 0 and the value of the constant table at 0 is “x,” it writes something to the global table using the key “x.” Whatever is in the register table at the slot given by the first argument is what’s being written, which the previous instruction loaded with the value 10.
Let’s look at a slightly more complicated example, “x = 10; y = x.” I’ll leave the manual execution of the code as an exercise for the reader. 🙂
.const "x" ; 0 .const 10 ; 1 .const "y" ; 2 [1] loadk 0 1 ; 10 [2] setglobal 0 0 ; x [3] getglobal 0 0 ; x [4] setglobal 0 2 ; y
Bytecode for function calls
Let’s look at the code generated for “foo(10):”
.const "foo" ; 0 .const 10 ; 1 [1] getglobal 0 0 ; foo // R(A) := Gbl[Kst(Bx)] [2] loadk 1 1 ; 10 // R(A) := Kst(Bx) [3] call 0 2 1
To execute function calls, the function must be loaded into the first register and arguments into subsequent ones. The semantics for “CALL A B C” are such that A holds the function, B is the number of arguments (actually, it’s the number of arguments +1, due to the way “…” is implemented), and C is the number of return values (again, it’s the number of return values +1, to handle multiple return values).
We’re familiar with the first two lines; they load a value into register table slot 0 and the value 10 into register table slot 1. The third line is what performs the function call, using the value in register A (register table slot 0, which was loaded with “foo”), with B specifying the number of arguments, and C the number of return values (remember, both B and C should have 1 added to their values). Before the function can be called, the VM also verifies that the value in R(A) is in fact callable.
Interlude — Metafunctions, metatables, and userdata
Lua has a mechanism that allows users to extend the functionality of tables by associating a metatable with an existing table. The metatable contains fallback methods that are invoked if a certain method or operation can’t be performed on the main table (see https://www.lua.org/pil/13.html for a thorough description).
For our purposes, the most relevant entries in the metatable are the “__index” and “__call” fields. __index is used when looking up an element in a table, so the code “local x = my_table[10]” would first call the __index method on my_table. If that failed, it would instead try to call __index on my_table’s metatable. __call is similarly used when you try to treat something as a function and call it “local x = foo(42),” for example
In order for Lua and C++ to interoperate, they need some way to be able to share functions and data. Lua facilitates this by providing a data type called UserData. UserData objects can be created in C++ land, and because they are native Lua data types, they can be adorned with metatables that allows Lua code to interact with them as if they were just regular Lua objects.
Member function calls
Okay, back to looking at some bytecode! This next example is a bit more interesting because it shows what happens when you have code like “foo:bar(10),” which is calling the bar method on the instance foo (an instance of the class Foo).
foo:bar(10) .const "foo" ; 0 .const "bar" ; 1 .const 10 ; 2 [1] getglobal 0 0 ; foo [2] self 0 0 257 ; "bar" [3] loadk 2 2 ; 10 [4] call 0 3 1
The new thing here is the self instruction [line 2], which we haven’t seen before. Self has the syntax “SELF A B C — R(A) := R(B)[RK(C)]; R(A+1) := R(B),” so let’s break this down. In the register table at slot R(A), it will place the result of looking up the table in register slot R(B) using the key in slot RK(C). It will also copy whatever was in slot R(B) into slot R(A+1), but more on this later. You might notice that the value of the C register is 257. This is valid because Lua is using RK(C) to look up the value, and RK will either use the register table or the constant table, depending on the value of 9:th bit. If it’s a 1, which it is in this case, then the constant table is used; otherwise, the lookup goes to the register table (after masking out the highest bit).
Line 3 places 10 in slot 2, and finally line 4 will execute the function call.
The SELF instruction serves two purposes. First, it looks for the “bar” method on the Foo class, and places it in R(A). Secondly, because foo is an instance method and we need the instance of the class that we’re invoking the method on when doing the call, it places this instance in R(A+1). If you’re familiar with classes in Python, then you might recognize the concept: methods are usually written as “def my_method(self, arg1, arg2..),” where self is the class instance.
We’ll need to dig into this a little deeper and take a look at what happens when the foo instance is a C++ object, represented in Lua as a UserData object.
The SELF call can be seen as a table lookup, i.e. Foo[“bar”] (capital Foo represents the class Foo, as opposed to foo, the instance), and we know that lookups will use the __index method. When the foo instance was created in C++ land, a metatable was associated with the instance, and the metatable had its __index field set to a piece of C++ code that will get called when __index is invoked.
Interlude — lua_State
When C/C++ gets called from Lua, the only data that gets passed is a lua_State object. This object contains everything related to the currently running Lua thread. The most important information in the state object is the Lua stack, which contains the function arguments (accessed via the lua_tointeger/tostring etc family of functions) and is also used to return values back to Lua.
In pseudo-C++, our __index function looks something like this:
int metaIndex(lua_State* L) { // first argument is the userdata object UserData* userdata = lua_touserdata(L, 1); // get some kind of descriptor, that contains information // about what methods the class exposes ClassDescriptor* desc = getDescriptorForUserData(userdata); // See if the class has the requested method const char* methodName = lua_tostring(L, 2); MemberFunctionPtr method = desc->hasMethod(methodName); if (method) { // Upvalues are values that are available when a C // function is invoked. lua_pushupvalue(L, method); lua_pushcfunction(L, methodInvoker); return 1; } else { lua_pushnil(L); return 0; } }
Lots of internals are glossed over, but here’s the gist of it. Given that the UserData object passed as the first argument on the Lua stack, we’re able to find a descriptor that describes the actual C++ class, and via the descriptor we can see if this class has a method with the given name. If it does, a function pointer to a method invoker is pushed on the Lua stack, and we return success.
After this call, the Lua VM will place the rest of the arguments in the register table, and then call the function we returned from the metaIndex method, which will again call into C++, and land in the invoker function:
int methodInvoker(lua_State* L) {
// Get the userdata and the class descriptor UserData* userdata = lua_touserdata(L, 1); ClassDescriptor* desc = getDescriptorForUserData(userdata); Class* instance = (Class*)userdata; // Using Lua's upvalue mechanism, get the 'method' // that was stored in metaIndex. MemberFunctionPtr method = lua_getupvalue(L, 1); // This is hand-wavey, but we have some mechanism of being // able to invoke a member function via the class descriptor, // and also pop arguments from the Lua stack, and push return values return desc->invokeFunction(instance, method, L); }
The methodInvoker also uses the ClassDescriptor, but this time it’s able to invoke the member function, and pop the correct arguments from the stack.
The home stretch!
Now that we can clearly see the two round trips from Lua to C++, we can try to figure out how to optimize this.
Our end goal is to do a single function call from Lua to C++ and have all the pieces we need on the Lua stack to be able to do method lookup and invocation at once. The problem seems to be that we have one register too few. When we call our combined lookup/invoker function, we want the Lua stack to look like [self, method name, arg1, arg2, …], but looking at SELF, we see that it uses its first slot for the result of looking up the method function and the second slot for storing the instance.
A key realization came when we looked at the way the __call metamethod worked. If an object has the __call metamethod, then before the _call function gets invoked, the object itself is pushed on the stack and all arguments are shifted up. By piggy-backing on this functionality, there was a way of getting “self” on the stack without having to explicitly store it in a register.
The second part involved getting the method name on the stack as well. For this, we had to be sneaky and alter the workings of the SELF opcode.
Remember that in the default case, SELF would try to find the member function and store this in R(A) together with the instance in R(A+1). We ended up skipping the lookup altogether and stored the actual object in R(A) and the method name in R(A+1).
If we now made sure that the object in R(A) had a __call metamethod, then we’d also end up pushing self on the stack. So, we’d have a stack that looked like [self, method name, args…] and did just a single call into C++. Perfect! Well, almost. 🙂
Before considering this done, we wanted to put some final polish on it. We didn’t want to overload the semantics of the __call metamethod, so instead we added a specific metamethod for this type of invocation—called __namecall—that was only available on UserData objects. We also modified the SELF opcode so it only uses the new semantics if the object has a __namecall metamethod.
The second thing we did was mostly make the new path and the old path able to share code easily. Instead of having the method-name as the second argument, we pushed it to the last argument. So, after it had been used to look up the method pointer, it could easily be popped off the stack and the stack looked like it did if the function had been invoked via the old path.
Conclusion
How much of an impact does this optimization have? Well, like most things in programming, the answer is “it depends.” For functions that are heavyweight—and you don’t call that often—you won’t see much of an improvement. But for smaller functions that you call often, the savings can be considerable.
People on the Developer Forum quickly noticed the appearance of this strange, new metamethod, and a table was presented that compared the speed of __namecall against both the old method of calling instance methods and against a workaround that developers had been using to optimize method calling:
local part = workspace.Baseplate local count = 1000000 local start0 = tick() for i=1,count do part:IsA("BasePart") end local end0 = tick() local start1 = tick() for i=1,count do local isa = part.IsA isa(part, "BasePart") end local end1 = tick() local start2 = tick() local isa = part.IsA for i=1,count do isa(part, "Basepart") end local end2 = tick() print("namecall", end0 - start0) print("index+call", end1 - start1) print("call", end2 - start2) > namecall 0.49229717254639 > index+call 0.78510332107544 > call 0.49960780143738
The first loop uses the new __namecall code path, but because all the magic happens under the hood, there is no need for developers to change any existing code to benefit from the optimization.
The second loop emulates the old way of doing an instance method call; first doing a lookup to find the method and then invoking it.
And finally, the third loop shows a common optimization that developers were doing, where the method was first looked up, stored in a local variable, and then the variable was invoked.
The nice thing here is that it shows that with the __namecall optimization, it’s no longer necessary to explicitly cache instance functions, as it’s just as fast as the cached optimization, so the most straightforward code will also be the most performant.
Now that __namecall has been deployed, and we’re happy with the results we’re seeing, it’s time to turn our focus to memory usage, and see what we can do to improve the client in that area!