Using Clang to Minimize Global Variable Use

July 23, 2020

Every non-trivial program has at least some amount of global state, but too much can be a bad thing. In C++ (which constitutes close to 100% of Roblox’s engine code) this global state is initialized before main() and destroyed after returning from main(), and this happens in a mostly non-deterministic order. In addition to leading to confusing startup and shutdown semantics that are difficult to reason about (or change), it can also lead to severe instability.

Roblox code also creates a lot of long-running detached threads (threads which are never joined and just run until they decide to stop, which might be never). These two things together have a very serious negative interaction on shutdown, because long-running threads continue accessing the global state that is being destroyed. This can lead to elevated crash rates, test suite flakiness, and just general instability.

The first step to digging yourself out of a mess like this is to understand the extent of the problem, so in this post I’m going to talk about one technique you can use to gain visibility into your global startup flow. I’m also going to discuss how we are using this to improve stability across the entire Roblox game engine platform by decreasing our use of global variables.

Introducing -finstrument-functions

Nothing excites me more than learning about a new obscure compiler option that I’ve never had a use for before, so I was pretty happy when a colleague pointed me to this option in the Clang Command Line Reference. I’d never used it before, but it sounded very cool. The idea being that if we could get the compiler to tell us every time it entered and exited a function, we could filter this information through a symbolizer of some kind and generate a report of functions that a) occur before main(), and b) are the very first function in the call-stack (indicating it’s a global).

Unfortunately, the documentation basically just tells you that the option exists with no mention of how to use it or if it even actually does what it sounds like it does. There’s also two different options that sound similar to each other (-finstrument-functions and -finstrument-functions-after-inlining), and I still wasn’t entirely sure what the difference was. So I decided to throw up a quick sample on godbolt to see what happened, which you can see here. Note there are two assembly outputs for the same source listing. One uses the first option and the other uses the second option, and we can compare the assembly output to understand the differences. We can gather a few takeaways from this sample:

The compiler is injecting calls to __cyg_profile_func_enter and __cyg_profile_func_exit inside of every function, inline or not.
The only difference between the two options occurs at the call-site of an inline function.
With -finstrument-functions, the instrumentation for the inlined function is inserted at the call-site, whereas with -finstrument-functions-after-inlining we only have instrumentation for the outer function. This means that when using-finstrument-functions-after-inlining you won’t be able to determine which functions are inlined and where.

Of course, this sounds exactly like what the documentation said it did, but sometimes you just need to look under the hood to convince yourself.

To put all of this another way, if we want to know about calls to inline functions in this trace we need to use -finstrument-functions because otherwise their instrumentation is silently removed by the compiler. Sadly, I was never able to get -finstrument-functions to work on a real example. I would always end up with linker errors deep in the Standard C++ Library which I was unable to figure out. My best guess is that inlining is often a heuristic, and this can somehow lead to subtle ODR (one-definition rule) violations when the optimizer makes different inlining decisions from different translation units. Luckily global constructors (which is what we care about) cannot possibly be inlined anyway, so this wasn’t a problem.

I suppose I should also mention that I still got tons of linker errors with -finstrument-functions-after-inlining as well, but I did figure those out. As best as I can tell, this option seems to imply –whole-archive linker semantics. Discussion of –whole-archive is outside the scope of this blog post, but suffice it to say that I fixed it by using linker groups (e.g. -Wl,–start-group and -Wl,–end-group) on the compiler command line. I was a bit surprised that we didn’t get these same linker errors without this option and still don’t totally understand why. If you happen to know why this option would change linker semantics, please let me know in the comments!

Implementing the Callback Hooks

If you’re astute, you may be wondering what in the world __cyg_profile_func_enter and __cyg_profile_func_exit are and why the program is even successfully linking in the first without giving undefined symbol reference errors, since the compiler is apparently trying to call some function we’ve never defined. Luckily, there are some options that allow us to see inside the linker’s algorithm so we can find out where it’s getting this symbol from to begin with. Specifically, -y should tell us how the linker is resolving . We’ll try it with a dummy program first and a symbol that we’ve defined ourselves, then we’ll try it with __cyg_profile_func_enter .

zturner@ubuntu:~/src/sandbox$ cat instr.cpp

int main() {}

zturner@ubuntu:~/src/sandbox$ clang++-9 -fuse-ld=lld -Wl,-y -Wl,main instr.cpp

/usr/bin/../lib/gcc/x86_64-linux-gnu/crt1.o: reference to main

/tmp/instr-5b6c60.o: definition of main

No surprises here. The C Runtime Library references main(), and our object file defines it. Now let’s see what happens with __cyg_profile_func_enter and -finstrument-functions-after-inlining.

zturner@ubuntu:~/src/sandbox$ clang++-9 -fuse-ld=lld

-finstrument-functions-after-inlining -Wl,-y -Wl,__cyg_profile_func_enter instr.cpp

/tmp/instr-8157b3.o: reference to __cyg_profile_func_enter

/lib/x86_64-linux-gnu/libc.so.6: shared definition of __cyg_profile_func_enter

Now, we see that libc provides the definition, and our object file references it. Linking works a bit differently on Unix-y platforms than it does on Windows, but basically this means that if we define this function ourselves in our cpp file, the linker will just automatically prefer it over the shared library version. Working godbolt link without runtime output is here. So now you can kind of see where this is going, however there are still a couple of problems left to solve.

We don’t want to do this for a full run of the program. We want to stop as soon as we reach main.
We need a way to symbolize this trace.

The first problem is easy to solve. All we need to do is compare the address of the function being called to the address of main, and set a flag indicating we should stop tracing henceforth. (Note that taking the address of main is undefined behavior[1], but for our purposes it gets the job done, and we aren’t shipping this code, so ¯\_(ツ)_/¯). The second problem probably deserves a little more discussion though.

Symbolizing the Traces

In order to symbolize these traces, we need two things. First, we need to store the trace somewhere on persistent storage. We can’t expect to symbolize in real time with any kind of reasonable performance. You can write some C code to save the trace to some magic filename, or you can do what I did and just write it to stderr (this way you can pipe stderr to some file when you run it).

Second, and perhaps more importantly, for every address we need to write out the full path to the module the address belongs to. Your program loads many shared libraries, and in order to translate an address into a symbol, we have to know which shared library or executable the address actually belongs to. In addition, we have to be careful to write out the address of the symbol in the file on disk. When your program is running, the operating system could have loaded it anywhere in memory. And if we’re going to symbolize it after the fact we need to make sure we can still reference it after the information about where it was loaded in memory is lost. The linux function dladdr() gives us both pieces of information we need. A working godbolt sample with the exact implementation of our instrumentation hooks as they appear in our codebase can be found here.

Putting it All Together

Now that we have a file in this format saved on disk, all we need to do is symbolize the addresses. addr2line is one option, but I went with llvm-symbolizer as I find it more robust. I wrote a Python script to parse the file and symbolize each address, then print it in the same “visual” hierarchical format that the original output file is in. There are various options for filtering the resulting symbol list so that you can clean up the output to include only things that are interesting for your case. For example, I filtered out any globals that have boost:: in their name, because I can’t exactly go rewrite boost to not use global variables.

The script isn’t as simple as you would think, because simply crawling each line and symbolizing it would be unacceptably slow (when I tried this, it took over 2 hours before I finally killed the process). This is because the same address might appear thousands of times, and there’s no reason to run llvm-symbolizer against the same address multiple times. So there’s a lot of smarts in there to pre-process the address list and eliminate duplicates. I won’t discuss the implementation in more detail because it isn’t super interesting. But I’ll do even better and provide the source!

So after all of this, we can run any one of our internal targets to get the call tree, run it through the script, and then get output like this (actual output from a Roblox process, source file information removed):

excluded_symbols = [‘.*boost.*’]

excluded_modules = [‘/usr.*’]

/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1: 140 unique addresses

InterestingRobloxProcess: 38928 unique addresses

/usr/lib/x86_64-linux-gnu/libstdc++.so.6: 1 unique addresses

/usr/lib/x86_64-linux-gnu/libc++.so.1: 3 unique addresses

Printing call tree with depth 2 for 29276 global variables.

__cxx_global_var_init.5 (InterestingFile1.cpp:418:22)

RBX::InterestingRobloxClass2::InterestingRobloxClass2() (InterestingFile2.cpp.:415:0)

__cxx_global_var_init.19 (InterestingFile2.cpp:183:34)

(anonymous namespace)::InterestingRobloxClass2::InterestingRobloxClass2()

(InterestingFile2.cpp:171:0)

__cxx_global_var_init.274 (InterestingFile3.cpp:2364:33)

RBX::InterestingRobloxClass3::InterestingRobloxClass3()

So there you have it: the first half of the battle is over. I can run this script on every platform, compare results to understand what order our globals are actually initialized in in practice, then slowly migrate this code out of global initializers and into main where it can be deterministic and explicit.

Future Work

It occurred to me sometime after implementing this that we could make a general purpose profiling hook that exposed some public symbols (dllexport’ed if you speak Windows), and allowed a plugin module to hook into this dynamically. This plugin module could filter addresses using whatever arbitrary logic that it was interested in. One interesting use case I came up for this is that it could look up the debug information, check if the current address maps to the constructor of a function local static, and write out the address if so. This effectively allows us to gain a deeper understanding of the order in which our lazy statics are initialized. The possibilities are endless here.