inline does not inline

Over the many years of C++’s life, some keywords such as auto, register, export have gone obsolete and ended up being reused with a new, changed meaning. If you know what these meant originally, congratulations, you’re old!

inline went through subtler changes that were not directly caused by the language standard redefining it, yet its current meaning has nothing to do with the original nevertheless. C++17 has formally acknowledged the status quo and built upon it.

C++ is really averse to introducing new keywords1, so even though inline now officially has nothing to do with inlining anymore (as we’ll see later), it remained instead of being changed to something more appropriate, such as shared.

What it was

inline was originally intended to be a compiler hint to inline a function, but as compilers evolved, it became apparent that programmers are terrible at hinting these things correctly. Better code could be generated by more or less ignoring hints (or maybe only accept them as a small nudge, depending on the compiler) and deciding inlining based on the compiler’s far more advanced heuristics.

This goes both ways: inlining things that aren’t inline, and using regular function calls for inline functions.

This is the same fate that befell for(register int i = 0; i < n; ++i): once no compiler cared about register, it eventually fell out of use. As of C++17, register is an unused keyword to be repurposed later.

What it has become

inline has a far more useful side effect than its originally-intended purpose: something defined as inline–either explicitly or implicitly–can be defined in multiple translation units2 without violating ODR.

When an inline function is compiled, it’s placed into a COMDAT section (or its equivalent on your platform) instead of, e.g., .text where most functions would normally go. If the linker sees multiple definitions of something in such a section, it understands this as an intended duplicate and instead of raising an error, it discards all but one copy3 as if the inline function was declared extern and implemented in exactly one .cpp file.

inline variables

This is exactly what happens to inline variables, which is why it’s better to think of inline as shared, merged, or duplicates_are_allowed. Inlining a global (non-const4) variable does not make sense: you want all of them to end up in the same memory location, reducing the number of copies. Inlining makes extra copies.

They can cause issues if your project has more binaries in it than one: the linker will do its job and discard duplicates once per linking, so you could end up with one copy in your .exe, another in a .dll5, …

In an Unreal project, you’ll get both: Development builds are one-binary-per-module, and Shipping builds are monolithically linked together by default, so you can’t even rely on having the same number of copies.

My personal suggestion is to limit inline variables to constants where it does not matter which one you’re looking at. constexpr is implicitly inline and should be preferred, with inline const used as a fallback for types that cannot be constexpr. Of course, anything mutable or used with a const_cast would violate this.

The confusion

As such, we ended up with inlining, an important optimization technique, and inline, a keyword with a very similar-sounding name that has nothing to do with it, or even having the opposite effect. It is, however, often used in code that’s written to help with inlining, making for a significant correlation that fuels the confusion. Official documentation contradicting itself is not helping either.

Let’s go through some examples to see what really happens! This is on MSVC:

inline int foo(int ecx) {
    int rsp_8 = ecx;
    int eax = rsp_8;
    ++eax;
    return eax;
}

int bar()
{
    int ecx = 1;
    return foo(ecx);


}

int foo(int) PROC                      ; foo, COMDAT
        mov     DWORD PTR [rsp+8], ecx
        mov     eax, DWORD PTR 8[rsp]
        inc     eax
        ret     0
int foo(int) ENDP

int bar(void) PROC                     ; bar
        sub     rsp, 40
        mov     ecx, 1
        call    int foo(int)
        add     rsp, 40
        ret     0
int bar(void) ENDP

In this first example, inline caused foo to be placed in a COMDAT section and no inlining happened, since bar contains a call instruction for foo. The linker will see this, find the one single copy of foo, discard 0 extras as instructed, and link normally, rendering inline more or less pointless here.

On the other hand, if we remove inline and enable optimization:

int foo(int ecx) {
    return ecx + 1;

}

int bar() {
    return foo(1); // return 2;

}

int foo(int) PROC                      ; foo, COMDAT
        lea     eax, DWORD PTR [rcx+1]
        ret     0
int foo(int) ENDP

int bar(void) PROC                     ; bar, COMDAT
        mov     eax, 2
        ret     0
int bar(void) ENDP

The compiler internally decided to place even bar in a COMDAT section (it’s free to do so), and precompute foo’s result instead of inlining it, as if it was constexpr. This example illustrates how using these keywords with the intent to control inlining or as an optimization attempt is often just placebo.

When to (not) use inline?

As we just saw, inline does not really affect inlining. The compiler will inline or not inline calls regardless of its presence, assuming it knows about the implementation. As the opening section alluded to, its main use is for what used to be its side effect, allowing multiple definitions of something.

Inlining still remains an important tool in the compiler’s toolbox though, and it needs your help to do it properly.

If your project calls functions across binaries (.dll, .so, .dylib) or even just across object files making up one binary in case you’re not using link-time optimization, those calls cannot be inlined even if the compiler wanted to: it can’t inline what it can’t see.

To alleviate this, you can move code from .cpp to .h files (typically, small methods because doing so negatively affects compile times), which is where inline is sometimes needed if you’re not writing code that’s implicitly inline already (non-template functions outside classes usually require inline). Hopefully with the community’s eventual adoption of C++20 modules, this will improve.

FORCEINLINE

Let’s go one step further! Another very popular placebo is FORCEINLINE (__forceinline, [[clang::always_inline]], etc.)

Other than causing issues such as breaking debugging, it does not actually force inlining. Depending on the compiler, it will often consider your request since you’re already using a nonstandard extension, but a function may simply not be eligible for inlining. Even if a function is written so that it is eligible on its own, its usage might render it ineligible anyway, which can be hard to track down in a larger project.

The following example was compiled with -O3:

[[clang::always_inline]] int fibonacci(int x)
{
    if (x <= 1)
        return x;
    return fibonacci(x - 1) * fibonacci(x - 2);
}

int bar(int x)
{
    return fibonacci(x);
}

fibonacci(int):                         # @fibonacci(int)
        push    r14
        push    rbx
        push    rax
        mov     r14d, edi
        mov     ebx, 1
        cmp     edi, 2
        jge     .LBB0_2
        mov     ecx, r14d
        jmp     .LBB0_3
.LBB0_2:
        lea     edi, [r14 - 1]
        call    fibonacci(int)
        lea     ecx, [r14 - 2]
        imul    ebx, eax
        cmp     r14d, 3
        mov     r14d, ecx
        ja      .LBB0_2
.LBB0_3:
        imul    ebx, ecx
        mov     eax, ebx
        add     rsp, 8
        pop     rbx
        pop     r14
        ret
bar(int):                               # @bar(int)
        jmp     fibonacci(int)          # TAILCALL

There’s a lot going on! You can ignore most of the generated code, but there are a few key takeaways.

First of all, fibonacci obviously did not get inlined. A different optimization called a tail call was performed, where bar transfers control (jmp) instead of calling it (call). This is possible because bar ends with returning the value from fibonacci without any extra conversion, so fibonacci might as well return it on bar’s behalf and its caller will be none the wiser.

fibonacci also ends up calling itself once instead of the two calls that are in its source code. These transformations are far more involved than “inlining, yes/no” and you have no hope of expressing this nuance with C++ keywords.

I won’t waste space with more code blocks, but making fibonacci regular inline or not inline results in identical assembly. The “force” inline was completely ignored.

I’m also skipping examples for inline+inlining happening together. I don’t think this one would surprise anyone, and it does happen very often.

When to (not) use FORCEINLINE?

So, what are legitimate uses of FORCEINLINE? In line with the first rule of optimization (“don’t optimize”), you should default to not using it at all. You may stop reading now.

One use of it is sacrificing debuggability in debug builds for performance, which is the opposite of what one normally wants from a debug build. As such, it should only be used very sparingly, on functions that are trivial yet called so often that it causes an actual problem for programmers.

Compilers are smart even in debug builds: for instance, MSVC will still skip some calls whose effects it intrinsically knows and there’s no value in debugging, such as std::move.

Another use of it is genuine optimizing (when you’re on the third rule of optimization or beyond). This happens once your source code and even compiler is mostly frozen, and you have proof that using it on some functions actually makes things better.

This is not a case of littering FORCEINLINE on functions that “gotta go fast” and calling it a day. The 80/20 rule still applies, and even in a relatively large codebase, you’ll likely end up with only a small fraction of functions needing this treatment, assuming you’re already using a monolithic release build with LTCG/LTO + PGO.

Inlining can slow things down (it increases pressure on the instruction cache among other things), and depending on your code, you can do better than it. In one outlier instance, I managed to outperform FORCEINLINE by 30% in vectorized code by not using it and instead moving an if from the function to a few of its call sites. Sometimes, the optimum lies between a full call and inlining.

I expect future compilers to eventually deal with that particular case, but on that day, with that version, on my CPU, this was significantly faster. I would not be surprised if on another compiler or microarchitecture it would’ve been slower. It’s important to not over-optimize for one particular computer so that it ends up worse for 90% of your playerbase.

  1. See also the five new meanings of auto and the three separate meanings of static. 

  2. This broadly corresponds to a .cpp file, but it could be multiple .cpp files in the case of unity builds. 

  3. The reality is slightly more complex. For safety, these symbols can come with extra flags, such as asking the linker to verify that all of them are really the same before discarding the copies, or to pick the largest copy if they’re different, etc. 

  4. The “inlining” of constants happens automatically as part of a different optimization technique called constant folding, propagation, or substitution. 

  5. This behavior is platform-dependent to make things even more complex. Windows .dlls and Linux .so files for instance handle duplicates within a single process differently at runtime. 

Updated: