inline
does not inline
Over the many years of C++’s life, some keywords such as auto
, register
,
export
have gone obsolete and ended up being reused with a new, changed
meaning.
If you know what these meant originally, congratulations, you’re old!
inline
went through subtler changes that were not directly caused by the
language standard redefining it, yet its current meaning has nothing to do with
the original nevertheless.
C++17 has formally acknowledged the status quo and built upon it.
C++ is really averse to introducing new keywords1, so even though
inline
now officially has nothing to do with inlining anymore (as we’ll see
later), it remained instead of being changed to something more appropriate, such
as shared
.
What it was
inline
was originally intended to be a compiler hint to inline a function, but
as compilers evolved, it became apparent that programmers are terrible at
hinting these things correctly.
Better code could be generated by more or less ignoring hints (or maybe only
accept them as a small nudge, depending on the compiler) and deciding inlining
based on the compiler’s far more advanced heuristics.
This goes both ways: inlining things that aren’t inline
, and using regular
function calls for inline
functions.
This is the same fate that befell for(register int i = 0; i < n; ++i)
: once
no compiler cared about register
, it eventually fell out of use.
As of C++17, register
is an unused keyword to be repurposed later.
What it has become
inline
has a far more useful side effect than its originally-intended purpose:
something defined as inline
—either explicitly or implicitly—can be defined in
multiple translation units2 without violating
ODR.
When an inline
function is compiled, it’s placed into a COMDAT section (or its
equivalent on your platform) instead of, e.g., .text
where most functions
would normally go.
If the linker sees multiple definitions of something in such a section, it
understands this as an intended duplicate and instead of raising an error, it
discards all but one copy3 as if the inline
function was declared
extern
and implemented in exactly one .cpp file.
inline
variables
This is exactly what happens to inline
variables, which is why it’s better to
think of inline
as shared
, merged
, or duplicates_are_allowed
.
Inlining a global (non-const4) variable does not make sense: you want all
of them to end up in the same memory location, reducing the number of copies.
Inlining makes extra copies.
They can cause issues if your project has more binaries in it than one: the linker will do its job and discard duplicates once per linking, so you could end up with one copy in your .exe, another in a .dll5, …
In an Unreal project, you’ll get both: Development builds are one-binary-per-module, and Shipping builds are monolithically linked together by default, so you can’t even rely on having the same number of copies.
My personal suggestion is to limit inline
variables to constants where it does
not matter which one you’re looking at.
constexpr
is implicitly inline
and should be preferred, with inline const
used as a fallback for types that cannot be constexpr
.
Of course, anything mutable
or used with a const_cast
would violate this.
The confusion
As such, we ended up with inlining, an important optimization technique, and
inline
, a keyword with a very similar-sounding name that has nothing to do
with it, or even having the opposite effect.
It is, however, often used in code that’s written to help with inlining, making
for a significant correlation that fuels the confusion.
Official documentation
contradicting itself is not helping either.
Let’s go through some examples to see what really happens! This is on MSVC:
|
|
In this first example, inline
caused foo
to be placed in a COMDAT section
and no inlining happened, since bar
contains a call
instruction for foo
.
The linker will see this, find the one single copy of foo
, discard 0 extras as
instructed, and link normally, rendering inline
more or less pointless here.
On the other hand, if we remove inline
and enable optimization:
|
|
The compiler internally decided to place even bar
in a COMDAT section (it’s
free to do so), and precompute foo
’s result instead of inlining it, as if it
was constexpr
.
This example illustrates how using these keywords with the intent to control
inlining or as an optimization attempt is often just placebo.
When to (not) use inline
?
As we just saw, inline
does not really affect inlining.
The compiler will inline or not inline calls regardless of its presence,
assuming it knows about the implementation.
As the opening section alluded to, its main use is for what used to be its side
effect, allowing multiple definitions of something.
Inlining still remains an important tool in the compiler’s toolbox though, and it needs your help to do it properly.
If your project calls functions across binaries (.dll, .so, .dylib) or even just across object files making up one binary in case you’re not using link-time optimization, those calls cannot be inlined even if the compiler wanted to: it can’t inline what it can’t see.
To alleviate this, you can move code from .cpp to .h files (typically, small
methods because doing so negatively affects compile times), which is where
inline
is sometimes needed if you’re not writing code that’s implicitly
inline
already (non-template functions outside classes usually require
inline
).
Hopefully with the community’s eventual adoption of C++20 modules, this will
improve.
FORCEINLINE
Let’s go one step further!
Another very popular placebo is FORCEINLINE (__forceinline
,
[[clang::always_inline]]
, etc.)
Other than causing issues such as breaking debugging, it does not actually force inlining. Depending on the compiler, it will often consider your request since you’re already using a nonstandard extension, but a function may simply not be eligible for inlining. Even if a function is written so that it is eligible on its own, its usage might render it ineligible anyway, which can be hard to track down in a larger project.
The following example was compiled with -O3:
|
|
There’s a lot going on! You can ignore most of the generated code, but there are a few key takeaways.
First of all, fibonacci
obviously did not get inlined.
A different optimization called a tail call was performed, where bar
transfers
control (jmp
) instead of calling it (call
).
This is possible because bar
ends with returning the value from fibonacci
without any extra conversion, so fibonacci
might as well return it on bar
’s
behalf and its caller will be none the wiser.
fibonacci
also ends up call
ing itself once instead of the two calls that
are in its source code.
These transformations are far more involved than “inlining, yes/no” and you have
no hope of expressing this nuance with C++ keywords.
I won’t waste space with more code blocks, but making fibonacci
regular
inline
or not inline
results in identical assembly.
I’m also skipping examples for inline
+inlining happening together.
I don’t think this one would surprise anyone, and it does happen very often.
When to (not) use FORCEINLINE?
So, what are legitimate uses of FORCEINLINE? In line with the first rule of optimization (“don’t optimize”), you should default to not using it at all. You may stop reading now.
One use of it is sacrificing debuggability in debug builds for performance, which is the opposite of what one normally wants from a debug build. As such, it should only be used very sparingly, on functions that are trivial yet called so often that it causes an actual problem for programmers.
Compilers are smart even in debug builds: for instance, MSVC will still skip some calls whose effects it intrinsically knows and there’s no value in debugging, such as std::move.
Another use of it is genuine optimizing (when you’re on the third rule of optimization or beyond). This happens once your source code and even compiler is mostly frozen, and you have proof that using it on some functions actually makes things better.
This is not a case of littering FORCEINLINE on functions that “gotta go fast” and calling it a day. The 80/20 rule still applies, and even in a relatively large codebase, you’ll likely end up with only a small fraction of functions needing this treatment, assuming you’re already using a monolithic release build with LTCG/LTO + PGO.
Inlining can slow things down (it increases pressure on the instruction cache
among other things), and depending on your code, you can do better than it.
In one outlier instance, I managed to outperform FORCEINLINE by 30% in
vectorized code by not using it and instead moving an if
from the function to
a few of its call sites.
Sometimes, the optimum lies between a full call
and inlining.
I expect future compilers to eventually deal with that particular case, but on that day, with that version, on my CPU, this was significantly faster. I would not be surprised if on another compiler or microarchitecture it would’ve been slower. It’s important to not over-optimize for one particular computer so that it ends up worse for 90% of your playerbase.
-
See also the five new meanings of
auto
and the three separate meanings ofstatic
. ↩ -
This broadly corresponds to a .cpp file, but it could be multiple .cpp files in the case of unity builds. ↩
-
The reality is slightly more complex. For safety, these symbols can come with extra flags, such as asking the linker to verify that all of them are really the same before discarding the copies, or to pick the largest copy if they’re different, etc. ↩
-
The “inlining” of constants happens automatically as part of a different optimization technique called constant folding, propagation, or substitution. ↩
-
This behavior is platform-dependent to make things even more complex. Windows .dlls and Linux .so files for instance handle duplicates within a single process differently at runtime. ↩