Document tostring/tonumber and math.pi/huge optimizations
31 KiB
permalink | title | toc |
---|---|---|
/performance | Performance | true |
One of main goals of Luau is to enable high performance code, with gameplay code being the main use case. This can be viewed as two separate goals:
- Make idiomatic code that wasn't tuned faster
- Enable even higher performance through careful tuning
Both of these goals are important - it's insufficient to just focus on the highly tuned code, and all things being equal we prefer to raise all boats by implementing general optimizations. However, in some cases it's important to be aware of optimizations that Luau does and doesn't do.
Worth noting is that Luau is focused on, first and foremost, stable high performance code in interpreted context. This is because JIT compilation is not available on many platforms Luau runs on, and AOT compilation would only work for code that Roblox ships (and even that does not always work). This is in stark contrast with LuaJIT that, while providing an excellent interpreter as well, focuses a lot of the attention on JIT (with many optimizations unavailable in the interpreter).
Luau eventually plans to implement JIT on some platforms, but this is subject to careful memory safety analysis and is likely to not be deployed for client-side scripts, as the extra risk involved in JITs is much more pronounced when it may affect players.
The rest of this document goes into some optimizations that Luau employs and how to best leverage them when writing code. The document is not complete - a lot of optimizations are transparent to the user and involve detailed low-level tuning of various parts that is not described here - and all of this is subject to change without notice, as it doesn't affect the semantics of valid code.
Fast bytecode interpreter
Luau features a very highly tuned portable bytecode interpreter. It's similar to Lua interpreter in that it's written in C, but it's highly tuned to yield efficient assembly when compiled with Clang and latest versions of MSVC. On some workloads it can match the performance of LuaJIT interpreter which is written in highly specialized assembly. We are continuing to tune the interpreter and the bytecode format over time; while extra performance can be extracted by rewriting the interpreter in assembly, we're unlikely to ever do that as the extra gains at this point are marginal, and we gain a lot from C in terms of portability and being able to quickly implement new optimizations.
Of course the interpreter isn't typical C code - it uses many tricks to achieve extreme levels of performance and to coerce the compiler to produce efficient assembly. Due to a better bytecode design and more efficient dispatch loop it's noticeably faster than Lua 5.x (including Lua 5.4 which made some of the changes similar to Luau, but doesn't come close). The bytecode design was partially inspired by excellent LuaJIT interpreter. Most computationally intensive scripts only use the interpreter core loop and builtins, which on x64 compiles into ~16 KB, thus leaving half of the instruction cache for other infrequently called code.
Optimizing compiler
Unlike Lua and LuaJIT, Luau uses a multi-pass compiler with a frontend that parses source into an AST and a backend that generates bytecode from it. This carries a small penalty in terms of compilation time, but results in more flexible code and, crucially, makes it easier to optimize the generated bytecode.
Note: Compilation throughput isn't the main focus in Luau, but our compiler is reasonably fast; with all currently implemented optimizations enabled, it compiles 950K lines of Luau code in 1 second on a single core of a desktop Ryzen 5900X CPU, producing bytecode and debug information.
While bytecode optimizations are limited due to the flexibility of Luau code (e.g. a * 1
may not be equivalent to a
if *
is overloaded through metatables), even in absence of type information Luau compiler can perform some optimizations such as "deep" constant folding across functions and local variables, perform upvalue optimizations for upvalues that aren't mutated, do analysis of builtin function usage, optimize the instruction sequences for multiple variable assignments, and some peephole optimizations on the resulting bytecode. The compiler can also be instructed to use more aggressive optimizations by enabling optimization level 2 (-O2
in CLI tools), some of which are documented further on this page.
Most bytecode optimizations are performed on individual statements or functions, however the compiler also does a limited amount of interprocedural optimizations; notably, calls to local functions can be optimized with the knowledge of the argument count or number of return values involved. Interprocedural optimizations are limited to a single module due to the compilation model.
Luau compiler currently doesn't use type information to do further optimizations, however early experiments suggest that we can extract further wins. Because we control the entire stack (unlike e.g. TypeScript where the type information is discarded completely before reaching the VM), we have more flexibility there and can make some tradeoffs during codegen even if the type system isn't completely sound. For example, it might be reasonable to assume that in presence of known types, we can infer absence of side effects for arithmetic operations and builtins - if the runtime types mismatch due to intentional violation of the type safety through global injection, the code will still be safely sandboxed; this may unlock optimizations such as common subexpression elimination and allocation hoisting without a JIT. This is speculative pending further research.
Epsilon-overhead debugger
It's important for Luau to have stable and predictable performance. Something that comes up in Lua-based environments often is the use of line hooks to implement debugging (both for breakpoints and for stepping). This is problematic because the support for hooks is typically not free in general, but importantly once the hook is enabled, calling the hook has a considerable overhead, and the hook itself may be very costly to evaluate since it will need to associate the script:line pair with the breakpoint information.
Luau does not support hooks at all, and relies on first-class support for breakpoints (using bytecode patching) and single-stepping (using a custom interpreter loop) to implement debugging. As a result, the presence of breakpoints doesn't slow the script execution down - the only noticeable discrepancy between running code under a debugger and without a debugger should be in cases where breakpoints are evaluated and skipped based on breakpoint conditions, or when stepping over long-running fragments of code.
Inline caching for table and global access
Table access for field lookup is optimized in Luau using a mechanism that blends inline caching (classically used in Java/JavaScript VMs) and HREFs (implemented in LuaJIT). Compiler can predict the hash slot used by field lookup, and the VM can correct this prediction dynamically.
As a result, field access can be very fast in Luau, provided that:
- The field name is known at compile time. To make sure this is the case,
table.field
notation is recommended, although the compiler will also optimizetable["field"]
when the expression is known to be a constant string. - The field access doesn't use metatables. The fastest way to work with tables in Luau is to store fields directly inside the table, and store methods in the metatable (see below); access to "static" fields in classic OOP designs is best done through
Class.StaticField
instead ofobject.StaticField
. - The object structure is usually uniform. While it's possible to use the same function to access tables of different shape - e.g.
function getX(obj) return obj.x end
can be used on any table that has a field"x"
- it's best to not vary the keys used in the tables too much, as it defeats this optimization.
The same optimization is applied to the custom globals declared in the script, although it's best to avoid these altogether by using locals instead. Still, this means that the difference between function
and local function
is less pronounced in Luau.
Importing global access chains
While global access for library functions can be optimized in a similar way, this optimization breaks down when the global table is using sandboxing through metatables, and even when globals aren't sandboxed, math.max
still requires two table accesses.
It's always possible to "localize" the global accesses by using local max = math.max
, but this is cumbersome - in practice it's easy to forget to apply this optimization. To avoid relying on programmers remembering to do this, Luau implements a special optimization called "imports", where most global chains such as math.max
are resolved when the script is loaded instead of when the script is executed.
This optimization relies on being able to predict the shape of the environment table for a given function; this is possible due to global sandboxing, however this optimization is invalid in some cases:
loadstring
can load additional code that runs in context of the caller's environmentgetfenv
/setfenv
can directly modify the environment of any function
The use of any of these functions performs a dynamic deoptimization, marking the affected environment as "impure". The optimizations are only in effect on functions with "pure" environments - because of this, the use of loadstring
/getfenv
/setfenv
is not recommended. Note that getfenv
deoptimizes the environment even if it's only used to read values from the environment.
Note: Luau still supports these functions as part of our backwards compatibility promise, although we'd love to switch to Lua 5.2's
_ENV
as that mechanism is cleaner and doesn't require costly dynamic deoptimization.
Fast method calls
Luau specializes method calls to improve their performance through a combination of compiler, VM and binding optimizations. Compiler emits a specialized instruction sequence when methods are called through obj:Method
syntax (while this isn't idiomatic anyway, you should avoid obj.Method(obj)
). When the object in question is a Lua table, VM performs some voodoo magic based on inline caching to try to quickly discover the implementation of this method through the metatable.
For this to be effective, it's crucial that __index
in a metatable points to a table directly. For performance reasons it's strongly recommended to avoid __index
functions as well as deep __index
chains; an ideal object in Luau is a table with a metatable that points to itself through __index
.
When the object in question is a reflected userdata, a special mechanism called "namecall" is used to minimize the interop cost. In classical Lua binding model, obj:Method
is called in two steps, retrieving the function object (obj.Method
) and calling it; both steps are often implemented in C++, and the method retrieval needs to use a method object cache - all of this makes method calls slow.
Luau can directly call the method by name using the "namecall" extension, and an optimized reflection layer can retrieve the correct method quickly through more voodoo magic based on string interning and custom Luau features that aren't exposed through Luau scripts.
As a result of both optimizations, common Lua tricks of caching the method in a local variable aren't very productive in Luau and aren't recommended either.
Specialized builtin function calls
Due to global sandboxing and the ability to dynamically deoptimize code running in impure environments, in pure environments we go beyond optimizing the interpreter and optimize many built-in functions through a "fastcall" mechanism.
For this mechanism to work, function call must be "obvious" to the compiler - it needs to call a builtin function directly, e.g. math.max(x, 1)
, although it also works if the function is "localized" (local max = math.max
); this mechanism doesn't work for indirect function calls unless they were inlined during compilation, and doesn't work for method calls (so calling string.byte
is more efficient than s:byte
).
The mechanism works by directly invoking a highly specialized and optimized implementation of a builtin function from the interpreter core loop without setting up a stack frame and omitting other work; additionally, some fastcall specializations are partial in that they don't support all types of arguments, for example all math
library builtins are only specialized for numeric arguments, so calling math.abs
with a string argument will fall back to the slower implementation that will do string->number coercion.
As a result, builtin calls are very fast in Luau - they are still slightly slower than core instructions such as arithmetic operations, but only slightly so. The set of fastcall builtins is slowly expanding over time and as of this writing contains assert
, type
, typeof
, rawget
/rawset
/rawequal
, getmetatable
/setmetatable
, tonumber
/tostring
, all functions from math
(except noise
and random
/randomseed
) and bit32
, and some functions from string
and table
library.
Some builtin functions have partial specializations that reduce the cost of the common case further. Notably:
assert
is specialized for cases when the assertion return value is not used and the condition is truthy; this helps reduce the runtime cost of assertions to the extent possiblebit32.extract
is optimized further when field and width selectors are constantselect
is optimized when the second argument is...
; in particular,select(x, ...)
is O(1) when using the builtin dispatch mechanism even though it's normally O(N) in variadic argument count.
Some functions from math
library like math.floor
can additionally take advantage of advanced SIMD instruction sets like SSE4.1 when available.
In addition to runtime optimizations for builtin calls, many builtin calls, as well as constants like math.pi
/math.huge
, can also be constant-folded by the bytecode compiler when using aggressive optimizations (level 2); this currently applies to most builtin calls with constant arguments and a single return value. For builtin calls that can not be constant folded, compiler assumes knowledge of argument/return count (level 2) to produce more efficient bytecode instructions.
Optimized table iteration
Luau implements a fully generic iteration protocol; however, for iteration through tables in addition to generalized iteration (for .. in t
) it recognizes three common idioms (for .. in ipairs(t)
, for .. in pairs(t)
and for .. in next, t
) and emits specialized bytecode that is carefully optimized using custom internal iterators.
As a result, iteration through tables typically doesn't result in function calls for every iteration; the performance of iteration using generalized iteration, pairs
and ipairs
is comparable, so generalized iteration (without the use of pairs
/ipairs
) is recommended unless the code needs to be compatible with vanilla Lua or the specific semantics of ipairs
(which stops at the first nil
element) is required. Additionally, using generalized iteration avoids calling pairs
when the loop starts which can be noticeable when the table is very short.
Iterating through array-like tables using for i=1,#t
tends to be slightly slower because of extra cost incurred when reading elements from the table.
Optimized table length
Luau tables use a hybrid array/hash storage, like in Lua; in some sense "arrays" don't truly exist and are an internal optimization, but some operations, notably #t
and functions that depend on it, like table.insert
, are defined by the Luau/Lua language to allow internal optimizations. Luau takes advantage of that fact.
Unlike Lua, Luau guarantees that the element at index #t
is stored in the array part of the table. This can accelerate various table operations that use indices limited by #t
, and this makes #t
worst-case complexity O(logN), unlike Lua where the worst case complexity is O(N). This also accelerates computation of this value for small tables like { [1] = 1 }
since we never need to look at the hash part.
The "default" implementation of #t
in both Lua and Luau is a binary search. Luau uses a special branch-free (depending on the compiler...) implementation of the binary search which results in 50+% faster computation of table length when it needs to be computed from scratch.
Additionally, Luau can cache the length of the table and adjust it following operations like table.insert
/table.remove
; this means that in practice, #t
is almost always a constant time operation.
Creating and modifying tables
Luau implements several optimizations for table creation. When creating object-like tables, it's recommended to use table literals ({ ... }
) and to specify all table fields in the literal in one go instead of assigning fields later; this triggers an optimization inspired by LuaJIT's "table templates" and results in higher performance when creating objects. When creating array-like tables, if the maximum size of the table is known up front, it's recommended to use table.create
function which can create an empty table with preallocated storage, and optionally fill it with a given value.
When the exact table shape isn't known, Luau compiler can still predict the table capacity required in case the table is initialized with an empty literal ({}
) and filled with fields subsequently. For example, the following code creates a correctly sized table implicitly:
local v = {}
v.x = 1
v.y = 2
v.z = 3
return v
When appending elements to tables, it's recommended to use table.insert
(which is the fastest method to append an element to a table if the table size is not known). In cases when a table is filled sequentially, however, it can be more efficient to use a known index for insertion - together with preallocating tables using table.create
this can result in much faster code, for example this is the fastest way to build a table of squares:
local t = table.create(N)
for i=1,N do
t[i] = i * i
end
Native vector math
Luau uses tagged value storage - each value contains a type tag and the data that represents the value of a given type. Because of the need to store 64-bit double precision numbers and 64-bit pointers, we don't use NaN tagging and have to pay the cost of 16 bytes per value.
We take advantage of this to provide a native value type that can store a 32-bit floating point vector with 3 components. This type is fundamental to game computations and as such it's important to optimize the storage and the operations with that type - our VM implements first class support for all math operations and component manipulation, which essentially means we have native 3-wide SIMD support. For code that uses many vector values this results in significantly smaller GC pressure and significantly faster execution, and gives programmers a mechanism to hand-vectorize numeric code if need be.
Optimized upvalue storage
Lua implements upvalues as garbage collected objects that can point directly at the thread's stack or, when the value leaves the stack frame (and is "closed"), store the value inside the object. This representation is necessary when upvalues are mutated, but inefficient when they aren't - and 90% or more of upvalues aren't mutated in typical Lua code. Luau takes advantage of this by reworking upvalue storage to prioritize immutable upvalues - capturing upvalues that don't change doesn't require extra allocations or upvalue closing, resulting in faster closure allocation, faster execution, faster garbage collection and faster upvalue access due to better memory locality.
Note that "immutable" in this case only refers to the variable itself - if the variable isn't assigned to it can be captured by value, even if it's a table that has its contents change.
When upvalues are mutable, they do require an extra allocated object; we carefully optimize the memory consumption and access cost for mutable upvalues to reduce the associated overhead.
Closure caching
With optimized upvalue storage, creating new closures (function objects) is more efficient but still requires allocating a new object every time. This can be problematic for cases when functions are passed to algorithms like table.sort
or functions like pcall
, as it results in excessive allocation traffic which then leads to more work for garbage collector.
To make closure creation cheaper, Luau compiler implements closure caching - when multiple executions of the same function expression are guaranteed to result in the function object that is semantically identical, the compiler may cache the closure and always return the same object. This changes the function identity which may affect code that uses function objects as table keys, but preserves the calling semantics - compiler will only do this if calling the original (cached) function behaves the same way as a newly created function would. The heuristics used for this optimization are subject to change; currently, the compiler will cache closures that have no upvalues, or all upvalues are immutable (see previous section) and are declared at the module scope, as the module scope is (almost always) evaluated only once.
Fast memory allocator
Similarly to LuaJIT, but unlike vanilla Lua, Luau implements a custom allocator that is highly specialized and tuned to the common allocation workloads we see. The allocator design is inspired by classic pool allocators as well as the excellent mimalloc
, but through careful domain-specific tuning it beats all general purpose allocators we've tested, including rpmalloc
, mimalloc
, jemalloc
, ptmalloc
and tcmalloc
.
This doesn't mean that memory allocation in Luau is free - it's carefully optimized, but it still carries a cost, and a high rate of allocations requires more work from the garbage collector. The garbage collector is incremental, so short of some edge cases this rarely results in visible GC pauses, but can impact the throughput since scripts will interrupt to perform "GC assists" (helping clean up the garbage). Thus for high performance Luau code it's recommended to avoid allocating memory in tight loops, by avoiding temporary table and userdata creation.
In addition to a fast allocator, all frequently used structures in Luau have been optimized for memory consumption, especially on 64-bit platforms, compared to Lua 5.1 baseline. This helps to reduce heap memory footprint and improve performance in some cases by reducing the memory bandwidth impact of garbage collection.
Optimized libraries
While the best performing code in Luau spends most of the time in the interpreter, performance of the standard library functions is critical to some applications. In addition to specializing many small and simple functions using the builtin call mechanism, we spend extra care on optimizing all library functions and providing additional functions beyond the Lua standard library that help achieve good performance with idiomatic code.
For example, functions like insert
, remove
and move
from the table
library have been tuned for performance on array-like tables, achieving 3x and more performance compared to un-tuned versions, and Luau provides functions like table.create
and table.find
to achieve further speedup when applicable. We also use a carefully tuned dynamic string buffer implementation for internal string
library to reduce garbage created during string manipulation.
In addition to the array-like specializations mentioned above, our implementation of table.sort
is using introsort
algorithm which results in guaranteed worst case NlogN
complexity regardless of the input.
Improved garbage collector pacing
Luau uses an incremental garbage collector which does a little bit of work every so often, and at no point does it stop the world to traverse the entire heap. The runtime will make sure that the collector runs interspersed with the program execution as the program allocates additional memory, which is known as "garbage collection assists", and can also run in response to explicit garbage collection invocation via lua_gc
. In interactive environments such as video game engines it's possible, and even desirable, to request garbage collection every frame to make sure assists are minimized, since that allows scheduling the garbage collection to run concurrently with other engine processing that doesn't involve script execution.
Inspired by excellent work by Austin Clements on Go's garbage collector pacer, we've implemented a pacing algorithm that uses a proportional–integral–derivative controller to estimate internal garbage collector tunables to reach a target heap size, defined as a percentage of the live heap data (which is more intuitive and actionable than Lua 5.x "GC pause" setting). Luau runtime also estimates the allocation rate making it easy (given uniform allocation rates) to adjust the per-frame garbage collection requests to do most of the required GC work outside of script execution.
Reduced garbage collector pauses
While Luau uses an incremental garbage collector, once per each collector cycle it runs a so-called "atomic" step. While all other GC steps can do very little work by only looking at a few objects at a given time, which means that the collector can have arbitrarily short pauses, the "atomic" step needs to traverse some amount of data that, in some cases, may scale with the application heap. Since atomic step is indivisible, it can result in occasional pauses on the order of tens of milliseconds, which is problematic for interactive applications. We've implemented a series of optimizations to help reduce the atomic step.
Normally objects that have been modified after the GC marked them in an incremental mark phase need to be rescanned during atomic phase, so frequent modifications of existing tables may result in a slow atomic step. To address this, we run a "remark" step where we traverse objects that have been modified after being marked once more (incrementally); additionally, the write barrier that triggers for object modifications changes the transition logic during remark phase to reduce the probability that the object will need to be rescanned.
Another source of scalability challenges is coroutines. Writes to coroutine stacks don't use a write barrier, since that's prohibitively expensive as they are too frequent. This means that coroutine stacks need to be traversed during atomic step, so applications with many coroutines suffer large atomic pauses. To address this, we implement incremental marking of coroutines: marking a coroutine makes it "inactive" and resuming a coroutine (or pushing extra objects on the coroutine stack via C API) makes it "active". Atomic step only needs to traverse active coroutines again, which reduces the cost of atomic step by effectively making coroutine collection incremental as well.
While large tables can be a problem for incremental GC in general since currently marking a single object is indivisible, large weak tables are a unique challenge because they also need to be processed during atomic phase, and the main use case for weak tables - object caches - may result in tables with large capacity but few live objects in long-running applications that exhibit bursts of activity. To address this, weak tables in Luau can be marked as "shrinkable" by including s
as part of __mode
string, which results in weak tables being resized to the optimal capacity during GC. This option may result in missing keys during table iteration if the table is resized while iteration is in progress and as such is only recommended for use in specific circumstances.
Optimized garbage collector sweeping
The incremental garbage collector in Luau runs three phases for each cycle: mark, atomic and sweep. Mark incrementally traverses all live objects, atomic finishes various operations that need to happen without mutator intervention (see previous section), and sweep traverses all objects in the heap, reclaiming memory used by dead objects and performing minor fixup for live objects. While objects allocated during the mark phase are traversed in the same cycle and thus may get reclaimed, objects allocated during the sweep phase are considered live. Because of this, the faster the sweep phase completes, the less garbage will accumulate; and, of course, the less time sweeping takes the less overhead there is from this phase of garbage collection on the process.
Since sweeping traverses the whole heap, we maximize the efficiency of this traversal by allocating garbage-collected objects of the same size in 16 KB pages, and traversing each page at a time, which is otherwise known as a paged sweeper. This ensures good locality of reference as consecutively swept objects are contiugous in memory, and allows us to spend no memory for each object on sweep-related data or allocation metadata, since paged sweeper doesn't need to be able to free objects without knowing which page they are in. Compared to linked list based sweeping that Lua/LuaJIT implement, paged sweeper is 2-3x faster, and saves 16 bytes per object on 64-bit platforms.
Function inlining and loop unrolling
By default, the bytecode compiler performs a series of optimizations that result in faster execution of the code, but they preserve both execution semantics and debuggability. For example, a function call is compiled as a function call, which may be observable via debug.traceback
; a loop is compiled as a loop, which may be observable via lua_getlocal
. To help improve performance in cases where these restrictions can be relaxed, the bytecode compiler implements additional optimizations when optimization level 2 is enabled (which requires using -O2
switch when using Luau CLI), namely function inlining and loop unrolling.
Only loops with loop bounds known at compile time, such as for i=1,4 do
, can be unrolled. The loop body must be simple enough for the optimization to be profitable; compiler uses heuristics to estimate the performance benefit and automatically decide if unrolling should be performed.
Only local functions (defined either as local function foo
or local foo = function
) can be inlined. The function body must be simple enough for the optimization to be profitable; compiler uses heuristics to estimate the performance benefit and automatically decide if each call to the function should be inlined instead. Additionally recursive invocations of a function can’t be inlined at this time, and inlining is completely disabled for modules that use getfenv
/setfenv
functions.
In both cases, in addition to removing the overhead associated with function calls or loop iteration, these optimizations can additionally benefit by enabling additional optimizations, such as constant folding of expressions dependent on loop iteration variable or constant function arguments, or using more efficient instructions for certain expressions when the inputs to these instructions are constants.