luau/docs/performance.md
2020-10-07 12:07:13 -07:00

17 KiB

Performance

One of main goals of Luau is to enable high performance code, with gameplay code being the main use case. This can be viewed as two separate goals:

  • Make idiomatic code that wasn't tuned faster
  • Enable even higher performance through careful tuning

Both of these goals are important - it's insufficient to just focus on the highly tuned code, and all things being equal we prefer to raise all boats by implementing general optimizations. However, in some cases it's important to be aware of optimizations that Luau does and doesn't do.

Worth noting is that Luau is focused on, first and foremost, stable high performance code in interpreted context. This is because JIT compilation is not available on many platforms Luau runs on, and AOT compilation would only work for code that Roblox ships (and even that does not always work). This is in stark contrast with LuaJIT that, while providing an excellent interpreter as well, focuses a lot of the attention on JIT (with many optimizations unavailable in the interpreter).

Luau eventually plans to implement JIT on some platforms, but this is subject to careful memory safety analysis and is likely to not be deployed for client-side scripts, as the extra risk involved in JITs is much more pronounced when it may affect players.

The rest of this document goes into some optimizations that Luau employs and how to best leverage them when writing code. The document is not complete - a lot of optimizations are transparent to the user and involve detailed low-level tuning of various parts that is not described here - and all of this is subject to change without notice, as it doesn't affect the semantics of valid code.

Fast bytecode interpreter

Luau features a very highly tuned portable bytecode interpreter. It's similar to Lua interpreter in that it's written in C, but it's highly tuned to yield efficient assembly when compiled with Clang and latest versions of MSVC. On some workloads it can match the performance of LuaJIT interpreter which is written in highly specialized assembly. We are continuing to tune the interpreter and the bytecode format over time; while extra performance can be extracted by rewriting the interpreter in assembly, we're unlikely to ever do that as the extra gains at this point are marginal, and we gain a lot from C in terms of portability and being able to quickly implement new optimizations.

Of course the interpreter isn't typical C code - it uses many tricks to achieve extreme levels of performance and to coerce the compiler to produce efficient assembly. Due to a better bytecode design and more efficient dispatch loop it's noticeably faster than Lua 5.x (including Lua 5.4 which made some of the changes similar to Luau, but doesn't come close). The bytecode design was partially inspired by excellent LuaJIT interpreter. Most computationally intensive scripts only use the interpreter core loop and builtins, which on x64 compiles into ~16 KB, thus leaving half of the instruction cache for other infrequently called code.

Optimizing compiler

Unlike Lua and LuaJIT, Luau uses a multi-pass compiler with a frontend that parses source into an AST and a backend that generates bytecode from it. This carries a small penalty in terms of compilation time, but results in more flexible code and, crucially, makes it easier to optimize the generated bytecode.

Note: Compilation throughput isn't the main focus in Luau, but our compiler is reasonably fast; with all currently implemented optimizations enabled, it compiles 400K lines of Luau code in 0.5 seconds on a single core of a desktop Core i7 CPU, producing bytecode and debug information.

While bytecode optimizations are limited due to the flexibility of Luau code (e.g. a * 1 may not be equivalent to a if * is overloaded through metatables), even in absence of type information Luau compiler can perform some optimizations such as "deep" constant folding across functions and local variables, perform upvalue optimizations for upvalues that aren't mutated, do analysis of builtin function usage, and some peephole optimizations on the resulting bytecode. In the future we plan to do bytecode-level inlining and possibly other code transformation.

Luau compiler currently doesn't use type information to do further optimizations, however early experiments suggest that we can extract further wins. Because we control the entire stack (unlike e.g. TypeScript where the type information is discarded completely before reaching the VM), we have more flexibility there and can make some tradeoffs during codegen even if the type system isn't completely sound. For example, it might be reasonable to assume that in presence of known types, we can infer absence of side effects for arithmetic operations and builtins - if the runtime types mismatch due to intentional violation of the type safety through global injection, the code will still be safely sandboxed; this may unlock optimizations such as common subexpression elimination and allocation hoisting without a JIT. This is speculative pending further research.

Epsilon-overhead debugger

It's important for Luau to have stable and predictable performance. Something that comes up in Lua-based environments often is the use of line hooks to implement debugging (both for breakpoints and for stepping). This is problematic because the support for hooks is typically not free in general, but importantly once the hook is enabled, calling the hook has a considerable overhead, and the hook itself may be very costly to evaluate since it will need to associate the script:line pair with the breakpoint information.

Luau does not support hooks at all, and relies on first-class support for breakpoints (using bytecode patching) and single-stepping (using a custom interpreter loop) to implement debugging. As a result, the presence of breakpoints doesn't slow the script execution down - the only noticeable discrepancy between running code under a debugger and without a debugger should be in cases where breakpoints are evaluated and skipped based on breakpoint conditions, or when stepping over long-running fragments of code.

Inline caching for table and global access

Table access for field lookup is optimized in Luau using a mechanism that blends inline caching (classically used in Java/JavaScript VMs) and HREFs (implemented in LuaJIT). Compiler can predict the hash slot used by field lookup, and the VM can correct this prediction dynamically.

As a result, field access can be very fast in Luau, provided that:

  • The source code uses table.field notation. The compiler doesn't optimize table[field] as it assumes that in this case field is not a string and/or can change for different accesses. Because of this you should avoid using table["field"] which isn't idiomatic anyway.
  • The field access doesn't use metatables. The fastest way to work with tables in Luau is to store fields directly inside the table, and store methods in the metatable (see below); access to "static" fields in classic OOP designs is best done through Class.StaticField instead of object.StaticField.
  • The object structure is usually uniform. While it's possible to use the same function to access tables of different shape - e.g. function getX(obj) return obj.x end can be used on any table that has a field "x" - it's best to not vary the keys used in the tables too much, as it defeats this optimization.

The same optimization is applied to the custom globals declared in the script, although it's best to avoid these altogether by using locals instead. Still, this means that the difference between function and local function is less pronounced in Luau.

Importing global access chains

While global access for library functions can be optimized in a similar way, this optimization breaks down when the global table is using sandboxing through metatables, and even when globals aren't sandoxed, math.max still requires two table accesses.

It's always possible to "localize" the global accesses by using local max = math.max, but this is cumbersome - in practice it's easy to forget to apply this optimization. To avoid relying on programmers remembering to do this, Luau implements a special optimization called "imports", where most global chains such as math.max are resolved when the script is loaded instead of when the script is executed.

This optimization relies on being able to predict the shape of the environment table for a given function; this is possible due to global sandboxing, however this optimization is invalid in some cases:

  • loadstring can load additional code that runs in context of the caller's environment
  • getfenv/setfenv can directly modify the environment of any function

The use of any of these functions performs a dynamic deoptimization, marking the affected environment as "impure". The optimizations are only in effect on functions with "pure" environments - because of this, the use of loadstring/getfenv/setfenv is not recommended. Note that getfenv deoptimizes the environment even if it's only used to read values from the environment.

Note: Luau still supports these functions as part of our backwards compatibility promise, although we'd love to switch to Lua 5.2's _ENV as that mechanism is cleaner and doesn't require costly dynamic deoptimization.

Fast method calls

Luau specializes method calls to improve their performance through a combination of compiler, VM and binding optimizations. Compiler emits a specialized instruction sequence when methods are called through obj:Method syntax (while this isn't idiomatic anyway, you should avoid obj.Method(obj)). When the object in question is a Lua table, VM performs some voodoo magic based on inline caching to try to quickly discover the implementation of this method through the metatable.

For this to be effective, it's crucial that __index in a metatable points to a table directly. For performance reasons it's strongly recommended to avoid __index functions as well as deep __index chains; an ideal object in Luau is a table with a metatable that points to itself through __index.

When the object in question is a reflected userdata, a special mechanism called "namecall" is used to minimize the interop cost. In classical Lua binding model, obj:Method is called in two steps, retrieving the function object (obj.Method) and calling it; both steps are often implemented in C++, and the method retrieval needs to use a method object cache - all of this makes method calls slow.

Luau can directly call the method by name using the "namecall" extension, and an optimized reflection layer can retrieve the correct method quickly through more voodoo magic based on string interning and custom Luau features that aren't exposed through Luau scripts.

As a result of both optimizations, common Lua tricks of caching the method in a local variable aren't very productive in Luau and aren't recommended either.

Specialized builtin function calls

Due to global sandboxing and the ability to dynamically deoptimize code running in impure environments, in pure environments we go beyond optimizing the interpreter and optimize many built-in functions through a "fastcall" mechanism.

For this mechanism to work, function call must be "obvious" to the compiler - it needs to call a builtin function directly, e.g. math.max(x, 1), although it also works if the function is "localized" (local max = math.max); this mechanism doesn't work for indirect function calls unless they were inlined during compilation, and doesn't work for method calls (so calling string.byte is more efficient than s:byte).

The mechanism works by directly invoking a highly specialized and optimized implementation of a builtin function from the interpreter core loop without setting up a stack frame and omitting other work; additionally, some fastcall specializations are partial in that they don't support all types of arguments, for example all math library builtins are only specialized for numeric arguments, so calling math.abs with a string argument will fall back to the slower implementation that will do string->number coercion.

As a result, builtin calls are very fast in Luau - they are still slightly slower than core instructions such as arithmetic operations, but only slightly so. The set of fastcall builtins is slowly expanding over time and as of this writing contains math, bit32, assert, type, typeof and some functions from string library.

Note: The partial specialization mechanism is cute in that for assert, it only specializes on truthful conditions; hopefully performance of assert(false) isn't crucial for most code!

Optimized table iteration

Luau implements a fully generic iteration protocol; however, for iteration through tables it recognizes three common idioms (for .. in ipairs(t), for .. in pairs(t) and for .. in next, t) and emits specialized bytecode that is carefully optimized using custom internal iterators.

As a result, iteration through tables typically doesn't result in function calls for every iteration; the performance of iteration using pairs and ipairs is comparable, so it's recommended to pick the iteration style based on readability instead of performance.

Iterating through array-like tables using for i=1,#t tends to be slightly slower because of extra cost incurred when reading elements from the table.

Creating and modifying tables

Luau implements several optimizations for table creation. When creating object-like tables, it's recommended to use table literals ({ ... }) and to specify all table fields in the literal in one go instead of assigning fields later; this triggers an optimization inspired by LuaJIT's "table templates" and results in higher performance when creating objects. When creating array-like tables, if the maximum size of the table is known up front, it's recommended to use table.create function which can create an empty table with preallocated storage, and optionally fill it with a given value.

When appending elements to tables, it's recommended to use table.insert (which is currently ever so slightly slower than t[#t+1] but it will be improved in the future) if the table size is not known. In cases when a table is filled sequentially, however, it's much more efficient to use a known index for insertion - together with preallocating tables using table.create this can result in much faster code, for example this is the fastest way to build a table of squares:

local t = table.create(N)

for i=1,N do
	t[i] = i * i
end

Native Vector3 math

Note: this optimization is still in progress, so this section doesn't document it, but it's going to be great

Optimized upvalue storage

Lua implements upvalues as garbage collected objects that can point directly at the thread's stack or, when the value leaves the stack frame (and is "closed"), store the value inside the object. This representation is necessary when upvalues are mutated, but inefficient when they aren't - and 90% or more of upvalues aren't mutated in typical Lua code. Luau takes advantage of this by reworking upvalue storage to prioritize immutable upvalues - capturing upvalues that don't change doesn't require extra allocations or upvalue closing, resulting in faster closure allocation, faster execution, faster garbage collection and faster upvalue access due to better memory locality.

Note that "immutable" in this case only refers to the variable itself - if the variable isn't assigned to it can be captured by value, even if it's a table that has its contents change.

Fast memory allocator

Similarly to LuaJIT, but unlike vanilla Lua, Luau implements a custom allocator that is highly specialized and tuned to the common allocation workloads we see. The allocator design is inspired by classic pool allocators as well as the excellent mimalloc, but through careful domain-specific tuning it beats all general purpose allocators we've tested, including rpmalloc, mimalloc, jemalloc, ptmalloc and tcmalloc.

This doesn't mean that memory allocation in Luau is free - it's carefully optimized, but it still carries a cost, and a high rate of allocations requires more work from the garbage collector. The garbage collector is incremental, so short of some edge cases this rarely results in visible GC pauses, but can impact the throughput since scripts will interrupt to perform "GC assists" (helping clean up the garbage). Thus for high performance Luau code it's recommended to avoid allocating memory in tight loops, by avoiding temporary table and userdata creation.

In addition to a fast allocator, all frequently used structures in Luau have been optimized for memory consumption, especially on 64-bit platforms, compared to Lua 5.1 baseline. This helps to reduce heap memory footprint and improve performance in some cases by reducing the memory bandwidth impact of garbage collection.

Optimized garbage collector

Note: our garbage collector optimizations are still in progress, so this section doesn't document them.

Fast binding interface

Note: our optimizations of binding interface are still in progress, so this section doesn't document them.