luau/docs/performance.md
2020-06-12 01:48:10 -07:00

8.1 KiB

Performance

One of main goals of Luau is to enable high performance code, with gameplay code being the main use case. This can be viewed as two separate goals:

  • Make idiomatic code that is often written faster
  • Enable even more high performance code through careful tuning

Both of these goals are important - it's insufficient to just focus on the highly tuned code, and all things being equal we prefer to raise all boats by implementing general optimizations. However, in some cases it's important to be aware of optimizations that Luau does and doesn't do.

Worth noting is that Luau is focused on, first and foremost, stable high performance code in interpreted context. This is because JIT compilation is not available on many platforms Luau runs on, and AOT compilation would only work for code that Roblox ships (and even that does not always work). This is in stark contrast with LuaJIT that, while providing an excellent interpreter as well, focuses a lot of the attention on JIT (with many optimizations unavailable in the interpreter).

Luau eventually plans to implement JIT on some platforms, but this is subject to careful memory safety analysis and is likely to not be deployed for client-side scripts, as the extra risk involved in JITs is much more pronounced when it may affect players.

The rest of this document goes into some optimizations that Luau employs and how to best leverage them when writing code. The document is not complete - a lot of optimizations are transparent to the user and involve detailed low-level tuning of various parts that is not described here - and all of this is subject to change without notice, as it doesn't affect the semantics of valid code.

Fast bytecode interpreter

Luau features a very highly tuned portable bytecode interpreter. It's similar to Lua interpreter in that it's written in C, but it's highly tuned to yield efficient assembly when compiled with Clang and latest versions of MSVC. On some workloads it can match the performance of LuaJIT interpreter which is written in highly specialized assembly. We are continuing to tune the interpreter and the bytecode format over time; while some extra performance can be extracted by rewriting the interpreter in assembly, we're unlikely to ever do that as the extra gains at this point are marginal, and we gain a lot in terms of portability and being able to quickly prototype new ideas.

Of course the interpreter isn't typical C code - it uses many tricks to achieve extreme levels of performance and to coerce the compiler to produce efficient code. Due to a better bytecode design and more efficient dispatch loop it's noticeably faster than Lua 5.x (including Lua 5.4 which made some of the changes similar to Luau, but doesn't come close). The bytecode design was partially inspired by excellent LuaJIT interpreter.

Optimizing compiler

Unlike Lua and LuaJIT, Luau uses a more classical compiler construction with a frontend that parses source into an AST and a backend that generates bytecode from it. This carries a small penalty in terms of compilation time, but results in more flexible code and, crucially, makes it easier to optimize the generated bytecode.

While bytecode optimizations are limited due to the flexibility of Luau code (e.g. a * 1 may not be equivalent to a if * is overloaded through metatables), even in absence of type information Luau compiler can perform some optimizations such as "deep" constant folding across functions and local variables, perform upvalue optimizations for upvalues that aren't mutated, do analysis of builtin function usage, and some peephole optimizations on the resulting bytecode. In the future we plan to do bytecode-level inlining and possibly other code transformation.

Luau compiler currently doesn't use type information to do further optimizations, however early experiments suggest that we can extract further wins. Because we control the entire stack (unlike e.g. TypeScript where the type information is discarded completely before reaching the VM), we have more flexibility there and can make some tradeoffs during codegen even if the type system isn't completely sound. For example, it might be reasonable to assume that in presence of known types, we can infer absence of side effects for arithmetic operations and builtins - if the runtime types mismatch due to intentional violation of the type safety through global injection, the code will still be safely sandboxed; this may unlock optimizations such as common subexpression elimination and allocation hoisting without a JIT. This is speculative pending further research.

Inline caching for table and global access

Table access for field lookup is optimized in Luau using a mechanism that blends inline caching (classically used in Java/JavaScript VMs) and HREFs (implemented in LuaJIT). Compiler can predict the hash slot used by field lookup, and the VM can correct this prediction dynamically.

As a result, field access can be very fast in Luau, provided that:

  • The source code uses table.field notation. The compiler doesn't optimize table[field] as it assumes that in this case field is not a string and/or can change for different accesses. Because of this you should avoid using table["field"] which isn't idiomatic anyway.
  • The field access doesn't use metatables. The fastest way to work with tables in Luau is to store fields directly inside the table, and store methods in the metatable (see below); access to "static" fields in classic OOP designs is best done through Class.StaticField instead of object.StaticField.
  • The object structure is usually uniform. While it's possible to use the same function to access tables of different shape - e.g. function getX(obj) return obj.x end can be used on any table that has a field "x" - it's best to not vary the keys used in the tables too much, as it defeats this optimization.

The same optimization is applied to the custom globals declared in the script, although it's best to avoid these altogether by using locals instead. Still, this means that the difference between function and local function is less pronounced in Luau.

Importing global access chains

While global access for library functions can be optimized in a similar way, this optimization breaks down when the global table is using sandboxing through metatables, and even when globals aren't sandoxed, math.max still requires two table accesses.

It's always possible to "localize" the global accesses by using local max = math.max, but this is cumbersome - in practice it's easy to forget to apply this optimization. To avoid relying on programmers remembering to do this, Luau implements a special optimization called "imports", where most global chains such as math.max are resolved when the script is loaded instead of when the script is executed.

This optimization relies on being able to predict the shape of the environment table for a given function; this is possible due to global sandboxing, however this optimization is invalid in some cases:

  • loadstring can load additional code that runs in context of the caller's environment
  • getfenv/setfenv can directly modify the environment of any function

The use of any of these functions performs a dynamic deoptimization, marking the affected environment as "impure". The optimizations are only in effect on functions with "pure" environments - because of this, the use of loadstring/getfenv/setfenv is not recommended. Note that getfenv deoptimizes the environment even if it's only used to read values from the environment.

Note: Luau still supports these functions as part of our backwards compatibility promise, although we'd love to switch to Lua 5.2's _ENV as that mechanism is cleaner and doesn't require costly dynamic deoptimization.

Calling built-in functions

Inserting into tables

Iterating through tables

Creating record-like tables

Native Vector3 math

Note: this optimization is still in progress, so this paragraph represents the desired end state

Fast memory allocator

Optimized garbage collector

Note: our garbage collector optimizations are still in progress, so this section doesn't document them.