luau/CodeGen at c5f4d973d75b17dd82dd7ad197860c96c211b884 - luau

mirror of https://github.com/luau-lang/luau.git synced 2024-11-15 06:15:44 +08:00

History

Arseny Kapoulkine c5f4d973d7 Improve A64 lowering for vector operations by using vector instructions (#1164 ) This change replaces scalar versions of vector opcodes for A64 with actual vector instructions. We take the approach similar to X64: patch last component with zero, perform the math, patch last component with type tag. I'm hoping that in the future the type tag will be placed separately (separate IR opcode?) because right now chains of math operations result in excessive type tag operations. To patch the type tag without always keeping a mask in a register, ins.4s instructions can be used; unfortunately it's only capable of patching a register in-place, so we need an extra register copy in case it's not last-use. Usually it's last-use so the patch is free; probably with IR rework mentioned above all of this can be improved (e.g. load-with-patch will never need to copy). ~It's not 100% clear if we have to patch type tag: Apple does preserve denormals but we'd need to benchmark this to see if there's an actual performance impact. But for now we're playing it safe.~ This was tested by running the conformance tests, and new opcode implementations were checked by comparing the result with https://armconverter.com/. Performance testing is complicated by the fact that OSS Luau doesn't support vector constructor out of the box, and other limitations of codegen. I've hacked vector constructor/type into REPL and confirmed that on a test that calls this function in a loop (not inlined): ``` function fma(a: vector, b: vector, c: vector) return a * b + c end ``` ... this PR improves performance by ~6% (note that probably most of the overhead here is the call dispatch; I didn't want to brave testing a more complex expression). The assembly for an individual operation changes as follows: Before: ``` # %14 = MUL_VEC %12, %13 ; useCount: 2, lastUse: %22 dup s29,v31.s[0] dup s28,v30.s[0] fmul s29,s29,s28 ins v31.s[0],v29.s[0] dup s29,v31.s[1] dup s28,v30.s[1] fmul s29,s29,s28 ins v31.s[1],v29.s[0] dup s29,v31.s[2] dup s28,v30.s[2] fmul s29,s29,s28 ins v31.s[2],v29.s[0] ``` After: ``` # %14 = MUL_VEC %12, %13 ; useCount: 2, lastUse: %22 ins v31.s[3],w31 ins v30.s[3],w31 fmul v31.4s,v31.4s,v30.4s movz w17,#4 ins v31.s[3],w17 ``` edit final form (see comments): ``` # %14 = MUL_VEC %12, %13 ; useCount: 2, lastUse: %22 fmul v31.4s,v31.4s,v30.4s movz w17,#4 ins v31.s[3],w17 ```	2024-02-16 08:30:35 -08:00
..
include	Improve A64 lowering for vector operations by using vector instructions (#1164 )	2024-02-16 08:30:35 -08:00
src	Improve A64 lowering for vector operations by using vector instructions (#1164 )	2024-02-16 08:30:35 -08:00

Arseny Kapoulkine c5f4d973d7

Improve A64 lowering for vector operations by using vector instructions (#1164 )

This change replaces scalar versions of vector opcodes for A64 with
actual vector instructions.

We take the approach similar to X64: patch last component with zero,
perform the math, patch last component with type tag. I'm hoping that in
the future the type tag will be placed separately (separate IR opcode?)
because right now chains of math operations result in excessive type tag
operations.

To patch the type tag without always keeping a mask in a register,
ins.4s instructions can be used; unfortunately it's only capable of
patching a register in-place, so we need an extra register copy in case
it's not last-use. Usually it's last-use so the patch is free; probably
with IR rework mentioned above all of this can be improved (e.g.
load-with-patch will never need to copy).

~It's not 100% clear if we *have* to patch type tag: Apple does preserve
denormals but we'd need to benchmark this to see if there's an actual
performance impact. But for now we're playing it safe.~

This was tested by running the conformance tests, and new opcode
implementations were checked by comparing the result with
https://armconverter.com/.

Performance testing is complicated by the fact that OSS Luau doesn't
support vector constructor out of the box, and other limitations of
codegen. I've hacked vector constructor/type into REPL and confirmed
that on a test that calls this function in a loop (not inlined):

```
function fma(a: vector, b: vector, c: vector)
        return a * b + c
end
```

... this PR improves performance by ~6% (note that probably most of the
overhead here is the call dispatch; I didn't want to brave testing a
more complex expression). The assembly for an individual operation
changes as follows:

Before:

```
#   %14 = MUL_VEC %12, %13                                    ; useCount: 2, lastUse: %22
 dup         s29,v31.s[0]
 dup         s28,v30.s[0]
 fmul        s29,s29,s28
 ins         v31.s[0],v29.s[0]
 dup         s29,v31.s[1]
 dup         s28,v30.s[1]
 fmul        s29,s29,s28
 ins         v31.s[1],v29.s[0]
 dup         s29,v31.s[2]
 dup         s28,v30.s[2]
 fmul        s29,s29,s28
 ins         v31.s[2],v29.s[0]
```

After:

```
#   %14 = MUL_VEC %12, %13                                    ; useCount: 2, lastUse: %22
 ins         v31.s[3],w31
 ins         v30.s[3],w31
 fmul        v31.4s,v31.4s,v30.4s
 movz        w17,#4
 ins         v31.s[3],w17
```

**edit** final form (see comments):

```
#   %14 = MUL_VEC %12, %13                                    ; useCount: 2, lastUse: %22
 fmul        v31.4s,v31.4s,v30.4s
 movz        w17,#4
 ins         v31.s[3],w17
```

2024-02-16 08:30:35 -08:00

include

Improve A64 lowering for vector operations by using vector instructions (#1164 )

2024-02-16 08:30:35 -08:00

src

Improve A64 lowering for vector operations by using vector instructions (#1164 )

2024-02-16 08:30:35 -08:00