luau/CodeGen at master - luau

mirrors/luau

Fork 0

mirror of https://github.com/luau-lang/luau.git synced 2024-11-15 06:15:44 +08:00

History

Arseny Kapoulkine e6bf71871a Some checks failed benchmark / callgrind (map[branch:main name:luau-lang/benchmark-data], ubuntu-22.04) (push) Has been cancelled Details build / ${{matrix.os.name}} (map[name:macos version:macos-latest]) (push) Has been cancelled Details build / ${{matrix.os.name}} (map[name:macos-arm version:macos-14]) (push) Has been cancelled Details build / ${{matrix.os.name}} (map[name:ubuntu version:ubuntu-latest]) (push) Has been cancelled Details build / windows (Win32) (push) Has been cancelled Details build / windows (x64) (push) Has been cancelled Details build / coverage (push) Has been cancelled Details build / web (push) Has been cancelled Details release / ${{matrix.os.name}} (map[name:macos version:macos-latest]) (push) Has been cancelled Details release / ${{matrix.os.name}} (map[name:ubuntu version:ubuntu-20.04]) (push) Has been cancelled Details release / ${{matrix.os.name}} (map[name:windows version:windows-latest]) (push) Has been cancelled Details release / web (push) Has been cancelled Details CodeGen: Rewrite dot product lowering using a dedicated IR instruction (#1512 ) Instead of doing the dot product related math in scalar IR, we lift the computation into a dedicated IR instruction. On x64, we can use VDPPS which was more or less tailor made for this purpose. This is better than manual scalar lowering that requires reloading components from memory; it's not always a strict improvement over the shuffle+add version (which we never had), but this can now be adjusted in the IR lowering in an optimal fashion (maybe even based on CPU vendor, although that'd create issues for offline compilation). On A64, we can either use naive adds or paired adds, as there is no dedicated vector-wide horizontal instruction until SVE. Both run at about the same performance on M2, but paired adds require fewer instructions and temporaries. I've measured this using mesh-normal-vector benchmark, changing the benchmark to just report the time of the second loop inside `calculate_normals`, testing master vs #1504 vs this PR, also increasing the grid size to 400 for more stable timings. On Zen 4 (7950X), this PR is comfortably ~8% faster vs master, while I see neutral to negative results in #1504. On M2 (base), this PR is ~28% faster vs master, while #1504 is only about ~10% faster. If I measure the second loop in `calculate_tangent_space` instead, I get: On Zen 4 (7950X), this PR is ~12% faster vs master, while #1504 is ~3% faster On M2 (base), this PR is ~24% faster vs master, while #1504 is only about ~13% faster. Note that the loops in question are not quite optimal, as they store and reload various vectors to dictionary values due to inappropriate use of locals. The underlying gains in individual functions are thus larger than the numbers above; for example, changing the `calculate_normals` loop to use a local variable to store the normalized vector (but still saving the result to dictionary value), I get a ~24% performance increase from this PR on Zen4 vs master instead of just 8% (#1504 is ~15% slower in this setup).	2024-11-08 16:23:09 -08:00
..
include	CodeGen: Rewrite dot product lowering using a dedicated IR instruction (#1512 )	2024-11-08 16:23:09 -08:00
src	CodeGen: Rewrite dot product lowering using a dedicated IR instruction (#1512 )	2024-11-08 16:23:09 -08:00

Arseny Kapoulkine e6bf71871a

benchmark / callgrind (map[branch:main name:luau-lang/benchmark-data], ubuntu-22.04) (push) Has been cancelled

Details

build / ${{matrix.os.name}} (map[name:macos version:macos-latest]) (push) Has been cancelled

Details

build / ${{matrix.os.name}} (map[name:macos-arm version:macos-14]) (push) Has been cancelled

Details

build / ${{matrix.os.name}} (map[name:ubuntu version:ubuntu-latest]) (push) Has been cancelled

Details

build / windows (Win32) (push) Has been cancelled

Details

build / windows (x64) (push) Has been cancelled

Details

build / coverage (push) Has been cancelled

Details

build / web (push) Has been cancelled

Details

release / ${{matrix.os.name}} (map[name:macos version:macos-latest]) (push) Has been cancelled

Details

release / ${{matrix.os.name}} (map[name:ubuntu version:ubuntu-20.04]) (push) Has been cancelled

Details

release / ${{matrix.os.name}} (map[name:windows version:windows-latest]) (push) Has been cancelled

Details

release / web (push) Has been cancelled

Details

CodeGen: Rewrite dot product lowering using a dedicated IR instruction (#1512 )

Instead of doing the dot product related math in scalar IR, we lift the
computation into a dedicated IR instruction.

On x64, we can use VDPPS which was more or less tailor made for this
purpose. This is better than manual scalar lowering that requires
reloading components from memory; it's not always a strict improvement
over the shuffle+add version (which we never had), but this can now be
adjusted in the IR lowering in an optimal fashion (maybe even based on
CPU vendor, although that'd create issues for offline compilation).

On A64, we can either use naive adds or paired adds, as there is no
dedicated vector-wide horizontal instruction until SVE. Both run at
about the same performance on M2, but paired adds require fewer
instructions and temporaries.

I've measured this using mesh-normal-vector benchmark, changing the
benchmark to just report the time of the second loop inside
`calculate_normals`, testing master vs #1504 vs this PR, also increasing
the grid size to 400 for more stable timings.

On Zen 4 (7950X), this PR is comfortably ~8% faster vs master, while I
see neutral to negative results in #1504.
On M2 (base), this PR is ~28% faster vs master, while #1504 is only
about ~10% faster.

If I measure the second loop in `calculate_tangent_space` instead, I
get:

On Zen 4 (7950X), this PR is ~12% faster vs master, while #1504 is ~3%
faster
On M2 (base), this PR is ~24% faster vs master, while #1504 is only
about ~13% faster.

Note that the loops in question are not quite optimal, as they store and
reload various vectors to dictionary values due to inappropriate use of
locals. The underlying gains in individual functions are thus larger
than the numbers above; for example, changing the `calculate_normals`
loop to use a local variable to store the normalized vector (but still
saving the result to dictionary value), I get a ~24% performance
increase from this PR on Zen4 vs master instead of just 8% (#1504 is
~15% slower in this setup).

2024-11-08 16:23:09 -08:00

include

CodeGen: Rewrite dot product lowering using a dedicated IR instruction (#1512 )

2024-11-08 16:23:09 -08:00

src

CodeGen: Rewrite dot product lowering using a dedicated IR instruction (#1512 )

2024-11-08 16:23:09 -08:00