Improvement coefficients vary with TLS fragment length and platform, on
most Intel processors maximum improvement is ~50%, while on Ryzen - 80%.
The "secret" is new dedicated ChaCha20_128 code path and vectorized xor
helpers.
Reviewed-by: Rich Salz <rsalz@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/6638)
Hardware used for benchmarking courtesy of Atos, experiments run by
Romain Dolbeau <romain.dolbeau@atos.net>. Kudos!
Reviewed-by: Rich Salz <rsalz@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/4855)
Convert AVX512F+VL+BW code path to pure AVX512F, so that it can be
executed even on Knights Landing. Trigger for modification was
observation that AVX512 code paths can negatively affect overall
Skylake-X system performance. Since we are likely to suppress
AVX512F capability flag [at least on Skylake-X], conversion serves
as kind of "investment protection".
Reviewed-by: Rich Salz <rsalz@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/4758)
Around 138 distinct errors found and fixed; thanks!
Reviewed-by: Kurt Roeckx <kurt@roeckx.be>
Reviewed-by: Tim Hudson <tjh@openssl.org>
Reviewed-by: Rich Salz <rsalz@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/3459)
"Optimize" is in quotes because it's rather a "salvage operation"
for now. Idea is to identify processor capability flags that
drive Knights Landing to suboptimial code paths and mask them.
Two flags were identified, XSAVE and ADCX/ADOX. Former affects
choice of AES-NI code path specific for Silvermont (Knights Landing
is of Silvermont "ancestry"). And 64-bit ADCX/ADOX instructions are
effectively mishandled at decode time. In both cases we are looking
at ~2x improvement.
AVX-512 results cover even Skylake-X :-)
Hardware used for benchmarking courtesy of Atos, experiments run by
Romain Dolbeau <romain.dolbeau@atos.net>. Kudos!
Reviewed-by: Rich Salz <rsalz@openssl.org>
As hinted by its name new subroutine processes 8 input blocks in
parallel by loading data to 512-bit registers. It still needs more
work, as it needs to handle some specific input lengths better.
In this sense it's yet another intermediate step...
Reviewed-by: Rich Salz <rsalz@openssl.org>
As hinted by its name new subroutine processes 4 input blocks in
parallel. It still operates on 256-bit registers and is just
another step toward full-blown AVX512IFMA procedure.
Reviewed-by: Rich Salz <rsalz@openssl.org>
On pre-Skylake best optimization strategy was balancing port-specific
instructions, while on Skylake minimizing the sheer amount appears
more sensible.
Reviewed-by: Rich Salz <rsalz@openssl.org>
Formally only 32-bit AVX2 code path needs this, but I choose to
harmonize all vector code paths.
RT#4346
Reviewed-by: Richard Levitte <levitte@openssl.org>