... script data load.
On related note an attempt was made to merge rotations with logical
operations. I mean as we know, ARM ISA has merged rotate-n-logical
instructions which can be used here. And they were used to improve
keccak1600-armv4 performance. But not here. Even though this approach
resulted in improvement on Cortex-A53 proportional to reduction of
amount of instructions, ~8%, it didn't exactly worked out on
non-Cortex cores. Presumably because they break merged instructions
to separate μ-ops, which results in higher *operations* count. X-Gene
and Denver went ~20% slower and Apple A7 - 40%. The optimization was
therefore dismissed.
Reviewed-by: Rich Salz <rsalz@openssl.org>