ghash-x86_64.pl: minimize stack frame usage.
ghash-x86.pl: modulo-scheduling MMX loop in respect to input vector
results in up to 10% performance improvement.
Submitted by: "Noszticzius, Istvan" <inoszticzius@rightnow.com>
Don't clear the output buffer: ciphers should correctly the same input
and output buffers.
rely on indirect call to block functions, they are not as fast as
non-consolidated routines. However, performance loss(*) is within
measurement error and consolidation advantages are considered to
outweigh it.
(*) actually one can observe performance *improvement* on e.g.
CBC benchmarks thanks to optimization, which also becomes
shared among ciphers.