Even though no test could be found to trigger this, paper-n-pencil
estimate suggests that x86 and ARM inner loop lazy reductions can
loose a bit in H4>>*5+H0 step.
Reviewed-by: Emilia Käsper <emilia@openssl.org>
Formally only 32-bit AVX2 code path needs this, but I choose to
harmonize all vector code paths.
RT#4346
Reviewed-by: Richard Levitte <levitte@openssl.org>