actually be used, namely x86* platforms [because they don't bomb on
unaligned access]. This resulted in 30-40% [depending on message
length] improvement for SHA-256 compiled with gcc and running on P4.
In the lack of assembler implementation I give the compiler all the
help it can possibly get:-)