certain mix of calls to RC4 routine not covered by rc4test.c.
It's fixed now. In addition this patch inadvertently fixes minor
performance problem: in 0.9.7 context P4 was performing 12% slower
than the original implementation...
apparently impossible to compose blended code with would perform
satisfactory on all x86 and x86_64 cores, an extra RC4_CHAR
code-path is introduced and P4 core is detected at run-time. This
way we keep original performance on non-P4 implementations and
turbo-charge P4 performance by factor of 2.8x (on 32-bit core).