instead of buffering an entire block, buffer only when the input is not
aligned to 8 bytes, and otherwise xor uint64-sized chunks directly into
the state.
the code is a little more complicated but i think it's worth it.
we could eliminate the buffer entirely but that requires either
shenanigans with unsafe, or fiddly code to xor partial uint64s
a caveat is that the implementation now only supports sponge capacities
that are a multiple of 8. that's fine for the standard instantiations
but may restrict unusual applications.
not only does this let us reduce the buffer from 200 bytes to 8,
it also provides a nice speedup
name old time/op new time/op delta
256_8-2 1.45µs ± 0% 1.28µs ± 1% -11.58% (p=0.000 n=10+10)
256_1k-2 10.1µs ± 0% 9.3µs ± 0% -7.67% (p=0.000 n=10+10)
256_8k-2 75.6µs ± 0% 70.2µs ± 1% -7.09% (p=0.000 n=10+10)
512_8-2 1.39µs ± 1% 1.29µs ± 1% -6.85% (p=0.000 n=10+10)
512_1k-2 18.7µs ± 0% 17.0µs ± 0% -8.70% (p=0.000 n=9+10)
512_8k-2 146µs ± 1% 129µs ± 0% -11.70% (p=0.000 n=10+9)
name old speed new speed delta
256_8-2 5.53MB/s ± 0% 6.25MB/s ± 0% +13.06% (p=0.000 n=10+10)
256_1k-2 102MB/s ± 0% 110MB/s ± 0% +8.30% (p=0.000 n=10+10)
256_8k-2 108MB/s ± 0% 117MB/s ± 1% +7.64% (p=0.000 n=10+10)
512_8-2 5.78MB/s ± 1% 6.20MB/s ± 1% +7.32% (p=0.000 n=10+10)
512_1k-2 54.9MB/s ± 0% 60.1MB/s ± 0% +9.53% (p=0.000 n=9+10)
512_8k-2 56.1MB/s ± 1% 63.5MB/s ± 0% +13.26% (p=0.000 n=10+9)
(reroll? no, that's something else)
makes it more similar to the templatized code in gen.go. this isn't the
optimized code, so performance doesn't matter.