good example of how a merge can have no conflicts yet neverthless
require a bunch of adjustments to got something that actually works.
these two changes are conceptually independent but since shrink-buf
changes the internal representation of a digest, the shake code needs to
be adjusted to match. i wanted to capture the code both before and
after the shrink-buf change, thus this merge.
there's a bunch of code duplication that i will clean up later.
instead of buffering an entire block, buffer only when the input is not
aligned to 8 bytes, and otherwise xor uint64-sized chunks directly into
the state.
the code is a little more complicated but i think it's worth it.
we could eliminate the buffer entirely but that requires either
shenanigans with unsafe, or fiddly code to xor partial uint64s
a caveat is that the implementation now only supports sponge capacities
that are a multiple of 8. that's fine for the standard instantiations
but may restrict unusual applications.
not only does this let us reduce the buffer from 200 bytes to 8,
it also provides a nice speedup
name old time/op new time/op delta
256_8-2 1.45µs ± 0% 1.28µs ± 1% -11.58% (p=0.000 n=10+10)
256_1k-2 10.1µs ± 0% 9.3µs ± 0% -7.67% (p=0.000 n=10+10)
256_8k-2 75.6µs ± 0% 70.2µs ± 1% -7.09% (p=0.000 n=10+10)
512_8-2 1.39µs ± 1% 1.29µs ± 1% -6.85% (p=0.000 n=10+10)
512_1k-2 18.7µs ± 0% 17.0µs ± 0% -8.70% (p=0.000 n=9+10)
512_8k-2 146µs ± 1% 129µs ± 0% -11.70% (p=0.000 n=10+9)
name old speed new speed delta
256_8-2 5.53MB/s ± 0% 6.25MB/s ± 0% +13.06% (p=0.000 n=10+10)
256_1k-2 102MB/s ± 0% 110MB/s ± 0% +8.30% (p=0.000 n=10+10)
256_8k-2 108MB/s ± 0% 117MB/s ± 1% +7.64% (p=0.000 n=10+10)
512_8-2 5.78MB/s ± 1% 6.20MB/s ± 1% +7.32% (p=0.000 n=10+10)
512_1k-2 54.9MB/s ± 0% 60.1MB/s ± 0% +9.53% (p=0.000 n=9+10)
512_8k-2 56.1MB/s ± 1% 63.5MB/s ± 0% +13.26% (p=0.000 n=10+9)
(reroll? no, that's something else)
makes it more similar to the templatized code in gen.go. this isn't the
optimized code, so performance doesn't matter.