Commit 7f8e5f84 authored by Gaëtan Cassiers's avatar Gaëtan Cassiers
Browse files

Add README and PERFORMANCE

parent 541a1ff8
# Performance evaluation
Performance has been evaluated for both most relevant primitives (Clyde128 and Shadow512) and for the whole AEAD mode (Spook128su512v1).
Performance for other Spook modes has not yet been evaluated, yet the Spook128mu512v1 is expected to have the same performance characteristics as Spook128su512v1.
## Build
The code was compiled with gcc 8.2 for the following targets:
* generic `x86_64`
* haswell
* skylake-avx512
## Intel IACA
Below are the extimates of cycle count for various primitive implementations given by the [IACA 3.0](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) tool.
Clyde128:
| |x86-64|haswell|skylake-avx512|
|-|-|-|-|
|clyde_32bit|372.66|252.30|252.30|
|clyde_64bit| |210.00|210.00|
Shadow512:
| |x86-64|haswell|skylake-avx512|
|-|-|-|-|
|shadow_128bit|325.92|333.78|225.78|
|shadow_256bit| |295.56|242.22|
|shadow_32bit| | | |
|shadow_512bit| | |192.00|
## Benchmark
The benchmark was run on a Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz.
The reported performance figures are cycles per execution for the primitives, and throughput (cycles per byte) for Spook (for large messages).
Clyde128:
| |x86-64|haswell|skylake-avx512|
|-|-|-|-|
|clyde_32bit|317.20|283.40|283.20|
|clyde_64bit| |271.00|271.40|
Shadow512:
| |x86-64|haswell|skylake-avx512|
|-|-|-|-|
|shadow_128bit|408.80|396.60|304.40|
|shadow_256bit| |432.20|312.40|
|shadow_32bit|904.40|456.60|342.40|
|shadow_512bit| | |454.20|
Spook128su512v1
| |x86-64|haswell|skylake-avx512|
|-|-|-|-|
|clyde_32bit-shadow_128bit|12.70|12.88|10.06|
|clyde_32bit-shadow_32bit|159.56|100.98|11.33|
|clyde_64bit-shadow_128bit| |12.87|10.06|
# Spook High End Implementations
Spook software implementations for high-end microprocessors: from 32 bit generic target to AVX512 x86_64 processors.
\ No newline at end of file
Spook software implementations for high-end microprocessors: from 32 bit generic target to AVX512 x86_64 processors.
## Implementations
All the implementations share the same code for the S1P mode of operation, but use different Clyde and Shadow primitive implementations.
Clyde:
* `clyde_32bit.c`: A straightforward, standard and portable C99 implementation. It is based on the reference implementation with changes to enable better compiler optimizations.
* `clyde_64bit.c`: This implementation puts Clyde state in two 64 bit registers, where each register interleaves bits of two consecutive rows. The 2x32 bit L-box is implemented using 64 bit rotations and XORs. This implementation uses compiler instrinsics to access instructions of the `BMI2` extension of the `x86_64` instruction set and improves performance by about 4%.
Implementations of the inverse of Clyde are available (`clyde_32bit_inv.c` and `clyde_64bit_inv.c`) but are not used since they are not useful in implementations without side-channel countermeasures, and have worse performance than their direct counterparts (due the to properties of the the S-box and L-box functions).
Shadow:
* `shadow_32bit.c`: A straightforward, standard and portable C99 implementation. It is based on the reference implementation with changes to enable better compiler optimizations.
* `shadow_128bit.c`: This implementation uses GCC vector extensions to achieve architecture portability. Shadow state is put in four 128-bit registers, where each registers contains one row of each bundle. This implementation exhibits best performance and is thus recommended on platforms that have 128-bit registers.
* `shadow_256bit.c` and `shadow_512bit.c`: These implementations use respectively AVX2 and AVX512f intrinsics. They usually have worse performance than the 128 bit implementation. These implementation do not cover the Shadow384 case (only Shadow512).
These primitives have much better performance than the reference implementation, however they are not fully optimized either: portability and code simplicity are also a concern.
The code size is not optimized at all, for reduced code size, please look at the portable embedded implementation.
## Build
Have a look at `test/Makefile`.
## Test
```sh
$ cd test
$ ./test.sh # Compiler errors are expected when trying to build shadow_256bit.c and shadow_512.c with SMALL_PERM=1
```
## Benchmarking
```sh
$ cd bench
$ ./prim_bench.sh
$ ./spook_bench.sh
```
A few results are shown in [PERFORMANCE](PERFORMANCE.md).
## Contributing
Contributions of any kind (code, bug reports, benckmarks, ...) are welcome. Please contact us at `team@spook.dev`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment