Home made crypto

Using home made crypto for a non-cryptographic problem

Stockholm C++ 20200123

Paul Dreik

## About me - const west - let's break ABI in C++ 23 - I like unsigned integers - Greta is right

About Me

"Don't make your own crypto"

Dont TRUST your own crypto.

Don't use home made crypto for cryptographic purposes

unless you write malware
serious motivation

From some random code base

01: int
02: count_set_bits(unsigned int v)
03: {
04:   const int S[] = { 1, 2, 4, 8, 16 };
05:   const int B[] = { 0x55555555,
06:                     0x33333333,
07:                     0x0F0F0F0F,
08:                     0x00FF00FF,
09:                     0x0000FFFF };
10: 
11:   unsigned int c;
12:   c = v - ((v >> 1) & B[0]);
13:   c = ((c >> S[1]) & B[1]) + (c & B[1]);
14:   c = ((c >> S[2]) + c) & B[2];
15:   c = ((c >> S[3]) + c) & B[3];
16:   c = ((c >> S[4]) + c) & B[4];
17:   return c;
18: }
19: 
credit: https://graphics.stanford.edu/~seander/bithacks.html

Write a unit test

1: int
2: main()
3: {
4:   assert(count_set_bits(0) == 0);
5:   assert(count_set_bits(1) == 1);
6:   assert(count_set_bits(1 << 10) == 1);
7:   assert(count_set_bits(0xF) == 4);
8: }
9: 

Exhaustive testing

1: void
2: exhaustive_test()
3: {
4:   unsigned int = 0;
5:   do {
6:     // .......
7:   } while (++i);
8: }
9: 

Exhaustive testing

01: int
02: main()
03: {
04:   unsigned int i = 0;
05:   do {
06:     auto actual = count_set_bits(i);
07:     auto expected = __builtin_popcount(i);
08:     assert(actual == expected);
09:   } while (++i);
10: }
11: 
Demo

Exploration

What if the interesting bug happens only when the leading bit is set?

Xor

01: int
02: main()
03: {
04:   unsigned int i = 0;
05:   const unsigned int flip = std::random_device{}();
06:   do {
07:     const unsigned int j = i ^ flip;
08:     auto actual = count_set_bits(j);
09:     auto expected = __builtin_popcount(j);
10:     assert(actual == expected);
11:   } while (++i);
12: }
13: 
Xor Demo

Use random

01: #include <random>
02: 
03: int
04: main()
05: {
06:   std::mt19937_64 gen;
07:   gen.seed(42);
08: 
09:   for (;;) {
10:     auto x = gen();
11:     auto actual = count_set_bits(x);
12:     auto expected = __builtin_popcountl(x);
13:     assert(actual == expected);
14:   }
15: }
16: 
Random Demo

Good exploration

Only beaten by fuzzing tricks

Analyze the waste

Notation

M possible inputs ($M=2^{32}$ in the example)

We pick $k$ testcases, and get $Y$ distinct inputs

$Y$ is stochastic

variant on the birthday problem

https://en.wikipedia.org/wiki/Birthday_problem
### Comparison | birthday problem | our problem | |:---------------------:|:-------------------- | |365 days | M=2^32 inputs | |30 students | k tries | |Prob(collision) ? | expected # of distinct inputs |

Expected coverage

Y is stochastic. Expected value: $$ E(Y) = M \left(1 - \left(\frac{M-1}{M}\right)^k\right)$$

Simplification

When M is large: $$ E(Y) = M \left(1 - \left(\frac{M-1}{M}\right)^k\right)$$ $$ \frac{E(Y)}{M} \approx 1 - e^{-\frac{k}{M}}$$

Distinct values Y after k samples

$M=2^{32}$

Wasted effort

$M=2^{32}$

Takeaways

  • Random sampling is good
  • wasted effort when k approaches M
  • Inefficient way to get exhaustive testing
Can we get random order, without repetition?

Using shuffle

  • Generate $0, 1, 2, 3, 4 \dotsc M-1$
  • permute with std::shuffle()
  • Use the data

Shuffle - implementation

Generate $0, 1, 2, 3, 4 \dotsc M-1$:
1: std::vector<unsigned int> v(M);
2: std::iota(begin(v), end(v), 0);
3: 
Shuffle the order:
1: std::minstd_rand gen(42);
2: std::shuffle(begin(v), end(v), gen);
3: 
Use the data:
1: for (unsigned int i : v) {
2:   auto actual = count_set_bits(i);
3:   auto expected = __builtin_popcount(i);
4:   assert(actual == expected);
5: }
6: 

Shuffle - implementation

01: int
02: main(int argc, char* argv[])
03: {
04:   assert(argc > 1);
05:   const std::size_t M = std::stoull(argv[1]);
06: 
07:   std::vector<unsigned int> v(M);
08:   std::iota(begin(v), end(v), 0);
09: 
10:   std::minstd_rand gen(42);
11:   std::shuffle(begin(v), end(v), gen);
12: 
13:   for (unsigned int i : v) {
14:     auto actual = count_set_bits(i);
15:     auto expected = __builtin_popcount(i);
16:     assert(actual == expected);
17:   }
18: }
19: 
Shuffle demo
## Performance - clang 8, 64 bit Debian Linux - perf - empty function in other TU - Intel core i7-8565U (laptop) - count cpu cycles instead of time

Performance

Method cycles/sample
sequential 4
xor 5
random 42
shuffle 900

Note: random is the best of mt19937_64, mt19937 and minstd_rand

## Why is shuffle so slow? - allocates large amount of memory - all work up front - terrible cache behaviour https://pauldreik.blogspot.com/2019/12/iterating-in-random-order-with-lazy.html
## We want - random exploration - no repetitions - low memory need - no work up front - be as fast as random()

Block cipher

A crypto building block

Encryption

secret key K
|
V
$\fbox{N bits of cleartext}$   ====>   $\fbox{N bits of cryptotext}$

Decryption

secret key K
|
V
$\fbox{N bits of cleartext}$   <====   $\fbox{N bits of cryptotext}$

Example

Let's make a 3-bit block crypto

$\fbox{3 bits of cleartext}$   ====>   $\fbox{3 bits of cryptotext}$
| 3-bit Input | 3-bit Output | |:-------:|:--------:| | 0 | 7 | | 1 | 2 | | 2 | 4 | | 3 | 5 | | 4 | 3 | | 5 | 1 | | 6 | 6 | | 7 | 0 |
### Size preserving sizeof(i) = sizeof(encrypt(j))
### Distinct $\forall i \ne j$ encrypt(i) != encrypt(j)
block cipher = permutation

Outlining a solution

01: struct BlockCipher32
02: {
03:   unsigned encrypt(unsigned x) const;
04:   unsigned decrypt(unsigned x) const;
05: 
06: private:
07:   SecretKey m_key;
08: };
09: 
10: int
11: main()
12: {
13:   BlockCipher32 cipher;
14:   unsigned int i = 0;
15:   do {
16:     f(cipher.encrypt(i));
17:   } while (++i);
18: }
19: 
## Existing block ciphers - AES,Serpent,Twofish - 128 bit - DES,3DES,Blowfish - 64 bit - RC5 - 32/64/128 bit
## Problem 1: Block size Testing a function f(int, bool) needs a $N=33$ bit cipher
## Problem 2: Speed Don't pay for security if we don't need it

Where do we find a fast N-bit block cipher?

We will make are own!

Feistel Cipher

A simple recipe for making block ciphers
For a N-bit block cipher
1: // example - 32 bit block cipher
2: using half = std::uint16_t;
3: 
4: half
5: roundFunction(half x, Key k);
6: 
### What does the round function do? Wreak havoc with the input

Encrypting with Feistel

  • split in left and right
  • repeat Nround times:
    1. apply round function F to right
    2. xor the output with left
    3. swap left and right
01: struct BlockCipher32
02: {
03:   uint32_t encrypt(uint32_t plain)
04:   {
05:     uint16_t left = plain >> 16;
06:     uint16_t right = plain & 0xFFFF;
07: 
08:     for (int i = 0; i < rounds; ++i) {
09:       uint16_t tmp = roundFunc(right, m_keys[i]);
10:       left ^= tmp;
11:       std::swap(left, right);
12:     }
13:     return (left << 16) | right;
14:   }
15: 
16: private:
17:   static const int Nrounds = 10;
18:   uint16_t roundFunc(uint16_t x, uint16_t key)
19:   {
20:     return (x ^ key) * 35;
21:   }
22:   std::array<uint16_t, Nrounds> m_keys;
23: }
24: 

Generalizing to N-bit

### N bit block size Pick integer types - Full: $\ge N$ bits - Half: $\ge \frac{N}{2}$ bits Precompute a mask with the bottom $\frac{N}{2}$ bits set

24 bit example

01: struct BlockCipher24
02: {
03:   uint32_t encrypt(uint32_t plain)
04:   {
05:     uint16_t left = plain >> 12;
06:     const uint16_t mask = 0xFFF;
07:     uint16_t right = plain & mask;
08: 
09:     for (int i = 0; i < rounds; ++i) {
10:       uint16_t tmp = roundFunc(right, m_keys[i]);
11:       left ^= (tmp & mask);
12:       std::swap(left, right);
13:     }
14:     return (left << 12) | right;
15:   }
16: 
17: private:
18:   static const int Nrounds = 10;
19:   uint16_t roundFunc(uint16_t x, uint16_t key)
20:   {
21:     return (x ^ key) * 35;
22:   }
23:   std::array<uint16_t, Nrounds> m_keys;
24: }
25: 
## What if N is Odd ? - treat left and right differently - not difficult, just tedious
### Generic implementation - templated on integer types - CRTP to avoid runtime cost - Derived class defines the round function and number of rounds
01: template<typename Derived,
02:          typename EncryptTypeFull,
03:          typename EncryptTypeHalf>
04: class GenericFeistel
05: {
06: public:
07:   EncryptTypeFull encrypt(EncryptTypeFull cleartext)
08:   {
09:     return commonEncryptAndDecrypt<Encrypt>(cleartext);
10:   }
11:   EncryptTypeFull decrypt(EncryptTypeFull ciphertext)
12:   {
13:     return commonEncryptAndDecrypt<Decrypt>(ciphertext);
14:   }
15: 
16:   template<Direction direction>
17:   EncryptTypeFull commonEncryptAndDecrypt(
18:     EncryptTypeFull input);
19: 
20: private:
21:   const int m_Nbitshalf;
22:   const EncryptTypeHalf m_mask;
23: };
24: 
01: template<Direction direction>
02: EncryptTypeFull
03: commonEncryptAndDecrypt(EncryptTypeFull input)
04: {
05:   EncryptTypeHalf left = (input >> m_Nbitshalf) & m_mask;
06:   EncryptTypeHalf right = input & m_mask;
07:   int round;
08:   for (int i = 0; i < rounds(); ++i) {
09:     if constexpr (direction == Encrypt)
10:       round = i;
11:     else
12:       round = rounds() - 1 - i;
13:     EncryptTypeHalf Fresult =
14:       This()->roundFunction(right, round);
15:     left ^= Fresult;
16:     left &= m_mask;
17:     std::swap(left, right);
18:   }
19:   std::swap(left, right);
20:   return (EncryptTypeFull{ left } << m_Nbitshalf) | right;
21: }
22: 
01: class Dynamic32
02:   : public GenericFeistel<Dynamic32,
03:                           std::uint32_t,
04:                           std::uint16_t>
05: {
06: public:
07:   static constexpr int ROUNDS = 2;
08:   using Base = GenericFeistel<Dynamic32,
09:                               std::uint32_t,
10:                               std::uint16_t>;
11:   explicit Dynamic32(int Nbits)
12:     : Base(Nbits){};
13: 
14:   std::uint16_t roundFunction(const std::uint16_t x,
15:                               int round) const
16:   {
17:     return hashfnv1a(x) ^ m_key[round];
18:   }
19: 
20: private:
21:   std::array<std::uint16_t, ROUNDS> m_key;
22: };
23: 

Choosing the round function

### Desired properties - fast - mixes key and input well - "bit avalanche"

FNV-1A Hash

algorithm fnv-1a is
    hash := FNV_offset_basis

    for each byte_of_data to be hashed do
        hash := hash XOR byte_of_data
        hash := hash × FNV_prime

    return hash
Exists for 32, 64, 128, 256, 512, 1024 bits
### 16-bit FNV-1a - xor the two halves of 32 bit (correct) - just return the bottom part (truncating)
### Judging encryption quality randomness test on encrypting a counter That's another talk

Tuning

How expensive is all this?

Performance

Method cycles/sample
sequential 4
xor 5
fnv1a (truncated) 10
fnv1a (proper) 12.4
random 42
shuffle 900
Can we do better?
faster hash?
use simd (vector units)? - encrypt 8 values at a time
more instructions in parallel? [wikichip skylake](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler_Ports_.26_Execution_Units)
### Hardware accelerated [intel intrinsics guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5026,5027,5028,5029,5030,5031,5032,459,414,415,416,417,3485,3490,4018,4355,5961,304,402,2936,3656,3668,4584,4582,4580,5808,5952,4567,4329,4151,462,463,599,4027,4151,5960,402,460,1284,3259,1284,1285,4380,4378,3483,4697,595,1287,238,237,682,4426,4537,401,3863,3544,4447,2944,5150,5155,233&cats=Cryptography&othertechs=AES) Use the _mm_aesenc_si128 intrinsic

AES Hardware assistance

1: std::uint16_t
2: roundFunction(const std::uint16_t x, int round) const
3: {
4:   __m128i m = _mm_set1_epi16(x);
5:   m = _mm_aesenc_si128(m, m_key);
6:   return _mm_extract_epi16(m, 0);
7: }
8: 

Performance

Method cycles/sample
sequential 4
xor 5
AES intrinsic 8.9
fnv1a (truncated) 10
fnv1a (proper) 12.4
random 42
shuffle 900

Random iteration

We have all the building blocks to iterate randomly now!

A range with M elements - not always a power of two

IF M is not a power of two

Use $N=\lceil log_2(M)\rceil$
and throw away results above M.
Will waste at most 50%
01: template<typename RandomAccessIterator, typename Callback>
02: void
03: random_for_each(RandomAccessIterator first,
04:                 RandomAccessIterator last,
05:                 Callback&& cb)
06: {
07:   const auto M = std::distance(first, last);
08:   const int N = ceil_log2(M);
09:   if (M <= 32) {
10:     Dynamic32 cipher(N);
11:     unsigned generated = 0;
12:     for (unsigned i = 0; generated < M; ++i) {
13:       unsigned e = cipher.encrypt(i);
14:       if (e >= M) {
15:         continue;
16:       }
17:       ++generated;
18:       cb(*(first + e));
19:     }
20:   } else {
21:     // .... use Dynamic64 etc
22:   }
23: }
24: 
### Thank you! - https://www.pauldreik.se/talks/20200123_crypto/ - https://github.com/pauldreik/random_foreach - https://pauldreik.blogspot.com/2019/12/iterating-in-random-order-with-lazy.html