Home made crypto

Using home made crypto for a non-cryptographic problem

Stockholm C++ 20200123

Paul Dreik

## About me - const west - let's break ABI in C++ 23 - I like unsigned integers - Greta is right

About Me

Independent contractor
security code review
fuzz *all* the things
twitter: @pauldreik
https://github.com/pauldreik/
https://www.pauldreik.se/

"Don't make your own crypto"

Dont TRUST your own crypto.

Don't use home made crypto for cryptographic purposes

unless you write malware
serious motivation

From some random code base

01: int
02: count_set_bits(unsigned int v)
03: {
04:   const int S[] = { 1, 2, 4, 8, 16 };
05:   const int B[] = { 0x55555555,
06:                     0x33333333,
07:                     0x0F0F0F0F,
08:                     0x00FF00FF,
09:                     0x0000FFFF };
10: 
11:   unsigned int c;
12:   c = v - ((v >> 1) & B[0]);
13:   c = ((c >> S[1]) & B[1]) + (c & B[1]);
14:   c = ((c >> S[2]) + c) & B[2];
15:   c = ((c >> S[3]) + c) & B[3];
16:   c = ((c >> S[4]) + c) & B[4];
17:   return c;
18: }
19:

credit: https://graphics.stanford.edu/~seander/bithacks.html

Write a unit test

1: int
2: main()
3: {
4:   assert(count_set_bits(0) == 0);
5:   assert(count_set_bits(1) == 1);
6:   assert(count_set_bits(1 << 10) == 1);
7:   assert(count_set_bits(0xF) == 4);
8: }
9:

Exhaustive testing

1: void
2: exhaustive_test()
3: {
4:   unsigned int = 0;
5:   do {
6:     // .......
7:   } while (++i);
8: }
9:

Exhaustive testing

01: int
02: main()
03: {
04:   unsigned int i = 0;
05:   do {
06:     auto actual = count_set_bits(i);
07:     auto expected = __builtin_popcount(i);
08:     assert(actual == expected);
09:   } while (++i);
10: }
11:

Demo

Exploration

What if the interesting bug happens only when the leading bit is set?

Xor

01: int
02: main()
03: {
04:   unsigned int i = 0;
05:   const unsigned int flip = std::random_device{}();
06:   do {
07:     const unsigned int j = i ^ flip;
08:     auto actual = count_set_bits(j);
09:     auto expected = __builtin_popcount(j);
10:     assert(actual == expected);
11:   } while (++i);
12: }
13:

Xor Demo

Use random

01: #include <random>
02: 
03: int
04: main()
05: {
06:   std::mt19937_64 gen;
07:   gen.seed(42);
08: 
09:   for (;;) {
10:     auto x = gen();
11:     auto actual = count_set_bits(x);
12:     auto expected = __builtin_popcountl(x);
13:     assert(actual == expected);
14:   }
15: }
16:

Random Demo

Good exploration

Only beaten by fuzzing tricks

Analyze the waste

Notation

M possible inputs ($M=2^{32}$ in the example)

We pick $k$ testcases, and get $Y$ distinct inputs

$Y$ is stochastic

variant on the birthday problem

https://en.wikipedia.org/wiki/Birthday_problem

Expected coverage

Y is stochastic. Expected value: $$ E(Y) = M \left(1 - \left(\frac{M-1}{M}\right)^k\right)$$

Simplification

When M is large: $$ E(Y) = M \left(1 - \left(\frac{M-1}{M}\right)^k\right)$$ $$ \frac{E(Y)}{M} \approx 1 - e^{-\frac{k}{M}}$$

Distinct values Y after k samples

$M=2^{32}$

Wasted effort

$M=2^{32}$

Takeaways

Random sampling is good
wasted effort when k approaches M
Inefficient way to get exhaustive testing

Can we get random order, without repetition?

Using shuffle

Generate $0, 1, 2, 3, 4 \dotsc M-1$
permute with std::shuffle()
Use the data

Shuffle - implementation

Generate $0, 1, 2, 3, 4 \dotsc M-1$:

1: std::vector<unsigned int> v(M);
2: std::iota(begin(v), end(v), 0);
3:

Shuffle the order:

1: std::minstd_rand gen(42);
2: std::shuffle(begin(v), end(v), gen);
3:

Use the data:

1: for (unsigned int i : v) {
2:   auto actual = count_set_bits(i);
3:   auto expected = __builtin_popcount(i);
4:   assert(actual == expected);
5: }
6:

Shuffle - implementation

01: int
02: main(int argc, char* argv[])
03: {
04:   assert(argc > 1);
05:   const std::size_t M = std::stoull(argv[1]);
06: 
07:   std::vector<unsigned int> v(M);
08:   std::iota(begin(v), end(v), 0);
09: 
10:   std::minstd_rand gen(42);
11:   std::shuffle(begin(v), end(v), gen);
12: 
13:   for (unsigned int i : v) {
14:     auto actual = count_set_bits(i);
15:     auto expected = __builtin_popcount(i);
16:     assert(actual == expected);
17:   }
18: }
19:

Shuffle demo

## Performance - clang 8, 64 bit Debian Linux - perf - empty function in other TU - Intel core i7-8565U (laptop) - count cpu cycles instead of time

Performance

Method	cycles/sample
sequential	4
xor	5
random	42
shuffle	900

Note: random is the best of mt19937_64, mt19937 and minstd_rand

## Why is shuffle so slow? - allocates large amount of memory - all work up front - terrible cache behaviour https://pauldreik.blogspot.com/2019/12/iterating-in-random-order-with-lazy.html

## We want - random exploration - no repetitions - low memory need - no work up front - be as fast as random()

Block cipher

A crypto building block

Encryption

secret key K
|
V
$\fbox{N bits of cleartext}$ ====> $\fbox{N bits of cryptotext}$

Decryption

secret key K
|
V
$\fbox{N bits of cleartext}$ <==== $\fbox{N bits of cryptotext}$

Example

Let's make a 3-bit block crypto

$\fbox{3 bits of cleartext}$ ====> $\fbox{3 bits of cryptotext}$

| 3-bit Input | 3-bit Output | |:-------:|:--------:| | 0 | 7 | | 1 | 2 | | 2 | 4 | | 3 | 5 | | 4 | 3 | | 5 | 1 | | 6 | 6 | | 7 | 0 |

### Size preserving sizeof(i) = sizeof(encrypt(j))

### Distinct $\forall i \ne j$ encrypt(i) != encrypt(j)

block cipher = permutation

Outlining a solution

01: struct BlockCipher32
02: {
03:   unsigned encrypt(unsigned x) const;
04:   unsigned decrypt(unsigned x) const;
05: 
06: private:
07:   SecretKey m_key;
08: };
09: 
10: int
11: main()
12: {
13:   BlockCipher32 cipher;
14:   unsigned int i = 0;
15:   do {
16:     f(cipher.encrypt(i));
17:   } while (++i);
18: }
19:

## Existing block ciphers - AES,Serpent,Twofish - 128 bit - DES,3DES,Blowfish - 64 bit - RC5 - 32/64/128 bit

## Problem 1: Block size Testing a function f(int, bool) needs a $N=33$ bit cipher

## Problem 2: Speed Don't pay for security if we don't need it

Where do we find a fast N-bit block cipher?

We will make are own!

Feistel Cipher

A simple recipe for making block ciphers

For a N-bit block cipher

1: // example - 32 bit block cipher
2: using half = std::uint16_t;
3: 
4: half
5: roundFunction(half x, Key k);
6:

### What does the round function do? Wreak havoc with the input

Encrypting with Feistel

split in left and right
repeat Nround times:
1. apply round function F to right
2. xor the output with left
3. swap left and right

01: struct BlockCipher32
02: {
03:   uint32_t encrypt(uint32_t plain)
04:   {
05:     uint16_t left = plain >> 16;
06:     uint16_t right = plain & 0xFFFF;
07: 
08:     for (int i = 0; i < rounds; ++i) {
09:       uint16_t tmp = roundFunc(right, m_keys[i]);
10:       left ^= tmp;
11:       std::swap(left, right);
12:     }
13:     return (left << 16) | right;
14:   }
15: 
16: private:
17:   static const int Nrounds = 10;
18:   uint16_t roundFunc(uint16_t x, uint16_t key)
19:   {
20:     return (x ^ key) * 35;
21:   }
22:   std::array<uint16_t, Nrounds> m_keys;
23: }
24:

Generalizing to N-bit

### N bit block size Pick integer types - Full: $\ge N$ bits - Half: $\ge \frac{N}{2}$ bits Precompute a mask with the bottom $\frac{N}{2}$ bits set

24 bit example

01: struct BlockCipher24
02: {
03:   uint32_t encrypt(uint32_t plain)
04:   {
05:     uint16_t left = plain >> 12;
06:     const uint16_t mask = 0xFFF;
07:     uint16_t right = plain & mask;
08: 
09:     for (int i = 0; i < rounds; ++i) {
10:       uint16_t tmp = roundFunc(right, m_keys[i]);
11:       left ^= (tmp & mask);
12:       std::swap(left, right);
13:     }
14:     return (left << 12) | right;
15:   }
16: 
17: private:
18:   static const int Nrounds = 10;
19:   uint16_t roundFunc(uint16_t x, uint16_t key)
20:   {
21:     return (x ^ key) * 35;
22:   }
23:   std::array<uint16_t, Nrounds> m_keys;
24: }
25:

## What if N is Odd ? - treat left and right differently - not difficult, just tedious

### Generic implementation - templated on integer types - CRTP to avoid runtime cost - Derived class defines the round function and number of rounds

01: template<typename Derived,
02:          typename EncryptTypeFull,
03:          typename EncryptTypeHalf>
04: class GenericFeistel
05: {
06: public:
07:   EncryptTypeFull encrypt(EncryptTypeFull cleartext)
08:   {
09:     return commonEncryptAndDecrypt<Encrypt>(cleartext);
10:   }
11:   EncryptTypeFull decrypt(EncryptTypeFull ciphertext)
12:   {
13:     return commonEncryptAndDecrypt<Decrypt>(ciphertext);
14:   }
15: 
16:   template<Direction direction>
17:   EncryptTypeFull commonEncryptAndDecrypt(
18:     EncryptTypeFull input);
19: 
20: private:
21:   const int m_Nbitshalf;
22:   const EncryptTypeHalf m_mask;
23: };
24:

01: template<Direction direction>
02: EncryptTypeFull
03: commonEncryptAndDecrypt(EncryptTypeFull input)
04: {
05:   EncryptTypeHalf left = (input >> m_Nbitshalf) & m_mask;
06:   EncryptTypeHalf right = input & m_mask;
07:   int round;
08:   for (int i = 0; i < rounds(); ++i) {
09:     if constexpr (direction == Encrypt)
10:       round = i;
11:     else
12:       round = rounds() - 1 - i;
13:     EncryptTypeHalf Fresult =
14:       This()->roundFunction(right, round);
15:     left ^= Fresult;
16:     left &= m_mask;
17:     std::swap(left, right);
18:   }
19:   std::swap(left, right);
20:   return (EncryptTypeFull{ left } << m_Nbitshalf) | right;
21: }
22:

01: class Dynamic32
02:   : public GenericFeistel<Dynamic32,
03:                           std::uint32_t,
04:                           std::uint16_t>
05: {
06: public:
07:   static constexpr int ROUNDS = 2;
08:   using Base = GenericFeistel<Dynamic32,
09:                               std::uint32_t,
10:                               std::uint16_t>;
11:   explicit Dynamic32(int Nbits)
12:     : Base(Nbits){};
13: 
14:   std::uint16_t roundFunction(const std::uint16_t x,
15:                               int round) const
16:   {
17:     return hashfnv1a(x) ^ m_key[round];
18:   }
19: 
20: private:
21:   std::array<std::uint16_t, ROUNDS> m_key;
22: };
23:

Choosing the round function

### Desired properties - fast - mixes key and input well - "bit avalanche"

FNV-1A Hash

algorithm fnv-1a is
    hash := FNV_offset_basis

    for each byte_of_data to be hashed do
        hash := hash XOR byte_of_data
        hash := hash × FNV_prime

    return hash

Exists for 32, 64, 128, 256, 512, 1024 bits

### 16-bit FNV-1a - xor the two halves of 32 bit (correct) - just return the bottom part (truncating)

### Judging encryption quality randomness test on encrypting a counter That's another talk

Tuning

How expensive is all this?

Performance

Method	cycles/sample
sequential	4
xor	5
fnv1a (truncated)	10
fnv1a (proper)	12.4
random	42
shuffle	900

Can we do better?

faster hash?

use simd (vector units)? - encrypt 8 values at a time

more instructions in parallel? [wikichip skylake](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler_Ports_.26_Execution_Units)

### Hardware accelerated [intel intrinsics guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5026,5027,5028,5029,5030,5031,5032,459,414,415,416,417,3485,3490,4018,4355,5961,304,402,2936,3656,3668,4584,4582,4580,5808,5952,4567,4329,4151,462,463,599,4027,4151,5960,402,460,1284,3259,1284,1285,4380,4378,3483,4697,595,1287,238,237,682,4426,4537,401,3863,3544,4447,2944,5150,5155,233&cats=Cryptography&othertechs=AES) Use the _mm_aesenc_si128 intrinsic

AES Hardware assistance

1: std::uint16_t
2: roundFunction(const std::uint16_t x, int round) const
3: {
4:   __m128i m = _mm_set1_epi16(x);
5:   m = _mm_aesenc_si128(m, m_key);
6:   return _mm_extract_epi16(m, 0);
7: }
8:

Performance

Method	cycles/sample
sequential	4
xor	5
AES intrinsic	8.9
fnv1a (truncated)	10
fnv1a (proper)	12.4
random	42
shuffle	900

Random iteration

We have all the building blocks to iterate randomly now!

A range with M elements - not always a power of two

IF M is not a power of two

Use $N=\lceil log_2(M)\rceil$
and throw away results above M.
Will waste at most 50%

01: template<typename RandomAccessIterator, typename Callback>
02: void
03: random_for_each(RandomAccessIterator first,
04:                 RandomAccessIterator last,
05:                 Callback&& cb)
06: {
07:   const auto M = std::distance(first, last);
08:   const int N = ceil_log2(M);
09:   if (M <= 32) {
10:     Dynamic32 cipher(N);
11:     unsigned generated = 0;
12:     for (unsigned i = 0; generated < M; ++i) {
13:       unsigned e = cipher.encrypt(i);
14:       if (e >= M) {
15:         continue;
16:       }
17:       ++generated;
18:       cb(*(first + e));
19:     }
20:   } else {
21:     // .... use Dynamic64 etc
22:   }
23: }
24:

### Thank you! - https://www.pauldreik.se/talks/20200123_crypto/ - https://github.com/pauldreik/random_foreach - https://pauldreik.blogspot.com/2019/12/iterating-in-random-order-with-lazy.html