By Paul Dreik 20251102
The problem is:
Given positive integers , find out if .
The obvious solution is to carry out the multiplication and compare. But is it possible to solve this without carrying out the multiplication?
This is interesting, because if the numbers are stored as 32 bit integers, carrying out the multiplication requires 64 bit computations to avoid overflow. That is maybe not so problematic, but what if they were 64 bit or even larger?
We can emulate 64 bit multiplication with several 32 bit operations, for instance with Karatsuba multiplication. This is probably the fastest way (unless 64 bit arithmetic is available).
We will use the chinese remainder theorem.
This says that if we know a value modulo several relatively prime numbers , is unique up to a multiple of the product of the :s. We will exploit this by finding several numbers such that:
The test for equality between and then comes down to:
for each :
If we tested all without finding inequality, .
First off, doing multiplication modulo is very efficient in computers. So it is good to select . If we are to utilize 32-bit multiplications without overflow for the rest of the , we need to keep below . That means we have to use at least three more coefficients , , to satisify point three in the bullet list.
Compilers are able to generate efficient code for calculating modulo when the divisor is known ahead of time. The following C++ code:
template <unsigned M>
unsigned divide(unsigned x) {
return x % M;
}
template unsigned divide<65519>(unsigned);generates the following assembly:
unsigned int divide<65519u>(unsigned int):
mov eax, edi
mov edx, 2148040849
imul rax, rdx
shr rax, 47
imul edx, eax, 65519
mov eax, edi
sub eax, edx
retOne can see that there is no division, just two multiplications, a subtract and a shift.
For other numbers, slightly different variations are generated. Here are some examples:
unsigned int divide<62297u>(unsigned int):
mov edx, edi
mov eax, edi
imul rdx, rdx, 223307689
shr rdx, 32
sub eax, edx
shr eax
add eax, edx
shr eax, 15
imul edx, eax, 62297
mov eax, edi
sub eax, edx
ret
unsigned int divide<65449u>(unsigned int):
mov eax, edi
imul rax, rax, 1075169127
shr rax, 46
imul edx, eax, 65449
mov eax, edi
sub eax, edx
ret
unsigned int divide<65479u>(unsigned int):
mov eax, edi
imul rax, rax, 1074676525
shr rax, 46
imul edx, eax, 65479
mov eax, edi
sub eax, edx
ret
unsigned int divide<65497u>(unsigned int):
mov eax, edi
mov edx, 2148762361
imul rax, rdx
shr rax, 47
imul edx, eax, 65497
mov eax, edi
sub eax, edx
retWe need to search for three numbers which are odd (to avoid a common factor with ), below and relatively prime. They also need to multiply to at least .
The obvious solution is to benchmark all candidate divisors to see which of them are fastest.
Going downwards from , the three first fast I find are 65485, 65483 and 65481. The moduli takes about 0.38 ns to compute, compared to 0.46 ns for 65487 and 0.56 ns for 65535.
We need to check these numbers are relatively prime. Using the factor program, I get:
None of these numbers share any factors, so they are relatively prime. And because they are odd, they are also relatively prime to .
Are their product large enough? Definetely, it is almost .
This is what the finished code looks like:
template<unsigned M>
constexpr unsigned
modulo(unsigned x)
{
return x % M;
}
/**
* multiplies x and y modulo a_i, while avoiding
* internal overflow
* @return xy mod a_i
*/
template<unsigned a_i>
constexpr unsigned
multiply_modulo(const unsigned x, const unsigned y)
{
const auto xm = modulo<a_i>(x);
const auto ym = modulo<a_i>(y);
return modulo<a_i>(xm * ym);
}
/**
* returns true if a*b == c*d, determined while avoiding
* overflow
* @param a
* @param b
* @param c
* @param d
* @return
*/
constexpr bool
products_equal(const unsigned a,
const unsigned b,
const unsigned c,
const unsigned d)
{
// we use the chinese remainder theorem with four coefficients
// carry out the comparision modulo a_1=2**32 which is fast
if (a * b != c * d) {
return false;
}
// this coefficient was chosen after benchmarking
constexpr unsigned a_2 = 65485;
if (multiply_modulo<a_2>(a, b) != multiply_modulo<a_2>(c, d)) {
return false;
}
// this coefficient was chosen after benchmarking
constexpr unsigned a_3 = 65483;
if (multiply_modulo<a_3>(a, b) != multiply_modulo<a_3>(c, d)) {
return false;
}
// this coefficient was chosen after benchmarking
constexpr unsigned a_4 = 65481;
if (multiply_modulo<a_4>(a, b) != multiply_modulo<a_4>(c, d)) {
return false;
}
return true;
}The reference implementation to compare against is
bool
reference_products_equal(const unsigned a,
const unsigned b,
const unsigned c,
const unsigned d)
{
using U = std::uint64_t;
return U{ a } * b == U{ c } * d;
}The benchmark uses random data to prevent optimization. Here is the benchmark fixture, using Catch2 benchmarking:
TEST_CASE("Benchmark equality", "[!benchmark]")
{
BENCHMARK_ADVANCED("64 bit multiplication")(
Catch::Benchmark::Chronometer meter)
{
// prevent optimization by getting these at runtime
std::random_device rd;
const unsigned b = rd();
const unsigned c = rd();
const unsigned d = rd();
meter.measure(
[=](unsigned int i) { return reference_products_equal(i, b, c, d); });
};
BENCHMARK_ADVANCED("chinese remainder")(Catch::Benchmark::Chronometer meter)
{
// prevent optimization by getting these at runtime
std::random_device rd;
const unsigned b = rd();
const unsigned c = rd();
const unsigned d = rd();
meter.measure([=](unsigned int i) { return products_equal(i, b, c, d); });
};
}The results show that the performance is within measurement noise:
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------
64 bit multiplication 100 49124 0 ns
0.483111 ns 0.480609 ns 0.488554 ns
0.0178952 ns 0.0100452 ns 0.0317602 ns
chinese remainder 100 49028 0 ns
0.479843 ns 0.479145 ns 0.483282 ns
0.00685153 ns 0.000123012 ns 0.0163458 ns
While testing the code I wrote, I used fuzzing to find test cases. The fuzzer checks that the reference implementation gives the same results as the implementation based on the chinese remainder theorem. I commented out the checks for the respective , waited for the fuzzer to find an example which I then added to test cases.
The interesting thing is that the check for could be removed without me finding any valid counterexample. I let the fuzzer run for about 100 cpu hours without any luck. This got me thinking. Is it perhaps for the particular choise of this happens? Excluding , ends up very close below and that might mean it is impossible to find values such that and ? If so, the code could be cut down to do three modulus operations instead of four.
My friend asked AI to explain this. It failed, but generated a brute force program to search for a counterexample. The program worked and produced the following counterexample: