Determining if two products are equal without multiplying them

By Paul Dreik 20251102

The problem is:

Given positive integers a,b,c,da,b,c,d, find out if ab=cdab=cd.

The obvious solution is to carry out the multiplication and compare. But is it possible to solve this without carrying out the multiplication?

This is interesting, because if the numbers are stored as 32 bit integers, carrying out the multiplication requires 64 bit computations to avoid overflow. That is maybe not so problematic, but what if they were 64 bit or even larger?

The usual solution

We can emulate 64 bit multiplication with several 32 bit operations, for instance with Karatsuba multiplication. This is probably the fastest way (unless 64 bit arithmetic is available).

An alternative solution

We will use the chinese remainder theorem.

This says that if we know a value xx modulo several relatively prime numbers aia_i, xx is unique up to a multiple of the product NN of the aia_i:s. We will exploit this by finding several numbers aia_i such that:

The test for equality between abab and cdcd then comes down to:

for each aia_i:

If we tested all aia_i without finding inequality, ab=cdab=cd.

First off, doing multiplication modulo 2322^32 is very efficient in computers. So it is good to select a1=232a_1=2^32. If we are to utilize 32-bit multiplications without overflow for the rest of the aia_i, we need to keep below 2162^16. That means we have to use at least three more coefficients a2a_2, a3a_3, a4a_4 to satisify point three in the bullet list.

Modulo computation

Compilers are able to generate efficient code for calculating modulo when the divisor is known ahead of time. The following C++ code:

template <unsigned M>
unsigned divide(unsigned x) {
    return x % M;
}
template unsigned divide<65519>(unsigned);

generates the following assembly:

unsigned int divide<65519u>(unsigned int):
        mov     eax, edi
        mov     edx, 2148040849
        imul    rax, rdx
        shr     rax, 47
        imul    edx, eax, 65519
        mov     eax, edi
        sub     eax, edx
        ret

One can see that there is no division, just two multiplications, a subtract and a shift.

For other numbers, slightly different variations are generated. Here are some examples:

unsigned int divide<62297u>(unsigned int):
        mov     edx, edi
        mov     eax, edi
        imul    rdx, rdx, 223307689
        shr     rdx, 32
        sub     eax, edx
        shr     eax
        add     eax, edx
        shr     eax, 15
        imul    edx, eax, 62297
        mov     eax, edi
        sub     eax, edx
        ret
unsigned int divide<65449u>(unsigned int):
        mov     eax, edi
        imul    rax, rax, 1075169127
        shr     rax, 46
        imul    edx, eax, 65449
        mov     eax, edi
        sub     eax, edx
        ret
unsigned int divide<65479u>(unsigned int):
        mov     eax, edi
        imul    rax, rax, 1074676525
        shr     rax, 46
        imul    edx, eax, 65479
        mov     eax, edi
        sub     eax, edx
        ret
unsigned int divide<65497u>(unsigned int):
        mov     eax, edi
        mov     edx, 2148762361
        imul    rax, rdx
        shr     rax, 47
        imul    edx, eax, 65497
        mov     eax, edi
        sub     eax, edx
        ret

Implementation

We need to search for three numbers ai,i>1a_i,i>1 which are odd (to avoid a common factor with a1=232a_1=2^{32}), below 2162^{16} and relatively prime. They also need to multiply to at least 2322^{32}.

The obvious solution is to benchmark all candidate divisors to see which of them are fastest.

Going downwards from 2162^{16}, the three first fast I find are 65485, 65483 and 65481. The moduli takes about 0.38 ns to compute, compared to 0.46 ns for 65487 and 0.56 ns for 65535.

We need to check these numbers are relatively prime. Using the factor program, I get:

None of these numbers share any factors, so they are relatively prime. And because they are odd, they are also relatively prime to a1=232a_1=2^{32}.

Are their product large enough? Definetely, it is almost 2482^{48}.

Resulting code

This is what the finished code looks like:

template<unsigned M>
constexpr unsigned
modulo(unsigned x)
{
  return x % M;
}

/**
 * multiplies x and y modulo a_i, while avoiding
 * internal overflow
 * @return xy mod a_i
 */
template<unsigned a_i>
constexpr unsigned
multiply_modulo(const unsigned x, const unsigned y)
{
  const auto xm = modulo<a_i>(x);
  const auto ym = modulo<a_i>(y);
  return modulo<a_i>(xm * ym);
}

/**
 * returns true if a*b == c*d, determined while avoiding
 * overflow
 * @param a
 * @param b
 * @param c
 * @param d
 * @return
 */
constexpr bool
products_equal(const unsigned a,
               const unsigned b,
               const unsigned c,
               const unsigned d)
{
  // we use the chinese remainder theorem with four coefficients

  // carry out the comparision modulo a_1=2**32 which is fast
  if (a * b != c * d) {
    return false;
  }

  // this coefficient was chosen after benchmarking
  constexpr unsigned a_2 = 65485;
  if (multiply_modulo<a_2>(a, b) != multiply_modulo<a_2>(c, d)) {
    return false;
  }

  // this coefficient was chosen after benchmarking
  constexpr unsigned a_3 = 65483;
  if (multiply_modulo<a_3>(a, b) != multiply_modulo<a_3>(c, d)) {
    return false;
  }

  // this coefficient was chosen after benchmarking
  constexpr unsigned a_4 = 65481;
  if (multiply_modulo<a_4>(a, b) != multiply_modulo<a_4>(c, d)) {
    return false;
  }

  return true;
}

Benchmarking

The reference implementation to compare against is

bool
reference_products_equal(const unsigned a,
                         const unsigned b,
                         const unsigned c,
                         const unsigned d)
{
  using U = std::uint64_t;
  return U{ a } * b == U{ c } * d;
}

The benchmark uses random data to prevent optimization. Here is the benchmark fixture, using Catch2 benchmarking:

TEST_CASE("Benchmark equality", "[!benchmark]")
{
  BENCHMARK_ADVANCED("64 bit multiplication")(
    Catch::Benchmark::Chronometer meter)
  {
    // prevent optimization by getting these at runtime
    std::random_device rd;
    const unsigned b = rd();
    const unsigned c = rd();
    const unsigned d = rd();

    meter.measure(
      [=](unsigned int i) { return reference_products_equal(i, b, c, d); });
  };
  BENCHMARK_ADVANCED("chinese remainder")(Catch::Benchmark::Chronometer meter)
  {
    // prevent optimization by getting these at runtime
    std::random_device rd;
    const unsigned b = rd();
    const unsigned c = rd();
    const unsigned d = rd();

    meter.measure([=](unsigned int i) { return products_equal(i, b, c, d); });
  };
}

The results show that the performance is within measurement noise:

benchmark name             samples       iterations    est run time
                           mean          low mean      high mean
                           std dev       low std dev   high std dev
-------------------------------------------------------------------
64 bit multiplication      100           49124          0 ns
                           0.483111 ns   0.480609 ns    0.488554 ns
                           0.0178952 ns  0.0100452 ns   0.0317602 ns

chinese remainder          100           49028          0 ns
                           0.479843 ns   0.479145 ns    0.483282 ns
                           0.00685153 ns 0.000123012 ns 0.0163458 ns

An interesting observation

While testing the code I wrote, I used fuzzing to find test cases. The fuzzer checks that the reference implementation gives the same results as the implementation based on the chinese remainder theorem. I commented out the checks for the respective aia_i, waited for the fuzzer to find an example which I then added to test cases.

The interesting thing is that the check for a2a_2 could be removed without me finding any valid counterexample. I let the fuzzer run for about 100 cpu hours without any luck. This got me thinking. Is it perhaps for the particular choise of aia_i this happens? Excluding a2a_2, NN ends up very close below 2642^{64} and that might mean it is impossible to find values a,b,c,da,b,c,d such that ab=N+xa b = N+x and cd=xc d = x ? If so, the code could be cut down to do three modulus operations instead of four.

Update: counterexample found

My friend asked AI to explain this. It failed, but generated a brute force program to search for a counterexample. The program worked and produced the following counterexample: a=4294940936,b=4289265232,c=372413899,d=15529856a=4294940936, b=4289265232, c=372413899, d=15529856