When working with CRCs, we work in the ring of polynomials over the field 𝔽2 (mostly modulo some polynomial).
Polynomials are typically notated like x5 + x2 + x1 + 1 but I will also notate them like 100111 (for each exponent, just the coefficient is written) in this document.
Addition is then done using XOR and multiplication is carry-less multiplication, which works like normal multiplication but in each step of the calculation, one uses XOR instead of normal addition.
For example, 1010 + 1100 = 0110
and
110010 · 1011 = 111000110
-------------
1011
+ 1011
+ 1011
-------------
111000110
From a number theoretical perspective, they are similar to the integers:
- the result of multiplicating two non-zero elements is always non-zero and multiplying by 1 leaves it the same.
- division with remainder is possible for each pair: the remainder then always has a degree smaller than the divisor. For example, 1000110100 (degree 9) modulo 11001 (degree 4) is 1101 (degree 3).
- two elements have a unique greatest common divisor that divides both elements. "Greatest" here means that it has the greatest degree.
- each element has a unique factorization. For example, 10001 may be written as 114 and 1000110100 as 102 · 112 · 101001.
One difference to integers is that anything multiplied by 1 + 1 (= 2) is zero because a number XORed by itself is always zero. Because 1 + 1 = 0, one cannot make sign errors as 1 + 1 = 0 = 1 - 1 and therefore 1 = -1.
There are some well-established parameters for a CRC algorithm:
init
: the initial state of the CRC registerpoly
: the polynomial by which the sum gets reducedwidth
: the degree of polyxorout
: a final polynomial that gets added after everything has been donerefin
: whether to reflect the bits of the input bytesrefout
: whether to reflect the bits of the final sum
Let f
be the input file represented as a polynomial (e.g. the bytes 0x45 0x11
would get represented as 01000101 00010001
or x14 + x10 + x8 + x4 + 1) and len(f)
the file length in bits (which is not neccessarily the degree, as the high bits can be zero).
refin
and refout
will be ignored for now.
The CRC can then be represented as: crc
= (init
· xlen(f)
+ f
· xwidth
+ xorout
) % poly
.
Suppose you have some files f
k with k ranging from 1 to n.
Additionally, you have their corresponding checksums crc
k using some unknown CRC algorithm.
Now you want to find the parameters for the algorithm.
First off, we assume that we already know the width of the CRC (otherwise, looping over it is usually not that hard), which is everything needed to calculate
a
k := (crc
k - f
k · xwidth
) % poly
= (init
· xlen(f_k)
+ f
k · xwidth
+ xorout
- f
k · xwidth
) % poly
= (init
· xlen(f_k)
+ xorout
) % poly
The resulting term does not depend on the file content, only its length.
Let's call that term a
k.
If we calculate that term for each k, we can subtract pairs of consecutive terms and get
(a
k+1 - a
k) % poly
= (init
· xlen(f_(k+1))
+ xorout
- init
· xlen(f_k)
- xorout
) % poly
= init
· (xlen(f_(k+1))
- xlen(f_k)
) % poly
Now it might be the case that len(f_k)
= len(f_(k+1))
which would mean that the term modulo poly
is already zero.
Since the remainder mod poly
would be zero, poly
would be a divisor of a
k + 1 - a
k which gives us some information on the possible values of poly
.
Indeed, we can calculate the greatest common divisor of all such result in order to combine all the information we get from that into a single value (let's call it m
).
This is because each of those differences individually must be divisible by poly
, therefore poly
must be a divisor common to all of those values.
Additionally, we can still do something when the files are of different lengths.
If we have three files with len(f_k)
< len(f_(k+1))
< len(f_(k+2))
(for clearer notation, call those lengths l0
, l1
and l2
), we can cancel out the init
term by noting the following equality:
(xl2
- xl1
) · (a
k+1 - a
k) % poly
= (xl2
- xl1
) · init
· (xl1
- xl0
) % poly
= (xl1
- xl0
) · (a
k+2 - a
k+1) % poly
Which means that
((xl2
- xl1
) · (a
k+1 - a
k) - (xl1
- xl0
) · (a
k+2 - a
k+1)) % poly
= 0
We can calculate the value on the left side of the modulo directly ourselves, which means we have a way of calculating a value that poly
divides even if all files are of different sizes. That can be done for all consecutive triples of files (in order) and the values can all be gcd'ed together again with m
.
Finally, we know that poly
is of degree width
, which means that a prime factor of degree d
can occur at most width
/ d
times. The product of all those prime factors is
q
:= ∏n=0..width
(x2n - x)
Why that expression has that property is not important in this context (the subexpression x2n - x contains exactly those prime factors whose degree divides n).
One can see that the result of that expression would be quite large, however what we desire is gcd(q
, m
), which is the same as gcd(q
% m
, m
) and the calculation modulo m
should therefore not use much more space than m
itself.
We can also remove any multiple of x from m
, since poly
will have a non-zero constant term, otherwise it would not be used in a CRC.
Now that we have a non-zero upper bound (in the divisibility sense) on poly
with m
, we can find out init
modulo m
.
First, observe again what we get when we calculate ak + 1 - ak:
a k + 1 - ak = init
· (xlen(f_(k+1))
- xlen(f_k)
) % poly
We do not have poly
, but we can instead use m
, which will yield a init
' modulo m, and for each solution candidate poly
, we can then simply calculate init
= init
' % poly
.
So we now have a set of modular equations
for all k from 1 to n - 1:
init
· zk ≡ ak + 1 - ak mod m
where
zk = xlen(f_(k+1))
- xlen(f_k)
yk = ak + 1 - ak
The only value we do not have is init
.
We could transform the equation into
init
≡ zk-1 · yk mod m
and directly get init
, but wait!
zk-1 may not be invertible!
We can still get all the available information from each equation even if zk is not invertible:
first, yk and zk may have a common factor ck = gcd(yk, zk, m
) which we can remove from zk, obtaining z' and y'
init
· ck · z'k = ck · y'k mod m
⇔ init
· z'k = y'k mod m
/ ck
If z'k is still not invertible, there would be a non-one m
' dividing m
/ ck such that z'k ≡ 0 mod m
', but that would mean
init
· z'k ≡ 0 ≡ y'k mod m
'
which would imply that gcd(y'k, z'k, m
') is not 1, contradicted by the fact that we factored ck out.
Since we assume that the equation modulo poly
does have a solution, we remove those problematic factors from m
until we get something solvable again (when we reach something with degree smaller than width
, we can always abort and output that no solutions were found).
For each equation, there is now a solution of init
mod m
/ ck with m
having all problematic factors from each equation removed.
We can then combine all these solutions with (almost) the chinese remainder theorem.
The traditional chinese remainder theorem assumes that all modulos are coprime, which is not our case.
Instead, in our simplified chinese remainder theorem we make sure that the solutions are the same on the overlapping factors.
If they are not the same, the factors where the solution differs gets removed from m
so that we can still have a solution.
In the end, we get a solution of the form init
' mod m
/ c, such that init
= init
' · c + d mod m
where 0 ≤ deg(d) < deg(c).
xorout
is now easy to calculate modulo m
:
a1 - init
· xlen(f₁)
= (init
· xlen(f₁)
+ xorout
) - init
· xlen(f₁)
≡ xorout
mod m
m
might still be bigger than poly
, however it is not much.
Usually all other spurious factors are removed by this point (depending on how many files are given), but it is very improbable that the degree is bigger than a few 100 at most.
This can be very easily factored with modern factorization algorithms.
With the factors, it is a matter of simply picking those factors whose degrees add up to width
, the subset sum problem.
Again, this is very fast at the scale of typical solutions.
For each possible solution poly
, one can then take the values that are still modulo m
and simply calculate them modulo poly
.
For refin
and refout
, it is easiest to just run the whole algorithm for all the 4 value combinations.
Reverse Engineering the Checksum Range from Files and Their Corresponding Checksums with a Given Checksum Algorithm
Suppose you have some files f
k with k ranging from 1 to n.
Additionally, you have their corresponding checksums crc
k using some known CRC algorithm.
But you do not know over what ranges of the file the checksums are taken.
All following operations are modulo poly
.
Also, x = 108 (a shift equivalent to a byte, not a bit) in this context.
Let the pure checksum (without init
or xorout
) starting at byte a and ending before byte b of file f
k be denoted by [a..b]
k.
Then the actual checksum is
crc
k = [a..b]
k + init
· xb - a + xorout
Then with
start(m
) = ([0..m]
k - init
) · xlen(f_k)
- m
end(m
) = ([0..m]
k + xorout
- crc
k) · xlen(f_k)
- m
we can calculate for a the start of the sum and b the end of the sum
end(b) - start(a) = ([0..b]
k + xorout
- crc
k) · xlen(f_k)
- b - ([0..a]
k - init
) · xlen(f_k)
- a
= ([0..b]
k + xorout
- crc
k) · xlen(f_k)
- b - ([0..a]
k - init
) · xb - axlen(f_k)
- b
= ([0..b]
k + xorout
- crc
k - ([0..a]
k - init
) · xb - a) · xlen(f_k)
- b
= ([a..b]
k + xorout
- crc
k + init
· xb - a) · xlen(f_k)
- b
Note that xn is never 0 since poly
does not contain a factor of 10 anywhere.
Therefore, since [a..b]
k + init
· xb - a + xorout
is the checksum from a to b, the above equation should be 0 iff it is equal to crc
k.
That is equivalent to end(b) = start(a), so we just have to find values of start(a) that are equal to end(a).
All start(a) and end(a) values can be calculated in linear time:
if one has [0..a]
one can of course calculate [0..a+1]
in constant time, which means one can first calculate all ([0..m]
k - init
) and ([0..m]
k + xorout
- crc
k) in linear time.
After that, one can then calculate the whole terms, this time in reverse directions, because one can calculate xa + 1 from xa in constant time.
Using a hashtable to find duplicates, one can then get all pairs (a, b) where start(a) = end(b), again in linear time. One has to sometimes represent the offsets for example as a₁,a₂,a₃:b₁,b₂
meaning that the intervals may start at any of the ak and end at any of the bk to avoid quadratically many pairs when explicitely showing them.
This can be applied to multiple files at once by basically just doing everything in (𝔽[X]/poly
)n instead of plain 𝔽[X]/poly
.