-
Notifications
You must be signed in to change notification settings - Fork 27
Real world: hashes
You will often see a particular term being used frequently: hash and its derivatives, such as hashes and even hashing. This topic is closely related to zpaqfranz
, so it’s worth providing some introductory clarification. Apologies to the experts, but I will gloss over many important details.
An hash is, to put it as simply as possible, a numeric code associated with a portion of data (typically a file, but it could also be part of a file. We’ll come back to this later when explaining the "magic" of zpaq). This numeric code can have various lengths, such as 32 bits (which is 4 bytes, or 4 characters), 128 bits, 160, 256, 512, and so on. Since these are very large numbers, they are often represented in a specific format called HEX (or hexadecimal). We won’t go into detail, but it’s essentially a string of characters. For example, ABC12345 could be (in the context of zpaq
) an hash.
Let’s get back to the point. Given a certain amount of data (for simplicity’s sake, let’s say a .jpg
image that is about a million characters in size), through a process we won’t describe in detail, I can generate a hash for the image, but the hash itself is only 16 characters long. Interesting, right? We started with a million (or even a billion, or a thousand billion characters) and got just a handful.
The interesting part is that (in theory — and practically too, although I won’t go into details) there are no two different files with the same hash.
Let’s go back to our example. Suppose we have a photo of a cat with hash 27. Now suppose we have another photo (I don’t know of what) with hash 154. Well, I can conclude (with relative certainty) that the second photo is different from the first one. I can say this without even seeing it. It has a different hash, so it’s different.
The reverse is also true: if two files have the same hash, they are identical. So, if I had two files, let’s say pippo.doc
and pluto.pdf
, both with the same hash — let’s say 99 — I would be certain that, in reality, they are exactly the same file. Interesting, right?
There are various types of hashes, each with different names (actually, we should also differentiate between categories like checksums, cryptographic hashes, and so on. Forgive me if I don’t go into that here). These are important names that you will often see, and there are many of them.
For now, let’s just talk about one (you can find the technical details in the various pages), which is SHA-1. We don’t need to say much about it, except that it is the fundamental component behind the zpaq
technology.
Do you remember when I mentioned that it can recognize duplicate information? Well, this happens thanks to SHA-1 codes.
This is a difficult topic, hang in there! A few more notes and then we’ll summarize.
Why are there so many different types of hashes? Isn’t one enough? Why all these complications? The answer is that — in general — the "stronger" (let’s say "better") a hashing algorithm is, the slower it tends to be (OK, this isn’t technically always true, but for now, let’s say it is).
So, a compromise is sought between computational speed, memory requirements, CPU usage (think about how important it is NOT to overheat a phone’s CPU, which would drastically reduce battery life), and so on.
Just as there isn’t a "perfect" vehicle (if you want to move house, a Ferrari won’t do; you’d need a pickup truck. And, conversely, if you want to race on a track, you won’t be able to do it with a van), there isn’t a "perfect" hash either. Some are fast, some are secure, and some are very secure.
So, the goal is to use the right tool for the right situation. "Powerful" (but "heavy") when needed, "fast" and "lightweight" when that’s enough for a particular task.
OK, that was tough, let’s stop here for now!