-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating test set with hash #71
Comments
Thanks for your kind words and your question. What we are trying to achieve is to have a stable test set across multiple runs of your program. If the test set changes every time you run the program, then you run the risk of gradually tweaking your program and your model to fit all of the data (including test instances), which is bad because it makes the evaluation of your model on the test set unreliable (the metric will be too optimistic). Next question: how can you ensure that the test set is stable across multiple runs? Well, there are countless solutions. For example, you could simply randomly pick a set for each instance, just once, then write the sets in separate files. In subsequent runs of your program, you would just use these existing files. End of story. However, this extra preprocessing is not necessary if every instance has a unique ID: suppose the ID is a random integer, then you can simply decide that instances whose IDs end with 00 to 19 belong to the test set, and the rest belong to the training set (this would ensure that 20% of your instances go to the test set, and 80% go to the training set). So for example, the instance with ID=5910406 is in the test set, but the instance with ID=48290483 is not. This solution would work fine. However, database IDs are often sequential, not random, so you need a way to shuffle the IDs to ensure that the instances are randomly distributed across the training set and the test set. This leads us to hash functions. As Wikipedia says, "A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.". The most famous example is the MD5 function: >>> from hashlib import md5
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This may be a very long document").digest()
b'}\xeb\xdb\xdc\r\x81D\xa0\xa1\\\x19\x8c\xccT:-' As this example shows, the MD5 function is deterministic: if you run it twice on the same data, it outputs the same result twice. However, if you run it on two different documents (even if they differ by only one bit), it outputs very different results, seemingly random. This is exactly what we need to shuffle the instance IDs! What we can do is just run the instance IDs through MD5 (or any other hash function), and this will give us a new, seemingly random ID (but actually stable across multiple runs, since hash functions are deterministic). Now MD5 outputs a 128 bit number, so if we want to have 20% of the instances in the test set, we can just compute the MD5 of each instance's ID, and if the result is lower than 2^128 * 20%, it goes to the test set, otherwise it goes to the training set. Unfortunately, things get slightly ugly in Python when dealing with MD5, since it actually returns a byte array (with 16 bytes), not a large integer. We could convert the whole byte array to a long 128 bit integer (using the Lastly, what do we do if the dataset has no IDs at all? Well, we can just apply the MD5 function to each instance's input features (assuming they are unique and stable across runs). In the case of the California Housing dataset, you could use the longitude, latitude, median_house_value, population and so on. For example: >>> doc = "\n".join([str(field)
... for field in (longitude, latitude, median_house_value, population)])
...
>>> h = md5(doc.encode("ascii")).digest()
>>> is_test = (h[-1] <= 51)
>>> is_test
True I hope this helps! |
Thank you for a clear explanation. |
Hi @SumoBBQ, The MD5 hash function acts like a pseudo-random number generator with a uniform probability distribution (as do most hash functions, in fact). So there will be roughly 52/256=20.3125% hashes whose last byte is strictly lower than 52 (i.e., lower or equal to 51). Hope this helps. |
Nice! Thank you for your help 👍 |
Hi @ageron , Thanks for the nice explanation above! I am a beginner in the hash area, so my questions may sound strange! I apologize in advance! I just want to have a clear picture in my mind. Thanks a lot, |
Hi @zahraegh , Here's a typical usage of the MD5 hash function: >>> from hashlib import md5
>>> md5(b"hello")
<md5 HASH object @ 0x101988580>
>>> md5(b"hello").digest()
b']A@*\xbcK*v\xb9q\x9d\x91\x10\x17\xc5\x92' You give it a byte array, of any length, and when you call >>> len(md5(b"hello").digest())
16
>>> len(md5(b"hello").digest())*8
128
>>> len(md5(b"hello this is longer").digest())*8
128 The >>> md5(123)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required One workaround is to convert the integer to a string, then encode that string (e.g. using ASCII) to a byte-array. But that adds some unnecessary overhead. Instead, we can just create a NumPy >>> import numpy as np
>>> md5(np.int64(123))
<md5 HASH object @ 0x101f2ba58>
>>> md5(np.int64(123)).digest()
b'\xf1\x8b\x8d\xbe\xfe\x02\xa0\xef\xce(\x1d\xebU\xa2\t\xcd' Whether the dataset is big or small, we want to keep about 20% for the test set, and 80% for the training set. We don't need a one-to-one mapping from instance to hash: all we need is to have a deterministic mapping from each instance in the dataset to either the training set or the test set, so it does not matter if two instances end up with the same hash. We just need the mapping to be deterministic, so that we a given instance always gets assigned to the same subset (train or test). And it should also be pseudo-random so that instances get well distributed across the training set and the test set. Hope this helps. |
I would like to ask one question ? In the function def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set] Where is |
applies the function In case you don't understand how the function is applied in the above code I'd suggest looking up lambda functions. Basically,
is equivalent to:
|
Thank you . @skp104 skp |
Thanks for a clear explanation |
Hi @Riadh123-star , |
Dear Mr. Ageron, |
I have one simple question. During the hashing, only the last byte of the actual hash is considered as the test if the data in question belongs to the test set. Yes, the whole hash is a unique value (unless a collision happens). But, only the last byte 0-255 is used as the determinant of belonging in the data set. So, are you saying that because the hashing algorithm provides a "uniform distribution" that 20% of the values that represent the last byte of the hash will be less than 51 (20% of 256)? Thank You BTW, I purchased your book. Love it so far. |
Hi @ageron , i really love your book and it is easy to understand for me, but in the chapter 2 you wrote:
For example, you could compute a hash of each
instance’s identifier, keep only the last byte of the hash, and put the
instance in the test set if this value is lower or equal to 51 (~20% of 256).
This ensures that the test set will remain consistent across multiple runs,
even if you refresh the dataset. The new test set will contain 20% of the
new instances, but it will not contain any instance that was previously in
the training set
Could you give me some explanation, what is hash of instance's identifier, how can I apply it to split data set into 2 parts training set and test set? And one more question, instance's identifier are longtitude, median_house_value, population,... is that correct?
Sorry for my bad English :( Thank you in advanced.
The text was updated successfully, but these errors were encountered: