Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldnt understand the code in chapter-2 while separating test set #567

Open
sniray opened this issue May 27, 2020 · 1 comment
Open

Couldnt understand the code in chapter-2 while separating test set #567

sniray opened this issue May 27, 2020 · 1 comment

Comments

@sniray
Copy link

sniray commented May 27, 2020

Hi Mr.Aurélien Géron,

In your book while separating the test set you have written.
def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]

Can you help me to understand how hash helps in separating the test set and avoid the problems mentioned before. In the second book you have used crc32 and the code is as following:
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
How does this equals to the above code?

@Rosseel
Copy link

Rosseel commented Jun 24, 2020

He covered it pretty extensively here : #71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants