Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 'seed' for 'train_test_split' (take two) #678

Merged
merged 10 commits into from
Dec 11, 2024

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Dec 9, 2024

Testing simple approach discussed in #657


New seed param for train_test_split toolkit method was added:

train, test, val = train_test_split(dc, [0.7, 0.2, 0.1], seed=37)
train.save("dataset_train")
test.save("dataset_test")
val.save("dataset_val")

Closes #606

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.34%. Comparing base (6ca3c98) to head (fc66bec).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #678   +/-   ##
=======================================
  Coverage   87.34%   87.34%           
=======================================
  Files         113      113           
  Lines       10791    10798    +7     
  Branches     1479     1480    +1     
=======================================
+ Hits         9425     9432    +7     
  Misses        989      989           
  Partials      377      377           
Flag Coverage Δ
datachain 87.28% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dreadatour dreadatour force-pushed the 606-improve-split-dataset-into-train-test-eval-2 branch from 6a7a5f0 to 0522a33 Compare December 10, 2024 04:30
Copy link

cloudflare-workers-and-pages bot commented Dec 10, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: fc66bec
Status: ✅  Deploy successful!
Preview URL: https://2433ebdb.datachain-documentation.pages.dev
Branch Preview URL: https://606-improve-split-dataset-in-3xqc.datachain-documentation.pages.dev

View logs

@dreadatour dreadatour requested a review from a team December 10, 2024 04:32
@dreadatour
Copy link
Contributor Author

dreadatour commented Dec 10, 2024

Test this implementation with this simple script:

import random
import sys

from PIL import Image


def sys__rand():
    return random.randint(0, MAX_INT_64) - 2**63


MAX_INT_64 = 2**64 - 1
COLORS = [(0, 0, 0), (255, 255, 255), (255, 0, 0), (0, 255, 0), (0, 0, 255)]

seed = int(sys.argv[1])
weights = [float(val) for val in sys.argv[2:]]

resolution = 2**31 - 1  # Maximum positive value for a 32-bit signed integer.
uniform_seed = random.Random(seed).randrange(1, resolution)
weights_normalized = [weight / sum(weights) for weight in weights]

width, height = 512, 512
im = Image.new("RGB", (width, height), "black")
pixels = im.load()
res = [0 for _ in range(len(weights_normalized))]
for y in range(height):
    for x in range(int(width)):
        sys_rand_col = sys__rand()
        rand_col = sys_rand_col if seed < 0 else (sys_rand_col % resolution) * uniform_seed
        rand_col = rand_col % resolution

        for index, _ in enumerate(weights_normalized):
            if (
                rand_col >= round(sum(weights_normalized[:index]) * resolution) and
                rand_col < round(sum(weights_normalized[: index + 1]) * resolution)
            ):
                pixels[x, y] = COLORS[index]
                res[index] += 1
                break

im.save("random_pattern.png", "PNG")
print(res)

No seed, 1:1 distribution

$ python test-train-split-img.py -1 1 1
[131389, 130755]

random_pattern

No seed, 10:1 distribution

$ python test-train-split-img.py -1 10 1
[238471, 23673]

random_pattern

No seed, 10:5:1 distribution

$ python test-train-split-img.py -1 10 5 1
[163846, 81782, 16516]

random_pattern

With seed, 1:1 distribution

$ python test-train-split-img.py 123456 1 1
[131329, 130815]

random_pattern

With seed, 10:1 distribution

$ python test-train-split-img.py 123456 10 1
[238164, 23980]

random_pattern


Unless proven otherwise, I will assume this to be a uniform distribution and consider the task complete.

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this awesome work, @dreadatour! 🤩

Just a small note with regard to half-open intervals: should RESOLUTION be RESOLUTION + 1 for modulo and randrange but not for multiplication?

@dreadatour dreadatour requested a review from 0x2b3bfa0 December 10, 2024 15:41
@0x2b3bfa0 0x2b3bfa0 force-pushed the 606-improve-split-dataset-into-train-test-eval-2 branch from 90c082e to fc66bec Compare December 11, 2024 07:39
@dreadatour dreadatour merged commit 5db33a2 into main Dec 11, 2024
34 checks passed
@dreadatour dreadatour deleted the 606-improve-split-dataset-into-train-test-eval-2 branch December 11, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve split Dataset into train / test / eval
2 participants