-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement 'seed' for 'train_test_split' (take two) #678
Implement 'seed' for 'train_test_split' (take two) #678
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #678 +/- ##
=======================================
Coverage 87.34% 87.34%
=======================================
Files 113 113
Lines 10791 10798 +7
Branches 1479 1480 +1
=======================================
+ Hits 9425 9432 +7
Misses 989 989
Partials 377 377
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
6a7a5f0
to
0522a33
Compare
Deploying datachain-documentation with
|
Latest commit: |
fc66bec
|
Status: | ✅ Deploy successful! |
Preview URL: | https://2433ebdb.datachain-documentation.pages.dev |
Branch Preview URL: | https://606-improve-split-dataset-in-3xqc.datachain-documentation.pages.dev |
Test this implementation with this simple script: import random
import sys
from PIL import Image
def sys__rand():
return random.randint(0, MAX_INT_64) - 2**63
MAX_INT_64 = 2**64 - 1
COLORS = [(0, 0, 0), (255, 255, 255), (255, 0, 0), (0, 255, 0), (0, 0, 255)]
seed = int(sys.argv[1])
weights = [float(val) for val in sys.argv[2:]]
resolution = 2**31 - 1 # Maximum positive value for a 32-bit signed integer.
uniform_seed = random.Random(seed).randrange(1, resolution)
weights_normalized = [weight / sum(weights) for weight in weights]
width, height = 512, 512
im = Image.new("RGB", (width, height), "black")
pixels = im.load()
res = [0 for _ in range(len(weights_normalized))]
for y in range(height):
for x in range(int(width)):
sys_rand_col = sys__rand()
rand_col = sys_rand_col if seed < 0 else (sys_rand_col % resolution) * uniform_seed
rand_col = rand_col % resolution
for index, _ in enumerate(weights_normalized):
if (
rand_col >= round(sum(weights_normalized[:index]) * resolution) and
rand_col < round(sum(weights_normalized[: index + 1]) * resolution)
):
pixels[x, y] = COLORS[index]
res[index] += 1
break
im.save("random_pattern.png", "PNG")
print(res) No seed, 1:1 distribution
No seed, 10:1 distribution
No seed, 10:5:1 distribution
With seed, 1:1 distribution
With seed, 10:1 distribution
Unless proven otherwise, I will assume this to be a uniform distribution and consider the task complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this awesome work, @dreadatour! 🤩
Just a small note with regard to half-open intervals: should RESOLUTION
be RESOLUTION + 1
for modulo and randrange
but not for multiplication?
90c082e
to
fc66bec
Compare
Testing simple approach discussed in #657
New
seed
param fortrain_test_split
toolkit method was added:Closes #606