-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[skip ci] Fix Broken Link in NYC Taxi Notebook #616
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -58,9 +58,9 @@ | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taureandyernv , are you aware how was the notebook run prior to this change? On what machine, with/without memory pool etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're saving the parquet file as a csv :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taureandyernv do you have an answer to the question about what hardware it was run on, memory pool etc? We need to make sure this is not a real regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we should always store this information in the notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acutally...this is about right. This should take 30ish seconds on a tesla based hardware, 3.5 minutes on quadro. @isVoid, what slowdown are you talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the diff in this PR, you will see the slowdown I am talking about.
The notebook as checked in stores the time taken by the previous run. We should document in the notebook what hardware, RAPIDS and CUDA version, etc. it was run on. Otherwise when people run and they see a slower time they are likely to be confused or surprised.
None of this "this is about right" guesswork. Let's be precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the slowdown:
old:
"CPU times: user 11.2 s, sys: 6.27 s, total: 17.5 s\n",
"Wall time: 17.7 s\n"
new
"CPU times: user 22.6 s, sys: 14.6 s, total: 37.2 s\n",
"Wall time: 37.2 s\n"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is interesting and I've honestly don't recall seeing it run that fast. However, we can't yet compare these numbers and say there's a regression, but we can strongly suspect it. When we built this notebook for the 0.14 release, I did my work on RTX8000s (here is the commit with the ~3.5 minute run), ran some quick tests on the GV100s to see how my time could be so different, and only reran it on GV100s with this update. @thomcom used a different machine that was consistently faster with the PIP algorithm. It was cuSpatial's PIP that helped me realize the depth of the "by algorithm" performance discrepancies that classes of cards in the same arch family could have. Looking back through the branches, the notebook hasn't actually been reran since its merge and the CI jenkins jobs for each release may have already been deleted. This may end up being a @thomcom question, as those 17 second numbers were run on his machine, as per 6c3d853 and this commit (which I could only render ) . We may need him to rerun the notebook on that system to get any realistic understanding about if there is a regression and where it could have happened. @thomcom , may you provide your system details for @harrism and, if you still have that system, rerun the notebook on that system on 22.08 nightlies and provide us numbers?
A bit about the past:
While I can talk about what I recall observing then (and could be mistaken), the challenge you're facing comes from the stoppage of testing and record keeping for regressions on notebooks right around 0.14 and notebooks testing didn't start back up until after 0.18, which wasn't nearly as thorough until very recently. Some libraries reestablished forms of testing and record keeping a bit more whole heartedly than others, but cuGraph remains a shining example of what might help with system spec tracking with their table at the top). For now, would it help that current and future cuSpatial notebooks include a version of cuGraph's tables?
Thoughts on the future:
I absolutely agree on the precise record keeping, with system specs, which is why it's a built in portion of the outputs with that new beta notebooks testing script (although we currently don't test the cuSpatial notebook). We're only collecting data, not yet processing or analyzing it. We're in talks to get it integrated into Ops. Would you like me to start testing cuSpatial notebooks? Happy to share more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @taureandyernv @harrism I'll try to rerun the notebook this afternoon or tomorrow.
could probably skip ci for PRs like this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the taxi2017.csv to taxi2017.parquet in the noted areas. Also, that data needs to be downloaded ad its about 4gb.
getting the data in the folder, which may be hard for a CI machine without getting ops support, may solve the speed issues
Fix a parquet issues, fix dtype issues
"CPU times: user 11.2 s, sys: 6.27 s, total: 17.5 s\n", | ||
"Wall time: 17.7 s\n" | ||
"CPU times: user 22.6 s, sys: 14.6 s, total: 37.2 s\n", | ||
"Wall time: 37.2 s\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the slowdown I'm talking about. More than 2x slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thanks! I hadn't noticed that at all!
Since there's still longwinded discussion happening, we have to retarget this to 22.10. We are in code freeze. |
Hmmm, @harrism , i thought the discussion was a side topic of the possibility of a regression, but not dealing with the material updates of the notebook itself. All necessary changes are complete. This means that cuspatial won't have a working notebook for 22.08 and there are more eyes on spatial analytics right now... |
We have rules: PRs must be approved before merging. Also, the labels don't say whether this PR is ready to merge. We should treat opening new PRs during burndown as a "big deal". I approved this PR, but really @thomcom should approve it since he is more of a Python codeowner than I am (and since @isVoid can't approve his own PR). I will leave merging up to @isVoid since I don't know if it is ready. |
I ran the code on a dgx and branch-22.08 and I have this timing for the cell:
I think the function performance depends a lot on which machine it runs on. Since code freeze starts tomorrow. Let's merge this and follow up on the effects of machine to performance. |
@gpucibot merge |
I reran the benchmark and got 14.5 seconds. |
This PR fixes both broken links for the NYC 2016 and NYC 2017 datasets and reran the notebooks under 22.08.