-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datashader notebook #2
Comments
@mrocklin - here's the stripped down version of the notebook, i'm in meetings the rest of my day so I'll let you take it from here. Happy to pick things up again when I wake up. |
Thanks @rrpelgrim playing now. I'm able to get this down to about 2s so far. I don't have interactive stuff working though. That code seems a bit strange currently. I'll report back in a while. |
Pushed up to https://github.com/mrocklin/dask-tutorial/blob/main/2-dataframes-at-scale.ipynb It doesn't flow pedagogically yet, and I had to get rid of interaction (but I hope to get it back after talking to PyViz folks) but everything computes close to where it should. We can process through the large 800M row dataset in about three seconds. If I can get interaction in there well then that should leave the students with a positive experience. Interactivity was weird. It resulted in far more computation, lots of slowness, and yet not actually the ability to zoom around and see things re-render. It was like the worst of both worlds. I'm sure that there is a nicer way to go about this. If anyone has a chance to try out the notebook please do. |
OK, the next thing I want I think is interactivity. In the previous notebook I think we copy-pasted the following code: # generate interactive datashader plot
shaded = hd.datashade(hv.Points(ddf, ['dropoff_longitude', 'dropoff_latitude']), cmap=Hot, aggregator=ds.count('passenger_count'))
hd.dynspread(shaded, threshold=0.5, max_px=4).opts(bgcolor='black', xaxis=None, yaxis=None, width=900, height=500) But this seemed broken to me in two ways:
@rrpelgrim was this also your experience? cc'ing @ianthomas23 from the datashader in case he has suggestions Ian, we have a notebook that uses datashader to plot the NYC Taxi data (not very original, I know). At first this takes 30s to plot one year's worth of data. In the notebook the students will progress through slimming down memory and persisting data in ram to get this running in about ~1-2s. At this performance I would love to give them an interactive pan/zoom experience. None of us know datashader well enough though to get this running smoothly. I've looked at docs a bit and I'm afraid that there are a few too many concepts for me to learn this quickly. Do you have any suggestions on how to take some of our current code, which looks like this: agg = datashader.Canvas().points(
source=df,
x="dropoff_longitude",
y="dropoff_latitude",
agg=datashader.count("passenger_count")
)
tf.shade(agg, cmap=Hot, how="eq_hist") And give it a more interactive recomputation-non-pan-zoom experience? Also if there are other performance opportunities that you see please speak up. Cheers, In the exercise we |
In my experience, this code below: # generate interactive datashader plot
shaded = hd.datashade(hv.Points(ddf, ['dropoff_longitude', 'dropoff_latitude']), cmap=Hot, aggregator=ds.count('passenger_count'))
hd.dynspread(shaded, threshold=0.5, max_px=4).opts(bgcolor='black', xaxis=None, yaxis=None, width=900, height=500) gives me an interactive Bokeh plot which I can pan/zoom. When run on a 100-worker cluster the experience is pretty smooth (few seconds per render) On a 15-worker cluster the experience is pretty terrible (2-3 minutes per render).
Yes I saw this too.
I didn't experience this. Panning and zooming does trigger re-renderings for me. AFAIK, this code below will only ever create static images agg = datashader.Canvas().points(
source=df,
x="dropoff_longitude",
y="dropoff_latitude",
agg=datashader.count("passenger_count")
)
tf.shade(agg, cmap=Hot, how="eq_hist") @ianthomas23 - if you're by any chance available for a screenshare to sync on this that would be grand, i'm in Portugal and closer to you in terms of timezones :) |
Some initial comments:
I can sync on this now @rrpelgrim. You have my work email, or you can use the gmail account associated with my github account. |
Thanks @ianthomas23 - I'm going to try to run your suggested code right now, and then reach out for a live sync if still needed. |
Thanks both |
@ianthomas23 - your suggested code runs a lot smoother, thanks. still have a few questions about potential performance speedups. have sent an invite for a live sync if that time works for you, otherwise i'm also flexible later today. |
Thanks for the chat @ianthomas23. |
The plan is to give people ~25 workers (4 vCPU each), right? |
Not sure yet. I'll want to tune cluster size to make things appropriately painful when things are done poorly, but pleasant when things are done well. |
@ianthomas23 now that I'm up and active could I also grab a bit of your time? Some questions:
Quick demonstration video here: https://www.loom.com/share/d16be675ccfe40fea9fddda5b24cb1ac |
There might be a pass over the data to check the x and y limits, which can be avoided by specifying them using the kwargs I don't know what is going on between the passes, that is occurring in hvplot/holoviews which I am not the resident expert on. We'd need one of my travelling colleagues to explain that. 3 is evidently catastrophic. But it was working for Richard earlier? |
I don't know. @rrpelgrim ? @ianthomas23 if you thought you might be able to diagnose 3 live I'd suggest a live call (if it's not too much of an imposition). If that's not the case though then I could pass) |
OK, I've got interactivity working (I just reinstalled everything). Three new questions:
|
Answers:
color_key = {'pickup': 'red', 'dropoff': 'blue'}
ddf.hvplot.scatter(x="x", y="y", aggregator=ds.by("type"), datashade=True, cnorm="eq_hist",
width=400, aspect=1.23, xlim=(133, 456), ylim={345, 678), color_key=color_key) but use the correct (Edited to fix typos) |
If you going to go for the categorical datashade route, you would ideally preprocess your data into the correct format beforehand rather than do the conversion live in the tutorial. |
Richard, I think that trying this is worth about an hour of your time. I
don't think we need to pre process and store. I think that we can probably
do this in a few lines at the end, persist, and give the students a nicer
exploratory experience at the end.
Thoughts?
…On Tue, Nov 8, 2022, 4:51 PM Ian Thomas ***@***.***> wrote:
If you going to go for the categorical datashade route, you would ideally
preprocess your data into the correct format beforehand rather than do the
conversion live in the tutorial.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGOGWQY3TYPQ73ZM63WHLKORANCNFSM6AAAAAARZLEIMA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ian, thank you for continuing to engage here. I appreciate it.
…On Tue, Nov 8, 2022, 5:55 PM Matthew Rocklin ***@***.***> wrote:
Richard, I think that trying this is worth about an hour of your time. I
don't think we need to pre process and store. I think that we can probably
do this in a few lines at the end, persist, and give the students a nicer
exploratory experience at the end.
Thoughts?
On Tue, Nov 8, 2022, 4:51 PM Ian Thomas ***@***.***> wrote:
> If you going to go for the categorical datashade route, you would ideally
> preprocess your data into the correct format beforehand rather than do the
> conversion live in the tutorial.
>
> —
> Reply to this email directly, view it on GitHub
> <#2 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTGOGWQY3TYPQ73ZM63WHLKORANCNFSM6AAAAAARZLEIMA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
I'm happy to help. Actually I'm being selfish, this is an example that I'd like to use myself and indeed extend to include use of GPUs for the datashader processing. We should keep a communication channel open to combine our dask and visualisation skills to produce really good examples, perhaps with a longer lead time in future 🙂 |
Thanks for your help here @ianthomas23, I got this to work. @mrocklin - code is over in https://github.com/rrpelgrim/dask-tutorial-mrocklin/blob/main/2-dataframes-at-scale.ipynb under the "More detail please" header. LMK if this is looking like what you had in mind. |
Absolutely. Is there anything you'd need from us to extend this existing example to use GPUs? An example like that would make for a fun blog post, I think :) |
All I need is time, but unfortunately that is not transferable! I have a couple of comments about the latest code in the final cell:
Do you have a screenshot of the initial image produced by the final cell? It is worth a look to see if my somewhat arbitrary choice of colors is good or not. |
thanks @ianthomas23, adjusting now - here's a screenshot: |
Perfect, that is both informative and beautiful! |
That does look beautiful. I'm looking forward to playing with it. I'll probaby be dark on this issue until later this afternoon US time. Thank you for your work here. |
Thanks @rrpelgrim I've pushed your work up to my branch at the end. I won't turn this into an exercise. It'll just be something fun to play with at the end. |
Running into this, which is a bit odd: dask/distributed#7289 |
@rrpelgrim and I just went through things. I recommend the following narrative:
Welcome! We're going to make it easy to interactively visualize N million points.
Beloved matplotlib doesn't work in this case (here is an image showing it not working). We'll do this with datashader instead and get beautiful images like this ... (no code here, just images copied in)
Let's load some data and plot it (maybe non-interactive at first)
This takes a while. If you want to take a look check out the profile tab in the dashboard (but we won't explain it much here in the interests of time).
This doesn't look good. Mostly it's because data is way far away. Exercise, use pandas syntax to filter to where latitude is between X and Y and where longitude is between X and Y
Visualize again, how does it look? (Hopefully it looks beautiful)
This takes a long time though, how can we make it faster? Let's try persisting in memory. What happens?
OK our cluster ran out of memory. What are some solutions?
Let's slim down our dataset using one of the following approaches:
Let's go interactive with pan/zoom.
... TODO: matt figures out how to make this even faster
The text was updated successfully, but these errors were encountered: