Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frame.mapColValues is weirdly slow compared to mapping columns as series and joining with Frame.ofColumns #539

Open
vijoc opened this issue Oct 27, 2021 · 1 comment

Comments

@vijoc
Copy link

vijoc commented Oct 27, 2021

I ran into an issue where a frame with tens of thousands of rows and a handful (<10) of columns is very slow to apply Frame.mapColValues over. On the other hand, when first mapping the series "manually" (see below) and then joining with Frame.ofColumns, the difference in speed is of orders of magnitude.

What I'm looking to do is a naive hourly averaging of time series data. I implemented it with essentially the following:

// A simple series of 20 000 observations with one minute interval
let startFrom = DateTimeOffset.Parse "2021-10-27T00:00:00Z"
let series =
    Seq.init 20000 (fun idx -> startFrom.AddMinutes (float idx), float idx)
    |> Series.ofObservations

// Two columns with the same series from above
let columns = seq { "one", series; "two", series }
let frame = Frame.ofColumns columns

// Comparison of two timestamps to check if the hour is the same
let isSameHour (d1: DateTimeOffset) (d2: DateTimeOffset) =
    d1.Hour = d2.Hour && d1.Day = d2.Day && d1.Month = d2.Month && d1.Year = d2.Year

// Three methods to convert to a new frame with hourly averages
// 1. Using Frame.mapColValues, takes over a second
frame |> Frame.mapColValues (Series.chunkWhileInto isSameHour Stats.mean) // this takes over a second

// 2. An approximation of the internals of Frame.mapColValues, takes the same time (over a second):
frame.Columns
    |> Series.mapValues (Series.chunkWhileInto isSameHour Stats.mean)
    |> Frame.ofColumns

// 3. Sidestepping the initial frame, this takes 10-20 *milli*seconds:
columns
    |> Seq.map (fun (k, s) -> k, s |> Series.chunkWhileInto isSameHour Stats.mean)
    |> Frame.ofColumns

It may well be that I'm overlooking something here, I'm not super confident with either the Deedle codebase nor performance diagnosis in F#. I do have a setup with BenchmarkDotNet, which I could extract and share if that would be helpful.

Is this kind of performance expected? I believe I can avoid the issue in my use case by using method 3 from above, but I'm struggling to understand what could cause this kind of performance difference in this case.

@vijoc
Copy link
Author

vijoc commented Oct 28, 2021

For what it's worth, I did some more testing and found that the bad performance can also be avoided by using Frame.getNumericCols or even simply Frame.getCols.

// About the same performance as approach number 3 from above
frame
    |> Frame.getNumericCols
    |> Series.mapValues (Series.chunkWhileInto inlineComparison Stats.mean)
    |> Frame.ofColumns

// Slightly worse behavior, but still around 50 milliseconds versus ~10+ milliseconds for the above 
// or ~1+ seconds for Frame.mapColValues
frame
    |> Frame.getCols
    |> Series.mapValues (Series.chunkWhileInto inlineComparison Stats.mean)
    |> Frame.ofColumns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant