Exploring a possible ungroup operation #110

rat-matheson · 2021-02-17T20:10:42Z

This came up in the data forge 2 discussion (Plans for Data Forge version 2).

I suspect you may be able to do this already in Data Forge but let me lay out my use case. In this example, I want to represent employee schedules in a DataFrame. A schedule for a single user consists of a several rows. Each row contains an interval with a start, end, type (such as work, lunch, break), and user to tie them all together.

Here's an example. I'm representing the interval by using a line such as |------| where the first | is the start (such as 9am) and the second | is the end (such as 5pm).

Example DataFrame - showing a schedule

user          interval                                 type
=================================================================
userA         |-------------------------------|        Work
userA                 |----|                           Lunch
userA                                |--|              Break

The operation I want to do is take the schedules for all users, and determine the number of productive headcount I have in a given interval. So in the case of a big department store, I'm trying to determine the number of people working at a given time which I would likely use to ensure that we were staffed for peak demand.

So there are two operations I need to create custom code for. One is to convert a user's schedule into productive time. This removes all the breaks such that I only have the time when they are physically working.

operation: toProductiveTime

// input rows:
user          interval                                 type
=================================================================
userA         |-------------------------------|        Work
userA                 |----|                           Lunch
userA                                |--|              Break

// output rows:   
user          interval                                 type
==================================================================
userA         |-------|                                Work
userA                      |---------|                 Work
userA                                   |-----|        Work

The second operation needs to aggregate all these productive intervals into a headcount number

operation: toAggregateChunks

// input rows:
user          interval                                 type
==================================================================
userB         |---------|                              Work
userC              |---------|                         Work
...

// output rows
Headcount     interval     
==================================================================
1             |----|
2                  |----|
1                       |----|
0                            |----|

So in order to do this, I want to do the following

group by user
for each group, do toProductiveTime transformation so that we have the time when employees are actually working
ungroup so that we have a big collection of productive intervals
aggregate into 30 minute chunks so that we can get productive head count for each interval using toAggregateChunks transformation

I imagine the code might look like

dataFrame.groupBy(r => r.userId)
    .map(scheduleTransformations.toProductiveTime())
    .ungroup()
.map(scheduleTransformations.toAggregateChunks({intervalLength:'30 minutes'}))

Is this a crazy? To me, it feels very natural to group, do some operations on each subgroup, and then ungroup. It feels easy to read and the operations themselves are separately testable (which is good because they are complicated). I've implemented something just like the above in the past that didn't require loading all the data into memory...but I'm not sure if the way I think about this problem is overly complicated compared to how others might approach it.

The text was updated successfully, but these errors were encountered:

ashleydavis · 2021-02-17T21:58:50Z

I don't think there is any equivalent to ungroup. But what you are asking for should be possible.

I'd probably do it like this:

Filter out breaks for all users.
Group by user.
For each user divide their productive time into hourly slots (open column per slot).
Merge groups to look like this:

User     09:00     10:00   11:00    12:00   13:00    etc...
======================================================================
A          1            1          0           1         1
B          0            1          1           0         1

Aggregates rows to find the head count in each hour.

There are so many ways you could do this depending on the outcome you want.

Please feel free to share the code with and I can help you with it.

rat-matheson · 2021-02-18T21:24:09Z

Apologies. I think I phrased my comment poorly but I added this to make a case for adding an ungroup function. It was an example meant to show the type of operation that might require an ungroup function. It has been useful for specific schedule operations I've had in the past, especially for cases where I need to group, do an operation, and then group that group with another operation before ungrouping them all.

With regards to your response above, I think the first step can't be filter out breaks for all users because that first operation actually involves flattening a users' specific schedule into work intervals, not removing the breaks. It combines all the overlapping intervals into a single non-overlapping interval based on precedence. I probably could have better represented it by showing an intermediate step

// input rows:
user          interval                                 type
=================================================================
userA         |-------------------------------|        Work
userA                 |----|                           Lunch
userA                                |--|              Break

// flattened schedule:   
user          interval                                 type
==================================================================
userA         |-------|                                Work
userA                 |----|                           Lunch
userA                      |---------|                 Work
userA                                |--|              Break
userA                                   |-----|        Work

// filter out breaks:   
user          interval                                 type
==================================================================
userA         |-------|                                Work
userA                      |---------|                 Work
userA                                   |-----|        Work

The original work interval has the break and lunch interval removed. The end result is the time when an employee is doing productive work (and not on break or lunch). In reality, that part needs to happen on a per-user basis because there may be multiple kinds of overlap (eg a team meeting overlapping lunch doesn't result in -2 headcount...but you have to look at the specific user's schedule to determine if it is overlapping another break in addition to the work interval.

rat-matheson · 2021-02-19T01:38:49Z

I was thinking a bit more about this...perhaps the existing groupBy function can do something similar. I'm not experienced enough with it but perhaps

df.groupBy(group => {
   // So if this group is essentially a subset dataframe (or a collection of rows), I can manipulate it however I want.
   // If I can return a collection of rows, and then groupBy just aggregates all the rows into a big table again, it is the same thing
});

I really need to play around with data-forge more to understand this part. Honestly, there is no immediate need to get hung up on it so feel free to punt this or close it for a later visit. Just tossing around an idea.

ashleydavis · 2021-03-01T08:32:58Z

Yeah groupBy is probably close to what you need. If you find any good examples of how this kind of thing works in Pandas or R please share them here. I'll keep this ticket open and reconsider it again in the future.

rat-matheson changed the title ~~Exploring a possible ungroup optionation~~ Exploring a possible ungroup operation Feb 17, 2021

rat-matheson mentioned this issue Feb 17, 2021

Plans for Data-Forge version 2 #108

Open

rat-matheson closed this as completed Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring a possible ungroup operation #110

Exploring a possible ungroup operation #110

rat-matheson commented Feb 17, 2021 •

edited

Loading

ashleydavis commented Feb 17, 2021 •

edited

Loading

rat-matheson commented Feb 18, 2021

rat-matheson commented Feb 19, 2021

ashleydavis commented Mar 1, 2021

Exploring a possible ungroup operation #110

Exploring a possible ungroup operation #110

Comments

rat-matheson commented Feb 17, 2021 • edited Loading

ashleydavis commented Feb 17, 2021 • edited Loading

rat-matheson commented Feb 18, 2021

rat-matheson commented Feb 19, 2021

ashleydavis commented Mar 1, 2021

rat-matheson commented Feb 17, 2021 •

edited

Loading

ashleydavis commented Feb 17, 2021 •

edited

Loading