Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring a possible ungroup operation #110

Closed
rat-matheson opened this issue Feb 17, 2021 · 4 comments
Closed

Exploring a possible ungroup operation #110

rat-matheson opened this issue Feb 17, 2021 · 4 comments

Comments

@rat-matheson
Copy link

rat-matheson commented Feb 17, 2021

This came up in the data forge 2 discussion (Plans for Data Forge version 2).

I suspect you may be able to do this already in Data Forge but let me lay out my use case. In this example, I want to represent employee schedules in a DataFrame. A schedule for a single user consists of a several rows. Each row contains an interval with a start, end, type (such as work, lunch, break), and user to tie them all together.

Here's an example. I'm representing the interval by using a line such as |------| where the first | is the start (such as 9am) and the second | is the end (such as 5pm).

Example DataFrame - showing a schedule

user          interval                                 type
=================================================================
userA         |-------------------------------|        Work
userA                 |----|                           Lunch
userA                                |--|              Break

The operation I want to do is take the schedules for all users, and determine the number of productive headcount I have in a given interval. So in the case of a big department store, I'm trying to determine the number of people working at a given time which I would likely use to ensure that we were staffed for peak demand.

So there are two operations I need to create custom code for. One is to convert a user's schedule into productive time. This removes all the breaks such that I only have the time when they are physically working.

operation: toProductiveTime

// input rows:
user          interval                                 type
=================================================================
userA         |-------------------------------|        Work
userA                 |----|                           Lunch
userA                                |--|              Break

// output rows:   
user          interval                                 type
==================================================================
userA         |-------|                                Work
userA                      |---------|                 Work
userA                                   |-----|        Work

The second operation needs to aggregate all these productive intervals into a headcount number

operation: toAggregateChunks

// input rows:
user          interval                                 type
==================================================================
userB         |---------|                              Work
userC              |---------|                         Work
...

// output rows
Headcount     interval     
==================================================================
1             |----|
2                  |----|
1                       |----|
0                            |----|

So in order to do this, I want to do the following

  1. group by user
  2. for each group, do toProductiveTime transformation so that we have the time when employees are actually working
  3. ungroup so that we have a big collection of productive intervals
  4. aggregate into 30 minute chunks so that we can get productive head count for each interval using toAggregateChunks transformation

I imagine the code might look like

dataFrame.groupBy(r => r.userId)
    .map(scheduleTransformations.toProductiveTime())
    .ungroup()
.map(scheduleTransformations.toAggregateChunks({intervalLength:'30 minutes'}))

Is this a crazy? To me, it feels very natural to group, do some operations on each subgroup, and then ungroup. It feels easy to read and the operations themselves are separately testable (which is good because they are complicated). I've implemented something just like the above in the past that didn't require loading all the data into memory...but I'm not sure if the way I think about this problem is overly complicated compared to how others might approach it.

@rat-matheson rat-matheson changed the title Exploring a possible ungroup optionation Exploring a possible ungroup operation Feb 17, 2021
@ashleydavis
Copy link
Member

ashleydavis commented Feb 17, 2021

I don't think there is any equivalent to ungroup. But what you are asking for should be possible.

I'd probably do it like this:

  1. Filter out breaks for all users.
  2. Group by user.
  3. For each user divide their productive time into hourly slots (open column per slot).
  4. Merge groups to look like this:
User     09:00     10:00   11:00    12:00   13:00    etc...
======================================================================
A          1            1          0           1         1
B          0            1          1           0         1
  1. Aggregates rows to find the head count in each hour.

There are so many ways you could do this depending on the outcome you want.

Please feel free to share the code with and I can help you with it.

@rat-matheson
Copy link
Author

Apologies. I think I phrased my comment poorly but I added this to make a case for adding an ungroup function. It was an example meant to show the type of operation that might require an ungroup function. It has been useful for specific schedule operations I've had in the past, especially for cases where I need to group, do an operation, and then group that group with another operation before ungrouping them all.

With regards to your response above, I think the first step can't be filter out breaks for all users because that first operation actually involves flattening a users' specific schedule into work intervals, not removing the breaks. It combines all the overlapping intervals into a single non-overlapping interval based on precedence. I probably could have better represented it by showing an intermediate step

// input rows:
user          interval                                 type
=================================================================
userA         |-------------------------------|        Work
userA                 |----|                           Lunch
userA                                |--|              Break

// flattened schedule:   
user          interval                                 type
==================================================================
userA         |-------|                                Work
userA                 |----|                           Lunch
userA                      |---------|                 Work
userA                                |--|              Break
userA                                   |-----|        Work

// filter out breaks:   
user          interval                                 type
==================================================================
userA         |-------|                                Work
userA                      |---------|                 Work
userA                                   |-----|        Work

The original work interval has the break and lunch interval removed. The end result is the time when an employee is doing productive work (and not on break or lunch). In reality, that part needs to happen on a per-user basis because there may be multiple kinds of overlap (eg a team meeting overlapping lunch doesn't result in -2 headcount...but you have to look at the specific user's schedule to determine if it is overlapping another break in addition to the work interval.

@rat-matheson
Copy link
Author

I was thinking a bit more about this...perhaps the existing groupBy function can do something similar. I'm not experienced enough with it but perhaps

df.groupBy(group => {
   // So if this group is essentially a subset dataframe (or a collection of rows), I can manipulate it however I want.
   // If I can return a collection of rows, and then groupBy just aggregates all the rows into a big table again, it is the same thing
});

I really need to play around with data-forge more to understand this part. Honestly, there is no immediate need to get hung up on it so feel free to punt this or close it for a later visit. Just tossing around an idea.

@ashleydavis
Copy link
Member

Yeah groupBy is probably close to what you need. If you find any good examples of how this kind of thing works in Pandas or R please share them here. I'll keep this ticket open and reconsider it again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants