Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new function to expand final labels #846

Merged
merged 6 commits into from
Jan 15, 2022

Conversation

shankari
Copy link
Contributor

@shankari shankari commented Jan 2, 2022

This will expand the user labels if they exist, and expand the highest
probability inferred label if we did not prompt the user.

Basically, this is now consistent with
e-mission/e-mission-docs#688 (comment)

There were some minor trickinesses working with this data structure and pandas.

  • We had to expand the expectation field to filter by to_label and then
  • We had to create a separate dataframe for the inferred labels and the apply a
    custom filter to them to pick the one with the max probability
  • We had to ensure that rows didn't appear twice - i.e. once for user inputs
    and once for inferred labels. This is true even if the user_inputs were not
    filled out; without changes, we would end up with both N/A and the inferred
    label for the same row.

And we had to keep the indices constant throughout these changes.
Added multiple tests, both of the individual pandas steps and the complete
function to test this fairly complex piece of code.

Testing done:

  • Ran the newly added tests

Next steps:

  • Change the metrics code and the leaderboard to use the new functions

This will expand the user labels if they exist, and expand the highest
probability inferred label if we did not prompt the user.

Basically, this is now consistent with
e-mission/e-mission-docs#688 (comment)

There were some minor trickinesses working with this data structure and pandas.
- We had to expand the expectation field to filter by to_label and then
- We had to create a separate dataframe for the inferred labels and the apply a
  custom filter to them to pick the one with the max probability
- We had to ensure that rows didn't appear twice - i.e. once for user inputs
  and once for inferred labels. This is true even if the user_inputs were not
  filled out; without changes, we would end up with both N/A and the inferred
  label for the same row.

And we had to keep the indices constant throughout these changes.
Added multiple tests, both of the individual pandas steps and the complete
function to test this fairly complex piece of code.

Testing done:
- Ran the newly added tests

Next steps:
- Change the metrics code and the leaderboard to use the new functions
By adding an expected failure and the correct version with drop_na
@shankari
Copy link
Contributor Author

To test, we look for trips that will not be displayed to the user and do not
have a user input. We would expect that these would now have a mode instead of
being undefined.

Our test user has two such trips:

  • 2021-07-21T17:17:23 -> 2021-07-21T17:27:10: inferred labels: pilot_ebike, work, drove_alone
  • 2021-09-13T22:58:58 -> 2021-09-13T23:06:22: inferred_lables: drove_alone, shopping, no_travel

Focusing on the first one, we set the current date on the phone to Jul 23,
which retrieves data from Jul 16 to Jul 23 and we don't see any unlabeled data.
However, if we retrieve the corresponding data in the diary, we see two trips
with yellow labels.

  • 5:17 to 5:27 (expected)
  • 8:58 to 9:06 (unexpected)

Why does the 8:58 trip not show up as "unlabeled"?

@shankari
Copy link
Contributor Author

Included this into the metrics detection code by replacing:

-            section_group_df = esdt.expand_userinputs(section_group_df)
+            section_group_df = esdt.expand_finallabels(section_group_df)

To test, we look for trips that will not be displayed to the user and do not
have a user input. We would expect that these would now have a mode instead of
being undefined.

Our test user has two such trips:

  • 2021-07-21T17:17:23 -> 2021-07-21T17:27:10: inferred labels: pilot_ebike, work, drove_alone
  • 2021-09-13T22:58:58 -> 2021-09-13T23:06:22: inferred_lables: drove_alone, shopping, no_travel

Focusing on the first one, we set the current date on the phone to Jul 23,
which retrieves data from Jul 16 to Jul 23 and we don't see any unlabeled data.
However, if we retrieve the corresponding data in the diary, we see two trips
with yellow labels.

  • 5:17 to 5:27 (expected)
  • 8:58 to 9:06 (unexpected)

Why does the 8:58 trip not show up as "unlabeled"?

>>> next_confirmed["data"]["expectation"]
{'to_label': True}

>>> next_confirmed["data"]["user_input"]
{}

>>> next_confirmed["data"]["start_fmt_time"]
'2021-07-21T20:58:41.238209-06:00'

>>> next_confirmed["data"]["end_fmt_time"]
'2021-07-21T21:06:24-06:00'

Let's look at the logs to figure out what is going on. It looks like the query was from 9th to 22nd UTC. The UI displays 16th to 23rd. And with the time difference, 22nd UTC is before the 21st 20:00

The query range is {'$lte': 1626912000, '$gte': 1625788800}, which corresponds to 2021-07-09T00:00:00+00:00 -> 2021-07-22T00:00:00+00:00, which corresponds to 2021-07-08T18:00:00-06:00 to 2021-07-21T18:00:00-06:00.

  • Why does our query range end on the 22nd if we are querying on the 23rd?
  • Why should we query by UTC instead of the local time zone?

@shankari
Copy link
Contributor Author

These are UI questions. But it certainly looks like the mapping works correctly since the 5:17 trip does not show up as unlabeled.

@shankari
Copy link
Contributor Author

Confirmed this by switching to master and pulling the dashboard. Without this change, we have unlabeled trips. With this change, we do not.

Without change With change
Screenshot_1642193130 Screenshot_1642193470

@shankari
Copy link
Contributor Author

Quick performance check to ensure that we don't have to back this out again:

Before

2022-01-14 13:09:00,472:DEBUG:123145426894848:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 11.745658159255981
2022-01-14 13:09:00,606:DEBUG:123145421639680:END POST /result/metrics/timestamp  11.886730909347534

After

2022-01-14 13:02:27,466:DEBUG:123145408503808:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 14.188169002532959
2022-01-14 13:02:27,999:DEBUG:123145403248640:END POST /result/metrics/timestamp  14.753452062606812

Not too terrible in terms of % but still bad wrt absolute numbers (2 secs). Let's see if we can change the structure to get it back down...

@shankari
Copy link
Contributor Author

First, doing some basic repetitions:

Before

2022-01-14 13:12:51,773:DEBUG:123145476661248:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 13.517530918121338
2022-01-14 13:12:51,920:DEBUG:123145481916416:END POST /result/metrics/timestamp  13.663254976272583
2022-01-14 13:13:10,760:DEBUG:123145497681920:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 13.531949043273926
2022-01-14 13:13:11,292:DEBUG:123145492426752:END POST /result/metrics/timestamp  14.082863807678223
2022-01-14 13:13:25,893:DEBUG:123145513447424:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 11.667600870132446
2022-01-14 13:13:26,073:DEBUG:123145508192256:END POST /result/metrics/timestamp  11.91217303276062
2022-01-14 13:13:48,042:DEBUG:123145476661248:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 13.24062705039978
2022-01-14 13:13:48,274:DEBUG:123145523957760:END POST /result/metrics/timestamp  13.484846115112305

After

2022-01-14 13:15:56,881:DEBUG:123145527443456:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 11.992200136184692
2022-01-14 13:15:57,075:DEBUG:123145522188288:END POST /result/metrics/timestamp  12.194638967514038
2022-01-14 13:16:11,672:DEBUG:123145543208960:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 12.579771995544434
2022-01-14 13:16:11,970:DEBUG:123145537953792:END POST /result/metrics/timestamp  12.885691165924072
2022-01-14 13:16:27,193:DEBUG:123145558974464:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 12.590034008026123
2022-01-14 13:16:27,450:DEBUG:123145553719296:END POST /result/metrics/timestamp  12.8558189868927
2022-01-14 13:16:45,553:DEBUG:123145527443456:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 12.550609111785889
2022-01-14 13:16:45,714:DEBUG:123145569484800:END POST /result/metrics/timestamp  12.718816995620728
2022-01-14 13:17:13,471:DEBUG:123145543208960:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 12.2324059009552
2022-01-14 13:17:14,149:DEBUG:123145537953792:END POST /result/metrics/timestamp  12.9130699634552

This doesn't actually seem that much worse.

@shankari
Copy link
Contributor Author

shankari commented Jan 14, 2022

Let's do some quick checks for repeated calls before moving on.

DB query

We only do one query per call

2022-01-14 13:18:18,119:DEBUG:123145564229632:curr_query = {'$or': [{'metadata.key': 'analysis/confirmed_trip'}], 'data.start_ts': {'$lte': 1626912000, '$gte': 1625788800}}, sort_key = None
2022-01-14 13:18:18,130:DEBUG:123145553719296:curr_query = {'user_id': UUID('742fbefa-e7d7-45a9-bdf6-44659d21e0fa'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}], 'data.start_ts': {'$lte': 1626912000, '$gte': 1625788800}}, sort_key = data.start_ts

Added post-processing

We seem to be calling this a lot. We call the "unlabeled replace" code 52 times per call. This is because we perform the post-processing in the grouped_to_summary call so it gets called for each time grouping. This is not as bad as it seems, since each time grouping works with a smaller dataset. But it is also called 4x times, once for each metric.

Let's move this up to the queried data, which should help. It is not going to result in a 4x speedup since it doesn't change the query time, but hopefully it will help quite a bit.

grep "123145564229632.*After replacing unlabeled" /var/tmp/webserver.log  | wc -l
      52

Drop the dashboard metrics query time from 12 secs to 0.6 secs by
post-processing for the user input only at the beginning instead of for every
group. The native code speedups in pandas really do work!

Testing done:
- Reloaded the dashboard multiple times
- Consistently got a time of under one second

```
$ grep "END POST.*timestamp" /var/tmp/webserver.log
2022-01-14 13:47:35,583:DEBUG:123145442308096:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.6526408195495605
2022-01-14 13:47:35,632:DEBUG:123145447563264:END POST /result/metrics/timestamp  0.701042890548706
2022-01-14 13:47:39,820:DEBUG:123145463328768:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.5813801288604736
2022-01-14 13:47:39,862:DEBUG:123145458073600:END POST /result/metrics/timestamp  0.6311750411987305
2022-01-14 13:47:43,155:DEBUG:123145421287424:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.6370611190795898
2022-01-14 13:47:43,198:DEBUG:123145426542592:END POST /result/metrics/timestamp  0.6914339065551758
2022-01-14 13:47:45,669:DEBUG:123145442308096:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.5435740947723389
2022-01-14 13:47:45,712:DEBUG:123145437052928:END POST /result/metrics/timestamp  0.6239280700683594
2022-01-14 13:47:49,359:DEBUG:123145463328768:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.7017340660095215
2022-01-14 13:47:49,409:DEBUG:123145452818432:END POST /result/metrics/timestamp  0.7662367820739746
```
@shankari
Copy link
Contributor Author

Whoa! It improved performance by 10x!

$ grep "END POST.*timestamp" /var/tmp/webserver.log
2022-01-14 13:47:35,583:DEBUG:123145442308096:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.6526408195495605
2022-01-14 13:47:35,632:DEBUG:123145447563264:END POST /result/metrics/timestamp  0.701042890548706
2022-01-14 13:47:39,820:DEBUG:123145463328768:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.5813801288604736
2022-01-14 13:47:39,862:DEBUG:123145458073600:END POST /result/metrics/timestamp  0.6311750411987305
2022-01-14 13:47:43,155:DEBUG:123145421287424:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.6370611190795898
2022-01-14 13:47:43,198:DEBUG:123145426542592:END POST /result/metrics/timestamp  0.6914339065551758
2022-01-14 13:47:45,669:DEBUG:123145442308096:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.5435740947723389
2022-01-14 13:47:45,712:DEBUG:123145437052928:END POST /result/metrics/timestamp  0.6239280700683594
2022-01-14 13:47:49,359:DEBUG:123145463328768:END POST /result/metrics/timestamp 742fbefa-e7d7-45a9-bdf6-44659d21e0fa 0.7017340660095215
2022-01-14 13:47:49,409:DEBUG:123145452818432:END POST /result/metrics/timestamp  0.7662367820739746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant