Enhance `_get_and_store_range` Function to Include Trip Statistics and Last API Call Tracking #993

TeachMeTW · 2024-11-08T19:02:34Z

Overview

Enhances _get_and_store_range by adding total_trips, labeled_trips, and last_call fields to the user profile, enabling better insights into user activities and API usage.

Changes

Extended User Profile Updates:
- total_trips: Total number of trips recorded for the user.
- labeled_trips: Number of labeled/annotated trips.
Integrated Last API Call Tracking:
- Accesses stats/server_api_time to retrieve timestamps of the last GET and PUT API calls.
- Determines the latest call and stores it as last_call in the user's profile in UNIX format.

Testing

Picked a user, cleared pipeline, re-ran pipeline, verified user profile was updated.

Please review and provide feedback or approval.

TeachMeTW · 2024-11-08T20:44:02Z

Now I only have one concern: the current iteration does not use UUID in the api_call query, so I’m not too sure if the last call is accurate. If I do include the UUID, the last call is NULL for that UUID.

docs_cursor = edb.get_timeseries_db().find({
    "metadata.key": "stats/server_api_time",
    "user_id" : user_id
})

Thoughts?

emission/pipeline/intake_stage.py

TeachMeTW · 2024-11-08T22:25:12Z

@JGreenlee Comments addressed, I split the function into a different file as suggested -- though it may be overkill. Seeking thoughts

JGreenlee

Good refactor! I agree that the smaller functions are overkill.
Sometimes it's just a matter of finding the balance between tidiness and complexity.

emission/analysis/result/user_stat.py

TeachMeTW · 2024-11-09T07:20:58Z

@JGreenlee Comments addressed

emission/analysis/result/user_stat.py

TeachMeTW · 2024-11-09T17:33:53Z

@JGreenlee Addressed and rebased

JGreenlee

The code looks good to me now, but I think we should include a unit test for this pipeline stage since it now stores a full set of stats.

I also didn't see a breakdown of manual testing. (Have you tested this locally in conjunction with e-mission/op-admin-dashboard#153 to verify that the new stats show up in the table?)

JGreenlee

Let me know if you have questions

emission/tests/analysisTests/intakeTests/TestUserStat.py

JGreenlee · 2024-12-13T17:32:26Z

emission/tests/analysisTests/intakeTests/TestUserStat.py

+        self.assertIn("last_call_ts", profile, "User profile should contain 'last_call_ts'.")
+
+        expected_total_trips = -
+        expected_labeled_trips = -


0 if you don't load any user inputs

@JGreenlee load any user inputs how so? Like loading from the example or what?

The real_examples files like shankari_2015-aug-27 only have raw sensed data; i.e. what the phone automatically picked up while traveling

For some of the days, e.g. shankari_2016-06-20 we have an additional file shankari_2016-06-20.user_inputs, which does not contain sensed data but it has "manual inputs" or "user inputs". These entries represent labels or survey responses that the user has recorded while using the app.

For a more in-depth test you could load both shankari_2016-06-20 and shankari_2016-06-20.user_inputs and expected_labeled_trips would be something other than 0

emission/tests/analysisTests/intakeTests/TestUserStat.py

TeachMeTW · 2024-12-13T18:45:16Z

@JGreenlee I realized my code was NOT pushed and you may have reviewed the wrong thing.... my bad, I will take your notes into account with the latest.

TeachMeTW · 2024-12-13T20:13:49Z

@JGreenlee Comments addressed, ready for re-review

JGreenlee

Almost there!

emission/tests/analysisTests/intakeTests/TestUserStat.py

JGreenlee · 2024-12-13T20:24:49Z

emission/tests/analysisTests/intakeTests/TestUserStat.py

+        # Retrieve the user profile
+        profile = edb.get_profile_db().find_one({"user_id": self.UUID})
+        if profile is None:
+            # Initialize the profile if it does not exist
+            edb.get_profile_db().insert_one({"user_id": self.UUID})


What happens if you take this out? Does something not work without it?

When I take this out, the profile is empty.

Traceback (most recent call last): File "/Users/rsimpson/e-mission-server/emission/tests/analysisTests/intakeTests/TestUserStat.py", line 81, in testGetAndStoreUserStats self.assertIsNotNone(profile, "User profile should exist after storing stats.") AssertionError: unexpectedly None : User profile should exist after storing stats.

Meaning for some reason the entry is not being inserted. Does that hint to an issue with my additions or existing?

emission/tests/analysisTests/intakeTests/TestUserStat.py

TeachMeTW · 2024-12-13T23:06:51Z

@JGreenlee Addressed, see response -- in need of clarification

JGreenlee · 2024-12-14T06:09:57Z

emission/analysis/result/user_stat.py

+
+TIME_FORMAT = 'YYYY-MM-DD HH:mm:ss'
+
+def count_trips(ts: esta.TimeSeries, key_list: list, extra_query_list: Optional[list] = None) -> int:


nitpick: I had an earlier comment here that was resolved but I don't think was addressed. I don't see the point of this function when it is just a wrapper around ts.find_entries_count with the exact same params and no added functionality

In the event that a suggested change doesn't work for you or you have a good reason to do otherwise, that's fine; just reply and we can discuss

Or, if you just missed my comment, make sure not to resolve conversations until you have made a change that addresses them. When I have my code reviewed I like to react with thumbs up and reply "TODO" so I can keep track of what I have seen vs what I have actually implemented and pushed up

I think I did address it, but I've been trying to rebase and do all this github branch magic and maybe it got reverted -- will do react and/or reply on future comments before resolving

OR I believe a better option is for the reviewer to resolve the comment? what are your thoughts on that approach in addition?

JGreenlee

@TeachMeTW Code looks good but the new tests appear to be failing in GH Actions
Do they pass for you locally?

TeachMeTW · 2024-12-14T06:23:04Z

@JGreenlee Yes the tests pass locally for me. I believe the error is not related to my code changes.

JGreenlee · 2024-12-14T16:30:13Z

I don't know why it's passing locally and not here, but it is related to this PR because the test that's failing is the one you added. It's complaining about timestamps that look a little bit off.

TeachMeTW · 2024-12-14T19:46:17Z

I don't know why it's passing locally and not here, but it is related to this PR because the test that's failing is the one you added. It's complaining about timestamps that look a little bit off.

Shall I add delta/assert if almost equal?

JGreenlee · 2024-12-14T23:25:44Z

I don't know why it's passing locally and not here, but it is related to this PR because the test that's failing is the one you added. It's complaining about timestamps that look a little bit off.

Shall I add delta/assert if almost equal?

No we should investigate why it's different.
The pipeline is supposed to be deterministic, i.e. 100% reproducible

What happens if you checkout master and run that test data through the pipeline?
Is it the same as the CI or the same as you got on your branch?

TeachMeTW · 2024-12-15T04:25:01Z

@JGreenlee I checkout master and run the test data through pipeline and its the same as the branch NOT the CI. I believe running ALL the tests must affect the ts. I tried to test runAllTest.sh but it crashes my console/vscode, perhaps you could verify

JGreenlee · 2024-12-15T17:08:13Z

I will try to verify tomorrow. I will also pass this along in the review process for Shankari's input.

If runAllTests performs differently than running an individual test file, it is concerning as it means we do not have proper isolation between the tests. Maybe one of the tests does not tearDown properly. It may be difficult to track down the culprit.

shankari · 2024-12-17T02:54:23Z

My hunch is that a teardown is broken somewhere and this UUID is being reused. Perhaps I can use a different uuid as a bandaid fix or can dig deeper into the root cause.

Feel free to use either option. I will not merge until this is resolved.

JGreenlee · 2024-12-17T19:45:46Z

We do not understand why the pipeline start ts comes up as 1440688707.867 during runAllTests.sh.

1440688707.867 is the ts of the first background/location on that day. The pipeline start ts is supposed to be the start_ts of the first trip (specifically the first composite_trip)

I am confident that 1440688739.672 is the correct start ts because when I run:

opcode = 'nrelop_dev-emulator-study_test'
day = 'shankari_2015-aug-27'

!./e-mission-py.bash bin/debug/purge_user.py -e $opcode
!./e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/$day $opcode
!./e-mission-py.bash bin/debug/intake_single_user.py -e $opcode

import emission.storage.timeseries.abstract_timeseries as esta
import emission.core.wrapper.user as ecwu
import pandas as pd

pd.set_option('display.float_format', str)

user_id = ecwu.User.fromEmail(opcode).uuid
ts = esta.TimeSeries.get_time_series(user_id)
ts.get_data_df("analysis/confirmed_trip")

I can see that the first trip has start_ts of 1440688739.672. This is consistent on master and this branch

JGreenlee · 2024-12-17T19:47:54Z

This specific problem has happened once before.

I used Github's search feature with "org:e-mission 1440688707.867" and got a hit: #509 (comment)

Reading through that PR, we still cannot pinpoint what the cause was.

TeachMeTW · 2024-12-17T23:21:19Z

Commenting out all the test cases in TestTripSegmentation and replacing it with a dummy test with a print statment still invokes the same errors. Might be something with setup and teardown, might be not as I cleared everything and still the same error. Maybe its the pipeline -- not sure for now.

I also discovered TestSectionSegmentation with TestGetUserStats has a similar error result.

TeachMeTW · 2024-12-17T23:49:26Z

However, if one runs TestTripSegmentation or TestSectionSegmentation AFTER TestGetUserStats there is NO ERROR

Test User Stats as First

Test User Stats as Second

TeachMeTW · 2024-12-18T01:10:36Z

Alternatively, using emission/tests/data/real_examples/shankari_2015-aug-21 rather than emission/tests/data/real_examples/shankari_2015-aug-27 resulted in NO ERRORS when running multiple tests.

We have a couple options.

We could do above and use Aug 21 dataset for UserStats, call it a day for now.
We could dig deeper as to why this is happening.

Thoughts?

shankari · 2024-12-18T01:38:10Z

We should do both.
We should change the test to Aug 21 for UserStats so that we can merge this and continue to make progress.
We should then file a new issue to dig deeper into why this is happening

Add per-operation timing to segment_current_trips using ect.Timer Added total, labeled, and last call Modified total and labeled trips to match op-admin implementation Modified last call to mirror op-admin implementation

Addressed comments, reduced overkill on refactor Forgot to add last_call_ts

TeachMeTW · 2024-12-18T04:13:27Z

@shankari Modified and pushed accordingly. Will make an issue as suggested.

JGreenlee · 2024-12-19T06:25:29Z

Will make an issue as suggested.

@TeachMeTW I figured out what was causing it and made a PR #1003

TeachMeTW · 2024-12-19T06:44:27Z

Nice find!

shankari

I am going to roll forward and merge to unblock us; @TeachMeTW please address these comments in a follow-on PR

shankari · 2024-12-21T23:36:37Z

emission/pipeline/intake_stage.py

-
-def _get_and_store_range(user_id, trip_key):
-    ts = esta.TimeSeries.get_time_series(user_id)
-    start_ts = ts.get_first_value_for_field(trip_key, "data.start_ts", pymongo.ASCENDING)


For the record, and to assist in understanding #1003 (comment)
this should be the start of the first trip.

Why is not the first location if the location points were received before?
The background sensed objects don't have a data.start_ts, they are associated with data.ts

So this must find a trip-like object; since they are the only ones who have a start_ts. So it would really be the start timestamp of the first procssed trip.

The segmentation/raw_trip is created with a start that corresponds to the start transition (as we do for draft trips). In the clean_and_resample stage, we "fill in" the trip trajectory by interpolating to the end of the previous trip. So this should return the start_ts from a cleaned/confirmed/composite trip.

@JGreenlee for visibility

shankari · 2024-12-21T23:43:16Z

emission/tests/common.py

@@ -107,7 +107,7 @@ def getRealExampleEmail(testObj):
 def fillExistingUUID(testObj):
    userObj = ecwu.User.fromEmail(getRealExampleEmail(testObj))
    print("Setting testUUID to %s" % userObj.uuid)
-    testObj.testUUID = userObj.uuir
+    testObj.testUUID = userObj.uuid


This seems to make sense, but I am not sure how the the tests were working without this. It looks like I changed this ~ 3 years ago
a45d4e3

The change was almost certainly an unintentional typo but the tests have been passing consistently since then

shankari · 2024-12-21T23:46:27Z

emission/analysis/result/user_stat.py

+        total_trips = ts.find_entries_count(key_list=["analysis/confirmed_trip"])
+        labeled_trips = ts.find_entries_count(
+            key_list=["analysis/confirmed_trip"],


cleanup: is there a reason this should be hardcoded instead of being trip_key? In addition to being cleaner, this will also likely result in better performance, since repeated queries on the same objects are likely to hit the cache.

will be addressing in follow up pr; no reason it should be hard coded

shankari · 2024-12-21T23:52:20Z

bin/historical/migrations/populate_pipeline_range.py

@@ -25,7 +26,7 @@ def add_pipeline_range(process_number, uuid_list, skip_if_no_new_data):
            continue

        try:
-            epi._get_and_store_range(uuid, "analysis/composite_trip")
+            eaurs.get_and_store_user_stats(uuid, "analysis/composite_trip")


I don't think this should be changed. The scripts in historical intended to be a record of the changes that we made to the database over time. At the time that the script was created

commit e98218e5f2505225284fa7a73088b8b2861dcc65 Author: Shankari <shankari@eecs.berkeley.edu> Date: Wed Aug 16 07:13:49 2023 -0700 🦕 Migration script to iterate over all users and fill in the pipeline ranges

the function was get_and_store_range

Open to discussion about how to deal with historical migrations, but am reverting this change now.

shankari · 2024-12-22T01:01:17Z

@TeachMeTW there is now a merge conflict after I have pulled from master.

I think this may be because your original base did not include
2a32742

Your diff looks like

You need to pull changes from master periodically, particularly when you are making changes to the same files

…into HEAD

- Introduced comprehensive test suite to validate functionality of the `get_and_store_user_stats` method. - Covered core scenarios including: - Correct data aggregation and storage. - Handling cases with no trips or missing data. - Verifying behavior with partial data and multiple invocations. - handling of invalid UUID inputs. - setup and teardown to ensure test isolation and clean up user data. - Included data insertion for confirmed trips, composite trips, and server API timestamps to simulate realistic scenarios. This initial test suite establishes a baseline for ensuring reliability of the `get_and_store_user_stats` function. remove Modified test Refactored the test and simplified it; validated all new user stats added store to intake Updated Based on requested changes Removed unnecessary wrapper Changed to Aug 21 for the time being

shankari · 2024-12-22T02:25:24Z

@JGreenlee @TeachMeTW this needs to be refactored further for it to work properly.
Note that we have an early return if there is no new data, so this code is effectively in the "else"

        if skip_if_no_new_data and new_entry_count == 0:
            print("No new entries, and skip_if_no_new_data = %s, skipping the rest of the pipeline" % skip_if_no_new_data)
            return
        else:
            print("New entry count == %s and skip_if_no_new_data = %s, continuing" %
                (new_entry_count, skip_if_no_new_data))

That's fine as long as we were only caching pipeline state, because it wouldn't change if the pipeline was skipped
But now that we are caching the last_call_ts the current location will not work.
Concretely, if the app is contacting the server but is not uploading any data, last_call_ts will never be changed.

@TeachMeTW

In e-mission#993 we expanded `_get_and_store_range` to store entries other than the pipeline range, notably the `last_call_ts`. However, adding it to the end of the pipeline causes it to be skipped most of the time. We have an early return from `run_intake_pipeline_for_user` if there is no new data, so the end of the pipeline (the code in e-mission#993) is effectively in the "else" That's fine as long as we were only caching pipeline state, because it wouldn't change if the pipeline was skipped But now that we are caching the last_call_ts the current location will not work. Concretely, if the app is contacting the server but is not uploading any data, last_call_ts will never be changed. Since this is a fix that I am implementing on a holiday weekend, I have not added unit tests to check this. @TeachMeTW please enhance the unit tests to verify that the stats are generated in both cases: (i) when there is new data and (ii) when there is no new data.

shankari · 2024-12-22T02:53:10Z

I have fixed this in #1005, @TeachMeTW please update the unit tests as outlined there

@TeachMeTW

In e-mission#993 we expanded `_get_and_store_range` to store entries other than the pipeline range, notably the `last_call_ts`. However, adding it to the end of the pipeline causes it to be skipped most of the time. We have an early return from `run_intake_pipeline_for_user` if there is no new data, so the end of the pipeline (the code in e-mission#993) is effectively in the "else" That's fine as long as we were only caching pipeline state, because it wouldn't change if the pipeline was skipped But now that we are caching the last_call_ts the current location will not work. Concretely, if the app is contacting the server but is not uploading any data, last_call_ts will never be changed. Since this is a fix that I am implementing on a holiday weekend, I have not added unit tests to check this. @TeachMeTW please enhance the unit tests to verify that the stats are generated in both cases: (i) when there is new data and (ii) when there is no new data. I have also changed the two tests that were running `epi.run_intake_pipeline_for_user` and checking the profile for results, since that is not run in that function any more. I am a bit surprised that `TestUserStat` was calling `epi.run_intake_pipeline_for_user` although `etc.runIntakePipeline` was enhanced to run the new step. @TeachMeTW Why was the enhancement needed? We need to figure out which method to call and be consistent about it as part of adding new tests (@JGreenlee, @TeachMeTW)

JGreenlee suggested changes Nov 8, 2024

View reviewed changes

emission/pipeline/intake_stage.py Outdated Show resolved Hide resolved

emission/pipeline/intake_stage.py Outdated Show resolved Hide resolved

emission/pipeline/intake_stage.py Outdated Show resolved Hide resolved

emission/pipeline/intake_stage.py Outdated Show resolved Hide resolved

TeachMeTW mentioned this pull request Nov 8, 2024

Integrate New User Statistics into Profile Retrieval e-mission/op-admin-dashboard#153

Merged

JGreenlee reviewed Nov 9, 2024

View reviewed changes

emission/analysis/result/user_stat.py Outdated Show resolved Hide resolved

TeachMeTW force-pushed the Extend-Get-Store-Range branch from e4e8ed9 to 8faafe1 Compare November 9, 2024 17:33

JGreenlee suggested changes Dec 4, 2024

View reviewed changes

TeachMeTW force-pushed the Extend-Get-Store-Range branch from dfb5afc to f4c57f0 Compare December 4, 2024 20:43

JGreenlee suggested changes Dec 13, 2024

View reviewed changes

TeachMeTW force-pushed the Extend-Get-Store-Range branch 3 times, most recently from 179594b to 965cc17 Compare December 13, 2024 20:10

JGreenlee suggested changes Dec 13, 2024

View reviewed changes

TeachMeTW force-pushed the Extend-Get-Store-Range branch from b01bd17 to 965cc17 Compare December 13, 2024 22:38

TeachMeTW force-pushed the Extend-Get-Store-Range branch 2 times, most recently from a7008f8 to 3ef0517 Compare December 14, 2024 04:33

JGreenlee reviewed Dec 14, 2024

View reviewed changes

JGreenlee suggested changes Dec 14, 2024

View reviewed changes

TeachMeTW mentioned this pull request Dec 16, 2024

Fix file handling to prevent ValueError #1001

Merged

shankari and others added 2 commits December 17, 2024 19:55

Separated Get and Store into a separate file

39a3358

Addressed comments, reduced overkill on refactor Forgot to add last_call_ts

TeachMeTW force-pushed the Extend-Get-Store-Range branch from 8022a1f to cadb13e Compare December 18, 2024 03:55

shankari mentioned this pull request Dec 21, 2024

clear AND reload analysis config on tests’ tearDown #1003

Merged

shankari force-pushed the Extend-Get-Store-Range branch 2 times, most recently from a612c77 to 957c8fa Compare December 22, 2024 00:51

shankari approved these changes Dec 22, 2024

View reviewed changes

shankari and others added 2 commits December 21, 2024 17:03

Merge branch 'master' of https://github.com/e-mission/e-mission-server …

ad33691

…into HEAD

shankari force-pushed the Extend-Get-Store-Range branch from 148136c to ae549f2 Compare December 22, 2024 01:09

shankari merged commit fee1bef into e-mission:master Dec 22, 2024
5 checks passed

shankari mentioned this pull request Dec 22, 2024

Always run the computation of the user profile stats #1005

Merged

shankari mentioned this pull request Dec 22, 2024

Collect data on performance improvements to the dashboard e-mission/op-admin-dashboard#145

Open

shankari mentioned this pull request Jan 2, 2025

Refactored retrieval of last API call timestamps to improve performance. e-mission/op-admin-dashboard#156

Merged


		TIME_FORMAT = 'YYYY-MM-DD HH:mm:ss'

		def count_trips(ts: esta.TimeSeries, key_list: list, extra_query_list: Optional[list] = None) -> int:

Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking #993

Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking #993

Conversation

TeachMeTW commented Nov 8, 2024

Overview

Changes

Testing

TeachMeTW commented Nov 8, 2024 • edited Loading

TeachMeTW commented Nov 8, 2024

JGreenlee left a comment

Choose a reason for hiding this comment

TeachMeTW commented Nov 9, 2024

TeachMeTW commented Nov 9, 2024

JGreenlee left a comment

Choose a reason for hiding this comment

JGreenlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGreenlee Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

TeachMeTW commented Dec 13, 2024

TeachMeTW commented Dec 13, 2024

JGreenlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TeachMeTW commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGreenlee left a comment

Choose a reason for hiding this comment

TeachMeTW commented Dec 14, 2024

JGreenlee commented Dec 14, 2024

TeachMeTW commented Dec 14, 2024

JGreenlee commented Dec 14, 2024

TeachMeTW commented Dec 15, 2024

JGreenlee commented Dec 15, 2024

shankari commented Dec 17, 2024

JGreenlee commented Dec 17, 2024

JGreenlee commented Dec 17, 2024 • edited Loading

TeachMeTW commented Dec 17, 2024 • edited Loading

TeachMeTW commented Dec 17, 2024 • edited Loading

Test User Stats as First

Test User Stats as Second

TeachMeTW commented Dec 18, 2024 • edited Loading

shankari commented Dec 18, 2024

TeachMeTW commented Dec 18, 2024

JGreenlee commented Dec 19, 2024

TeachMeTW commented Dec 19, 2024

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari commented Dec 22, 2024 • edited Loading

shankari commented Dec 22, 2024

shankari commented Dec 22, 2024

Enhance `_get_and_store_range` Function to Include Trip Statistics and Last API Call Tracking #993

Enhance `_get_and_store_range` Function to Include Trip Statistics and Last API Call Tracking #993

TeachMeTW commented Nov 8, 2024 •

edited

Loading

JGreenlee Dec 13, 2024 •

edited

Loading

JGreenlee commented Dec 17, 2024 •

edited

Loading

TeachMeTW commented Dec 17, 2024 •

edited

Loading

TeachMeTW commented Dec 17, 2024 •

edited

Loading

TeachMeTW commented Dec 18, 2024 •

edited

Loading

shankari commented Dec 22, 2024 •

edited

Loading