check in notebook code #23

corinne-hcr · 2021-06-02T03:14:17Z

No description provided.

shankari · 2021-06-02T04:41:43Z

@corinne-hcr Wait, is the entire notebook implementation (other than the imports) in one ~ 300 line cell? You are right, I don't like the structure.

shankari · 2021-06-02T04:44:33Z

@corinne-hcr I am loathe to ask for significant restructuring. changes given the imminent deadline, but this definitely needs to be restructured significantly after the paper is submitted. I will mark my review with [NOW] and [LATER] to indicate priority.

shankari

The most egregious part of this code is the duplication between tuning and testing. It is not too terrible otherwise. Still debating whether to merge or not.

shankari · 2021-06-02T23:47:56Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "\n",
+    "for a in range(len(all_users)):\n",
+    "    user = all_users[a]\n",
+    "    filter_trips, trips = preprocess.filter_data(user, radius)\n",


so you refactored this into filter_data and read_data, but you called read_data from filter_data.
I noticed something similar with get_subdata in terms of refactoring the extraction.
The goal of a pipeline is to have multiple distinct steps, not call one function from another

You should have read and filter as separate functions called from the notebook, sort of like

trips = read_data(...) valid_trips = similarity.filter_too_short(trips)

My original feedback was bfdf1e8#r631490113

tour_model_eval/first_second_round_evaluation.ipynb

shankari · 2021-06-02T23:53:29Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "        # get first round labels\n",
+    "        first_labels = []\n",
+    "        for b in range(len(bins)):\n",
+    "            for trip in bins[b]:\n",


isn't this the original index too?

shankari · 2021-06-02T23:54:58Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "        for i in range(len(first_labels)):\n",
+    "            idx_labels_track[i].append(first_labels[i])\n",


this data structure is soo messy.

tour_model_eval/first_second_round_evaluation.ipynb

shankari · 2021-06-02T23:57:28Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "\n",
+    "                    method = 'single'\n",
+    "                    # get labels after two rounds of clustering on common trips\n",
+    "                    new_labels = lp.get_new_labels(x, low, dist_pct, second_round_idx_labels, new_labels, method=method)\n",


this should also be split up into cluster and get_labels

tour_model_eval/first_second_round_evaluation.ipynb

shankari · 2021-06-02T23:59:35Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "                if curr_score > highest_score:\n",
+    "                    highest_score = curr_score\n",
+    "                    sel_low = low\n",
+    "                    sel_dist_pct = dist_pct\n",
+    "                    sel_homo_second = homo_second\n",
+    "                    sel_percentge_second = percentge_second\n",
+    "        coll_low.append(sel_low)\n",
+    "        coll_dist_pct.append(sel_dist_pct)\n",
+    "        colle_tune_score.append(highest_score)\n",
+    "        pct_collect_second_train.append(sel_percentge_second)\n",
+    "        homo_collect_second_train.append(sel_homo_second)\n",


you could just append all of these as tuples/dicts directly, and find the max in the list.
you can insert the max directly as the entry for this split.

tour_model_eval/first_second_round_evaluation.ipynb

shankari

overall, I would recommend, at a high level, something like:

def tune(...):
   first_round
   second_round
   return tuning_parameters

def test(...., tuning_parameters):
    first_round
    second_round
    return results

shankari · 2021-06-21T21:08:01Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "    tune_data = preprocess.get_subdata(filter_trips, test_idx)\n",
+    "    test_data = preprocess.get_subdata(filter_trips, tune_idx)\n",
    "\n",


from the previous comments:
#23 (comment)

I noticed something similar with get_subdata in terms of refactoring the extraction.
The goal of a pipeline is to have multiple distinct steps, not call one function from another

Please address

I would like the pipeline to see separate steps like:

training_trips = get_subset(....) training_features = extract_features(training_trips)

Or the other way around. I don't really care which order the steps happen in at this point, just that the order is clearly visible.

shankari · 2021-06-21T21:11:09Z

tour_model_eval/first_second_round_evaluation.ipynb

    "\n",
-    "    # run tuning set first\n",
+    "    pct_collect_first, homo_collect_first, pct_collect_second, homo_collect_second = ep.init_score()\n",


If you have so much state, it seems like it would be better to have a class that encapsulates it.
But it also seems like overkill to create a function just to initialize the score.
It seems like you have not finished the more fundamental refactoring.

tour_model_eval/first_second_round_evaluation.ipynb

shankari · 2021-06-21T21:14:59Z

tour_model_eval/first_second_round_evaluation.ipynb

-    "                homo_second = float('%.3f' % gs.score(bin_trips, new_labels))\n",
+    "    pct_collect_first, homo_collect_first, pct_collect_second, homo_collect_second = ep.init_score()\n",
+    "    # testing\n",
+    "    pct_collect_first,homo_collect_first,pct_collect_second,homo_collect_second,coll_tradeoffs,coll_tune_score = ep.tuning_test(test_data,radius, pct_collect_first, homo_collect_first,pct_collect_second,homo_collect_second, coll_tradeoffs,coll_tune_score,test=True)\n",


why do we return coll_tradeoffs and coll_tune_score from the testing invocation

At this point, I am not sure if we still need the tradeoffs after the clustering, so I keep it.
I am still figuring out how to predict the labels

shankari · 2021-06-21T21:24:01Z

tour_model_eval/first_second_round_evaluation.ipynb

-    "all_percentage_first_train = []\n",
+    "all_percentage_first_tune = []\n",
    "all_percentage_first_test = []\n",
-    "all_percentage_second_train = []\n",
+    "all_percentage_second_tune = []\n",


where are these used? I don't see them reappearing in the diff?

shankari · 2021-06-21T21:59:19Z

tour_model_eval/evaluation_pipeline.py

+        sim,bins,bin_trips,filter_trips = first_round(data[j], radius)
+        # it is possible that we don't have common trips for tuning or testing
+        # bins contain common trips indices
+        if len(bins) is not 0:
+            gs.compare_trip_orders(bins, bin_trips, filter_trips)
+            first_labels = get_first_label(bins)
+            # new_labels temporary stores the labels from the first round, but later the labels in new_labels will be
+            # updated with the labels after two rounds of clustering.
+            new_labels = first_labels.copy()
+            first_label_set = list(set(first_labels))
+            track = get_track(bins, first_labels)
+            # get request percentage for the subset for the first round
+            percentage_first = grp.get_req_pct(new_labels, track, filter_trips, sim)
+            # get homogeneity score for the subset for the first round
+            homo_first = gs.score(bin_trips, first_labels)
+            pct_collect_first.append(percentage_first)
+            homo_collect_first.append(homo_first)


the code for each subset can be pulled out.

…et_tuning_score.py to get_scores.py

corinne-hcr · 2021-07-08T04:05:13Z

I have separated the functions from a giant function, basically like tune(...) and test(...). but not sure if it is ok now. I plan to move the evaluation_pipeline.py to e-mission-server

shankari

Almost there!

shankari · 2021-07-08T17:56:32Z

tour_model_eval/evaluation_pipeline.py

+import emission.analysis.modelling.tour_model.get_request_percentage as grp
+import emission.analysis.modelling.tour_model.get_scores as gs


where are these files? I don't see them in the pending PR
https://github.com/e-mission/e-mission-server/pull/826/files

files added

shankari · 2021-07-08T18:00:11Z

tour_model_eval/evaluation_pipeline.py

+import emission.analysis.modelling.tour_model.label_processing as lp
+import emission.analysis.modelling.tour_model.data_preprocessing as preprocess
+
+def second_round(first_label_set,first_labels,bin_trips,filter_trips,low,dist_pct,sim,new_labels,track):


do you really need 9 mandatory arguments to this function? What do you need sim for, for example, if you already have filter_trips and track?

do you really need both first_label_set and first_labels? Why can you not get first_label_set from first_label in this function?

At a high level, if you get to so many arguments, I will recommend that you create a class to pass them in. But given our time constraints, I am not going to push for this now.

sim is to find out trips below the cutoff in the 1st round. Those trips didn't go through further analysis except the 1st round, so I have to use sim

At a high level, if you get to so many arguments, I will recommend that you create a class to pass them in. But given our time constraints, I am not going to push for this now.

I don't understand this. Maybe I can reduce 1 or 2 arguments. But given that there are still many arguments I have to pass in, what is the difference between writing a pure function to accept them and writing a class?

you can write a class/struct to encapsulate them so that people don't have to write really long lines passing them back and forth. You can have a single class that you both pass in and return and that has the current state of the algorithm. Makes everything a lot less verbose.

I have created a class for the second round. See second_round_of_clustering.py
Although it is not a perfect one, but it works.

shankari · 2021-07-08T18:21:30Z

tour_model_eval/evaluation_pipeline.py

+    sim, bins, bin_trips, filter_trips = first_round(data, radius)
+    # it is possible that we don't have common trips for tuning or testing
+    # bins contain common trips indices
+    if len(bins) is not 0:
+        gs.compare_trip_orders(bins, bin_trips, filter_trips)
+        first_labels = get_first_label(bins)
+        # new_labels temporary stores the labels from the first round, but later the labels in new_labels will be
+        # updated with the labels after two rounds of clustering.
+        new_labels = first_labels.copy()
+        first_label_set = list(set(first_labels))
+        track = get_track(bins, first_labels)
+        # collect tuning scores and parameters


all of this is common, so it seems like it can be moved into first_round or into a new function like get_first_round_labels or something which is called from both tune and test

There were just two lines of code repeated after refactoring, but I created a function get_first_label_and_track(bins,bin_trips,filter_trips), which can be called from both tune and test

shankari · 2021-07-08T18:25:03Z

tour_model_eval/evaluation_pipeline.py

+                if curr_score not in tune_score:
+                    tune_score[curr_score] = (low, dist_pct, homo_second, percentage_second)
+
+        best_score = max(tune_score)
+        sel_tradeoffs = tune_score[best_score][0:2]


again, it seems like it would be good to create a class called ClusteringResults with these fields or even return a simple dict with the fields instead of passing around lists that are hard to read. Lists are convenient for you to write, but code like [0:2] doesn't indicate to others what you are returning.

I have already changed this part, there is no more[0:2]. I do have a dict to collect the parameters. The parameters are clearly pointed out and return separately.
e.g. return low,dist_pct

shankari · 2021-07-08T18:26:09Z

tour_model_eval/evaluation_pipeline.py

+    pct_collect_first = []
+    homo_collect_first = []
+    pct_collect_second = []
+    homo_collect_second = []
+    coll_score = []
+    coll_tradeoffs = []
+


Again, can this be moved into a list of results or even a dataframe instead of multiple lists of results?

I have changed to use a dataframe to collect all clustering result. These lists are deleted.

shankari · 2021-07-08T18:26:26Z

tour_model_eval/evaluation_pipeline.py

+        pct_collect_first.append(percentage_first)
+        homo_collect_first.append(homo_first)
+        pct_collect_second.append(percentage_second)
+        homo_collect_second.append(homo_second)
+        coll_score.append(scores)


then this will be significantly simplified.

These are deleted.

…model, only for testing, using user1

…e them to the server repo after refactoring

shankari

what is the structure of the user_labels and models data structures?
reading all user trips is too intensive
don't understand why predict is not good enough to directly find the labels and has to go through first round and second round again

tour_model_eval/load_predict.py

shankari · 2021-07-16T16:49:21Z

tour_model_eval/load_predict.py

+        # Check if start/end locations of the new trip and every start/end locations in this bin are within the range of
+        # radius. If so, the new trip falls in this bin. Then predict the second round label of the new trip
+        # using this bin's model
+        if in_bin(sel_loc_feat, trip_loc_feat, radius):
+            sel_fl = fl
+            break


This is too intensive for prediction. Basically, for prediction, you do not want to write any code that iterates over all the trips for a user; it will not scale.

It will not scale means that if you have a user with data for 6 months, the server will crash or it will take hours for the computation.

I didn't actually run all trips. I saved all coordinates for all common trips in the build and save step. My logic here is to compare the new trip's coordinates with other trips' to see which bin it should fall in.
The other way I can think of is that finding a representative trip for every bin, then compare it with the new trip. But there is a risk that new trip match a representative trip but not all the existing trips in that bin. ( I thought that the new trips would match all trips in a bin if it matched one, but that would not be certain.)
Do you have other way to see if a new trip can fall in any bin?

There is no concept of running a trip. A trip is data, you can only run code.
I think that you mean that you are not running the clustering algorithm on all trips.
You are not running the clustering algorithm, but you are still running a for loop over all the trips aka "iterating over the trips". Iterating over 10k trips to check the distances is still going to take time.

Using a representative trip has the problem that you have outlined " I thought that the new trips would match all trips in a bin if it matched one, but that would not be certain". A better option seems to be to store a geometric shape representing all the starts and all the ends of all the trips. You can then quickly check whether the new trip is within the shape.

The simplest shape is probably a circle. A potential algorithm might be:

get all the points

find the centroid

find the max distance of any point from the centroid

use that as the circle representing the points

You can also use a minimum enclosing circle algorithm/implementation instead of the hacky solution above.
https://www.nayuki.io/page/smallest-enclosing-circle

Actually, you can't use that implementation because its license is LGPL and we don't want to taint our BSD-3 license.
An alternative is OpenCV, which uses an Apache license.

A lighter weight alternative appears to be
https://github.com/luwei14/ShapeAnalyzer

Make sure to update the ThirdPartyLicenses.md file if you include a new library

This has a super simple example using scipy
https://www.cs.au.dk/~gerth/ipsa18/slides/optimization.pdf

tour_model_eval/load_predict.py

…with geometric shape

shankari · 2021-07-16T22:05:03Z

tour_model_eval/load_predict.py

+        # e.g.{'0': [[start lon1, start lat1, end lon1, end lat1],[start lon, start lat, end lon, end lat]]}
+        # another explanation: -'0': label from the 1st round
+        #                      - the value of key '0': all trips that in this bin
+        #                      - for every trip: the coordinates of start/end locations
        bin_locations = loadModelStage('locations_' + str(user))[0]
+


this is going to be completely changed in the next commit, so ignoring for now.

tour_model_eval/load_predict.py

shankari · 2021-07-16T22:32:11Z

tour_model_eval/load_predict.py

+    second_label_ls = list(seccond_round_result.keys())
    if sel_sl not in second_label_ls:
-        return {}
+        return []


why do you need this? I think you can directly use sel_ls not in second_round_result

Because Gabriel's function need a list, I follow the result from placeholder_predictor_2(trip)

this is not about the return type. It is about the need for second_label_ls = list(seccond_round_result.keys())

Oh ok, I see your point.

tour_model_eval/load_predict.py

shankari

I don't see any new design problems here.

You still need to fix the locations model.
The main function is way too large (in general, it is good to have functions be ~ 1 page long. But I will not force you to make this change now
you tend to treat dataframes as lists and iterate over them. using vectorized operations is more efficient - e.g. https://towardsdatascience.com/efficient-pandas-apply-vs-vectorized-operations-91ca17669e84

tour_model_eval/build_save_model.py

shankari · 2021-07-17T22:44:26Z

tour_model_eval/build_save_model.py

+                    labels_columns = user_label_df.columns.to_list()
+                    one_set_labels = {}
+                    for i in range(len(unique_labels)):
+                        # e.g. labels_only={'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}
+                        labels_only = {column: unique_labels.iloc[i][column] for column in labels_columns}


why not just use the built-in to_dict?

Suggested change

labels_columns = user_label_df.columns.to_list()

one_set_labels = {}

for i in range(len(unique_labels)):

# e.g. labels_only={'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}

labels_only = {column: unique_labels.iloc[i][column] for column in labels_columns}

# e.g. labels_only=[{'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}]

labels_only = user_label_df.to_dict('records'}

shankari · 2021-07-17T22:45:45Z

tour_model_eval/build_save_model.py

+                        # in case append() method changes the dict, we use deepcopy here
+                        labels_set = copy.deepcopy(one_set_labels)
+                        sec_label_obj.append(labels_set)


which append method would change the dict?

line 126 append()
This will change the dict if the dict is not an entire new object. if I directly append one_set_labels, the values in one_set_labels will be changed.

really? because that is not how python dicts or lists work. see below.
If one_set_labels is changed, there is some other reason, you need to understand and explain it.

>>> sec_label_obj = [] >>> one_set_labels = {} >>> one_set_labels["labels"] = {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'} >>> one_set_labels["p"] = 1.0 >>> sec_label_obj.append(one_set_labels) >>> sec_label_obj [{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 1.0}] >>> one_set_labels {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 1.0}

Here you are using one dict. But as you append more dicts with same name(one_set_labels), the previous added dict will be changed.

>>> test_list = [] >>> a={'q':1,'w':[]} >>> test_list.append(a) >>> print(test_list) [{'q': 1, 'w': []}] >>> a['q']=2 >>> a['w'].append(123) >>> test_list.append(a) >>> print(test_list) [{'q': 2, 'w': [123]}, {'q': 2, 'w': [123]}]

Several days ago, I found some explanations in Chinese. Someone says that what appended to the list is the address of the dict. It doesn't create a new object in the list. So, every time we append a dict with the same name, it actually points to the same address.

Also, there is a difference between copy() and deepcopy()
copy() is shallow copy

>>> a={'q':1,'w':[]} >>> b=a.copy() >>> b['q']=2 >>> b['w'].append(123) >>> print(a) >>> print(b) {'q': 1, 'w': [123]} {'q': 2, 'w': [123]}

If we change to deepcopy(), we can have entirely new object.

>>> a={'q':1,'w':[]} >>> b=copy.deepcopy(a) >>> b['q']=2 >>> b['w'].append(123) >>> print(a) >>> print(b) {'q': 1, 'w': []} {'q': 2, 'w': [123]}

Here you are using one dict. But as you append more dicts with same name(one_set_labels), the previous added dict will be changed.

This is not true if you do it "pythonically"

>>> sec_label_obj = [] >>> for p in range(0,10,1): ... one_set_labels = {} ... one_set_labels["labels"] = {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'} ... one_set_labels["p"] = p/10 ... sec_label_obj.append(one_set_labels) ... >>> sec_label_obj [{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.0}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.1}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.2}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.3}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.4}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.5}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.6}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.7}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.8}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.9}]

This style is more pythonic because it lends itself to list comprehensions, use with itertools, functional style programming, etc - e.g.

>>> sec_label_obj = [{"labels": {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, "p": p/10} for p in range(0,10,1)] >>> sec_label_obj [{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.0}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.1}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.2}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.3}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.4}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.5}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.6}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.7}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.8}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.9}]

Several days ago, I found some explanations in Chinese. Someone says that what appended to the list is the address of the dict. It doesn't create a new object in the list. So, every time we append a dict with the same name, it actually points to the same address.

@corinne-hcr I am aware of the difference between shallow copy and deepcopy. But the difference is primarily when you modify objects that you have appended to the list (e.g. b['q']=2); merely appending the object doesn't change it. In general modifying objects that you are appending to the list is bad practice because the code cannot be easily refactored using the alternate styles I outlined above.

Where are you modifying objects in your code?

I think I understand the differences between your output and mine. I created one_set_labels = {} before the loop, then I didn't have a new object every time I change the values. You created one_set_labels = {} in the loop, so you have a new object every time. But I am going to change the structure. Glad to figure it out!

Creating one object outside the loop and then deepcopying in the loop is not pythonic. It will make it harder to maintain the code in the future.

@corinne-hcr I still see the deepcopy here

Co-authored-by: shankari <shankari@eecs.berkeley.edu>

…ion-eval-private-data into notebook_code

shankari · 2021-07-19T18:22:14Z

@corinne-hcr just to clarify, I don't plan to merge this PR at this point, you can continue working on the refactoring and let me know when you are completely done. I will copy the modeling code to the server repo and will be working there directly.

I will submit a separate PR there as well. I don't think you run the modeling code from the notebook, but if you do, I can just write a harness anyway.

P.S. I am not sure why you included the model creation and prediction code in this repo instead of the server repo since that is its final destination anyway.

corinne-hcr · 2021-07-19T18:36:35Z

I don't think the notebook call the modeling code. But the notebook generate results and parameters. The modeling code need the parameters.
So, should I keep refactoring the modeling code and let you know when I am done?

shankari · 2021-07-19T18:40:24Z

@corinne-hcr I have already started working on moving the modeling code over. You are welcome to refactor the modeling code in parallel if you like as long as you can meet your other commitments around finishing the refactoring and the unit tests, but I cannot wait until you are done.

The modelling code should generate the results and parameters internally, there cannot be a notebook sitting around when it is running on production.

shankari · 2021-07-19T18:45:34Z

@corinne-hcr you may want to move over the .py files into the server repo yourself first so you get appropriate credit for them.
If I copy them over and commit them, it will look like I was the one that wrote all the code.

corinne-hcr · 2021-07-19T18:55:17Z

But if the modeling code need to generate the best parameters, it has to run through the same thing in the notebook code. It takes a long time, and it is repeated.
I thought that all results were saved after running the notebook code. Then run the modeling code to find the best result(best split, parameters) and re-build & save the models

shankari · 2021-07-19T18:55:43Z

@corinne-hcr is there an ETA for copying the modeling files over? otherwise, I am just going to do it myself.

corinne-hcr · 2021-07-19T18:58:19Z

@corinne-hcr is there an ETA for copying the modeling files over? otherwise, I am just going to do it myself.

Do you mean moving everything to the server repo except the notebook code?

shankari · 2021-07-19T19:01:06Z

@corinne-hcr

Do you mean moving everything to the server repo except the notebook code?

Yes. There are also some differences between the same files in the server repo and this repo; we need to have the most recent version (whichever that is) in the server repo - e.g.

--- ./get_users.py  2021-07-19 11:51:10.000000000 -0700
+++ /Users/kshankar/e-mission/e-mission-eval-private-data/tour_model_eval/./get_users.py    2021-05-27 17:39:27.000000000 -0700
@@ -1,4 +1,4 @@
-import emission.analysis.modelling.tour_model.data_preprocessing as preprocess
+import data_preprocessing as preprocess


 # to determine if the user is valid:
@@ -19,8 +19,7 @@
     for i in range(len(all_users)):
         curr_user = 'user' + str(i + 1)
         user = all_users[i]
-        trips = preprocess.read_data(user)
-        filter_trips = preprocess.filter_data(trips,radius)
+        filter_trips,trips = preprocess.filter_data(user,radius)
         if valid_user(filter_trips,trips):
             valid_user_ls.append(curr_user)
             user_ls.append(curr_user)

shankari · 2021-07-19T19:10:01Z

But if the modeling code need to generate the best parameters, it has to run through the same thing in the notebook code. It takes a long time, and it is repeated.

I thought that all results were saved after running the notebook code. Then run the modeling code to find the best result(best split, parameters) and re-build & save the models

The notebook is primarily for visualization and interactive exploration. I think you are arguing that we should pick parameters separate (in the notebook) from building the clusters (in the model). That is a good idea; we might want to generate the optimal parameters only once every 6 months or something.

Three points:

I don't see why picking the best parameters needs to happen in the notebook. There is no human intervention in picking the parameters, nothing that you need to "show" to a user.
The parameters could be different for different regions and different demographic groups. They might even change over time for the same group.
I am not going to sit there running the parameter generation code on my laptop for multiple different pilots.

Again, for this first implementation, tonight, we can even use hardcoded parameters that you got from the notebook. But it is not a long-term solution.

And moving the code to the server does not preclude importing it and running it in a notebook. So let's move things over to the server, please. Right now.

shankari · 2021-07-19T19:29:11Z

@corinne-hcr is there an ETA for resolving the differences between this repo and the server and moving the other .py files over? I can take a break for lunch to give you some time to work on this but I can't wait any longer. If I copy the files over, they will appear to have been created by me instead of you.

corinne-hcr · 2021-07-19T19:42:51Z

The codes have been moved. But the PR is in progress of checks. e-mission/e-mission-server#829
If you don't want to run the notebook to generate the result, can I put the part of generating codes in my evaluation_pipeline.py in main()? It is just not calling the scatter plot function.

shankari · 2021-07-19T20:07:52Z

Not sure what this would look like, please submit commit showing your idea.

…g code in evaluation_pipeline

corinne-hcr · 2021-07-19T23:37:11Z

I have updated the notebook code here

shankari · 2021-07-21T05:06:39Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "participant_uuid_obj = list(edb.get_profile_db().find({\"install_group\": \"participant\"}, {\"user_id\": 1, \"_id\": 0}))\n",
+    "all_users = [u[\"user_id\"] for u in participant_uuid_obj]"


@corinne-hcr The participant UUID check was specific to the CanBikeCO mini-pilot - it is not true for other programs or datasets. I have created a new mongodump of only the participant data, you should load that instead and change this to the call from emission.storage.timeseries.abstract_timeseries

shankari · 2021-07-21T05:07:59Z

tour_model_eval/first_second_round_evaluation.ipynb

+    "# get all/valid user list\n",
+    "user_ls, valid_users = gu.get_user_ls(all_users, radius)\n",
+    "# get all filenames of clustering result\n",
+    "collect_filename = predict.loadModelStage(\"collect_filename\")"


I don't understand why you need to have the filenames in a list. Can't you just read the file based on the UUID? Can you clarify your design decision here?

Oh, I think this one can be deleted. I did use the UUID to save the files. The original design is to save the filename. But it seems not necessary now.

shankari · 2021-07-21T05:17:58Z

@corinne-hcr I don't think you have addressed all of my comments. I am going to address most of them myself, but there are a few really small and clear one where it might be good for you to fix. Also what is the difference between the two notebooks? A short markdown summary would be useful.

corinne-hcr · 2021-07-21T06:13:29Z

Only first_second_round_evaluation.ipynb is the useful notebook. In the previous version, I went through the clustering in the notebook and saved the result as multiple lists. Now I change the result to one dataframe for one user, and the notebook is just to show the scatter plots for 2 rounds of clustering. The clustering code has been moved to evaluation_pipeline.

I think I have addressed your comments. Why do you see no difference in the notebook? Please see commit bfd6ba6
But this I forgot to delete the output, so I commit a new one, which is latest version.

shankari · 2021-07-21T06:16:04Z

I think I have addressed your comments.

I still see the deepcopy code, for example.

Why do you see no difference in the notebook? Please see commit bfd6ba6
But this I forgot to delete the output, so I commit a new one, which is latest version.

I'm not saying that there is no difference. I'm saying that there are two .ipynb files in this PR.

corinne-hcr · 2021-07-21T06:24:43Z

That's in the evaluation_pipeline.py. Because that file is actually in Modeling and functions #829 You said no more changes made in the file to avoid conflicts. I was trying to refactor it according to your suggestion. But that didn't return a result in right format for Gabriel's function. I thought you were going to refactor it by yourself.
You might have a different version of evaluation_pipeline now. Should I delete the collect filename from there?

shankari · 2021-07-21T14:05:21Z

Fine, I will refactor the code in the .py files myself. Once you explain the difference between the two .ipynb files, I will go ahead and merge this.

shankari · 2021-07-22T18:46:45Z

@corinne-hcr is there at ETA for updating the two notebooks so I can merge this?

corinne-hcr · 2021-07-23T04:39:12Z

I face a problem now.
I delete the collect_filename and let the plot code read files from user id. When I test running, the notebook always shows that files not found. The plot file is in e-mission-server, the clustering result files are also in e-mission-server. The notebook is in the e-mission-eval-private-data repo.

FileNotFoundError: [Errno 2] No such file or directory: 'user_576e37c7-ab7e-4c03-add7-02486bc3f42e.csv'

The file can be read by the plot code. I don't know why the notebook has this error. Previously I did everything in the e-mission-eval-private-data repo, so I didn't have such a problem.

shankari · 2021-07-23T16:39:48Z

The file can be read by the plot code. I don't know why the notebook has this error. Previously I did everything in the e-mission-eval-private-data repo, so I didn't have such a problem.

You know the reason why it has this error and have already identified it below.

Previously I did everything in the e-mission-eval-private-data repo, so I didn't have such a problem.

You need to specify the path to the files so they can be read correctly.

…o read all data from the database in the notebook

corinne-hcr · 2021-07-24T07:33:14Z

I think you can merge the code now.

shankari · 2021-07-24T14:41:05Z

There is so much refactoring left for this code... but making the changes has now taken around two months.
This PR is way past its "sell by date", so merging and will deal with refactoring later.

corinne-hcr · 2021-07-24T16:18:17Z

I delete the collect_filenamein the notebook and in the plot code, since I will change my typo in evaluation_pipeline anyway, I can delete the collect_filenamethere.

* check in notebook code * update notebook * update notebook code, the previous one was not the latest one * add some functions * refactor evaluation_pipeline code, pull out get_socre function from get_tuning_score.py to get_scores.py * add kmeans after running hierarchical clustering for re-building the model, only for testing, using user1 * adding kmeans only in the test step * check in codes for generating result for Gabriel's function, will move them to the server repo after refactoring * change the line of importing label_processing * address the problems from the previous commit, but has not yet done with geometric shape * Update tour_model_eval/build_save_model.py Co-authored-by: shankari <shankari@eecs.berkeley.edu> * refactored notebook code, not done with plot * update build_save_model according to notebook refactoring * check in the changes in the notebook, have put the original clustering code in evaluation_pipeline * delete output from the notebook * modify test notebook and add comments on it, remove extraneous files * add plot code, read filename directly from user id, add another way to read all data from the database in the notebook Co-authored-by: shankari <shankari@eecs.berkeley.edu>

check in notebook code

5cfc44c

shankari reviewed Jun 3, 2021

View reviewed changes

corinne-hcr added 3 commits June 13, 2021 11:32

update notebook

8cdf71e

update notebook code, the previous one was not the latest one

ceccf02

add some functions

109932b

shankari requested changes Jun 21, 2021

View reviewed changes

refactor evaluation_pipeline code, pull out get_socre function from g…

dd7eb30

…et_tuning_score.py to get_scores.py

shankari requested changes Jul 8, 2021

View reviewed changes

add kmeans after running hierarchical clustering for re-building the …

0d3fcd4

…model, only for testing, using user1

corinne-hcr mentioned this pull request Jul 10, 2021

Cluster trips between pairs of common places to form common trips e-mission/e-mission-docs#606

Open

corinne-hcr added 3 commits July 10, 2021 00:17

adding kmeans only in the test step

8b211f4

check in codes for generating result for Gabriel's function, will mov…

cd25861

…e them to the server repo after refactoring

change the line of importing label_processing

0bcbf5b

shankari requested changes Jul 16, 2021

View reviewed changes

address the problems from the previous commit, but has not yet done …

a5fe5c6

…with geometric shape

shankari reviewed Jul 16, 2021

View reviewed changes

shankari requested changes Jul 17, 2021

View reviewed changes

corinne-hcr and others added 4 commits July 19, 2021 10:02

Update tour_model_eval/build_save_model.py

102b9bd

Co-authored-by: shankari <shankari@eecs.berkeley.edu>

refactored notebook code, not done with plot

4d0cbdd

update build_save_model according to notebook refactoring

6eb56b1

Merge branch 'notebook_code' of https://github.com/corinne-hcr/e-miss…

1fba74a

…ion-eval-private-data into notebook_code

corinne-hcr added 2 commits July 19, 2021 16:32

check in the changes in the notebook, have put the original clusterin…

bfd6ba6

…g code in evaluation_pipeline

delete output from the notebook

a6851c6

shankari reviewed Jul 21, 2021

View reviewed changes

modify test notebook and add comments on it, remove extraneous files

3d2dff0

add plot code, read filename directly from user id, add another way t…

22a427b

…o read all data from the database in the notebook

shankari merged commit 9cdb64e into e-mission:master Jul 24, 2021

		" for i in range(len(first_labels)):\n",
		" idx_labels_track[i].append(first_labels[i])\n",

		import emission.analysis.modelling.tour_model.get_request_percentage as grp
		import emission.analysis.modelling.tour_model.get_scores as gs

		"participant_uuid_obj = list(edb.get_profile_db().find({\"install_group\": \"participant\"}, {\"user_id\": 1, \"_id\": 0}))\n",
		"all_users = [u[\"user_id\"] for u in participant_uuid_obj]"

check in notebook code #23

check in notebook code #23

Conversation

corinne-hcr commented Jun 2, 2021

shankari commented Jun 2, 2021 • edited Loading

shankari commented Jun 2, 2021

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corinne-hcr commented Jul 8, 2021

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari Jul 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankari commented Jul 19, 2021 • edited Loading

corinne-hcr commented Jul 19, 2021

shankari commented Jul 19, 2021 • edited Loading

shankari commented Jul 19, 2021 • edited Loading

corinne-hcr commented Jul 19, 2021

shankari commented Jul 19, 2021

corinne-hcr commented Jul 19, 2021

shankari commented Jul 19, 2021 • edited Loading

shankari commented Jul 19, 2021

shankari commented Jul 19, 2021

corinne-hcr commented Jul 19, 2021 • edited Loading

shankari commented Jul 19, 2021

corinne-hcr commented Jul 19, 2021

Choose a reason for hiding this comment

shankari commented Jun 2, 2021 •

edited

Loading

shankari left a comment •

edited

Loading

shankari Jun 30, 2021 •

edited

Loading

shankari left a comment •

edited

Loading

shankari Jul 18, 2021 •

edited

Loading

shankari commented Jul 19, 2021 •

edited

Loading

shankari commented Jul 19, 2021 •

edited

Loading

shankari commented Jul 19, 2021 •

edited

Loading

shankari commented Jul 19, 2021 •

edited

Loading

corinne-hcr commented Jul 19, 2021 •

edited

Loading

shankari commented Jul 21, 2021 •

edited

Loading