Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check in notebook code #23

Merged
merged 18 commits into from
Jul 24, 2021
Merged

Conversation

corinne-hcr
Copy link
Contributor

No description provided.

@shankari
Copy link
Contributor

shankari commented Jun 2, 2021

@corinne-hcr Wait, is the entire notebook implementation (other than the imports) in one ~ 300 line cell? You are right, I don't like the structure.

@shankari
Copy link
Contributor

shankari commented Jun 2, 2021

@corinne-hcr I am loathe to ask for significant restructuring. changes given the imminent deadline, but this definitely needs to be restructured significantly after the paper is submitted. I will mark my review with [NOW] and [LATER] to indicate priority.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most egregious part of this code is the duplication between tuning and testing. It is not too terrible otherwise. Still debating whether to merge or not.

"\n",
"for a in range(len(all_users)):\n",
" user = all_users[a]\n",
" filter_trips, trips = preprocess.filter_data(user, radius)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you refactored this into filter_data and read_data, but you called read_data from filter_data.
I noticed something similar with get_subdata in terms of refactoring the extraction.
The goal of a pipeline is to have multiple distinct steps, not call one function from another

You should have read and filter as separate functions called from the notebook, sort of like

trips = read_data(...)
valid_trips = similarity.filter_too_short(trips)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original feedback was bfdf1e8#r631490113

tour_model_eval/first_second_round_evaluation.ipynb Outdated Show resolved Hide resolved
" # get first round labels\n",
" first_labels = []\n",
" for b in range(len(bins)):\n",
" for trip in bins[b]:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the original index too?

Comment on lines 137 to 138
" for i in range(len(first_labels)):\n",
" idx_labels_track[i].append(first_labels[i])\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this data structure is soo messy.

tour_model_eval/first_second_round_evaluation.ipynb Outdated Show resolved Hide resolved
"\n",
" method = 'single'\n",
" # get labels after two rounds of clustering on common trips\n",
" new_labels = lp.get_new_labels(x, low, dist_pct, second_round_idx_labels, new_labels, method=method)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also be split up into cluster and get_labels

tour_model_eval/first_second_round_evaluation.ipynb Outdated Show resolved Hide resolved
Comment on lines 194 to 204
" if curr_score > highest_score:\n",
" highest_score = curr_score\n",
" sel_low = low\n",
" sel_dist_pct = dist_pct\n",
" sel_homo_second = homo_second\n",
" sel_percentge_second = percentge_second\n",
" coll_low.append(sel_low)\n",
" coll_dist_pct.append(sel_dist_pct)\n",
" colle_tune_score.append(highest_score)\n",
" pct_collect_second_train.append(sel_percentge_second)\n",
" homo_collect_second_train.append(sel_homo_second)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just append all of these as tuples/dicts directly, and find the max in the list.
you can insert the max directly as the entry for this split.

tour_model_eval/first_second_round_evaluation.ipynb Outdated Show resolved Hide resolved
tour_model_eval/first_second_round_evaluation.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, I would recommend, at a high level, something like:

def tune(...):
   first_round
   second_round
   return tuning_parameters
def test(...., tuning_parameters):
    first_round
    second_round
    return results

Comment on lines 105 to 107
" tune_data = preprocess.get_subdata(filter_trips, test_idx)\n",
" test_data = preprocess.get_subdata(filter_trips, tune_idx)\n",
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the previous comments:
#23 (comment)

I noticed something similar with get_subdata in terms of refactoring the extraction.
The goal of a pipeline is to have multiple distinct steps, not call one function from another

Please address

Copy link
Contributor

@shankari shankari Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like the pipeline to see separate steps like:

training_trips = get_subset(....)
training_features = extract_features(training_trips)

Or the other way around. I don't really care which order the steps happen in at this point, just that the order is clearly visible.

"\n",
" # run tuning set first\n",
" pct_collect_first, homo_collect_first, pct_collect_second, homo_collect_second = ep.init_score()\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have so much state, it seems like it would be better to have a class that encapsulates it.
But it also seems like overkill to create a function just to initialize the score.
It seems like you have not finished the more fundamental refactoring.

tour_model_eval/first_second_round_evaluation.ipynb Outdated Show resolved Hide resolved
" homo_second = float('%.3f' % gs.score(bin_trips, new_labels))\n",
" pct_collect_first, homo_collect_first, pct_collect_second, homo_collect_second = ep.init_score()\n",
" # testing\n",
" pct_collect_first,homo_collect_first,pct_collect_second,homo_collect_second,coll_tradeoffs,coll_tune_score = ep.tuning_test(test_data,radius, pct_collect_first, homo_collect_first,pct_collect_second,homo_collect_second, coll_tradeoffs,coll_tune_score,test=True)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we return coll_tradeoffs and coll_tune_score from the testing invocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I am not sure if we still need the tradeoffs after the clustering, so I keep it.
I am still figuring out how to predict the labels

Comment on lines 61 to 80
"all_percentage_first_train = []\n",
"all_percentage_first_tune = []\n",
"all_percentage_first_test = []\n",
"all_percentage_second_train = []\n",
"all_percentage_second_tune = []\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are these used? I don't see them reappearing in the diff?

Comment on lines 98 to 114
sim,bins,bin_trips,filter_trips = first_round(data[j], radius)
# it is possible that we don't have common trips for tuning or testing
# bins contain common trips indices
if len(bins) is not 0:
gs.compare_trip_orders(bins, bin_trips, filter_trips)
first_labels = get_first_label(bins)
# new_labels temporary stores the labels from the first round, but later the labels in new_labels will be
# updated with the labels after two rounds of clustering.
new_labels = first_labels.copy()
first_label_set = list(set(first_labels))
track = get_track(bins, first_labels)
# get request percentage for the subset for the first round
percentage_first = grp.get_req_pct(new_labels, track, filter_trips, sim)
# get homogeneity score for the subset for the first round
homo_first = gs.score(bin_trips, first_labels)
pct_collect_first.append(percentage_first)
homo_collect_first.append(homo_first)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code for each subset can be pulled out.

@corinne-hcr
Copy link
Contributor Author

I have separated the functions from a giant function, basically like tune(...) and test(...). but not sure if it is ok now. I plan to move the evaluation_pipeline.py to e-mission-server

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

Comment on lines 3 to 4
import emission.analysis.modelling.tour_model.get_request_percentage as grp
import emission.analysis.modelling.tour_model.get_scores as gs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are these files? I don't see them in the pending PR
https://github.com/e-mission/e-mission-server/pull/826/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files added

import emission.analysis.modelling.tour_model.label_processing as lp
import emission.analysis.modelling.tour_model.data_preprocessing as preprocess

def second_round(first_label_set,first_labels,bin_trips,filter_trips,low,dist_pct,sim,new_labels,track):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need 9 mandatory arguments to this function? What do you need sim for, for example, if you already have filter_trips and track?

do you really need both first_label_set and first_labels? Why can you not get first_label_set from first_label in this function?

At a high level, if you get to so many arguments, I will recommend that you create a class to pass them in. But given our time constraints, I am not going to push for this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sim is to find out trips below the cutoff in the 1st round. Those trips didn't go through further analysis except the 1st round, so I have to use sim

At a high level, if you get to so many arguments, I will recommend that you create a class to pass them in. But given our time constraints, I am not going to push for this now.

I don't understand this. Maybe I can reduce 1 or 2 arguments. But given that there are still many arguments I have to pass in, what is the difference between writing a pure function to accept them and writing a class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write a class/struct to encapsulate them so that people don't have to write really long lines passing them back and forth. You can have a single class that you both pass in and return and that has the current state of the algorithm. Makes everything a lot less verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a class for the second round. See second_round_of_clustering.py
Although it is not a perfect one, but it works.

Comment on lines 81 to 92
sim, bins, bin_trips, filter_trips = first_round(data, radius)
# it is possible that we don't have common trips for tuning or testing
# bins contain common trips indices
if len(bins) is not 0:
gs.compare_trip_orders(bins, bin_trips, filter_trips)
first_labels = get_first_label(bins)
# new_labels temporary stores the labels from the first round, but later the labels in new_labels will be
# updated with the labels after two rounds of clustering.
new_labels = first_labels.copy()
first_label_set = list(set(first_labels))
track = get_track(bins, first_labels)
# collect tuning scores and parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this is common, so it seems like it can be moved into first_round or into a new function like get_first_round_labels or something which is called from both tune and test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were just two lines of code repeated after refactoring, but I created a function get_first_label_and_track(bins,bin_trips,filter_trips), which can be called from both tune and test

Comment on lines 101 to 105
if curr_score not in tune_score:
tune_score[curr_score] = (low, dist_pct, homo_second, percentage_second)

best_score = max(tune_score)
sel_tradeoffs = tune_score[best_score][0:2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, it seems like it would be good to create a class called ClusteringResults with these fields or even return a simple dict with the fields instead of passing around lists that are hard to read. Lists are convenient for you to write, but code like [0:2] doesn't indicate to others what you are returning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already changed this part, there is no more[0:2]. I do have a dict to collect the parameters. The parameters are clearly pointed out and return separately.
e.g. return low,dist_pct

Comment on lines 147 to 153
pct_collect_first = []
homo_collect_first = []
pct_collect_second = []
homo_collect_second = []
coll_score = []
coll_tradeoffs = []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can this be moved into a list of results or even a dataframe instead of multiple lists of results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed to use a dataframe to collect all clustering result. These lists are deleted.

Comment on lines 165 to 169
pct_collect_first.append(percentage_first)
homo_collect_first.append(homo_first)
pct_collect_second.append(percentage_second)
homo_collect_second.append(homo_second)
coll_score.append(scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this will be significantly simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are deleted.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • what is the structure of the user_labels and models data structures?
  • reading all user trips is too intensive
  • don't understand why predict is not good enough to directly find the labels and has to go through first round and second round again

tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
Comment on lines 60 to 65
# Check if start/end locations of the new trip and every start/end locations in this bin are within the range of
# radius. If so, the new trip falls in this bin. Then predict the second round label of the new trip
# using this bin's model
if in_bin(sel_loc_feat, trip_loc_feat, radius):
sel_fl = fl
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too intensive for prediction. Basically, for prediction, you do not want to write any code that iterates over all the trips for a user; it will not scale.

It will not scale means that if you have a user with data for 6 months, the server will crash or it will take hours for the computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually run all trips. I saved all coordinates for all common trips in the build and save step. My logic here is to compare the new trip's coordinates with other trips' to see which bin it should fall in.
The other way I can think of is that finding a representative trip for every bin, then compare it with the new trip. But there is a risk that new trip match a representative trip but not all the existing trips in that bin. ( I thought that the new trips would match all trips in a bin if it matched one, but that would not be certain.)
Do you have other way to see if a new trip can fall in any bin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no concept of running a trip. A trip is data, you can only run code.
I think that you mean that you are not running the clustering algorithm on all trips.
You are not running the clustering algorithm, but you are still running a for loop over all the trips aka "iterating over the trips". Iterating over 10k trips to check the distances is still going to take time.

Using a representative trip has the problem that you have outlined " I thought that the new trips would match all trips in a bin if it matched one, but that would not be certain". A better option seems to be to store a geometric shape representing all the starts and all the ends of all the trips. You can then quickly check whether the new trip is within the shape.

The simplest shape is probably a circle. A potential algorithm might be:

  • get all the points
  • find the centroid
  • find the max distance of any point from the centroid
  • use that as the circle representing the points

You can also use a minimum enclosing circle algorithm/implementation instead of the hacky solution above.
https://www.nayuki.io/page/smallest-enclosing-circle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can't use that implementation because its license is LGPL and we don't want to taint our BSD-3 license.
An alternative is OpenCV, which uses an Apache license.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lighter weight alternative appears to be
https://github.com/luwei14/ShapeAnalyzer

Make sure to update the ThirdPartyLicenses.md file if you include a new library

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a super simple example using scipy
https://www.cs.au.dk/~gerth/ipsa18/slides/optimization.pdf

tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
Comment on lines 48 to 53
# e.g.{'0': [[start lon1, start lat1, end lon1, end lat1],[start lon, start lat, end lon, end lat]]}
# another explanation: -'0': label from the 1st round
# - the value of key '0': all trips that in this bin
# - for every trip: the coordinates of start/end locations
bin_locations = loadModelStage('locations_' + str(user))[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be completely changed in the next commit, so ignoring for now.

tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
Comment on lines 101 to 103
second_label_ls = list(seccond_round_result.keys())
if sel_sl not in second_label_ls:
return {}
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this? I think you can directly use sel_ls not in second_round_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Gabriel's function need a list, I follow the result from placeholder_predictor_2(trip)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not about the return type. It is about the need for second_label_ls = list(seccond_round_result.keys())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, I see your point.

tour_model_eval/load_predict.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any new design problems here.

tour_model_eval/build_save_model.py Outdated Show resolved Hide resolved
tour_model_eval/build_save_model.py Outdated Show resolved Hide resolved
tour_model_eval/build_save_model.py Outdated Show resolved Hide resolved
tour_model_eval/build_save_model.py Outdated Show resolved Hide resolved
Comment on lines 116 to 120
labels_columns = user_label_df.columns.to_list()
one_set_labels = {}
for i in range(len(unique_labels)):
# e.g. labels_only={'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}
labels_only = {column: unique_labels.iloc[i][column] for column in labels_columns}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use the built-in to_dict?

Suggested change
labels_columns = user_label_df.columns.to_list()
one_set_labels = {}
for i in range(len(unique_labels)):
# e.g. labels_only={'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}
labels_only = {column: unique_labels.iloc[i][column] for column in labels_columns}
# e.g. labels_only=[{'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}]
labels_only = user_label_df.to_dict('records'}

Comment on lines 124 to 126
# in case append() method changes the dict, we use deepcopy here
labels_set = copy.deepcopy(one_set_labels)
sec_label_obj.append(labels_set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which append method would change the dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 126 append()
This will change the dict if the dict is not an entire new object. if I directly append one_set_labels, the values in one_set_labels will be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? because that is not how python dicts or lists work. see below.
If one_set_labels is changed, there is some other reason, you need to understand and explain it.

>>> sec_label_obj = []
>>> one_set_labels = {}
>>> one_set_labels["labels"] = {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}
>>> one_set_labels["p"] = 1.0
>>> sec_label_obj.append(one_set_labels)
>>> sec_label_obj
[{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 1.0}]
>>> one_set_labels
{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 1.0}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are using one dict. But as you append more dicts with same name(one_set_labels), the previous added dict will be changed.

>>> test_list = []
>>> a={'q':1,'w':[]}
>>> test_list.append(a)
>>> print(test_list)
[{'q': 1, 'w': []}]

>>> a['q']=2
>>> a['w'].append(123)
>>> test_list.append(a)
>>> print(test_list)
[{'q': 2, 'w': [123]}, {'q': 2, 'w': [123]}]

Several days ago, I found some explanations in Chinese. Someone says that what appended to the list is the address of the dict. It doesn't create a new object in the list. So, every time we append a dict with the same name, it actually points to the same address.

Also, there is a difference between copy() and deepcopy()
copy() is shallow copy

>>> a={'q':1,'w':[]}
>>> b=a.copy()
>>> b['q']=2
>>> b['w'].append(123)
>>> print(a)
>>> print(b)
{'q': 1, 'w': [123]}
{'q': 2, 'w': [123]}

If we change to deepcopy(), we can have entirely new object.

>>> a={'q':1,'w':[]}
>>> b=copy.deepcopy(a)
>>> b['q']=2
>>> b['w'].append(123)
>>> print(a)
>>> print(b)
{'q': 1, 'w': []}
{'q': 2, 'w': [123]}

Copy link
Contributor

@shankari shankari Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are using one dict. But as you append more dicts with same name(one_set_labels), the previous added dict will be changed.

This is not true if you do it "pythonically"

>>> sec_label_obj = []
>>> for p in range(0,10,1):
...     one_set_labels = {}
...     one_set_labels["labels"] = {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}
...     one_set_labels["p"] = p/10
...     sec_label_obj.append(one_set_labels)
...
>>> sec_label_obj
[{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.0}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.1}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.2}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.3}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.4}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.5}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.6}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.7}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.8}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.9}]

This style is more pythonic because it lends itself to list comprehensions, use with itertools, functional style programming, etc - e.g.

>>> sec_label_obj = [{"labels": {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, "p": p/10} for p in range(0,10,1)]
>>> sec_label_obj
[{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.0}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.1}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.2}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.3}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.4}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.5}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.6}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.7}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.8}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.9}]

Several days ago, I found some explanations in Chinese. Someone says that what appended to the list is the address of the dict. It doesn't create a new object in the list. So, every time we append a dict with the same name, it actually points to the same address.

@corinne-hcr I am aware of the difference between shallow copy and deepcopy. But the difference is primarily when you modify objects that you have appended to the list (e.g. b['q']=2); merely appending the object doesn't change it. In general modifying objects that you are appending to the list is bad practice because the code cannot be easily refactored using the alternate styles I outlined above.

Where are you modifying objects in your code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the differences between your output and mine. I created one_set_labels = {} before the loop, then I didn't have a new object every time I change the values. You created one_set_labels = {} in the loop, so you have a new object every time. But I am going to change the structure. Glad to figure it out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating one object outside the loop and then deepcopying in the loop is not pythonic. It will make it harder to maintain the code in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinne-hcr I still see the deepcopy here

@shankari
Copy link
Contributor

shankari commented Jul 19, 2021

@corinne-hcr just to clarify, I don't plan to merge this PR at this point, you can continue working on the refactoring and let me know when you are completely done. I will copy the modeling code to the server repo and will be working there directly.

I will submit a separate PR there as well. I don't think you run the modeling code from the notebook, but if you do, I can just write a harness anyway.

P.S. I am not sure why you included the model creation and prediction code in this repo instead of the server repo since that is its final destination anyway.

@corinne-hcr
Copy link
Contributor Author

I don't think the notebook call the modeling code. But the notebook generate results and parameters. The modeling code need the parameters.
So, should I keep refactoring the modeling code and let you know when I am done?

@shankari
Copy link
Contributor

shankari commented Jul 19, 2021

@corinne-hcr I have already started working on moving the modeling code over. You are welcome to refactor the modeling code in parallel if you like as long as you can meet your other commitments around finishing the refactoring and the unit tests, but I cannot wait until you are done.

The modelling code should generate the results and parameters internally, there cannot be a notebook sitting around when it is running on production.

@shankari
Copy link
Contributor

shankari commented Jul 19, 2021

@corinne-hcr you may want to move over the .py files into the server repo yourself first so you get appropriate credit for them.
If I copy them over and commit them, it will look like I was the one that wrote all the code.

@corinne-hcr
Copy link
Contributor Author

But if the modeling code need to generate the best parameters, it has to run through the same thing in the notebook code. It takes a long time, and it is repeated.
I thought that all results were saved after running the notebook code. Then run the modeling code to find the best result(best split, parameters) and re-build & save the models

@shankari
Copy link
Contributor

@corinne-hcr is there an ETA for copying the modeling files over? otherwise, I am just going to do it myself.

@corinne-hcr
Copy link
Contributor Author

@corinne-hcr is there an ETA for copying the modeling files over? otherwise, I am just going to do it myself.

Do you mean moving everything to the server repo except the notebook code?

@shankari
Copy link
Contributor

shankari commented Jul 19, 2021

@corinne-hcr

Do you mean moving everything to the server repo except the notebook code?

Yes. There are also some differences between the same files in the server repo and this repo; we need to have the most recent version (whichever that is) in the server repo - e.g.

--- ./get_users.py  2021-07-19 11:51:10.000000000 -0700
+++ /Users/kshankar/e-mission/e-mission-eval-private-data/tour_model_eval/./get_users.py    2021-05-27 17:39:27.000000000 -0700
@@ -1,4 +1,4 @@
-import emission.analysis.modelling.tour_model.data_preprocessing as preprocess
+import data_preprocessing as preprocess


 # to determine if the user is valid:
@@ -19,8 +19,7 @@
     for i in range(len(all_users)):
         curr_user = 'user' + str(i + 1)
         user = all_users[i]
-        trips = preprocess.read_data(user)
-        filter_trips = preprocess.filter_data(trips,radius)
+        filter_trips,trips = preprocess.filter_data(user,radius)
         if valid_user(filter_trips,trips):
             valid_user_ls.append(curr_user)
             user_ls.append(curr_user)

@shankari
Copy link
Contributor

But if the modeling code need to generate the best parameters, it has to run through the same thing in the notebook code. It takes a long time, and it is repeated.

I thought that all results were saved after running the notebook code. Then run the modeling code to find the best result(best split, parameters) and re-build & save the models

The notebook is primarily for visualization and interactive exploration. I think you are arguing that we should pick parameters separate (in the notebook) from building the clusters (in the model). That is a good idea; we might want to generate the optimal parameters only once every 6 months or something.

Three points:

  • I don't see why picking the best parameters needs to happen in the notebook. There is no human intervention in picking the parameters, nothing that you need to "show" to a user.
  • The parameters could be different for different regions and different demographic groups. They might even change over time for the same group.
  • I am not going to sit there running the parameter generation code on my laptop for multiple different pilots.

Again, for this first implementation, tonight, we can even use hardcoded parameters that you got from the notebook. But it is not a long-term solution.

And moving the code to the server does not preclude importing it and running it in a notebook. So let's move things over to the server, please. Right now.

@shankari
Copy link
Contributor

@corinne-hcr is there an ETA for resolving the differences between this repo and the server and moving the other .py files over? I can take a break for lunch to give you some time to work on this but I can't wait any longer. If I copy the files over, they will appear to have been created by me instead of you.

@corinne-hcr
Copy link
Contributor Author

corinne-hcr commented Jul 19, 2021

The codes have been moved. But the PR is in progress of checks. e-mission/e-mission-server#829
If you don't want to run the notebook to generate the result, can I put the part of generating codes in my evaluation_pipeline.py in main()? It is just not calling the scatter plot function.

@shankari
Copy link
Contributor

Not sure what this would look like, please submit commit showing your idea.

@corinne-hcr
Copy link
Contributor Author

I have updated the notebook code here

Comment on lines +32 to +33
"participant_uuid_obj = list(edb.get_profile_db().find({\"install_group\": \"participant\"}, {\"user_id\": 1, \"_id\": 0}))\n",
"all_users = [u[\"user_id\"] for u in participant_uuid_obj]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinne-hcr The participant UUID check was specific to the CanBikeCO mini-pilot - it is not true for other programs or datasets. I have created a new mongodump of only the participant data, you should load that instead and change this to the call from emission.storage.timeseries.abstract_timeseries

"# get all/valid user list\n",
"user_ls, valid_users = gu.get_user_ls(all_users, radius)\n",
"# get all filenames of clustering result\n",
"collect_filename = predict.loadModelStage(\"collect_filename\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you need to have the filenames in a list. Can't you just read the file based on the UUID? Can you clarify your design decision here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this one can be deleted. I did use the UUID to save the files. The original design is to save the filename. But it seems not necessary now.

@shankari
Copy link
Contributor

@corinne-hcr I don't think you have addressed all of my comments. I am going to address most of them myself, but there are a few really small and clear one where it might be good for you to fix. Also what is the difference between the two notebooks? A short markdown summary would be useful.

@corinne-hcr
Copy link
Contributor Author

Only first_second_round_evaluation.ipynb is the useful notebook. In the previous version, I went through the clustering in the notebook and saved the result as multiple lists. Now I change the result to one dataframe for one user, and the notebook is just to show the scatter plots for 2 rounds of clustering. The clustering code has been moved to evaluation_pipeline.

I think I have addressed your comments. Why do you see no difference in the notebook? Please see commit bfd6ba6
But this I forgot to delete the output, so I commit a new one, which is latest version.

@shankari
Copy link
Contributor

shankari commented Jul 21, 2021

I think I have addressed your comments.

I still see the deepcopy code, for example.

Why do you see no difference in the notebook? Please see commit bfd6ba6
But this I forgot to delete the output, so I commit a new one, which is latest version.

I'm not saying that there is no difference. I'm saying that there are two .ipynb files in this PR.

@corinne-hcr
Copy link
Contributor Author

That's in the evaluation_pipeline.py. Because that file is actually in Modeling and functions #829 You said no more changes made in the file to avoid conflicts. I was trying to refactor it according to your suggestion. But that didn't return a result in right format for Gabriel's function. I thought you were going to refactor it by yourself.
You might have a different version of evaluation_pipeline now. Should I delete the collect filename from there?

@shankari
Copy link
Contributor

Fine, I will refactor the code in the .py files myself. Once you explain the difference between the two .ipynb files, I will go ahead and merge this.

@shankari
Copy link
Contributor

@corinne-hcr is there at ETA for updating the two notebooks so I can merge this?

@corinne-hcr
Copy link
Contributor Author

I face a problem now.
I delete the collect_filename and let the plot code read files from user id. When I test running, the notebook always shows that files not found. The plot file is in e-mission-server, the clustering result files are also in e-mission-server. The notebook is in the e-mission-eval-private-data repo.

FileNotFoundError: [Errno 2] No such file or directory: 'user_576e37c7-ab7e-4c03-add7-02486bc3f42e.csv'

The file can be read by the plot code. I don't know why the notebook has this error. Previously I did everything in the e-mission-eval-private-data repo, so I didn't have such a problem.

@shankari
Copy link
Contributor

The file can be read by the plot code. I don't know why the notebook has this error. Previously I did everything in the e-mission-eval-private-data repo, so I didn't have such a problem.

You know the reason why it has this error and have already identified it below.

Previously I did everything in the e-mission-eval-private-data repo, so I didn't have such a problem.

You need to specify the path to the files so they can be read correctly.

…o read all data from the database in the notebook
@corinne-hcr
Copy link
Contributor Author

I think you can merge the code now.

@shankari
Copy link
Contributor

There is so much refactoring left for this code... but making the changes has now taken around two months.
This PR is way past its "sell by date", so merging and will deal with refactoring later.

@shankari shankari merged commit 9cdb64e into e-mission:master Jul 24, 2021
@corinne-hcr
Copy link
Contributor Author

I delete the collect_filenamein the notebook and in the plot code, since I will change my typo in evaluation_pipeline anyway, I can delete the collect_filenamethere.

humbleOldSage pushed a commit to humbleOldSage/e-mission-eval-private-data that referenced this pull request Dec 1, 2023
* check in notebook code

* update notebook

* update notebook code, the previous one was not the latest one

* add some functions

* refactor evaluation_pipeline code, pull out get_socre function from get_tuning_score.py to get_scores.py

* add kmeans after running hierarchical clustering for re-building the model, only for testing, using user1

* adding kmeans only in the test step

* check in codes for generating result for Gabriel's function, will move them to the server repo after refactoring

* change the line of importing label_processing

* address the problems from the previous commit, but has not yet done with geometric shape

* Update tour_model_eval/build_save_model.py

Co-authored-by: shankari <shankari@eecs.berkeley.edu>

* refactored notebook code, not done with plot

* update build_save_model according to notebook refactoring

* check in the changes in the notebook, have put the original clustering code in evaluation_pipeline

* delete output from the notebook

* modify test notebook and add comments on it, remove extraneous files

* add plot code, read filename directly from user id, add another way to read all data from the database in the notebook

Co-authored-by: shankari <shankari@eecs.berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants