-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check in notebook code #23
Conversation
@corinne-hcr Wait, is the entire notebook implementation (other than the imports) in one ~ 300 line cell? You are right, I don't like the structure. |
@corinne-hcr I am loathe to ask for significant restructuring. changes given the imminent deadline, but this definitely needs to be restructured significantly after the paper is submitted. I will mark my review with [NOW] and [LATER] to indicate priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most egregious part of this code is the duplication between tuning and testing. It is not too terrible otherwise. Still debating whether to merge or not.
"\n", | ||
"for a in range(len(all_users)):\n", | ||
" user = all_users[a]\n", | ||
" filter_trips, trips = preprocess.filter_data(user, radius)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you refactored this into filter_data
and read_data
, but you called read_data
from filter_data
.
I noticed something similar with get_subdata
in terms of refactoring the extraction.
The goal of a pipeline is to have multiple distinct steps, not call one function from another
You should have read
and filter
as separate functions called from the notebook, sort of like
trips = read_data(...)
valid_trips = similarity.filter_too_short(trips)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original feedback was bfdf1e8#r631490113
" # get first round labels\n", | ||
" first_labels = []\n", | ||
" for b in range(len(bins)):\n", | ||
" for trip in bins[b]:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the original index too?
" for i in range(len(first_labels)):\n", | ||
" idx_labels_track[i].append(first_labels[i])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this data structure is soo messy.
"\n", | ||
" method = 'single'\n", | ||
" # get labels after two rounds of clustering on common trips\n", | ||
" new_labels = lp.get_new_labels(x, low, dist_pct, second_round_idx_labels, new_labels, method=method)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also be split up into cluster
and get_labels
" if curr_score > highest_score:\n", | ||
" highest_score = curr_score\n", | ||
" sel_low = low\n", | ||
" sel_dist_pct = dist_pct\n", | ||
" sel_homo_second = homo_second\n", | ||
" sel_percentge_second = percentge_second\n", | ||
" coll_low.append(sel_low)\n", | ||
" coll_dist_pct.append(sel_dist_pct)\n", | ||
" colle_tune_score.append(highest_score)\n", | ||
" pct_collect_second_train.append(sel_percentge_second)\n", | ||
" homo_collect_second_train.append(sel_homo_second)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could just append all of these as tuples/dicts directly, and find the max in the list.
you can insert the max directly as the entry for this split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, I would recommend, at a high level, something like:
def tune(...):
first_round
second_round
return tuning_parameters
def test(...., tuning_parameters):
first_round
second_round
return results
" tune_data = preprocess.get_subdata(filter_trips, test_idx)\n", | ||
" test_data = preprocess.get_subdata(filter_trips, tune_idx)\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the previous comments:
#23 (comment)
I noticed something similar with
get_subdata
in terms of refactoring the extraction.
The goal of a pipeline is to have multiple distinct steps, not call one function from another
Please address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like the pipeline to see separate steps like:
training_trips = get_subset(....)
training_features = extract_features(training_trips)
Or the other way around. I don't really care which order the steps happen in at this point, just that the order is clearly visible.
"\n", | ||
" # run tuning set first\n", | ||
" pct_collect_first, homo_collect_first, pct_collect_second, homo_collect_second = ep.init_score()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have so much state, it seems like it would be better to have a class that encapsulates it.
But it also seems like overkill to create a function just to initialize the score.
It seems like you have not finished the more fundamental refactoring.
" homo_second = float('%.3f' % gs.score(bin_trips, new_labels))\n", | ||
" pct_collect_first, homo_collect_first, pct_collect_second, homo_collect_second = ep.init_score()\n", | ||
" # testing\n", | ||
" pct_collect_first,homo_collect_first,pct_collect_second,homo_collect_second,coll_tradeoffs,coll_tune_score = ep.tuning_test(test_data,radius, pct_collect_first, homo_collect_first,pct_collect_second,homo_collect_second, coll_tradeoffs,coll_tune_score,test=True)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we return coll_tradeoffs
and coll_tune_score
from the testing invocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I am not sure if we still need the tradeoffs after the clustering, so I keep it.
I am still figuring out how to predict the labels
"all_percentage_first_train = []\n", | ||
"all_percentage_first_tune = []\n", | ||
"all_percentage_first_test = []\n", | ||
"all_percentage_second_train = []\n", | ||
"all_percentage_second_tune = []\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are these used? I don't see them reappearing in the diff?
sim,bins,bin_trips,filter_trips = first_round(data[j], radius) | ||
# it is possible that we don't have common trips for tuning or testing | ||
# bins contain common trips indices | ||
if len(bins) is not 0: | ||
gs.compare_trip_orders(bins, bin_trips, filter_trips) | ||
first_labels = get_first_label(bins) | ||
# new_labels temporary stores the labels from the first round, but later the labels in new_labels will be | ||
# updated with the labels after two rounds of clustering. | ||
new_labels = first_labels.copy() | ||
first_label_set = list(set(first_labels)) | ||
track = get_track(bins, first_labels) | ||
# get request percentage for the subset for the first round | ||
percentage_first = grp.get_req_pct(new_labels, track, filter_trips, sim) | ||
# get homogeneity score for the subset for the first round | ||
homo_first = gs.score(bin_trips, first_labels) | ||
pct_collect_first.append(percentage_first) | ||
homo_collect_first.append(homo_first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code for each subset can be pulled out.
…et_tuning_score.py to get_scores.py
I have separated the functions from a giant function, basically like tune(...) and test(...). but not sure if it is ok now. I plan to move the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
import emission.analysis.modelling.tour_model.get_request_percentage as grp | ||
import emission.analysis.modelling.tour_model.get_scores as gs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are these files? I don't see them in the pending PR
https://github.com/e-mission/e-mission-server/pull/826/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files added
import emission.analysis.modelling.tour_model.label_processing as lp | ||
import emission.analysis.modelling.tour_model.data_preprocessing as preprocess | ||
|
||
def second_round(first_label_set,first_labels,bin_trips,filter_trips,low,dist_pct,sim,new_labels,track): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really need 9 mandatory arguments to this function? What do you need sim
for, for example, if you already have filter_trips
and track
?
do you really need both first_label_set
and first_labels
? Why can you not get first_label_set
from first_label
in this function?
At a high level, if you get to so many arguments, I will recommend that you create a class to pass them in. But given our time constraints, I am not going to push for this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sim
is to find out trips below the cutoff in the 1st round. Those trips didn't go through further analysis except the 1st round, so I have to use sim
At a high level, if you get to so many arguments, I will recommend that you create a class to pass them in. But given our time constraints, I am not going to push for this now.
I don't understand this. Maybe I can reduce 1 or 2 arguments. But given that there are still many arguments I have to pass in, what is the difference between writing a pure function to accept them and writing a class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can write a class/struct to encapsulate them so that people don't have to write really long lines passing them back and forth. You can have a single class that you both pass in and return and that has the current state of the algorithm. Makes everything a lot less verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a class for the second round. See second_round_of_clustering.py
Although it is not a perfect one, but it works.
sim, bins, bin_trips, filter_trips = first_round(data, radius) | ||
# it is possible that we don't have common trips for tuning or testing | ||
# bins contain common trips indices | ||
if len(bins) is not 0: | ||
gs.compare_trip_orders(bins, bin_trips, filter_trips) | ||
first_labels = get_first_label(bins) | ||
# new_labels temporary stores the labels from the first round, but later the labels in new_labels will be | ||
# updated with the labels after two rounds of clustering. | ||
new_labels = first_labels.copy() | ||
first_label_set = list(set(first_labels)) | ||
track = get_track(bins, first_labels) | ||
# collect tuning scores and parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this is common, so it seems like it can be moved into first_round
or into a new function like get_first_round_labels
or something which is called from both tune
and test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were just two lines of code repeated after refactoring, but I created a function get_first_label_and_track(bins,bin_trips,filter_trips)
, which can be called from both tune
and test
if curr_score not in tune_score: | ||
tune_score[curr_score] = (low, dist_pct, homo_second, percentage_second) | ||
|
||
best_score = max(tune_score) | ||
sel_tradeoffs = tune_score[best_score][0:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, it seems like it would be good to create a class called ClusteringResults with these fields or even return a simple dict with the fields instead of passing around lists that are hard to read. Lists are convenient for you to write, but code like [0:2]
doesn't indicate to others what you are returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already changed this part, there is no more[0:2]
. I do have a dict to collect the parameters. The parameters are clearly pointed out and return separately.
e.g. return low,dist_pct
pct_collect_first = [] | ||
homo_collect_first = [] | ||
pct_collect_second = [] | ||
homo_collect_second = [] | ||
coll_score = [] | ||
coll_tradeoffs = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can this be moved into a list of results or even a dataframe instead of multiple lists of results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed to use a dataframe to collect all clustering result. These lists are deleted.
pct_collect_first.append(percentage_first) | ||
homo_collect_first.append(homo_first) | ||
pct_collect_second.append(percentage_second) | ||
homo_collect_second.append(homo_second) | ||
coll_score.append(scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this will be significantly simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are deleted.
…model, only for testing, using user1
…e them to the server repo after refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what is the structure of the
user_labels
andmodels
data structures? - reading all user trips is too intensive
- don't understand why
predict
is not good enough to directly find the labels and has to go through first round and second round again
tour_model_eval/load_predict.py
Outdated
# Check if start/end locations of the new trip and every start/end locations in this bin are within the range of | ||
# radius. If so, the new trip falls in this bin. Then predict the second round label of the new trip | ||
# using this bin's model | ||
if in_bin(sel_loc_feat, trip_loc_feat, radius): | ||
sel_fl = fl | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too intensive for prediction. Basically, for prediction, you do not want to write any code that iterates over all the trips for a user; it will not scale.
It will not scale means that if you have a user with data for 6 months, the server will crash or it will take hours for the computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually run all trips. I saved all coordinates for all common trips in the build and save
step. My logic here is to compare the new trip's coordinates with other trips' to see which bin it should fall in.
The other way I can think of is that finding a representative trip for every bin, then compare it with the new trip. But there is a risk that new trip match a representative trip but not all the existing trips in that bin. ( I thought that the new trips would match all trips in a bin if it matched one, but that would not be certain.)
Do you have other way to see if a new trip can fall in any bin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no concept of running a trip. A trip is data, you can only run code.
I think that you mean that you are not running the clustering algorithm on all trips.
You are not running the clustering algorithm, but you are still running a for
loop over all the trips aka "iterating over the trips". Iterating over 10k trips to check the distances is still going to take time.
Using a representative trip has the problem that you have outlined " I thought that the new trips would match all trips in a bin if it matched one, but that would not be certain". A better option seems to be to store a geometric shape representing all the starts and all the ends of all the trips. You can then quickly check whether the new trip is within the shape.
The simplest shape is probably a circle. A potential algorithm might be:
- get all the points
- find the centroid
- find the max distance of any point from the centroid
- use that as the circle representing the points
You can also use a minimum enclosing circle algorithm/implementation instead of the hacky solution above.
https://www.nayuki.io/page/smallest-enclosing-circle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you can't use that implementation because its license is LGPL and we don't want to taint our BSD-3 license.
An alternative is OpenCV, which uses an Apache license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lighter weight alternative appears to be
https://github.com/luwei14/ShapeAnalyzer
Make sure to update the ThirdPartyLicenses.md
file if you include a new library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a super simple example using scipy
https://www.cs.au.dk/~gerth/ipsa18/slides/optimization.pdf
…with geometric shape
tour_model_eval/load_predict.py
Outdated
# e.g.{'0': [[start lon1, start lat1, end lon1, end lat1],[start lon, start lat, end lon, end lat]]} | ||
# another explanation: -'0': label from the 1st round | ||
# - the value of key '0': all trips that in this bin | ||
# - for every trip: the coordinates of start/end locations | ||
bin_locations = loadModelStage('locations_' + str(user))[0] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be completely changed in the next commit, so ignoring for now.
tour_model_eval/load_predict.py
Outdated
second_label_ls = list(seccond_round_result.keys()) | ||
if sel_sl not in second_label_ls: | ||
return {} | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this? I think you can directly use sel_ls not in second_round_result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Gabriel's function need a list, I follow the result from placeholder_predictor_2(trip)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not about the return type. It is about the need for second_label_ls = list(seccond_round_result.keys())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I see your point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any new design problems here.
- You still need to fix the
locations
model. - The
main
function is way too large (in general, it is good to have functions be ~ 1 page long. But I will not force you to make this change now - you tend to treat dataframes as lists and iterate over them. using vectorized operations is more efficient - e.g. https://towardsdatascience.com/efficient-pandas-apply-vs-vectorized-operations-91ca17669e84
tour_model_eval/build_save_model.py
Outdated
labels_columns = user_label_df.columns.to_list() | ||
one_set_labels = {} | ||
for i in range(len(unique_labels)): | ||
# e.g. labels_only={'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'} | ||
labels_only = {column: unique_labels.iloc[i][column] for column in labels_columns} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use the built-in to_dict
?
labels_columns = user_label_df.columns.to_list() | |
one_set_labels = {} | |
for i in range(len(unique_labels)): | |
# e.g. labels_only={'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'} | |
labels_only = {column: unique_labels.iloc[i][column] for column in labels_columns} | |
# e.g. labels_only=[{'mode_confirm': 'pilot_ebike', 'purpose_confirm': 'work', 'replaced_mode': 'walk'}] | |
labels_only = user_label_df.to_dict('records'} |
tour_model_eval/build_save_model.py
Outdated
# in case append() method changes the dict, we use deepcopy here | ||
labels_set = copy.deepcopy(one_set_labels) | ||
sec_label_obj.append(labels_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which append
method would change the dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 126 append()
This will change the dict if the dict is not an entire new object. if I directly append one_set_labels
, the values in one_set_labels
will be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really? because that is not how python dicts or lists work. see below.
If one_set_labels
is changed, there is some other reason, you need to understand and explain it.
>>> sec_label_obj = []
>>> one_set_labels = {}
>>> one_set_labels["labels"] = {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}
>>> one_set_labels["p"] = 1.0
>>> sec_label_obj.append(one_set_labels)
>>> sec_label_obj
[{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 1.0}]
>>> one_set_labels
{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 1.0}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you are using one dict. But as you append more dicts with same name(one_set_labels
), the previous added dict will be changed.
>>> test_list = []
>>> a={'q':1,'w':[]}
>>> test_list.append(a)
>>> print(test_list)
[{'q': 1, 'w': []}]
>>> a['q']=2
>>> a['w'].append(123)
>>> test_list.append(a)
>>> print(test_list)
[{'q': 2, 'w': [123]}, {'q': 2, 'w': [123]}]
Several days ago, I found some explanations in Chinese. Someone says that what appended to the list is the address of the dict. It doesn't create a new object in the list. So, every time we append a dict with the same name, it actually points to the same address.
Also, there is a difference between copy()
and deepcopy()
copy()
is shallow copy
>>> a={'q':1,'w':[]}
>>> b=a.copy()
>>> b['q']=2
>>> b['w'].append(123)
>>> print(a)
>>> print(b)
{'q': 1, 'w': [123]}
{'q': 2, 'w': [123]}
If we change to deepcopy()
, we can have entirely new object.
>>> a={'q':1,'w':[]}
>>> b=copy.deepcopy(a)
>>> b['q']=2
>>> b['w'].append(123)
>>> print(a)
>>> print(b)
{'q': 1, 'w': []}
{'q': 2, 'w': [123]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you are using one dict. But as you append more dicts with same name(one_set_labels), the previous added dict will be changed.
This is not true if you do it "pythonically"
>>> sec_label_obj = []
>>> for p in range(0,10,1):
... one_set_labels = {}
... one_set_labels["labels"] = {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}
... one_set_labels["p"] = p/10
... sec_label_obj.append(one_set_labels)
...
>>> sec_label_obj
[{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.0}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.1}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.2}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.3}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.4}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.5}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.6}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.7}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.8}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.9}]
This style is more pythonic because it lends itself to list comprehensions, use with itertools, functional style programming, etc - e.g.
>>> sec_label_obj = [{"labels": {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, "p": p/10} for p in range(0,10,1)]
>>> sec_label_obj
[{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.0}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.1}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.2}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.3}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.4}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.5}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.6}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.7}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.8}, {'labels': {'mode_confirm': 'walk', 'replaced_mode': 'walk', 'purpose_confirm': 'exercise'}, 'p': 0.9}]
Several days ago, I found some explanations in Chinese. Someone says that what appended to the list is the address of the dict. It doesn't create a new object in the list. So, every time we append a dict with the same name, it actually points to the same address.
@corinne-hcr I am aware of the difference between shallow copy and deepcopy. But the difference is primarily when you modify objects that you have appended to the list (e.g. b['q']=2
); merely appending the object doesn't change it. In general modifying objects that you are appending to the list is bad practice because the code cannot be easily refactored using the alternate styles I outlined above.
Where are you modifying objects in your code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the differences between your output and mine. I created one_set_labels = {}
before the loop, then I didn't have a new object every time I change the values. You created one_set_labels = {}
in the loop, so you have a new object every time. But I am going to change the structure. Glad to figure it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating one object outside the loop and then deepcopying in the loop is not pythonic. It will make it harder to maintain the code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinne-hcr I still see the deepcopy
here
Co-authored-by: shankari <shankari@eecs.berkeley.edu>
@corinne-hcr just to clarify, I don't plan to merge this PR at this point, you can continue working on the refactoring and let me know when you are completely done. I will copy the modeling code to the server repo and will be working there directly. I will submit a separate PR there as well. I don't think you run the modeling code from the notebook, but if you do, I can just write a harness anyway. P.S. I am not sure why you included the model creation and prediction code in this repo instead of the server repo since that is its final destination anyway. |
I don't think the notebook call the modeling code. But the notebook generate results and parameters. The modeling code need the parameters. |
@corinne-hcr I have already started working on moving the modeling code over. You are welcome to refactor the modeling code in parallel if you like as long as you can meet your other commitments around finishing the refactoring and the unit tests, but I cannot wait until you are done. The modelling code should generate the results and parameters internally, there cannot be a notebook sitting around when it is running on production. |
@corinne-hcr you may want to move over the |
But if the modeling code need to generate the best parameters, it has to run through the same thing in the notebook code. It takes a long time, and it is repeated. |
@corinne-hcr is there an ETA for copying the modeling files over? otherwise, I am just going to do it myself. |
Do you mean moving everything to the server repo except the notebook code? |
Yes. There are also some differences between the same files in the server repo and this repo; we need to have the most recent version (whichever that is) in the server repo - e.g.
|
The notebook is primarily for visualization and interactive exploration. I think you are arguing that we should pick parameters separate (in the notebook) from building the clusters (in the model). That is a good idea; we might want to generate the optimal parameters only once every 6 months or something. Three points:
Again, for this first implementation, tonight, we can even use hardcoded parameters that you got from the notebook. But it is not a long-term solution. And moving the code to the server does not preclude importing it and running it in a notebook. So let's move things over to the server, please. Right now. |
@corinne-hcr is there an ETA for resolving the differences between this repo and the server and moving the other .py files over? I can take a break for lunch to give you some time to work on this but I can't wait any longer. If I copy the files over, they will appear to have been created by me instead of you. |
The codes have been moved. But the PR is in progress of checks. e-mission/e-mission-server#829 |
Not sure what this would look like, please submit commit showing your idea. |
…g code in evaluation_pipeline
I have updated the notebook code here |
"participant_uuid_obj = list(edb.get_profile_db().find({\"install_group\": \"participant\"}, {\"user_id\": 1, \"_id\": 0}))\n", | ||
"all_users = [u[\"user_id\"] for u in participant_uuid_obj]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinne-hcr The participant UUID check was specific to the CanBikeCO mini-pilot - it is not true for other programs or datasets. I have created a new mongodump of only the participant data, you should load that instead and change this to the call from emission.storage.timeseries.abstract_timeseries
"# get all/valid user list\n", | ||
"user_ls, valid_users = gu.get_user_ls(all_users, radius)\n", | ||
"# get all filenames of clustering result\n", | ||
"collect_filename = predict.loadModelStage(\"collect_filename\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you need to have the filenames in a list. Can't you just read the file based on the UUID? Can you clarify your design decision here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this one can be deleted. I did use the UUID to save the files. The original design is to save the filename. But it seems not necessary now.
@corinne-hcr I don't think you have addressed all of my comments. I am going to address most of them myself, but there are a few really small and clear one where it might be good for you to fix. Also what is the difference between the two notebooks? A short markdown summary would be useful. |
Only I think I have addressed your comments. Why do you see no difference in the notebook? Please see commit bfd6ba6 |
I still see the
I'm not saying that there is no difference. I'm saying that there are two |
That's in the |
Fine, I will refactor the code in the |
@corinne-hcr is there at ETA for updating the two notebooks so I can merge this? |
I face a problem now.
The file can be read by the |
You know the reason why it has this error and have already identified it below.
You need to specify the path to the files so they can be read correctly. |
…o read all data from the database in the notebook
I think you can merge the code now. |
There is so much refactoring left for this code... but making the changes has now taken around two months. |
I delete the |
* check in notebook code * update notebook * update notebook code, the previous one was not the latest one * add some functions * refactor evaluation_pipeline code, pull out get_socre function from get_tuning_score.py to get_scores.py * add kmeans after running hierarchical clustering for re-building the model, only for testing, using user1 * adding kmeans only in the test step * check in codes for generating result for Gabriel's function, will move them to the server repo after refactoring * change the line of importing label_processing * address the problems from the previous commit, but has not yet done with geometric shape * Update tour_model_eval/build_save_model.py Co-authored-by: shankari <shankari@eecs.berkeley.edu> * refactored notebook code, not done with plot * update build_save_model according to notebook refactoring * check in the changes in the notebook, have put the original clustering code in evaluation_pipeline * delete output from the notebook * modify test notebook and add comments on it, remove extraneous files * add plot code, read filename directly from user id, add another way to read all data from the database in the notebook Co-authored-by: shankari <shankari@eecs.berkeley.edu>
No description provided.