Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster trips between pairs of common places to form common trips #606

Open
shankari opened this issue Jan 25, 2021 · 133 comments
Open

Cluster trips between pairs of common places to form common trips #606

shankari opened this issue Jan 25, 2021 · 133 comments

Comments

@shankari
Copy link
Contributor

shankari commented Jan 25, 2021

  • Featurize trips between pairs of common places
  • Explore classification and clustering algorithms to bin them into similar trip buckets
  • Determine metrics for trip assignment accuracy and user input
  • Implement metrics
  • Visualize metrics and tradeoff
  • Tune algorithms at the tradeoff points
@shankari
Copy link
Contributor Author

shankari commented Feb 25, 2021

Leidy's suggestion was to calculate the difference in start time between all pairs of trips (similar to the difference between start locations for all pairs of trips) and then to include that in the binning.

An alternate suggestion is to just use the "hour" field for the trips. But that may have boundary effects. Need to test.

@shankari
Copy link
Contributor Author

shankari commented Mar 7, 2021

My suggestion is:

  • pick a user who has a lot of queries
  • pick a bin for that user that has a large number of trips but a low v-score
  • manually look at the trips in that bin and see which features can separate the actual labeled trips into finer-grained bins

@shankari
Copy link
Contributor Author

shankari commented Mar 7, 2021

examples of features:

  • distance (which is the distance along the route between start and end)
  • duration
  • time of day (e.g. 9:15, 9:30)

etc

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Mar 18, 2021

Adding more features for clustering doesn't work better, the v-score gets worse.
I am trying to have a second clustering step in one cluster.
For example, for trips labeled 1, if I further cluster the trips, what are the labels for the trips?
I assume that there would be labels 0,1,2,3 for the trips previous labeled 1. Would that cover the previous label?
e.g There are 5 trips, all labeled 1, I am further clustering them.
In the second round, they might have labels 1,1,2,0,0, would the new labels cover the previous label and conflict to other bigger clusters(having label 2 for the first round)

@shankari
Copy link
Contributor Author

The clustering algorithm will almost certainly return labels like 1,1,2,0,0 in the second round.
However, that does not mean that you have to use those labels directly for the trips.

You could choose to set the trip labels to a concatenation of the first round and second round labels. Or other such combinations.

So in your example above, the labels you set for the trips would be 11, 11, 12, 10, 10

@shankari
Copy link
Contributor Author

shankari commented Mar 18, 2021

I wonder if we should try homogeneity as a metric instead of v-score, specially with the smaller clusters.

For example, if I have 20 trips, 10 with label l1 = (pilot-ebike, work, walk) and 10 with label l2 = (drove_alone, shopping, bike).

  1. If they are split into two clusters (10 l1 + 10 l2) with perfect labels, the v-score will be great.
  2. But if they are split into 4 clusters (5 l1 + 5 l1 + 5 l2 + 5 l2), that would be fine for us with the smaller clusters; it would just have more user requests. If we got a new trip and and assigned it to either of the first two clusters, we would get l1 as the label so that would be fine.

From a homogeneity perspective, this is still a score of 1.0. But I think that the v-score will be lower because all the trips with l1 are not in the same cluster. So maybe we need to use the v-score after all, and the completeness will be represented by the user requests instead.

Concretely, the completeness of option (2) above would be bad because all trips with l1 are not in the same cluster. But we have a different metric to represents that badness where trips with the same label are put in different clusters - and that is the user requests. So if we have user requests, which is a better metric for us, we don't really need completeness.

It is possible that the trips will be split into (5 l1 + 10 (l1+l2) + 5 l2) as well. In that case, the homogeneity of the middle cluster will be bad, so the accuracy will also be lower. If we get a new trip, and it is assigned to the the first or last cluster, it will be accurate. If it is assigned to the middle cluster, it has a 50:50 chance of being accurate. So again, the homogeneity captures the actual behavior we want to represent.

@shankari
Copy link
Contributor Author

This is what I mean by using the trip id

In [4]: import pandas as pd

In [7]: x = pd.DataFrame({"id": ["abc123", "bcd123", "cde123", "def123"], "distance": [10,20,30,40]})

In [8]: x
Out[8]:
       id  distance
0  abc123        10
1  bcd123        20
2  cde123        30
3  def123        40

In [18]: cluster_labels = pd.Series([1,2,1,2], name="cluster_labels")

In [19]: pd.concat([x,cluster_labels], axis=1)
Out[19]:
       id  distance  cluster_labels
0  abc123        10               1
1  bcd123        20               2
2  cde123        30               1
3  def123        40               2

@corinne-hcr
Copy link
Contributor

I have some question about testing the algorithm.
For example, I have a list of common trip data, and a list of labels like[1,1,10,12,13,2,2,20,21]. If there is a new trip without label, how is it considered whether to be grouped in one of the known common clusters? Should we use classification? Then why didn't we use classification at the beginning?

If we are not using classification for new trips, how the new label list [1,1,10,12,13,2,2,20,21] works? It is not the list from kmeans.labels_. If I use kmeans.predict on a new trip data, it will label it based on the labels from kmeans.labels_. What to do if the new trip doesn't belong to the common trips?

@shankari
Copy link
Contributor Author

I assume you would use a two step prediction as well. So call predict_ on the first level to find the first digit (e.g. 1) and then cluster again to find the second digit (e.g. 0).

I should also note that you won't have a list of labels like [1,1,10,12,13,2,2,20,21]. Once you have split cluster 1 into 10, 12, 13, you will only use the raw 1 as a first part to the full label (10,12 or 13).

@shankari
Copy link
Contributor Author

shankari commented Mar 19, 2021

Have you looked at https://en.wikipedia.org/wiki/Hierarchical_clustering
Implemented in sklearn already
https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

and apparently in SciPy as well

@shankari
Copy link
Contributor Author

I don't think that sklearn has divisive clustering, which is what we are looking for, but I see that there are some other implementations...
https://github.com/topics/divisive-clustering

No need to reinvent the wheel...

@corinne-hcr
Copy link
Contributor

I should also note that you won't have a list of labels like [1,1,10,12,13,2,2,20,21]. Once you have split cluster 1 into 10, 12, 13, you will only use the raw 1 as a first part to the full label (10,12 or 13).

I don't understand this. Do you mean I keep the original labels? [1,1,1,1,1,2,2,2,2] for the first round, [0,1,2,3,3]for label1 in the second round, but not to generate a new label list for all trips for predicting a new trip? (I think I need a new list for evaluating the homogeneity score)

If so, how do I know if there is a second round or more for label1?

If the new trip was labeled 1 in the first round and went to the second round, it would be clustered again with other trips labeled 1 to find the second digit, right? If in the second round, the new trip was assigned a new label different from other trips, it would need to request the user, right?

@shankari
Copy link
Contributor Author

I don't understand your question either. At the end of the first round, your labels are [1,1,1,1,1,2,2,2,2]. At the end of the second round, your labels will be two digit - e.g. [11,11,12,12,12,21,21,22,22]. So if you have done two round clustering, you should never have a single digit label - that's all I said.

@shankari
Copy link
Contributor Author

If so, how do I know if there is a second round or more for label1?

You can assume that labels will always be two digit. If a cluster is homogenous enough that you don't want to run a second round, just assume they are all in one cluster for the second round.

So if cluster 1 is homogenous, then call it 11. That is the only two digit label starting with 1 that you will have.

@corinne-hcr
Copy link
Contributor

Then [11,11,12,12,12,21,21,22,22] is not for predict_, right? It is for evaluating homogeneity score?

@shankari
Copy link
Contributor Author

why would it not be for predict_? It would just be the result of two sequential predict_s

@corinne-hcr
Copy link
Contributor

I think [11,11,12,12,12,21,21,22,22] is a new list, not come from either round1 or round2. So 11 itself cannot be used for predict_, only 1 or 2 is the valid label for the first round. If the new trip was labeled 1for the first time, I think I need to collect the set of the second digit. If len(set) is >1, then the new trip go to the second round.

@shankari
Copy link
Contributor Author

Why is this not the algorithm?

  • fit a model to all the trips
  • you get back k clusters
  • for each cluster, build a second level model
  • when you get a new trip, call predict on the top-level model to find the first cluster e.g. (1)
  • then call predict on the second level model for cluster 1
  • you will get a second level label (say 3)

The trip is now in cluster 13

@shankari
Copy link
Contributor Author

shankari commented Mar 20, 2021

To fit models

first_level_model = kmeans.fit(above_cutoff_trips)
second_level_models = []
for i in first_level_model.clusters:
     second_level_model.append(kmeans.fit(first_level_model.trips))

print(first_level_model)
print(second_level_models)

To predict models

first_level_label = first_level_model.predict(incoming_trip)
sel_second_level_model = second_level_models[first_level_label]
second_level_label = sel_second_level_model.predict(incoming_trip)
final_label = str(first_level_label) + str(second_level_label)

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Mar 21, 2021

For first_level_model, what should it be in your mind?
I only got KMeans(n_clusters=23, random_state=8) as the first_level_model

For first_level_model.clusters, should it be the set of labels? e.g.[0,1,2,3,4]

Forfirst_level_model.trips, do you mean all trips that labeled 1 (if I choose label 1)? If so, I think I should create a new list for trips labeled 1 in the first round and pass in as first_level_model.trips.

If homo score > 0.9, how do I have 1 cluster for the second round? For min_cluster =1, the prompt is

Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

It is from silhouette_score

@shankari
Copy link
Contributor Author

If homo score > 0.9, how do I have 1 cluster for the second round? For min_cluster =1, the prompt is

you skip the cluster in the second round, and just assign the second label as 1. So after the first round, if cluster 5 has score > 0.9, then after the second round, it will have label 51. The second round will just be a NOP.

@shankari
Copy link
Contributor Author

I only got KMeans(n_clusters=23, random_state=8) as the first_level_model

That is the model - it is just represented that way. If you call predict on it, it should work. For the record, you can even store the model into a file. If you restore it, it will still work
https://scikit-learn.org/stable/modules/model_persistence.html

For first_level_model.clusters, should it be the set of labels? e.g.[0,1,2,3,4]

In the pseudo code, it can either be the set of labels or the number of clusters. Note that I don't actually use i anywhere
You can choose whichever option works better when you go from pseudocode to code

Forfirst_level_model.trips, do you mean all trips that labeled 1 (if I choose label 1)?

Yes.

If so, I think I should create a new list for trips labeled 1 in the first round and pass in as first_level_model.trips.

Sure. You can also just filter the dataframe and pass it in, which is likely to be more efficient.

Again, as I emphasized yesterday, what I wrote was pseudo-code, and not expected to work exactly as written. Feel free to edit to make it actually work.

@shankari
Copy link
Contributor Author

Also, given that you have spent multiple days on this already, are you sure that it wouldn't be faster to just use the existing implementation of divisive clustering?

@shankari
Copy link
Contributor Author

shankari commented Mar 21, 2021

To answer my own question, one advantage of building a homegrown solution over the existing implementation may be around controlling the features for each round. In the divisive clustering algorithm, we specify all features at the same time. With the homegrown solution, we can cluster on the start and end locations first, and then add additional features for the second round.

@corinne-hcr ideally, you would be able to articulate a reason like this for why divisive clustering is not a good fit. We should certainly use this as justification in the paper going forward

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Mar 21, 2021

I am a bit confused. Do you mean the existing divisive cluster algorithm need to pass in all features at the beginning? But homegrown solution can use location for the first round then add other features for the second round.

I got two charts that show the homogeneity score for the first round in different situation.
First one is using homogeneity score to control the number of clusters. max_clustersis 2*len(bins), so the number of cluster reaches 2*len(bins)
image

Second one is using silhouette score to control the number of clusters.max_clustersis 1.5*len(bins), but the number of cluster is smaller than the max_clusters
image

From these two charts, we can see the first round using only locations is better than adding other features.

@shankari
Copy link
Contributor Author

shankari commented Mar 21, 2021

Let me ask that as a question. You said that the implementation here
https://github.com/topics/divisive-clustering
"does not fit".

Why?

If you choose not to use a standard, well-known method and come up with a new algorithm instead, you need to be able to explain why. https://en.wikipedia.org/wiki/Not_invented_here is not a good enough reason.

You are not using the existing DIANA algorithm and are implementing your own two-step algorithm.

So what is the reason?

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Mar 24, 2021

After finding some code examples of DIANA, I figure that there is not a good metric for deciding the clusters.

Metrics in sklearn require max_clusters is n_sample -1, for example, 5 sample, only 4 clusters at most. But in our cases, some of the first round clusters can have sub_clusters fewer than the number of trips, but some cannot (trips inside this cluster should be divided into sub_clusters with 1 trip in it).

For DIANA algorithm, I have seen one examples will keep dividing trips until 1 trip in 1 cluster (from your the github link you posted), one example is based on the number of cluster that the programmer passes in (for example, setting k=3, the algorithm will stop when clusters =3). But either way doesn't work for us (we don't know how many clusters would be enough).

Here is my idea. I can use homo score for the fake second round clustering, then collect the differences in points(locations, basically reusing Naomi's code, finding the radius), distance, and duration in one cluster (so that I will know how close they are in order to be in one cluster). After that, we use the median of the differences from all users. We will end up having location radius in (for example) 10 meters, distance difference in 50 meters, duration difference in 200s. Then use them as metrics for real second round clustering. But then, I can write the code like Naomi's binning, not to use existing clustering algorithm. Could be more accurate than using kmeans or DIANA. It runs even faster than using sklearn according to our precious implementations. How do you think?

@shankari
Copy link
Contributor Author

@corinne-hcr for the implementation which keeps dividing trips, that is building out a full tree. It is actually a fairly common strategy for hierarchical clustering algorithms. You can then choose to cut off the tree at any point.

https://en.wikipedia.org/wiki/Hierarchical_clustering

The results of hierarchical clustering[2] are usually presented in a dendrogram.

Cutting the tree at a given height will give a partitioning clustering at a selected precision. In this example, cutting after the second row (from the top) of the dendrogram will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number but larger clusters.

@shankari
Copy link
Contributor Author

I see that you are coming up with ad-hoc ideas on how to work with the data. This is great, but you want to be very careful that you don't reinvent the wheel. Particularly for externally peer-reviewed papers, it is not sufficient to say "I did X", you need to put your work into the context of the existing literature so it is easier for people to understand what you are doing, and you can highlight how your work is novel.

Concretely, how is your idea "then collect the differences in points(locations, basically reusing Naomi's code, finding the radius), distance, and duration in one cluster (so that I will know how close they are in order to be in one cluster). " different from using distance metrics for the clustering algorithms?

Putting your work into the context of the existing work also allows you to avoid pitfalls like using the labels in the training step.

Can you try to map your problem into the standard hierarchical clustering model?

Note that the model (per wikipedia is):

In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

  • what is the data?
  • what are the distances?
  • what are the metrics?
  • what is the measure of linkage?

@shankari
Copy link
Contributor Author

shankari commented Jul 9, 2021

If we have a separate predict method, predicting the cluster for a new trip is:

  • extract features for the trip
  • load the saved model (which has one model for each bin)
  • find the model for the bin that the trip is in (first round)
  • call predict on the model (second round)

We do not need to call fit for the prediction step.

Note that this can actually be optimized to:

  • load the saved model (which has one model for each bin)
  • for each new trip:
    • extract trip features
    • find the bin that the trip would be in (first round)
    • call predict on the model for the bin (second round)

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Jul 9, 2021

Just to confirm that sklearn return the same result when I pass in correct distance_threshold

sel_features = location coordinates, distance, duration
# sel_features.append(new_trip_feat)
low = 250
dist_pct = 0.22999999999999995
# scipy
z = sch.linkage(sel_features, method=method, metric='euclidean')
last_d = z[-1][2]

if last_d < low:
    distance_threshold = None
    # n_clusters for sklearn param
    n_clusters = 1
else:
    distance_threshold = dist_pct * last_d 
    n_clusters = None           
clusters = sch.fcluster(z, distance_threshold, criterion='distance')
print('clusters from scipy',clusters)
# n_clusters = len(set(clusters))
# print('n_clusters',n_clusters)

# sklearn
clustering=AgglomerativeClustering(n_clusters=n_clusters,linkage='single',distance_threshold=distance_threshold,compute_full_tree=True).fit_predict(sel_features)
print('clusters from sklearn',clustering)

Here is the result

clusters from scipy [1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1]
clusters from sklearn [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]

@shankari
Copy link
Contributor Author

shankari commented Jul 9, 2021

@corinne-hcr I would suggest that you push your code to a draft PR periodically so I can see it and potentially run it if I have ideas on how to fix it. Just make sure not to push the results since they are privacy sensitive - clear all outputs before pushing.

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Jul 10, 2021

For DBSCAN, there are two required parameters -- eps and min_samples. But there two parameters could be different for different bins. For bin that has 3 trips, min_samples could be 3, but for bin that has 10 trips, min_samples could be 4. It is not feasible to find a cutoff for all bins.

For affinity clustering, it takes totally different parameters. I don't see many explanation on preference setting. I did try damping in range(0.5,1) on different bins, the results are different from agglomerative clustering or kmeans, but sometimes they have same results. Some examples online show that the time this algorithm takes is much longer than kmeans, not suitable for a large dataset.

For kmeans, I run hierarchical clustering first and get n_clusters, then pass in kmeans. As I try for user 1, I get same results as what I get from hierarchical clustering. The only problem is that it takes more than triple time for running a user. Hierarchical clustering takes around 20 mins for user 1. But after adding kmeans to re-build the model, it takes more than 1 hr and 20mins.

Here is the what I run. I pick two bins to experiment.

sel_features = location coordinates, distance, duration
low = 250
dist_pct = 0.22999999999999995
# scipy
z = sch.linkage(sel_features, method=method, metric='euclidean')
last_d = z[-1][2]

if last_d < low:
    distance_threshold = None
    # n_clusters for sklearn param
    n_clusters = 1
else:
    distance_threshold = dist_pct * last_d 
    n_clusters = None           
clusters = sch.fcluster(z, distance_threshold, criterion='distance')
print('clusters from scipy AgglomerativeClustering',clusters)

# sklearn
clustering=AgglomerativeClustering(n_clusters=n_clusters,linkage='single',distance_threshold=distance_threshold,compute_full_tree=True).fit(sel_features)
clusters_sklearn = clustering.labels_
print('clusters from sklearn AgglomerativeClustering',clusters_sklearn )

# kmeans
n_clusters = len(set(clusters))
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(sel_features)
k_clusters = kmeans.labels_
print('clusters from kmeans',k_clusters)

# AffinityPropagation
for i in np.arange(0.5, 1, 0.1):
    damping = i
    ap_clustering = AffinityPropagation(random_state=0,damping=damping).fit(sel_features)
    ap_clusters = ap_clustering.labels_
    print('clusters from AffinityPropagation',ap_clusters)

Results from bin 1

clusters from scipy AgglomerativeClustering [1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1]
clusters from sklearn AgglomerativeClustering [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
clusters from kmeans [1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1]
clusters from AffinityPropagation [0 3 0 1 0 0 0 0 2 3 4 0 5 0 3 0]
clusters from AffinityPropagation [0 3 0 1 0 0 0 0 2 3 4 0 5 0 3 0]
clusters from AffinityPropagation [0 3 0 1 0 0 0 0 2 3 4 0 5 0 3 0]
clusters from AffinityPropagation [0 3 0 1 0 0 0 0 2 3 4 0 5 0 3 0]

Result from bin 2

clusters from scipy AgglomerativeClustering [1 2 1 3 1 1 5 1 1 1 1 1 4]
clusters from sklearn AgglomerativeClustering [0 4 0 3 0 0 1 0 0 0 0 0 2]
clusters from kmeans [0 3 0 2 0 0 1 0 0 0 0 0 4]
clusters from AffinityPropagation [3 0 3 1 3 3 2 3 3 3 3 3 4]
clusters from AffinityPropagation [3 0 3 1 3 3 2 3 3 3 3 3 4]
clusters from AffinityPropagation [3 0 3 1 3 3 2 3 3 3 3 3 4]
clusters from AffinityPropagation [3 0 3 1 3 3 2 3 3 3 3 3 4]
clusters from AffinityPropagation [1 1 1 3 1 1 0 1 2 1 1 1 3]

Here is the result from user 1
scipy AgglomerativeClustering

user 1 filter_trips len 207
all_percentage_first_test [[0.57, 0.642, 0.608, 0.62, 0.639]]
all_homogeneity_score_first_test [[0.79, 0.696, 0.746, 0.723, 0.76]]
all_percentage_second_test [[0.764, 0.697, 0.819, 0.705, 0.711]]
all_homogeneity_score_second_test [[0.938, 0.792, 0.934, 0.828, 0.833]]
all_scores [[0.587, 0.548, 0.558, 0.561, 0.561]]
all_tradoffs [[(250, 0.22999999999999995), (423, 0.44999999999999984), (250, 0.15), (477, 0.15), (590, 0.3299999999999999)]]

kmeans

user 1 filter_trips len 207
all_percentage_first_test [[0.57, 0.642, 0.608, 0.62, 0.639]]
all_homogeneity_score_first_test [[0.79, 0.696, 0.746, 0.723, 0.76]]
all_percentage_second_test [[0.764, 0.697, 0.819, 0.705, 0.711]]
all_homogeneity_score_second_test [[0.938, 0.792, 0.934, 0.828, 0.833]]
all_scores [[0.587, 0.548, 0.558, 0.561, 0.561]]
all_tradoffs [[(250, 0.22999999999999995), (423, 0.44999999999999984), (250, 0.15), (477, 0.15), (590, 0.3299999999999999)]]

files for testing user1 is in e-mission/e-mission-eval-private-data#23

@shankari
Copy link
Contributor Author

shankari commented Jul 10, 2021

I don't understand the difference between the results "from bin1 and bin2" and from "user1".

Responding to the comments about the difference between different clustering algorithms:

For kmeans, I run hierarchical clustering first and get n_clusters, then pass in kmeans. As I try for user 1, I get same results as what I get from hierarchical clustering. The only problem is that it takes more than triple time for running a user. Hierarchical clustering takes around 20 mins for user 1. But after adding kmeans to re-build the model, it takes more than 1 hr and 20mins.

First, this is a very surprising result. k-means is known for being fast. From the sklearn comparison:
https://scikit-learn.org/stable/modules/clustering.html

for k-means:

It scales well to large number of samples and has been used across a large range of application areas in many different fields.

while agglomerative clustering says:

AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.

I don't think that you specify a connectivity matrix in your invocation, so I really question the results. Where is your timing code to print out the execution time?

Some other options to check:

  • Is the number of clusters really high?
  • Have you tried mini-batch k-means, which is supposed to converge faster?

@shankari
Copy link
Contributor Author

shankari commented Jul 10, 2021

However, having said all that, even if it is true that k-means is much slower than agglomerative filtering, this is only for the model building stage, which we only run once a day or once a week. Unless you prove that the model prediction is slower than 20 mins, it still makes sense to use a clustering method that can be applied as a two stage process.

We will be using the predict function once an hour, and it will ideally take less than 1 minute for each user.

@shankari
Copy link
Contributor Author

shankari commented Jul 10, 2021

There are other comments I have on the other clustering algorithms, but it is fairly easy to swap them out later since they all follow the same pattern. For now, we can use k-means (or minibatch k-means) and move on with the full system implementation. I am really worried about the code cleanup and the unit testing which I don't see getting done. We are now writing additional code, but don't have the basic testing infrastructure in place.

@corinne-hcr
Copy link
Contributor

Right now I pass in n_clusters right after I get labels from hierarchical clustering, but the tuning step takes a long time. I think I should only add kmeans in the test test step. I will modify the code and see if it is better.

@corinne-hcr
Copy link
Contributor

Only add kmeans in the test step, around 7 mins to run two rounds of clustering for user 1. Going to explore how to save the models.

@shankari
Copy link
Contributor Author

wrt these comments, I want to highlight that it is OK to have results that are different from the agglomerative clustering. The agglomerative clustering results are not ground truth and we don't have to try very hard to get the same clusters. We just have to check that the tradeoff wrt homogeneity/user request are not significantly worse.

I did try damping in range(0.5,1) on different bins, the results are different from agglomerative clustering or kmeans, but sometimes they have same results.
For kmeans, I run hierarchical clustering first and get n_clusters, then pass in kmeans. As I try for user 1, I get same results as what I get from hierarchical clustering.

@corinne-hcr
Copy link
Contributor

I have a question about the model filename. In your example, you use SAVED_MODEL_FILENAME. Does it mean the exact filename should be changed when this function is running locally on someone's computer?

  def loadModel():
    fd = open(SAVED_MODEL_FILENAME, "r")
    model_rep = fd.read()
    fd.close()
    return jpickle.loads(model_rep)

@shankari
Copy link
Contributor Author

shankari commented Jul 12, 2021

I don't understand your question. SAVED_MODEL_FILENAME is a variable. It is set to seed_model.json in my code.
https://github.com/e-mission/e-mission-server/blob/master/emission/analysis/classification/inference/mode/seed/pipeline.py#L35

  • you can use any filename that you like simply by changing the variable.
  • Why would it need to be different whether it runs locally on a computer or elsewhere?

@corinne-hcr
Copy link
Contributor

OK. I got it. I thought I needed to set a path for the saved model file. I use 'seed_model.json' like you did, and it is saved in the same directory where the notebook is.(I use a notebook to test the save, load and predict steps)

@corinne-hcr
Copy link
Contributor

Just for record
For saving the model, example from https://github.com/e-mission/e-mission-server/blob/72d1ee69728cc310c54571ab3b819cf41fbf3d77/emission/analysis/classification/inference/mode/seed/pipeline.py#L331

  def saveModelStep(self):
    model_rep = jpickle.dumps(self.model)
    with open(SAVED_MODEL_FILENAME, "w") as fd:
        fd.write(model_rep)

jpickle.dumps(self.model) is the code to convert the model to a string representation so it can be saved. https://jsonpickle.github.io/api.html

@shankari
Copy link
Contributor Author

I thought I needed to set a path for the saved model file.

If you have an assumption like this, a quick way to check the assumption is to Just Run the Code and see what it does.

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Jul 13, 2021

  1. I originally planned to save all filter_trips of the selected split, so that I can retrieve any information I need later. But I don't know how to save all the trips and read them again.
  2. I figure I don't need to save all trips, as long as I save the user labels, I have sufficient information to predict the new trip
  3. I think I need to use a dataframe to contain all user labels of one cluster, then save multiple dataframes. But @shankari thinks that I can just save the dict with user labels and p.
  4. Should I just save the combination of user labels with the max pct, or should I save all combinations of user labels in one cluster in one file?
    e.g.
    {"labels": {"mode_confirm": "walk", "purpose_confirm": "shopping"}, "p": 0.45},
    {"labels": {"mode_confirm": "walk", "purpose_confirm": "entertainment"}, "p": 0.35},
    {"labels": {"mode_confirm": "drove_alone", "purpose_confirm": "work"}, "p": 0.15},
    {"labels": {"mode_confirm": "shared_ride", "purpose_confirm": "work"}, "p": 0.05}

Should I only save all of them or save the first one with p of 0.45

@shankari
Copy link
Contributor Author

shankari commented Jul 13, 2021

I originally planned to save all filter_trips of the selected split, so that I can retrieve any information I need later. But I don't know how to save all the trips and read them again.
I figure I don't need to save all trips, as long as I save the user labels, I have sufficient information to predict the new trip
I think I need to use a dataframe to contain all user labels of one cluster, then save multiple dataframes. But @shankari thinks that I can just save the dict with user labels and p.

Do you see any problems with this suggestion?

Should I only save all of them or save the first one with p of 0.45

What do you think? If you only store the combination of user labels with the max pct (e.g. the one with p of 0.45), will you be able to meet the expectations for Gabriel's code?
https://github.com/e-mission/e-mission-server/blob/ef04ba4366997d3643b79faf772dae4729a6aba8/emission/analysis/classification/inference/labels/pipeline.py#L38

The interface we agreed on was for Gabriel to expect a list of label dictionaries with probabilities. This would be the list of label dictionaries associated with the cluster that the trip maps to through the prediction algorithm. So you would need to save all of them.

@corinne-hcr
Copy link
Contributor

I have a question about how you run the clustering. Would you pass in a specific user id (UUID) to run the clustering pipeline or pass in multiple users and let my program find the user id (like how I read the data from the database)?

@shankari
Copy link
Contributor Author

@corinne-hcr is this while building the model or while predicting the labels? We haven't really discussed the script to build the model, but I assumed you would write it to be similar to the existing intake pipeline.
https://github.com/e-mission/e-mission-server/blob/master/emission/pipeline/intake_stage.py

As you can see, the pipeline code reads in all users, but then calls the steps with one user at a time.
So similarly, your code would also read in all users, but then call the build_model with one user at a time.

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Jul 15, 2021

There is one thing we didn't mention before. Among the 13 users, only 8 are valid. Since invalid users would be filtered out at the very beginning in my program, they don't have save files(scores, parameters, model, user labels). So, when Gabriel uses my code for predict, if he pass in user id but cannot open the file, this user should be ask for user labels directly. Or it would be better to filter out invalid users in his program, and pass in valid user id to my functions.

Also, even though the user is valid, it is possible that this user doesn't have common trips, I will not have saved files for that user, either.

@shankari
Copy link
Contributor Author

@corinne-hcr maybe you don't understand the interface separation.

You are expected to write a function in emission.analysis.classification.inference.labels.pipeline.py
The function will look like def placeholder_predictor_1(trip): and def placeholder_predictor_0(trip): etc
It will take a trip and return a dict
How you implement that function is up to you - @GabrielKS doesn't care

Presumably you will implement the function by reading in the model for the user who took that trip.
If there is no file, you will return an empty dict.
But you will write that code, not @GabrielKS
He has no way of knowing what is a valid model and what is not and cannot filter anything before passing in values to you.

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Jul 15, 2021

If there is no file, you will return an empty dict.

OK. It sounds good.
But I still need to get the user id to find the files, not just a trip.

@shankari
Copy link
Contributor Author

The trip has the user_id. Please review the e-mission user model carefully.

@shankari
Copy link
Contributor Author

Tracking initial staging deployment at:
corinne-hcr/e-mission-server#2

@corinne-hcr
Copy link
Contributor

corinne-hcr commented Jul 22, 2021

Just some assumptions on not getting sufficient valid users or data.

  1. I would consider that if the users are working from home now. If most of them are working from home, it is understandable they don't have regular trips in every week. (e.g. they might just go out for shopping, exercise or some other purposes, and those locations are not necessarily to be the same each time). In that case, the users might have a high proportion of novel trips.
  2. A high proportion of trips are not labeled, or the trips don't meet the criteria.
  3. I think having another analysis focusing on common trips might be better than tuning the radius. Now after we do two rounds of clustering, we evaluate the scores on all trips we have collected. However, given a high proportion of novel trips after the 1st round of clustering (due to the wfh situation or unstable work/living places), we might not getting expected result for most of the users. We can just evaluate the score on common trips. After the first round, we have common trips. We can compare the result from the 1st round with the one after the 2nd round. The novel trips from the 1st round are unpredictable. We can consider them as noise.
  4. If it is necessary to elaborate the reason why we have so many novel trips, I think the proportion of purpose like shopping, personal med, exercise....except home/work in common trips and novel trips might help.

@shankari
Copy link
Contributor Author

@corinne-hcr the results are pretty much the same on the minipilot dataset as well. I don't remember the exact numbers now, and I didn't write them down at the time, but my recollection is that we ended up with < 10 inferred trips per user.

@corinne-hcr
Copy link
Contributor

Are they more like user 1 (has regular travel pattern) or more like user 13 (high request pct but low homogeneity score)?
I remember some of the highlight reasons of the rejection were -

  1. we drew the conclusion too early
  2. did not provide much novelty
  3. did not have much connection to main topic of the workshop

We don't a novel solution to address our research problem. I am not sure if the findings from the data can be considered novel.
But the dataset is the main problem. If our users have regular work place (e.g. going to workplace 5 days a week) or other regular patterns, the result should not be like this. At least at the 1st round we would collect many common trips. Is there any background info of the users (like low-income, wfh, etc) to explain it?

Is there any one from the mini-pilot also in your full-pilot program? If so, we can compare the result.
As for the 10 inferred trips, how are the frequencies? Did they occur once a week/month from the collected trips or more frequently? I think we should consider the similar trips frequency instead of just the number of the trips. If the trip happens once a month, and we catch it and infer the labels, then the algorithm works.

I really think that we can try to focus on the common trips once we have the 1st round result. We cannot deal with novel trips after all.

@shankari
Copy link
Contributor Author

@corinne-hcr I understand that you have a lot of ideas of how to proceed. So do I. I am working on an exploratory analysis according to my ideas now. If you finish the unit tests, you are welcome to continue exploring the data further.

@shankari
Copy link
Contributor Author

I will probably end up evaluating classification v/s clustering algorithms for this in the second round...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants