-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster trips between pairs of common places to form common trips #606
Comments
Leidy's suggestion was to calculate the difference in start time between all pairs of trips (similar to the difference between start locations for all pairs of trips) and then to include that in the binning. An alternate suggestion is to just use the "hour" field for the trips. But that may have boundary effects. Need to test. |
My suggestion is:
|
examples of features:
etc |
Adding more features for clustering doesn't work better, the v-score gets worse. |
The clustering algorithm will almost certainly return labels like 1,1,2,0,0 in the second round. You could choose to set the trip labels to a concatenation of the first round and second round labels. Or other such combinations. So in your example above, the labels you set for the trips would be 11, 11, 12, 10, 10 |
I wonder if we should try homogeneity as a metric instead of v-score, specially with the smaller clusters. For example, if I have 20 trips, 10 with label
From a homogeneity perspective, this is still a score of 1.0. But I think that the v-score will be lower because all the trips with l1 are not in the same cluster. So maybe we need to use the v-score after all, and the completeness will be represented by the user requests instead. Concretely, the completeness of option (2) above would be bad because all trips with l1 are not in the same cluster. But we have a different metric to represents that badness where trips with the same label are put in different clusters - and that is the user requests. So if we have user requests, which is a better metric for us, we don't really need completeness. It is possible that the trips will be split into (5 l1 + 10 (l1+l2) + 5 l2) as well. In that case, the homogeneity of the middle cluster will be bad, so the accuracy will also be lower. If we get a new trip, and it is assigned to the the first or last cluster, it will be accurate. If it is assigned to the middle cluster, it has a 50:50 chance of being accurate. So again, the homogeneity captures the actual behavior we want to represent. |
This is what I mean by using the trip id
|
I have some question about testing the algorithm. If we are not using classification for new trips, how the new label list [1,1,10,12,13,2,2,20,21] works? It is not the list from |
I assume you would use a two step prediction as well. So call I should also note that you won't have a list of labels like |
Have you looked at https://en.wikipedia.org/wiki/Hierarchical_clustering and apparently in SciPy as well |
I don't think that sklearn has divisive clustering, which is what we are looking for, but I see that there are some other implementations... No need to reinvent the wheel... |
I don't understand this. Do you mean I keep the original labels? [1,1,1,1,1,2,2,2,2] for the first round, [0,1,2,3,3]for label1 in the second round, but not to generate a new label list for all trips for predicting a new trip? (I think I need a new list for evaluating the homogeneity score) If so, how do I know if there is a second round or more for label1? If the new trip was labeled 1 in the first round and went to the second round, it would be clustered again with other trips labeled 1 to find the second digit, right? If in the second round, the new trip was assigned a new label different from other trips, it would need to request the user, right? |
I don't understand your question either. At the end of the first round, your labels are |
You can assume that labels will always be two digit. If a cluster is homogenous enough that you don't want to run a second round, just assume they are all in one cluster for the second round. So if cluster 1 is homogenous, then call it |
Then [11,11,12,12,12,21,21,22,22] is not for |
why would it not be for |
I think [11,11,12,12,12,21,21,22,22] is a new list, not come from either round1 or round2. So |
Why is this not the algorithm?
The trip is now in cluster |
To fit models
To predict models
|
For For For If homo score > 0.9, how do I have 1 cluster for the second round? For min_cluster =1, the prompt is
It is from silhouette_score |
you skip the cluster in the second round, and just assign the second label as 1. So after the first round, if cluster 5 has score > 0.9, then after the second round, it will have label |
That is the model - it is just represented that way. If you call
In the pseudo code, it can either be the set of labels or the number of clusters. Note that I don't actually use
Yes.
Sure. You can also just filter the dataframe and pass it in, which is likely to be more efficient. Again, as I emphasized yesterday, what I wrote was pseudo-code, and not expected to work exactly as written. Feel free to edit to make it actually work. |
Also, given that you have spent multiple days on this already, are you sure that it wouldn't be faster to just use the existing implementation of divisive clustering? |
To answer my own question, one advantage of building a homegrown solution over the existing implementation may be around controlling the features for each round. In the divisive clustering algorithm, we specify all features at the same time. With the homegrown solution, we can cluster on the start and end locations first, and then add additional features for the second round. @corinne-hcr ideally, you would be able to articulate a reason like this for why divisive clustering is not a good fit. We should certainly use this as justification in the paper going forward |
Let me ask that as a question. You said that the implementation here Why? If you choose not to use a standard, well-known method and come up with a new algorithm instead, you need to be able to explain why. https://en.wikipedia.org/wiki/Not_invented_here is not a good enough reason. You are not using the existing DIANA algorithm and are implementing your own two-step algorithm. So what is the reason? |
After finding some code examples of DIANA, I figure that there is not a good metric for deciding the clusters. Metrics in sklearn require max_clusters is n_sample -1, for example, 5 sample, only 4 clusters at most. But in our cases, some of the first round clusters can have sub_clusters fewer than the number of trips, but some cannot (trips inside this cluster should be divided into sub_clusters with 1 trip in it). For DIANA algorithm, I have seen one examples will keep dividing trips until 1 trip in 1 cluster (from your the github link you posted), one example is based on the number of cluster that the programmer passes in (for example, setting k=3, the algorithm will stop when clusters =3). But either way doesn't work for us (we don't know how many clusters would be enough). Here is my idea. I can use homo score for the fake second round clustering, then collect the differences in points(locations, basically reusing Naomi's code, finding the radius), distance, and duration in one cluster (so that I will know how close they are in order to be in one cluster). After that, we use the median of the differences from all users. We will end up having location radius in (for example) 10 meters, distance difference in 50 meters, duration difference in 200s. Then use them as metrics for real second round clustering. But then, I can write the code like Naomi's binning, not to use existing clustering algorithm. Could be more accurate than using kmeans or DIANA. It runs even faster than using sklearn according to our precious implementations. How do you think? |
@corinne-hcr for the implementation which keeps dividing trips, that is building out a full tree. It is actually a fairly common strategy for hierarchical clustering algorithms. You can then choose to cut off the tree at any point. https://en.wikipedia.org/wiki/Hierarchical_clustering
|
I see that you are coming up with ad-hoc ideas on how to work with the data. This is great, but you want to be very careful that you don't reinvent the wheel. Particularly for externally peer-reviewed papers, it is not sufficient to say "I did X", you need to put your work into the context of the existing literature so it is easier for people to understand what you are doing, and you can highlight how your work is novel. Concretely, how is your idea "then collect the differences in points(locations, basically reusing Naomi's code, finding the radius), distance, and duration in one cluster (so that I will know how close they are in order to be in one cluster). " different from using distance metrics for the clustering algorithms? Putting your work into the context of the existing work also allows you to avoid pitfalls like using the labels in the training step. Can you try to map your problem into the standard hierarchical clustering model? Note that the model (per wikipedia is):
|
If we have a separate
We do not need to call Note that this can actually be optimized to:
|
Just to confirm that sklearn return the same result when I pass in correct
Here is the result
|
@corinne-hcr I would suggest that you push your code to a draft PR periodically so I can see it and potentially run it if I have ideas on how to fix it. Just make sure not to push the results since they are privacy sensitive - clear all outputs before pushing. |
For DBSCAN, there are two required parameters -- For affinity clustering, it takes totally different parameters. I don't see many explanation on For kmeans, I run hierarchical clustering first and get Here is the what I run. I pick two bins to experiment.
Results from bin 1
Result from bin 2
Here is the result from user 1
kmeans
files for testing user1 is in e-mission/e-mission-eval-private-data#23 |
I don't understand the difference between the results "from bin1 and bin2" and from "user1". Responding to the comments about the difference between different clustering algorithms:
First, this is a very surprising result. k-means is known for being fast. From the sklearn comparison: for k-means:
while agglomerative clustering says:
I don't think that you specify a connectivity matrix in your invocation, so I really question the results. Where is your timing code to print out the execution time? Some other options to check:
|
However, having said all that, even if it is true that k-means is much slower than agglomerative filtering, this is only for the model building stage, which we only run once a day or once a week. Unless you prove that the model prediction is slower than 20 mins, it still makes sense to use a clustering method that can be applied as a two stage process. We will be using the |
There are other comments I have on the other clustering algorithms, but it is fairly easy to swap them out later since they all follow the same pattern. For now, we can use k-means (or minibatch k-means) and move on with the full system implementation. I am really worried about the code cleanup and the unit testing which I don't see getting done. We are now writing additional code, but don't have the basic testing infrastructure in place. |
Right now I pass in |
Only add kmeans in the test step, around 7 mins to run two rounds of clustering for user 1. Going to explore how to save the models. |
wrt these comments, I want to highlight that it is OK to have results that are different from the agglomerative clustering. The agglomerative clustering results are not ground truth and we don't have to try very hard to get the same clusters. We just have to check that the tradeoff wrt homogeneity/user request are not significantly worse.
|
I have a question about the model filename. In your example, you use
|
I don't understand your question.
|
OK. I got it. I thought I needed to set a path for the saved model file. I use |
Just for record
|
If you have an assumption like this, a quick way to check the assumption is to Just Run the Code and see what it does. |
Should I only save all of them or save the first one with p of 0.45 |
Do you see any problems with this suggestion?
What do you think? If you only store the combination of user labels with the max pct (e.g. the one with p of 0.45), will you be able to meet the expectations for Gabriel's code? The interface we agreed on was for Gabriel to expect a list of label dictionaries with probabilities. This would be the list of label dictionaries associated with the cluster that the trip maps to through the prediction algorithm. So you would need to save all of them. |
I have a question about how you run the clustering. Would you pass in a specific user id (UUID) to run the clustering pipeline or pass in multiple users and let my program find the user id (like how I read the data from the database)? |
@corinne-hcr is this while building the model or while predicting the labels? We haven't really discussed the script to build the model, but I assumed you would write it to be similar to the existing intake pipeline. As you can see, the pipeline code reads in all users, but then calls the steps with one user at a time. |
There is one thing we didn't mention before. Among the 13 users, only 8 are valid. Since invalid users would be filtered out at the very beginning in my program, they don't have save files(scores, parameters, model, user labels). So, when Gabriel uses my code for Also, even though the user is valid, it is possible that this user doesn't have common trips, I will not have saved files for that user, either. |
@corinne-hcr maybe you don't understand the interface separation. You are expected to write a function in Presumably you will implement the function by reading in the model for the user who took that trip. |
OK. It sounds good. |
The trip has the |
Tracking initial staging deployment at: |
Just some assumptions on not getting sufficient valid users or data.
|
@corinne-hcr the results are pretty much the same on the minipilot dataset as well. I don't remember the exact numbers now, and I didn't write them down at the time, but my recollection is that we ended up with < 10 inferred trips per user. |
Are they more like user 1 (has regular travel pattern) or more like user 13 (high request pct but low homogeneity score)?
We don't a novel solution to address our research problem. I am not sure if the findings from the data can be considered novel. Is there any one from the mini-pilot also in your full-pilot program? If so, we can compare the result. I really think that we can try to focus on the common trips once we have the 1st round result. We cannot deal with novel trips after all. |
@corinne-hcr I understand that you have a lot of ideas of how to proceed. So do I. I am working on an exploratory analysis according to my ideas now. If you finish the unit tests, you are welcome to continue exploring the data further. |
I will probably end up evaluating classification v/s clustering algorithms for this in the second round... |
The text was updated successfully, but these errors were encountered: