-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation Metrics & Process #47
Comments
It's an interesting idea, but how would that differ than comparing the metrics for the "ground" class? |
This comment was marked as duplicate.
This comment was marked as duplicate.
I think my example might have been a bit hard to follow with text, I wish I had some visualisations. Just to summarise shortly: The current metrics for "ground" class treat all points the same. So if you mis-classify a point as 'ground' when it was a rooftop or a treetop or a small bucket on the ground, then that has the same 'error' in the metric. But when we care about generating a DTM, it's much 'worse' (has larger cost/error) to mis-classify a treetop as ground then it is to mis-classify a small bucket on the ground. But if we only look at the current metrics for "ground" class, it would say there is no difference. The current metrics are useful and should be kept, but there's an old saying that what is not measured cannot be improved. In this case, if we are not measuring the final DTM accuracy then who's to say that any new models trained/released are actually improving the DTM? Maybe a new model has better ground class metrics but is actually producing worse DTMs for ODM but we would never know unless we measure the DTM metrics. |
Actually just had another idea in a similar vein @pierotofy , but in LightGBM there is an option to provide sample weightings during training. So you could easily add in a weighting that is calculated from the ground truth DEM, so that during training the model will be able to learn that it is worse to mis-classify points with large elevation deltas from the terrain/ground. So the model would be able to better learn those patterns and might be able to generate higher quality DTM models without adding any new training data at all. I'm also curious, have you experimented at all with hyperparameter tuning? I noticed that the current learning rate and 'num_leaves' parameters are hard-coded, and I wonder if there could be some easy gains in performance with a search over those parameters to find the best ones? Not sure if you've already done this already. |
That makes sense, thanks for the explanation. It could be an interesting addition. I have not played with hyperparameters much. We'd welcome improvements in this area as well. |
Quick update! I put together a python script that can be run with something like:
As output it can save a stats file with json 'dtm evaluation metrics' and also can save some graphs which is helpful for debugging the DEMs and the errors your model is getting. Questions:
Adding some extra info below on the evaluation metrics I came up with for anyone that might be interested and want to discuss or provide suggestions. These were just my best initial guesses for metrics that would measure how 'good' or 'useful' a predicted DTM is compared to a ground truth DTM. Evaluating Predicted DTMsNOTE: Skip this section if you're not interested in the evaluation metrics definitions The evaluation metrics produced:
Finally, for each evaluation metric, I also re-computed the evaluation metric using the DSM as a baseline. So for example for MAE, we treat the DSM as a 'predicted DTM' and calculate the MAE of lets say 4.5m and we see that for our actual predicted DTM we have an MAE of 0.9m. In that case, we can consider our 'MAE_relative-dsm' to be 80%. On this 'relative DSM' scale, every metric is divided by the metric computed for the DSM by using (1 - pred_metric / dsmpred_metric) and has a scale where 100% would mean our model is perfect and 0% would mean our model is the exact same as just using the DSM. This relative-to-dsm metric seems helpful because it lets us compare across different point clouds which might have different scales or levels of difficulty. Example metrics for OPC V1.3 run on odm_data_toledo.laz:
|
I think it might make sense for this to live as a separate effort (at least initially), due to the ODM dependency. I would recommend to publish the script in a separate repo, then add instructions on how to run the method on the README here. |
NOTE: This isn't really an issue, more of a discussion topic / idea that I wanted to raise and get some feedback on to see if there is interest or value in it before I implement something for it. Also a huge thanks to everyone that's contributed to this project & OpenPointClass, lots of amazing work already done so kudos!
Idea: What if we add an additional set of sub-task evaluation metrics that evaluates how well the point cloud classification is able to produce accurate DTMs.
My current understanding is that the evaluation metrics used so far focus on the classification metrics for the point cloud. For example, the metrics found on this PR (#46).
So models are currently evaluated on how accurately they are able to classify points which makes a lot of sense. The only questions that I then have is, how accurate are the DTM models generated using the point cloud classification?
For example, I imagine that we could have two models, M1 and M2. It's quite possible that M1 might have worse point classification precision/recall/accuracy scores compared to M2, but could produce higher quality/more accurate DTMs from the classified point clouds.
For that reason, I thought it might be a good idea to add in a new 'subtask' evaluation routine that is run as follows:
This would produce a new set of 'DTM estimation metrics' that would be complementary to the current set of 'point cloud classification metrics'. I would like to hear what others think, does this seem like a useful addition that could be pulled/merged in, or does it not align with the current goals of the project & dataset?
The text was updated successfully, but these errors were encountered: