Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outliers #189

Closed
ethanplunkett opened this issue Jun 12, 2024 · 11 comments · Fixed by #193
Closed

Outliers #189

ethanplunkett opened this issue Jun 12, 2024 · 11 comments · Fixed by #193
Assignees

Comments

@ethanplunkett
Copy link
Contributor

Problem

Some species for example American Robin and American Avocet have extreme values in small areas within the distribution maps.

The January 4th distribution for American Robin is a clear example:
image

We suspect that there is a concentration of Robins at the hot spot but highly doubt that 15% of the population is in a single 100km cell - as is the case in the above image.

Proposed Solution

Prior to training BirdFlow models truncate the distributions for each species in each timestep at the .95 quantile for the timestep. Thus any values higher than the .95 quantile would be assigned the value of the 0.95 quantile. Cells with zero abundance will be excluded from the quantile calculations so that the 95% threshold is relative to the occupied range at each timestep.

We should evaluate the results with both the problem species and a few other species, including some with very limited ranges as well as evaluate a few other quantiles (.97, .98, .99, .995).

Here's the same January 4th Robin distribution truncated at the 95th percentile.
image

Truncating at the 99th percentile would also solve the problem:
image

@ethanplunkett
Copy link
Contributor Author

@dsheldon Does this approach make sense to you? Any other ideas?

@slager
Copy link
Contributor

slager commented Jun 12, 2024

Seems like a good solution for a lot of cases, but it's also important to keep in mind that those types of distributions, when biologically real, are important not to miss:

https://www.audubon.org/news/dependence-threatened-saline-lakes-leaves-eared-grebes-risk

@dsheldon
Copy link
Contributor

The general approach sounds pretty good to me. A question is which quantile to use. In general, if we could get away with 99th I like it better than 95th. Could we collect some quick statistics across different species to assess the impact of truncating at different thresholds? One starting idea is just to list some quantiles (e.g. 0.5, 0.9, 0.95, 0.99, 0.995) or a bunch of species in a table to get a feel for them.

@ethanplunkett
Copy link
Contributor Author

@dsheldon Yes, I can make a table. I'm not sure the absolute value will be that useful but I can include it as well as the proportion of the original density lost by truncating, and the maximum proportion for any given timestep.

@ethanplunkett
Copy link
Contributor Author

ethanplunkett commented Jun 12, 2024

@slager That's a perfect example of a species that might be negatively affected by this.

Here's a 100km BirdFlow distribution derived from S&T for Eared Grebe for August 30, which is when they are most concentrated.
image

If we truncate even at the conservative .995 quantile we lose 80% of the abundance for that timestep. Here's what that distribution looks like:
image

I wonder if this is an option that we chose for individual species when we run into the problem and not universally applied.

Edit - changed 20 % to 80% (original abundance lost to truncation)

@ethanplunkett
Copy link
Contributor Author

For reference here's the Robin with truncation at the 0.995 quantile. I'd say it's largely fixed but I like 0.99 better.
image

@ethanplunkett
Copy link
Contributor Author

I don't ]know that there's an automated way to distinguish between the Eared Grebe and the American Robin. Maybe total population (independent of eBird) would help?

@dsheldon
Copy link
Contributor

I don't think there's a way to distinguish between these two cases. In one case (Robin) we want to drop 15% of the "population" because we don't think it's real. In another case (Grebe) we don't want to drop a significant fraction of the population because we do think it's real. It's just differing beliefs in the quality of S&T data.

On the other hand, I don't think the Grebe result looks so terrible. I assume we'll renormalize the abundances for each time step. The effect seems to be to spread out the main area of high density around Great Salt Lake and "bring up" the levels of some other areas (California, Baja California) in comparison, but the overall effect is still that they are highly concentrated near Great Salt Lake.

@ethanplunkett
Copy link
Contributor Author

@dsheldon Yes, the values would be rescaled to sum to 1; they weren't in the above image though. During rescaling values everywhere that isn't truncated will increase 5 fold; but because most of the extent is so close to zero the main impact is the spreading of the hot spot as you observed.

@ethanplunkett
Copy link
Contributor Author

ethanplunkett commented Jun 12, 2024

Conclusion

  • Add truncate argument to preprocess_species() it will take either a single quantile or a vector of values - one per timestep. In both cases the quantile(s) are then applied individually to each timestep to truncate high values.
  • Review process should evaluate whether the models need truncation. We can include the maximum (across weeks) percentage of the abundance that is above the 0.99 quantile as a summary statistic to help inform this.

@ethanplunkett ethanplunkett self-assigned this Jun 14, 2024
@ethanplunkett
Copy link
Contributor Author

I'm working on this and now realize that we use "truncate" to talk about models that cover only part of the year for example truncate_birdflow() and is_truncated(). I think we need another name for this argument. trim_quantile ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants