Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inat occ query failing with limit > 3000 #215

Closed
keatonwilson opened this issue Apr 9, 2019 · 27 comments
Closed

inat occ query failing with limit > 3000 #215

keatonwilson opened this issue Apr 9, 2019 · 27 comments
Labels
Milestone

Comments

@keatonwilson
Copy link

Hi there,

Interesting issue - code worked about a week ago, but now seems non-functional. Querying inat for a butterfly species with lots of records - gbif query works great, but inat query doesn't work when setting the limit greater than 3000. Console output below

Screen Shot 2019-04-09 at 11 09 20 AM

@sckott
Copy link
Contributor

sckott commented Apr 9, 2019

thanks for the report, please include the actual code next time, and not screenshots.

@sckott sckott added the bug label Apr 9, 2019
@sckott sckott added this to the v1.0 milestone Apr 9, 2019
@sckott sckott closed this as completed in 6899648 Apr 9, 2019
@sckott
Copy link
Contributor

sckott commented Apr 9, 2019

reinstall remotes::install_github("ropensci/spocc"), reload R session, and try again

@keatonwilson
Copy link
Author

Thanks for the help - worked like a charm. Will include code next time. I knew it was the wrong choice as soon as I did it. ;)

@sckott
Copy link
Contributor

sckott commented Apr 9, 2019

glad it works

@keatonwilson
Copy link
Author

This just cropped up for me again with a different species. The reinstall solution above is now not working. Reproducible example below.

#Reproducible Example of occ with iNat failing at high limits
#Keaton Wilson
#keatonwilson@me.com
#2019-05-21

#fresh install of spocc (as per last fix suggested on this thread)
remotes::install_github("ropensci/spocc", force = TRUE)

#Restart your R session here

#loading spocc
library(spocc)

#Successful query with small limits
monarch_500 = occ("Danaus plexippus", from = "inat", limit = 500)
monarch_500

#Can we pull the total number (53,066) - keep in mind, this takes a while.
monarch_full = occ("Danaus plexippus", from = "inat", limit = 53066)
monarch_full

#Nothing there - let's see if gbif works.
monarch_full_gbif = occ("Danaus plexippus", from = "gbif", limit = 50000)
monarch_full_gbif

#Quering GBIF seems to be functional - so it's an inat problem. 

@sckott
Copy link
Contributor

sckott commented May 22, 2019

thanks, will have a look - what does packageVersion("spocc") give you when you have spocc loaded?

@keatonwilson
Copy link
Author

Thanks @sckott . It reads 0.9.0.9811.

@sckott
Copy link
Contributor

sckott commented May 22, 2019

i can't replicate your problem, but I only tried with up to 3200 records for inat. (tethered to phone now, will try with large limit later to see if that causes some kind of problem)

@keatonwilson
Copy link
Author

Yeah, I just ran it successfully with pulling 3200 as well, so the problem must be pulling some number of records between 3200 and 53066 (or more). :)

@keatonwilson
Copy link
Author

Additionally, just found a similar issue with querying gbif. I ran a search for Danaus plexippus for all gbif records (some where in 215k range). It ran overnight (over 12 hours) without finishing. Should I open a new issue for this?

@sckott
Copy link
Contributor

sckott commented May 28, 2019

having a look

@sckott
Copy link
Contributor

sckott commented May 28, 2019

I ran a search for Danaus plexippus for all gbif records (some where in 215k range)

for GBIF for that many records you're better off using the GBIF download API https://www.gbif.org/developer/occurrence#download available in rgbif with occ_download and related fxns - GBIF downloads isn't available through spocc as the interface is different from the normal GBIF search and GBIF downloads has a different user interaction where you submit a request then wait for it to be completed, so it wouldn't fit in with the other data sources

@sckott
Copy link
Contributor

sckott commented May 28, 2019

I'm still not getting no data proble on the Inat queries that you are getting. I do see with larger requests some warnings about combining data

x = occ("Danaus plexippus", from = "inat", limit = 18020)
#> There were 41 warnings (use warnings() to see them)
warnings()
#> Warning messages:
#> 1: In data.table::rbindlist(x, fill = TRUE, use.names = TRUE) :
#>   Column 2 ['tag_list'] of item 2 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform.

but the data is still returned in this case.

@keatonwilson
Copy link
Author

Yeah, I get those warnings too, but when you query the occ object x, it shows 0 occurrences found and returned.
Screen Shot 2019-05-28 at 2 15 49 PM

@keatonwilson
Copy link
Author

And thanks for the tip on GBIF - I'm trying to write a function that pulls and cleans all records from inat and gbif (a common workflow a number of projects we're working on), so it will be good to integrate the rgbif stuff for species with large numbers of occurrences.

@sckott
Copy link
Contributor

sckott commented May 28, 2019

all records meaning literally all data from GBIF and iNat?

@keatonwilson
Copy link
Author

Sorry - no, nothing that crazy! All records for a particular species on both iNat and GBIF - I.e. can I get all records with lat/long for a particular species from both sources in a nice tidy data frame.

@keatonwilson
Copy link
Author

Also, more strange behavior on inat query limits:

#Reproducible Example of occ with iNat failing at high limits
#Keaton Wilson
#keatonwilson@me.com
#2019-05-21

#fresh install of spocc (as per last fix suggested on this thread)
remotes::install_github("ropensci/spocc", force = TRUE)

#Restart your R session here

#loading spocc
library(spocc)

#Successful query with small limits
monarch_500 = occ("Danaus plexippus", from = "inat", limit = 3200)
monarch_500

#Can we pull the total number (53,066) - keep in mind, this takes a while.
monarch_bigger = occ("Danaus plexippus", from = "inat", limit = 18020)
monarch_bigger

#This is particularly strange, because it pulls less than the limit (limit = 18020, returned = 10041), but still works? What happens if we
#pull even more?
#
#
monarch_bigger_still = occ("Danaus plexippus", from = "inat", limit = 20000)
monarch_bigger_still

#And this is even weirder - now it pulls less than the total number, but slightly more than when the limit is set at 18020. 

@sckott
Copy link
Contributor

sckott commented May 29, 2019

okay, i finally did the limit = 53066 request and i do get the empty result - investigating

@sckott sckott reopened this May 29, 2019
@sckott
Copy link
Contributor

sckott commented May 29, 2019

the root problem here is that inaturalist at some point changed to limit to 10,000 records maximum - so with pagination, which we do internally in spocc, you can only get for example 200 records starting at page 51, cause 51*200 = 10,200, which is more than 10,000

we need to error better so that user gets the message, so we'll do that, but not sure what the workaround is when more than 10K records needed

sckott added a commit that referenced this issue May 29, 2019
…n, add fixture for inat max records limit

add docs to occ() fxn for inat limits and where to get more data
@sckott
Copy link
Contributor

sckott commented May 29, 2019

reinstall - i've made some changes. There isn't a fix for the issue of getting all the results though. But there are some alternatives. Staying within spocc, you can try getting inat data through gbif, e.g.:

iNaturalist limits: they allow at most 10,000; query through GBIF to get more than 10,000

The inat research grade dataset on GBIF https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7

x <- occ(query = 'Danaus plexippus', from = 'gbif', limit = 10100, 
   gbifopts = list(datasetKey = "50c9509d-22c7-4a22-a47d-8c48425ef4a7"))
x$gbif

@sckott
Copy link
Contributor

sckott commented May 29, 2019

ugh, lat/lon vars changed in the new API ...

@keatonwilson
Copy link
Author

Nice. I'll re-install. I just finished a work-around that interacts with the inat api outside of spocc - it iterates through by year, which removes the page-limit issues. Happy to share code if you're at all interested.

A frustrating problem because I'm sure we're not the only group of folks interested in downloading all occurrence data from multiple sources. Thanks again for all of your hard work on this!

@sckott
Copy link
Contributor

sckott commented May 30, 2019

nice, that sounds good. by the way , the docs for the new inaturalist API we're using is here https://api.inaturalist.org/v1/docs/#!/Observations/get_observations

you can do date queries with it like:

x <- occ(query = 'Danaus plexippus', from = 'inat', limit = 10,
  inatopts = list(year = 2010))
x$inat$meta$found
#> [1] 193
x$inat$data$Danaus_plexippus$observed_on_details.year
#> [1] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010

y <- occ(query = 'Danaus plexippus', from = 'inat', limit = 10,
  inatopts = list(year = 2012))
y$inat$meta$found
#> [1] 478
y$inat$data$Danaus_plexippus$observed_on_details.year
#> [1] 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012

the output format for data from iNat has changed in the new API so the details of drilling down through data is a bit different i think

@keatonwilson
Copy link
Author

If you're interested: code for the inat/gbif combination and cleaning/munging. Not the most elegant, but currently working (still figuring out some bugs on records with really high occurrence numbers).

https://github.com/keatonwilson/insect_migration/blob/master/scripts/get_clean_obs_function.R

@sckott
Copy link
Contributor

sckott commented Jun 4, 2019

nice. Are we all good on this? Anything else on this topic?

@keatonwilson
Copy link
Author

All good - it seems like things are limited by the iNat API, so not much to do about it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants