-
Notifications
You must be signed in to change notification settings - Fork 2
Data ingestion
- Ingest of resources
- Use & good pratices of Biocache
- Command for spatial module
- Checks after ingestions
In case that you are working with just a dataset:
- Connect to the server where your biocache tool is hosted
- Connect to biocache
$ sudo biocache
- Run the following command line:
biocache> ingest -dr <dataresource_id>
You can also run directly on the terminal:
$ sudo biocache ingest -dr <dataresource_id>
Important : you will not have logs if you don’t specify the out file.
- Connect to the server where your biocache tool is hosted
- Connect to biocache:
$ sudo biocache
- Run the following command lines:
biocache> load <dataResource_id>
biocache> process -dr <dataResource_id>
biocache> sample -dr <dataResource_id>
biocache> index -dr <dataResource_id>
You can also run directly on the terminal, the command lines above with
$ sudo biocache
- Upload a modified DwC-Archive with 15 occurrences in order to create the dataset into the system.
- Copy the real DwC-Archive instead of the modified one on the
/collectory/upload/
folder - Then run the load, process and index commands:
biocache> load <dataResource_id>
biocache> process -dr <dataResource_id>
biocache> sample -dr <dataResource_id>
biocache> index -dr <dataResource_id>
We need to do step 1 because the ZipFile library used by the biocache-store can’t open a file bigger than 1 GO
You need to have a server with at least the size of your DwC-Archive in RAM.
You can run one command line (as sudo user):
$ nohup biocache ingest -all > /tmp/load.log &
You can run three different command lines directly on the terminal (as sudo user):
$ biocache process-local-node
$ biocache sample-local-node
$ biocache index-local-node
In fact ALA team loads datasets during the week, but they have jenkins jobs for offline indexing that twice a week run processing, sampling and index everything, specially for big datasets (> 100k).
In older versions of biocache:
$ biocache bulk-processor load -t 7 > data/output_load.log
$ biocache bulk-processor process -t 6 > data/output_process.log
$ biocache bulk-processor index -ps 1000 -t 8 > /data/output_index.log
With the -t
option, you will give the number of CPU you want to use for the processus.
With the -ps
option, you will give the number of occurrences per pages on SOLR.
You can run these task via Jenkins so you can store logs of tasks, and share tasks with your team.
@Todo : Tips : You don't need to enter on biocache environnment to execute biocache command line (flag by institution/ALA production)
@Todo : Instruction to add
Some manual checks you can be performed after an occurrences data resource ingestion to check if the data was ingested correctly:
- Check that the dr collections shows a similar number or occurrences in the collectory and in your source (IPT, DwCA). To check this you can do it:
- Via your biocache-hub web search
- Via your biocache-ws API with calls like: https://biocache-ws.ala.org.au/ws/occurrences/search?q=data_resource_uid:drNUMBER
- Via a direct Solr index search with a similar query.
If your collection it's empty probably your data resource is not correctly mapped to a institution and/or collection.
See the jenkins page to do this in a more automatized way.
Other checks:
- If you data resource has multimedia, you can search your image service using
dataResourceUid
criteria. Sample: - Search in your spatial services if your occurrences where processed also correctly in the spatial service
Index
- Wiki home
- Community
- Getting Started
- Support
- Portals in production
- ALA modules
- Demonstration portal
- Data management in ALA Architecture
- DataHub
- Customization
- Internationalization (i18n)
- Administration system
- Contribution to main project
- Study case