-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Herbarium encoding issues - dr10574, dr376 #1105
Comments
10/09/2024
|
11/09/2024 Databox Load
Production Load
|
13/09/2024 Check data resource after SOLR Index
Record counts
|
13/09/2024 Issue with encoding seems to be due to Pandas. Niels has said that using duckDB there is no problem reading the data. Counts for new records are not correct. Have rerun with Load_dataset, made a mistake and ran ingest_large_dataset.
|
14/09/2024
|
dr10574 Tasmania Herbarium - TMAG uploads directory
dr376 - Melbourne Herbarium - IPT
Both failing with encoding errors. TMAG is likely a non-utf8 line break character, dr376 just a non-utf8 character at a specific location.
Option to clean up the data in preingestion prior to load is not implemented yet.
Solution: run load_dataset with Herbarium/IPT datasets as per NZ herbarium, don't use pre-ingestion.
The text was updated successfully, but these errors were encountered: