Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR search results: duplicated CKAN datasets #252

Open
mwengren opened this issue Mar 13, 2024 · 10 comments
Open

SOLR search results: duplicated CKAN datasets #252

mwengren opened this issue Mar 13, 2024 · 10 comments

Comments

@mwengren
Copy link
Member

mwengren commented Mar 13, 2024

I noticed several issues with the Solr search index on a spot check today.

  1. Overall dataset count is too high/duplicated datasets: Take GCOOS for example, which has 59,000+ datasets listed on the default datasets view. If you add up the four individual GCOOS harvest WAFs, the total is only 231 + 156 + 24971 + 906 = 26,264. So there is some dataset duplication. I am not sure if this only affects GCOOS or is more widespread

  2. Poor search results. If I try to search by an individual GCOOS dataset id (see this search for 'Data for ioos-station-wmo-42400'), I get essentially a full list of datasets returned (~76.068 total datasets). The dataset order does appear to be sorted at least (most relevant results at top), but there is essentially no filtering happening on the count in the results set. Also, examining the results list from this search, there are three copies of the data dataset with title 'Data for ioos-station-wmo-42400', so an example of the duplication in the first issue mentioned above.

We need to look into how to restore the proper search functionality in the current version of SOLR and also troubleshoot why some datasets are clearly duplicated.

cc @benjwadams

@benjwadams
Copy link
Contributor

This is duplicated at the database level, so clearing up Solr indices will result in more duplicates the next time AWS is run.

I have a couple system scripts specifically for clearing out database duplicates, although it usually doesn't need to run for 10k+ datasets so it may run somewhat slowly. I'm running it now and it will take a while.

@benjwadams
Copy link
Contributor

Duplicates have been cleared from database and Solr. Closing issue.

@mwengren
Copy link
Member Author

mwengren commented Apr 8, 2024

This seems to still be present as of 4/8/24. GCOOS =~ 54000 datasets, Glider DAC =~ 6800.

Still need to diagnose what's causing the counts to be off.

What's the issue with the CKAN database that we're seeing so many duplicates, if it's not a problem in Solr?

Separately, there is also a Solr search results issue. If this should be separated into another issue, let's do that:

Solr is giving poor search filtering results from its index. If I search for 'osu592-20230524T1813-delayed' for example, I get 71,000 results. No filtering applied.

There can't be 71000 datasets with that string occurring in the index, so something is not working that used to work (in earlier Solr versions?).

@mwengren
Copy link
Member Author

mwengren commented Apr 8, 2024

I created a new issue #253 to track the Solr issue separately.

@mwengren mwengren changed the title SOLR search index issues/duplicated datasets SOLR search results: duplicated CKAN datasets Apr 8, 2024
@mwengren
Copy link
Member Author

mwengren commented Apr 8, 2024

@benjwadams Says that he'll need to clear out the CKAN database and reharvest. Issue may re-occur.

Unclear whether the CKAN database uses the ISO XML title value or XML 'flieIdentifier' field value. We'll need to keep an eye on this to understand how to minimize the consequences of this going forward.

@mwengren mwengren closed this as completed Apr 8, 2024
@mwengren mwengren reopened this Apr 8, 2024
@mwengren
Copy link
Member Author

We're still seeing the issue with inaccurate dataset counts. Here's a tally for GCOOS:

We're getting roughly 2x the count in the SOLR index than is in the database. Coincidence?

@benjwadams If the harvest counts as listed above are coming from the CKAN database, how do you know that this is caused by duplicated datasets in the CKAN database and not an issue in SOLR?

I'm not sure if this affects other providers than GCOOS.

Spot check of AOOS looks better: SOLR dataset count for AOOS: 2,765, harvest source counts: AOOS ERDDAP WAF 2607 + AOOS WAF 127 = 2,734. Would be better if those matched exactly, but this is good enough all things considered.

@benjwadams
Copy link
Contributor

benjwadams commented May 14, 2024

Counts have improved considerably since deduplication scripts have run.

@mwengren
Copy link
Member Author

mwengren commented Jun 3, 2024

@benjwadams has manual cleanup scripts that removes duplicates from database first, and then clears corresponding datasets from SOLR index. Checks dataset ID for number value suffixes in ID field and removes if present (this is an indication of a potential duplicate dataset). Not entirely a safe check, but the best that we have.

Duplicates can result from harvest sources that have been removed from CKAN but datasets have not fully cleared out from database as part of the removal process. If the same source is re-harvested afterwards, this can result in duplicate datasets being created.

We can keep this as a manual option to be run if necessary. Minor changes to the script would be needed to automate running routinely.

@mwengren mwengren closed this as completed Jun 3, 2024
@mwengren
Copy link
Member Author

@benjwadams A spot check of IOOS Catalog today shows this is happening again, possibly in a snowball-ish sort of way.

When I looked yesterday at https://data.ioos.us/dataset/, there were approx. 80,000 datasets listed in the Solr index.

Today, it's over 90,000K! IOOS has definitely not accumulated an additional 10K datasets in the past day.

Can we look into automating the aforementioned auto-cleanup script (see my previous comment to that effect #252 (comment)), or, alternatively, if there is something within the CKAN harvesting code that could be troubleshot that might prevent the duplicated datasets from being created during the ingest process in the first place?

We can discuss the best path forward at next week's meeting but, for now, can you attempt to clear duplicates from Catalog?

GCOOS:
GCOOS again appears to be the worst 'offender' (no blame intended), with over 60K datasets listed by Solr on the datasets page: https://data.ioos.us/dataset/.

The GCOOS Historical WAF source looks to be the main cause - it should have roughly 25K datasets in it but is listing 65,000 + currently: https://data.ioos.us/harvest/gcoos-waf-historical.

SECOORA:
SECOORA's datasets are also a problem. In SECOORA's case, however, SOLR yields a count of 15,000 + datasets https://data.ioos.us/dataset/?organization=secoora, whereas the primary ERDDAP harvest source - : https://data.ioos.us/harvest/secoora-erddap - only shows 7900.

So in this case it's both the harvest source being duplicated (SECOORA's ERDDAP only has 1,600 datasets - https://erddap.secoora.org/erddap/index.html) and the SOLR count that's off from the already inflated 7900 dataset count resulting from the harvest.

Overall, it seems our harvesting system isn't holding up to what it's being asked to do, or there's a major bug in CKAN's harvesting code causing all of these duplicates to be generated repeatedly. Either way, we need the Catalog to be more stable.

@mwengren
Copy link
Member Author

mwengren commented Oct 7, 2024

As of 10/7, it looks like SECOORA is the org with the most dataset duplicates:

SECOORA ERDDAP WAF harvest source (~5300 datasets): https://data.ioos.us/harvest/secoora-erddap

SECOORA ERDDAP (~16000 datasets): https://erddap.secoora.org/erddap/index.html

Doing a search for 'Indian River' as an example shows multiple datasets with the numerical suffix duplicate situation in the URL ('https://..../dataset', 'dataset2', 'dataset3', etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants