fileCount in Solr Documents and API performance #8941

luddaniel · 2022-08-26T16:30:26Z

Hello, I have recently worked on a customized tool that use Dataverse Installation Search API and I was questionning myself on some choices done in code.

My need was to find all datasets without one custom field and without any files.
Customized field is ok /api/v1/search?q=-alternativeURL:*&type=dataset
But no files... I found that json reponse had fileCount property constructed by java code after solr result.
Problem is : search is really slow if you ask lot of data.
Ex : /api/v1/search?q=*&type=dataset&subtree=root&sort=name&order=asc&per_page=1000 takes 37 seconds for a 2.3Mo json file of 1000 datasets on our Dataverse installation (1991 datasets and 29844 datafiles).
I ended up doing it using mostly SQL queries.

I think it would be intesting to have information on solr document instead, in order to have better performance and mostly be able to ask fileCount:0 in Dataverse Repository UI search bar or using Search API.
Or maybe display it on Dataverse Installation search snippet result or in a customized tool in my case (ex: 66 files).
I have seen that the question and the idea had exist in the past during fileCount implementation : Dataset file count in search results from API #6601 in Search API show fileCount for datasets #6601 #6623
I'm a bit disturbed to see simple query based on solr search hiting 37 seconds for only 1000 elements.
My guess is there is too much SQL done afterward SOLR result.
The interest of using SOLR is to have ultra fast response, my humble opinion is that it's start to lose interest, it would be faster using 100% SQL or 100% Solr don't you think ?
As Java developper I do understand that complex application make those choices really complicated due to maintenance of SQL queries and critetrias or to not overload SOLR documents.
But there is a performance danger to make SQL queries for each documents of SOLR found.
I might be off topic, feel free to correct me.

I can't wait to read you and, as always, thanks for you time and services for the community of Dataverse software.

The text was updated successfully, but these errors were encountered:

pdurbin · 2022-10-05T00:41:51Z

@luddaniel sorry for the slow response. I'm trying to understand what you're up to. 😄

Does the field have to be custom? Are you simply saying, "I want to know which datasets are not using field x?"

(By "no files" I assume you mean "no documents". Solr calls results "documents".)

If you're concerned about performance, please try querying Solr directly to see if it's faster. The Dataverse Search API does (sadly) add some overhead but I'm not sure how much.

If the SQL queries are working for you perhaps you can make a pull request to make an API endpoint.

What is this tool you've built? 😄 You've piqued my curiosity. 😄

Maybe we can chat in real time at https://chat.dataverse.org some day. You're welcome to join!

luddaniel · 2022-10-17T14:25:09Z

Hello @pdurbin :) Thank you for answering

Does the field have to be custom? Are you simply saying, "I want to know which datasets are not using field x?"
(By "no files" I assume you mean "no documents". Solr calls results "documents".)

By "no files" I mean "To retrieve Dataset without Datafile".
Maybe my text was mixing too much informations ^^

In other words, I want a SOLR field that tells me this dataset has 32 files. So I can ask Dataverse Search with : fileCount=0 so I retrieve datasets without files or fileCount>10 so I retrieve datasets with at least 10 files.

If you're concerned about performance, please try querying Solr directly to see if it's faster. The Dataverse Search API does (sadly) add some overhead but I'm not sure how much.

That's what I want to do but I cannot due to lack of 1 SOLR field that could contains informations on datafiles owned by a dataset (1 dataset = 1 solr document)
Ex : fileCount (int type) or fileIDs (array type)

If the SQL queries are working for you perhaps you can make a pull request to make an API endpoint.

Sadly, I don't have enough time at work to contribute at the moment but our team is thinking about contributing to Dataverse software, someday we will push code for sure !

What is this tool you've built? You've piqued my curiosity.

We created a little python web application to help our support team (non technical persons) of our Dataverse Installation.
It uses Dataverse API Endpoints, SQL queries, Dataverse Search API. At this moment :

Shows all datasets without datafile and with empty field alternativeUrl (so we can ask scientists to populate their dataset with data)
Import list of email to add them in group (massive user configuration)
Help to merge non admin users by selecting in 2 lists of users one to merge and one user to keep.

Maybe we can chat in real time at https://chat.dataverse.org some day. You're welcome to join!

I will sign in, hope to chat with you soon. It could be easier to exchange using screensharing.
Genuinely, Ludovic.

pdurbin · 2022-11-09T20:39:39Z

@luddaniel first, I'd like to point you toward a SQL query that might help as a workaround. It's from the useful queries doc linked from https://guides.dataverse.org/en/5.12.1/admin/reporting-tools-and-queries.html

List the titles of datasets without files (4.20):
select dvo.createdate as date_of_creation, dfv.value as title, dvo.publicationdate as publication_date
from datasetfieldvalue dfv
join datasetfield df on dfv.datasetfield_id = df.id
join datasetversion dv on df.datasetversion_id = dv.id
join dvobject dvo on dv.dataset_id = dvo.id
where datasetfield_id not in
(select id
from datasetfield
where datasetversion_id in
(select id
from datasetversion
where id in
(select datasetversion_id
from filemetadata)))
and datasetfield_id in
(select id
from datasetfield
where datasetfieldtype_id in
(select id
from datasetfieldtype
where name = 'title'))
order by dvo.createdate;

I see what you mean about fileCount. If I search for my dataset at https://dataverse.harvard.edu/api/search?q=dsPersistentId:%22doi:10.7910/DVN/TJCLKP%22 I see it like this:

"fileCount": 111,

However, this number isn't indexed into Solr. It gets pulled from the database like this:

nullSafeJsonBuilder.add("fileCount", dv.getFileMetadatas().size());

It looks like I added it in 055fa5f as part of this PR:

in Search API show fileCount for datasets #6601 #6623

@landreev even asked why we don't just index the number. I guess we should have. 😞

luddaniel · 2024-05-30T14:28:33Z

@pdurbin following our exchanges, with PR 8941 adding file count in solr (v2) #10598 I can now did this : http://localhost:8080/api/search?q=fileCount:0 to search datasets without file.

pdurbin added Feature: Search/Browse Feature: API User Role: API User Makes use of APIs Status: Still Interested? labels Oct 5, 2022

pdurbin mentioned this issue Nov 9, 2022

in Search API show fileCount for datasets #6601 #6623

Merged

mreekie removed the Status: Still Interested? label Jan 10, 2023

luddaniel mentioned this issue Aug 1, 2023

Investigate solr performance, diagnose bottlenecks and develop optimizations #9635

Closed

stevenferey mentioned this issue Aug 23, 2023

8941 adding file count in solr #9823

Closed

pdurbin added the Type: Feature a feature request label Oct 9, 2023

luddaniel mentioned this issue May 30, 2024

8941 adding file count in solr (v2) #10598

Merged

DS-INRAE added this to Recherche Data Gouv Jul 10, 2024

DS-INRAE moved this to 🚧 Dev by Recherche Data Gouv in Recherche Data Gouv Jul 10, 2024

ofahimIQSS closed this as completed in #10598 Nov 22, 2024

github-project-automation bot moved this from 🚧 Dev by Recherche Data Gouv to Done in Recherche Data Gouv Nov 22, 2024

pdurbin added this to the 6.5 milestone Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fileCount in Solr Documents and API performance #8941

fileCount in Solr Documents and API performance #8941

luddaniel commented Aug 26, 2022

pdurbin commented Oct 5, 2022

luddaniel commented Oct 17, 2022

pdurbin commented Nov 9, 2022

luddaniel commented May 30, 2024

fileCount in Solr Documents and API performance #8941

fileCount in Solr Documents and API performance #8941

Comments

luddaniel commented Aug 26, 2022

pdurbin commented Oct 5, 2022

luddaniel commented Oct 17, 2022

pdurbin commented Nov 9, 2022

luddaniel commented May 30, 2024