Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fileCount in Solr Documents and API performance #8941

Closed
luddaniel opened this issue Aug 26, 2022 · 4 comments · Fixed by #10598
Closed

fileCount in Solr Documents and API performance #8941

luddaniel opened this issue Aug 26, 2022 · 4 comments · Fixed by #10598

Comments

@luddaniel
Copy link
Contributor

Hello, I have recently worked on a customized tool that use Dataverse Installation Search API and I was questionning myself on some choices done in code.

My need was to find all datasets without one custom field and without any files.
Customized field is ok /api/v1/search?q=-alternativeURL:*&type=dataset
But no files... I found that json reponse had fileCount property constructed by java code after solr result.
Problem is : search is really slow if you ask lot of data.
Ex : /api/v1/search?q=*&type=dataset&subtree=root&sort=name&order=asc&per_page=1000 takes 37 seconds for a 2.3Mo json file of 1000 datasets on our Dataverse installation (1991 datasets and 29844 datafiles).
I ended up doing it using mostly SQL queries.

  1. I think it would be intesting to have information on solr document instead, in order to have better performance and mostly be able to ask fileCount:0 in Dataverse Repository UI search bar or using Search API.
    Or maybe display it on Dataverse Installation search snippet result or in a customized tool in my case (ex: 66 files).
    I have seen that the question and the idea had exist in the past during fileCount implementation : Dataset file count in search results from API #6601 in Search API show fileCount for datasets #6601 #6623

  2. I'm a bit disturbed to see simple query based on solr search hiting 37 seconds for only 1000 elements.
    My guess is there is too much SQL done afterward SOLR result.
    The interest of using SOLR is to have ultra fast response, my humble opinion is that it's start to lose interest, it would be faster using 100% SQL or 100% Solr don't you think ?
    As Java developper I do understand that complex application make those choices really complicated due to maintenance of SQL queries and critetrias or to not overload SOLR documents.
    But there is a performance danger to make SQL queries for each documents of SOLR found.
    I might be off topic, feel free to correct me.

I can't wait to read you and, as always, thanks for you time and services for the community of Dataverse software.

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2022

@luddaniel sorry for the slow response. I'm trying to understand what you're up to. 😄

Does the field have to be custom? Are you simply saying, "I want to know which datasets are not using field x?"

(By "no files" I assume you mean "no documents". Solr calls results "documents".)

If you're concerned about performance, please try querying Solr directly to see if it's faster. The Dataverse Search API does (sadly) add some overhead but I'm not sure how much.

If the SQL queries are working for you perhaps you can make a pull request to make an API endpoint.

What is this tool you've built? 😄 You've piqued my curiosity. 😄

Maybe we can chat in real time at https://chat.dataverse.org some day. You're welcome to join!

@luddaniel
Copy link
Contributor Author

Hello @pdurbin :) Thank you for answering

Does the field have to be custom? Are you simply saying, "I want to know which datasets are not using field x?"
(By "no files" I assume you mean "no documents". Solr calls results "documents".)

By "no files" I mean "To retrieve Dataset without Datafile".
Maybe my text was mixing too much informations ^^

In other words, I want a SOLR field that tells me this dataset has 32 files. So I can ask Dataverse Search with : fileCount=0 so I retrieve datasets without files or fileCount>10 so I retrieve datasets with at least 10 files.

If you're concerned about performance, please try querying Solr directly to see if it's faster. The Dataverse Search API does (sadly) add some overhead but I'm not sure how much.

That's what I want to do but I cannot due to lack of 1 SOLR field that could contains informations on datafiles owned by a dataset (1 dataset = 1 solr document)
Ex : fileCount (int type) or fileIDs (array type)

If the SQL queries are working for you perhaps you can make a pull request to make an API endpoint.

Sadly, I don't have enough time at work to contribute at the moment but our team is thinking about contributing to Dataverse software, someday we will push code for sure !

What is this tool you've built? You've piqued my curiosity.

We created a little python web application to help our support team (non technical persons) of our Dataverse Installation.
It uses Dataverse API Endpoints, SQL queries, Dataverse Search API. At this moment :

  1. Shows all datasets without datafile and with empty field alternativeUrl (so we can ask scientists to populate their dataset with data)
  2. Import list of email to add them in group (massive user configuration)
  3. Help to merge non admin users by selecting in 2 lists of users one to merge and one user to keep.

Maybe we can chat in real time at https://chat.dataverse.org some day. You're welcome to join!

I will sign in, hope to chat with you soon. It could be easier to exchange using screensharing.
Genuinely, Ludovic.

@pdurbin
Copy link
Member

pdurbin commented Nov 9, 2022

@luddaniel first, I'd like to point you toward a SQL query that might help as a workaround. It's from the useful queries doc linked from https://guides.dataverse.org/en/5.12.1/admin/reporting-tools-and-queries.html

List the titles of datasets without files (4.20):
select dvo.createdate as date_of_creation, dfv.value as title, dvo.publicationdate as publication_date
from datasetfieldvalue dfv
join datasetfield df on dfv.datasetfield_id = df.id
join datasetversion dv on df.datasetversion_id = dv.id
join dvobject dvo on dv.dataset_id = dvo.id
where datasetfield_id not in
(select id
from datasetfield
where datasetversion_id in
(select id
from datasetversion
where id in
(select datasetversion_id
from filemetadata)))
and datasetfield_id in
(select id
from datasetfield
where datasetfieldtype_id in
(select id
from datasetfieldtype
where name = 'title'))
order by dvo.createdate;

I see what you mean about fileCount. If I search for my dataset at https://dataverse.harvard.edu/api/search?q=dsPersistentId:%22doi:10.7910/DVN/TJCLKP%22 I see it like this:

"fileCount": 111,

However, this number isn't indexed into Solr. It gets pulled from the database like this:

nullSafeJsonBuilder.add("fileCount", dv.getFileMetadatas().size());

It looks like I added it in 055fa5f as part of this PR:

@landreev even asked why we don't just index the number. I guess we should have. 😞

@luddaniel
Copy link
Contributor Author

Screenshot from 2024-05-30 16-26-12

@DS-INRAE DS-INRAE moved this to 🚧 Dev by Recherche Data Gouv in Recherche Data Gouv Jul 10, 2024
@github-project-automation github-project-automation bot moved this from 🚧 Dev by Recherche Data Gouv to Done in Recherche Data Gouv Nov 22, 2024
@pdurbin pdurbin added this to the 6.5 milestone Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
3 participants