-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fileCount in Solr Documents and API performance #8941
Comments
@luddaniel sorry for the slow response. I'm trying to understand what you're up to. 😄 Does the field have to be custom? Are you simply saying, "I want to know which datasets are not using field x?" (By "no files" I assume you mean "no documents". Solr calls results "documents".) If you're concerned about performance, please try querying Solr directly to see if it's faster. The Dataverse Search API does (sadly) add some overhead but I'm not sure how much. If the SQL queries are working for you perhaps you can make a pull request to make an API endpoint. What is this tool you've built? 😄 You've piqued my curiosity. 😄 Maybe we can chat in real time at https://chat.dataverse.org some day. You're welcome to join! |
Hello @pdurbin :) Thank you for answering
By "no files" I mean "To retrieve Dataset without Datafile". In other words, I want a SOLR field that tells me this dataset has 32 files. So I can ask Dataverse Search with :
That's what I want to do but I cannot due to lack of 1 SOLR field that could contains informations on datafiles owned by a dataset (1 dataset = 1 solr document)
Sadly, I don't have enough time at work to contribute at the moment but our team is thinking about contributing to Dataverse software, someday we will push code for sure !
We created a little python web application to help our support team (non technical persons) of our Dataverse Installation.
I will sign in, hope to chat with you soon. It could be easier to exchange using screensharing. |
@luddaniel first, I'd like to point you toward a SQL query that might help as a workaround. It's from the useful queries doc linked from https://guides.dataverse.org/en/5.12.1/admin/reporting-tools-and-queries.html
I see what you mean about fileCount. If I search for my dataset at https://dataverse.harvard.edu/api/search?q=dsPersistentId:%22doi:10.7910/DVN/TJCLKP%22 I see it like this:
However, this number isn't indexed into Solr. It gets pulled from the database like this:
It looks like I added it in 055fa5f as part of this PR: @landreev even asked why we don't just index the number. I guess we should have. 😞 |
|
Hello, I have recently worked on a customized tool that use Dataverse Installation Search API and I was questionning myself on some choices done in code.
My need was to find all datasets without one custom field and without any files.
Customized field is ok
/api/v1/search?q=-alternativeURL:*&type=dataset
But no files... I found that json reponse had fileCount property constructed by java code after solr result.
Problem is : search is really slow if you ask lot of data.
Ex :
/api/v1/search?q=*&type=dataset&subtree=root&sort=name&order=asc&per_page=1000
takes 37 seconds for a 2.3Mo json file of 1000 datasets on our Dataverse installation (1991 datasets and 29844 datafiles).I ended up doing it using mostly SQL queries.
I think it would be intesting to have information on solr document instead, in order to have better performance and mostly be able to ask
fileCount:0
in Dataverse Repository UI search bar or using Search API.Or maybe display it on Dataverse Installation search snippet result or in a customized tool in my case (ex: 66 files).
I have seen that the question and the idea had exist in the past during
fileCount
implementation : Dataset file count in search results from API #6601 in Search API show fileCount for datasets #6601 #6623I'm a bit disturbed to see simple query based on solr search hiting 37 seconds for only 1000 elements.
My guess is there is too much SQL done afterward SOLR result.
The interest of using SOLR is to have ultra fast response, my humble opinion is that it's start to lose interest, it would be faster using 100% SQL or 100% Solr don't you think ?
As Java developper I do understand that complex application make those choices really complicated due to maintenance of SQL queries and critetrias or to not overload SOLR documents.
But there is a performance danger to make SQL queries for each documents of SOLR found.
I might be off topic, feel free to correct me.
I can't wait to read you and, as always, thanks for you time and services for the community of Dataverse software.
The text was updated successfully, but these errors were encountered: