-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: Clarify how managed/classic schema.xml works, use cases, how to operationalize it. #7306
Comments
@poikilotherm is this something that you're willing to take on? If not, we'll plan to pull it into a future sprint. Thanks! |
Just an update, was digging back in the 4.17 release notes and there were some instructions there that might provide part of the answer. Also some discussions here about that update script checking blocks in db made sense. https://github.com/IQSS/dataverse/releases/tag/v4.17 Flexible Solr Schema, optionally reconfigure Solr This is optional, but all future changes will go to these files. It might be a good idea to reconfigure Solr now or be aware to look for changes to these files in the future, too. Here's how: You will need to replace or modify your schema.xml with the recent one (containing XML includes) I think goals would be 1. what is the path for an older installation using single schema.xml to switch to managed and 2. is there a single, normalized way that we can recommend describing how to update schema in release notes going forward. If not single, then describe use cases. For instance, for 1. follow the above instructions in 4.17 release notes. for 2. developers post changes to the sub schema files, those are moved over and then update script is run as in v5.1 release notes but whether to selectively run update script. adding a new block would be normal add block api instructions per guide, including running update script per guide. Does this sound correct? |
@poikilotherm Replied via Slack: |
@poikilotherm @pdurbin Thinking more about out current system, using includes to separate out the default metadata block fields from schema.xml, did we go from a simple, convenient system to a more precise but complicated one? In our prior system, all source controlled metadata block fields were added to a single schema.xml. Changes were also made there and instructions to update were to simply replace schema.xml and restart or reload solr. This worked for all community members who used default and custom metadata blocks, as long as they were under project source control. In our current system, if I understand correctly and please comment if I do not, only default metadata blocks are included as part of the schema include files. This means that we do not source control non default index field changes such as mra collection? Also, does it means that any installation, eg. Harvard, that uses non default metadata blocks, now needs to always manually update their schemas, either schema.xml if using that method or the newer include schema files? So would a more correct set of update instructions for release notes look something like this? First, assuming index changes are for default metadata blocks only, not sure what to do with non default blocks.
Second, if changes were made to non default metadata blocks via .tsv files:
Note: 5.1 Release note instructions provided by the developer would have worked for installations that do not use custom metadata blocks. If Harvard had followed the developers release note instructions for updating schema for 5.1, we would have dropped our schema config for non default metadata blocks. The updated instructions added a note to run updateSchemaMDB.sh first but that alone would effectively ignore any new changes in 5.1. Complete instructions would be run updateSchemaMDB.sh to generate complete schema include files, identify changes to include files from 5.1, then manually add them to the generated schema include files and reload solr. Another note: I'm not sure what the cost/benefit analysis was for making this change but if the goal is to simply separate out schema items that are in less general use, might we add a 3rd include file that preserves those items and would allow source control for them as long as their metadata blocks are also under project source control. It seems like this gets more and more complicated and I wonder whether it is worth it. A third note: A question, really. Are there any other changes someone might reasonably be expected to make to schema.xml that are not field related? I know config options like search weighting are made to solrconfig.xml but not sure if any are in schema. |
(just to repeat what I said on slack earlier:) I have updated the prod solr config; using the schema.xml file from the distribution (with the |
@landreev I didn't see your slack post, where was it posted? Is this updated to v5.1.1 or just to base line managed schema config? I mention it because all Harvard fields are removed from these files so they need special (manual) handling -update script will write them out to files but then what method do you use to combine them with newer source controlled files? I am also wondering what it may mean for other installations. See my comments above. Ok, I saw your comments in slack, was in a thread. I've followed up there with essentially the same question I posted here. |
So, after discussion with Leonid and others, it looks like the following is how it is working: An example of release notes from v5.1 that illustrates the above: https://github.com/IQSS/dataverse/releases/tag/v5.1 Additional Upgrade Steps wget https://github.com/IQSS/dataverse/releases/download/v5.1/biomedical.tsv Check if your Solr installation is running with the latest schema.xml config file (https://github.com/IQSS/dataverse/releases/download/v5.1/schema.xml), update if needed. Run the script updateSchemaMDB.sh to generate updated solr schema files and preserve any other custom fields in your Solr configuration. Thanks everyone for clarifying what appears simple once known but was a bit mystifying to some of us. |
Hey @kcondon, thanks for writing all of this down. Yet, I'm not sure I get what the underlying problem is. When deploying a Dataverse installation, people using custom metadata blocks need to take care of their Solr config. No one stops people from not using the script and instead put a file in source control. But why should the upstream project, that Dataverse is now, force people to load metadata blocks they don't use and are customized for Harvard's usage? Before #6146 was merged, everyone had to cleanup the upstream files and put their field config inside the configuration manually. Now people can use a script to have a less error prone way to adjust their config. But no one has to stop using a source controlled version of the configuration file! Instead, they could use the script to update the file, add the patch to version control and deploy from there in multiple ways. So why should Harvard Dataverse be in the exclusive position to source control it's Solr config in the upstream project? I bet there is some private configuration space for Harvard Dataverse anyway, so you could basically regain the simplicity you favor by source controlling it there. However, the script had been invented because community members had troubles with maintaining their Solr config in terms of how to add their custom metadata to the Solr config. The API endpoint invented by @pdurbin was a great relief with that. The overall goal should be solving #5989 and throwing away this script, which had always meant to be a temporary solution. |
@poikilotherm My issues were based on a misunderstanding of how updates were managed and the idea that "extra" fields that were legacy were not such a burden on anyone that it required a rearchitecture if it resulted in other maintenance issues. Turns out it does not cause maintenance issues, that was my misunderstanding. We do not perform updates by manually modifying those include files to account for those changes but can just use the update script as long as our metadata blocks are loaded. I was confused by the developer checking in changes to those files and as part of deployment instructions, indicating to manually place them in place and reloading solr. If you combine that with running update script, there is an apparent conflict since update overwrites the include file but what I missed was if you first update the block with the change, the update script is all that is needed. Honestly I am not trying to create an us/them scenario, more of a was working/now complicated scenario but that again is not the case. Thanks for responding and if you want to chat off line, we can! Kevin |
Hey @poikilotherm thanks for comment, and I'm sure some of this can continue to be revisited as the project grows (and we look at options for flexible metadata). We just need to prioritize it, which we haven't been able to do so yet. This was mostly focused at understanding current state. |
Yeah. What @kcondon said. :) |
As time goes on some technical details become murky vis a vis updating solr schema.
@poikilotherm @pdurbin apologies but can we have some clarification on how admins should be maintaining schema.xml when they have 1. a legacy system, 2. a new system in cases where a field is added or changed? Also any considerations around a classic schema config and a managed one.
Current instructions:
http://guides.dataverse.org/en/5.1/admin/metadatacustomization.html?highlight=updateschemamdb
Current 5.1 release notes (https://github.com/IQSS/dataverse/releases/tag/v5.1) considering schema change:
Update Biomedical Metadata Block (if used), Reload Solr, ReExportAll
-wget https://github.com/IQSS/dataverse/releases/download/v5.1/biomedical.tsv
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @biomedical.tsv -H "Content-type: text/tab-separated-values"
-copy schema_dv_mdb_fields.xml and schema_dv_mdb_copies.xml to solr server, for example into /usr/local/solr/solr-7.7.2/server/solr/collection1/conf/ directory
-if your installation employs any custom metadatablocks, you should also run the script updateSchemaMDB.sh to preserve those fields. See http://guides.dataverse.org/en/5.1/admin/metadatacustomization.html?highlight=updateschemamdb . Otherwise it is unnecessary.
-Restart Solr, or tell Solr to reload its configuration:
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1"
-Run ReExportall to update JSON Exports
http://guides.dataverse.org/en/5.1/admin/metadataexport.html?highlight=export#batch-exports-through-the-api
Thanks
The text was updated successfully, but these errors were encountered: