Dataset versions, cards, and Solr indexing #380

eaquigley · 2014-07-09T15:37:55Z

Author Name: Philip Durbin (@pdurbin)
Original Redmine Issue: 3795, https://redmine.hmdc.harvard.edu/issues/3795
Original Date: 2014-03-31
Original Assignee: Kevin Condon

So far, every dataset has had a single corresponding Solr document. Now that we have versioning of datasets, Solr documents of published studies are being overwritten by draft versions when they are saved. This needs to change.

From a discussion between Gustavo and Merce on 2014-04-02:

It would be confusing to see multiple cards for the same dataset at once.
Try introducing a facet as a toggle between "published" and "draft".

The idea is that you'd only never see two cards for the same dataset at once. By default, you would see the published version. If you click the toggle, you will see only draft versions of datasets. Under the covers there will be two Solr documents such as id:dataset_42 and id:dataset_42_draft. Once the draft study is released, there will again only be one Solr document (and one card) for the dataset.

At a technical level, here's some pseudocode from a whiteboard:

OnSave:

// if it's a draft
if latestVersion == workingCopy
  // create the draft Solr document
  index dataset_42_draft
else
  // index the published version
  index dataset_42
  // delete the draft
  delete dataset_42_draft

If there's a workingCopy, the penultimate version should always be the published version.

Related issue(s): #214, #394, #472, #483
Redmine related issue(s): 3628, 3809, 3887, 3898

The text was updated successfully, but these errors were encountered:

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-04-16T02:08:33Z

No indexing changes have been made but I laid out the logic that's been talked about with up to two Solr documents (a draft version and a released version) per dataset. Significant refactoring will be required next in the indexDatasetAddOrUpdate() method in the code block below.

In the curl command tests I introduce the notion of a toggle (released=true vs. released=false) but no code has been written to support this yet. That's also next.

This commmit:

42aeafb no change but add logic and tests for indexing dataset versions #3795

if (latestVersion.isWorkingCopy()) {
sb.append("The latest version is a working copy (latestVersionState: " + latestVersionState + ") and will be indexed as " + solrIdDraftStudy + " (only visible by creator)\n");
if (releasedVersion != null) {
    String releasedVersionState = releasedVersion.getVersionState().name();
    String semanticVersion = releasedVersion.getSemanticVersion();
    sb.append("The released version is " + semanticVersion + " (releasedVersionState: " + releasedVersionState + ") and will be indexed as " + solrIdPublishedStudy + " (visible by anonymous)");
    /**
     * The latest version is a working copy (latestVersionState:
     * DRAFT) and will be indexed as dataset_17_draft (only visible
     * by creator)
     *
     * The released version is 1.0 (releasedVersionState: RELEASED)
     * and will be indexed as dataset_17 (visible by anonymous)
     */
    logger.info(sb.toString());
    String indexDraftResult = indexDatasetAddOrUpdate(dataset);
    String indexReleasedVersionResult = indexDatasetAddOrUpdate(dataset);
    return "indexDraftResult:" + indexDraftResult + ", indexReleasedVersionResult:" + indexReleasedVersionResult + ", " + sb.toString();
} else {
    sb.append("There is no released version yet so nothing will be indexed as " + solrIdPublishedStudy);
    /**
     * The latest version is a working copy (latestVersionState:
     * DRAFT) and will be indexed as dataset_33_draft (only visible
     * by creator)
     *
     * There is no released version yet so nothing will be indexed
     * as dataset_33
     */
    logger.info(sb.toString());
    String indexDraftResult = indexDatasetAddOrUpdate(dataset);
    return "indexDraftResult:" + indexDraftResult + ", " + sb.toString();
}
} else {
sb.append("The latest version is not a working copy (latestVersionState: " + latestVersionState + ") and will be indexed as " + solrIdPublishedStudy + " (visible by anonymous) and we will be deleting " + solrIdDraftStudy + "\n");
if (releasedVersion != null) {
    String releasedVersionState = releasedVersion.getVersionState().name();
    String semanticVersion = releasedVersion.getSemanticVersion();
    sb.append("The released version is " + semanticVersion + " (releasedVersionState: " + releasedVersionState + ") and will be (again) indexed as " + solrIdPublishedStudy + " (visible by anonymous)");
    /**
     * The latest version is not a working copy (latestVersionState:
     * RELEASED) and will be indexed as dataset_34 (visible by
     * anonymous) and we will be deleting dataset_34_draft
     *
     * The released version is 1.0 (releasedVersionState: RELEASED)
     * and will be  (again) indexed as dataset_34 (visible by anonymous)
     */
    logger.info(sb.toString());
    String deleteDraftVersionResult = removeDatasetDraftFromIndex(solrIdDraftStudy);
    String indexReleasedVersionResult = indexDatasetAddOrUpdate(dataset);
    return "deleteDraftVersionResult: " + deleteDraftVersionResult + ", indexReleasedVersionResult:" + indexReleasedVersionResult + ", " + sb.toString();
} else {
    sb.append("We don't ever expect to ever get here. Why is there no released version if the latest version is not a working copy? The latestVersionState is " + latestVersionState + " and we don't know what to do with it. Nothing will be added or deleted from the index.");
    logger.info(sb.toString());
    return sb.toString();
}

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-04-16T18:36:26Z

Philip Durbin wrote:

No indexing changes have been made but I laid out the logic that's been talked about with up to two Solr documents (a draft version and a released version) per dataset. Significant refactoring will be required next in the indexDatasetAddOrUpdate() method in the code block below.

In the curl command tests I introduce the notion of a toggle (released=true vs. released=false) but no code has been written to support this yet. That's also next.

Prior to this commit, we only ever had one Solr document at a time for datasets so the public could see when a version 1.0 released datasets became a draft. Now we keep the old Solr document around for the released 1.0 dataset until 1.1 (or 2.0) is published:

aad5c01 start hiding drafts after version 1.0 from public #3795

There's still a lot of work to do but this groundwork was important for the next steps... the toggle facet, etc.

Also, I haven't thought a lot about the implications for files... like Dataverses, right now there's still only one Solr document per file.

Also, there's something funny going on with the citation.

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-04-24T18:47:35Z

Philip Durbin wrote:

Also, there's something funny going on with the citation.

There are bugs around the citation in general, it seems, as described in #3875. From my perspective, getting the citation for a dataset version should be a black box.

Gustavo and I agreed today we'll stop indexing the citation anyway, which we've only been indexing for #3737 for a week or so anyway (since e9de92a). We should probably remove the custom non-highlighting CSS Mike and I added in b27c341.

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-04-24T22:52:07Z

First we tried a "Unpublished/Published" toggle and didn't like it. That was commit 58dfcf9 (still deployed to dvn-alpha).

Then we tried merging Solr documents together with a "group by" feature of Solr. That code is here: https://github.com/IQSS/dataverse_temp/tree/solr-groupby

Then we decided to try showing multiple cards (screenshot with comments at https://docs.google.com/a/harvard.edu/drawings/d/1PRJSP2EG31RPTG2SyTkjqldKe5lkNnV5FxQg2wFyql8/edit?usp=sharing). That's this commit I just made:

80f0be4 replace Published/Unpublished toggle with multiple cards #3795
    - cards can have the labels Unpublished or Draft
    - added facet for "Publication Status"

See also some of the reasoning and meeting notes at https://docs.google.com/a/harvard.edu/document/d/1clGJKOmrH8zhQyG_8vQHui5L4fszdqRjM4t3U6NFJXg/edit?usp=sharing

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-05-05T14:58:16Z

Philip Durbin wrote:

Also, I haven't thought a lot about the implications for files... like Dataverses, right now there's still only one Solr document per file.

I finally started taking a look at files. We still have only one card per file (should we?) but by indexing based on the dataset version rather than the dataset itself, a bug has been fixed where new files uploaded to published studies were discoverable:

index files based on version, not dataset itself #3795 · 522642b · IQSS/dataverse - 522642b

There's still more work to be done in the area of files, however. Is the expectation that a single file can have multiple cards based changes made to the description (for example) after the file was initial published? I raise this question in the integration test at https://github.com/IQSS/dataverse/blob/master/scripts/search/tests/dataset-versioning05 as well as in the screenshot with comments at https://docs.google.com/a/harvard.edu/drawings/d/1PRJSP2EG31RPTG2SyTkjqldKe5lkNnV5FxQg2wFyql8/edit?usp=sharing

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-05-06T14:03:24Z

Philip Durbin wrote:

Philip Durbin wrote:

Also, there's something funny going on with the citation.

There are bugs around the citation in general, it seems, as described in #3875. From my perspective, getting the citation for a dataset version should be a black box.

Gustavo and I agreed today we'll stop indexing the citation anyway, which we've only been indexing for #3737 for a week or so anyway (since e9de92a). We should probably remove the custom non-highlighting CSS Mike and I added in b27c341.

I stopped indexing the citation and am now looking it up properly on the dataset version rather than the dataset itself:

stop indexing citation, get from dataset version #3795 · caea2f1 · IQSS/dataverse - caea2f1

I did not yet look at removing the custom CSS Mike added to avoid highlighting on the citation. It shouldn't be necessary anymore since the citation isn't being indexed at all now (so it will never show highlights).

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-05-07T15:06:14Z

Philip Durbin wrote:

There's still more work to be done in the area of files, however. Is the expectation that a single file can have multiple cards based on changes made to the description (for example) after the file was initial published?

At our Monday meeting it was decided that we do want multiple cards for files (just like datasets).

This functionality is available as of this commit:

file card draft/publish lifecycle, including delete #3795 · d10add4 · IQSS/dataverse - d10add4

Please note that the minute you edit a published dataset the creator will be able to see two cards for the dataset (published and draft) and for each file in the published study, a draft file card and a published file card. The rules on when draft file cards are created might change in #3943 but for beta this ticket is ready for testing.

#3943 has lots of screenshots of a publishing workflow example that might be useful in understanding what to expect visually. If you're interested in trying the integration tests, they start at highlighting and end with dataset-versioning06 at https://github.com/IQSS/dataverse/tree/master/scripts/search/tests . I put in lots of detail (server.log output) of how the Solr versioning lifecycle (create/update/delete) works.

I'm moving this ticket to QA.

eaquigley · 2014-07-09T15:37:56Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-05-08T19:53:01Z

I realized that by default Solr limits queries to 10 documents which was imposing a low limit on the number of draft file cards that could be deleted. This commit fixes this, allowing for 2147483647 of these Solr docs to be deleted at once:

don't limit deletion to 10 file solr docs #3795 · ebc5163 · IQSS/dataverse - ebc5163

Should be plenty. Also we may switch how we do this in #3960.

eaquigley · 2014-07-09T15:37:57Z

Original Redmine Comment
Author Name: Kevin Condon (@kcondon)
Original Date: 2014-05-14T19:57:34Z

Tested on 5/14

Basic version logic works, at most 2 versions viewable: draft/published, results have proper tags and clicking on them goes to the correct version.

Also tested with 11 files and draft entries for all 11 files are removed when published.

Closing ticket

eaquigley added this to the Dataverse 4.0: Beta 1 milestone Jul 9, 2014

eaquigley assigned kcondon Jul 9, 2014

eaquigley closed this as completed Jul 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset versions, cards, and Solr indexing #380

Dataset versions, cards, and Solr indexing #380

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

Dataset versions, cards, and Solr indexing #380

Dataset versions, cards, and Solr indexing #380

Comments

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014