Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery: Determine work to move collections out of GSA CKAN core and into an extension #1461

Closed
10 of 15 tasks
kimwdavidson opened this issue Mar 16, 2020 · 8 comments
Closed
10 of 15 tasks
Assignees
Labels
component/catalog Related to catalog component playbooks/roles

Comments

@kimwdavidson
Copy link

kimwdavidson commented Mar 16, 2020

Description

Summary

This is a timeboxed effort to understand the technical implications of, and determine the HL level of effort involved in, moving collections out of catalog-app (CKAN 2.3) and into an extension that works with the OKF main CKAN branch (CKAN 2.8) (running in Docker).

The technical criteria to assess the solution against will be:

  • Ease of fit in the CKAN framework eco-system (will it work with the other extensions / inventory(?)
  • Search will work (we can index etc)
  • How elegant a solution is it (ie not incur technical debt)

Time-boxed at 3ish days:

  • 0.5 to 1 day Understanding the code-base (we already have done pre-work here)
  • 0.5 Summary of "non-core code" in the catalog app (ie BSP config, collections, other features - but this more for background... BSP config we can ignore... we want to focus on the collection code)
  • 0.5 days High-level understanding Dept Ed collections approach using group id = collection
  • 1 to 1.5 days analyzing an approach and coming up with HL solution and estimate

This discovery task is focused on purely looking at this problem from a "can we build collections as an extension" stand-point, the following are out of scope:

  • Product considerations: Do we need this? What do we need? we assume we need this and we will build the same features (if this solution looks viable then we can see features collections would be required.
  • "Change all WAF collections to be single file harvest, and have metadata point to resource endpoints that are relevant for download. This still doesn't handle the POD collections implementation. CKAN does have a native feature to "connect" datasets, we could consider going that route for those datasets..." (From James Slack dated 1/31/2020

Links

Acceptance Criteria

  • A proposed solution to move the Collections feature out of CKAN core and have a technical understanding of the implications of doing that on the other extensions
  • Estimate work for a solution (or a spike) to move collections out of CKAN core and into a separate extension
  • Estimate how long it would take to amend the extensions that have a reverse dependency with the collections code (Basically how long would it take to change the extensions so they work with a new collections extension)

Tasklist - getting set up for this task

  • Esteban to draft the issue and discuss with Julie
  • Julie to ask Tom to add you to Dept Ed repo (so you can see code (this project is temporarily no longer open source 😠)
  • Esteban to provide link to documentation and point Julie to the Dept Ed code and answer any questions she has
  • Understand collections implementation in UI
    • Esteban to do first pass
    • Julie to review document and add/amend

Task list - analysis

  • Understanding the current approach
    • What metadata to they have associated with Collections, what are the features of collections?
    • Understand the dependencies needed by collections and reverse dependencies (extensions use collections code)
  • Run the OKF upstream locally to better understand how features the collections are coming from the extensions or core, and where the collections code that exists in CKAN core is being used.
  • If we move this collections code out of core and into an extension - what do we break?
  • Create a list of possible technical ideas and provide pros and cons (Esteban's initial ideas of our options: keep it in the core (boo), CKAN ext scheming (I'd stay clear of that), use group)

Analysis & Recommendation

Recommendation

There are 2 options to deal with collections:

  1. Re-implement the elegant hack in CKAN 2.8 (fork again) - Not recommended
  2. Create an extension - Recommended

If we opt for an extension there are a number of options.

  1. Implement the package id = collections as an extension.
    • This will work fine for the existing collections features.
  2. Use group id = collections as per Dept Ed solution
    • This is a good option if and only if you want to have more features. This would involve rework for the other extensions in addition to the work involved in creating said features
  3. Use Scheming
    • Datopian evaluated this when building the collections feature for Dept Ed and decided against using scheming. To be honest the scheming extension has a mixed reception in the CKAN community. Some like it more than others but the main issue working with scheming is with scheming "changing something here breaks something that is unrelated over there" (which some might say is a general CKAN issue) so this is not recommended.

Therefore the recommendation is to build an extension using the package id = collection pattern. If a richer set of features were required in the future then the extension can be upgraded, possibly using the group id = collections pattern.

As a t-shirt size, we would estimate this to be 2 to 3 weeks of work of which half that time is developing unit tests. But we would know after 2/3 days of dev time whether this is going to work.

Proposed next steps:

  • Understand the other code in catalog to make sure that we incorporate that as well. See the sister issue.
  • Create a ticket to build a collection extension
    • Estimate the task (ie move beyond a high-level estimation provided as part discovery task)
    • Create extension
    • Validate that dependencies and reverse dependencies are still working
    • Create unit tests
    • Created another issue to make sure that the way other harvester types are considered in addition to data.json harvester and DCAT-US schema which use the ispartof. That is to say, we need to make sure that our proposed solution works for all harvesters.
    • Note to team - let's not forget to DevSecOps considerations - need to rope Tom in early

Analysis

See more details in our scratchpad analysis doc

Details

Notes about collection in GSA's CKAN fork.

Previous issues related to this

CKAN 2.8 plan of work

Where to start looking for the Collections code

Part of the data.json standard defined in the Project Open Data is the field isPartOf. This field allows the grouping of multiple datasets into a “collection”. This field should be employed by the individual datasets that together make up a collection. The value for this field should match the identifier of the parent dataset.

GSA added this functionality to CKAN. In packages, GSA used some new extras:

  • collection_package_id: Refers to the package_id of the father.
  • is_collection: True for the packages that have children.

Both seem to have been included in CKAN searches in 2015. Also in 2015 the added to datajson ext. Also added the is_collection extra. GSA also uses this collection_package_id at the DataGovTheme.

From Aaron about the where to find the collections code:

"One gotcha, isPartOf is specific to the data.json harvester and DCAT-US schema but other harvesters might implement collections too e.g. WAF or CSW. It's unclear how many harvesters have implemented it and its unclear if any harvest sources are actually using it."

Some info on how collections are implemented in the current UI

In summary, the features of the data.gov

  • If a dataset is part of a collection you can go to the collection
  • Search within a collection
  • Not easily discoverable

See this document for more info

@estebanruseler estebanruseler changed the title Discovery: Determine work to move collections to a new version Discovery: Determine work to move collections out of GSA CKAN core and into an extension Mar 20, 2020
@mogul mogul added the component/catalog Related to catalog component playbooks/roles label Mar 20, 2020
@kimwdavidson kimwdavidson added this to the Data.gov Sprint 18 milestone Mar 23, 2020
@estebanruseler
Copy link

estebanruseler commented Mar 23, 2020

@thejuliekramer as promised, here's the details of how the collections are done in the current UI based on what I found. But please double-check this...

@estebanruseler
Copy link

Julie and my wip analysis https://hackmd.io/DP3btMgNQQ6xoxuRlM2Ajw

@estebanruseler
Copy link

@thejuliekramer as promised I’ve updated the analysis doc with eg.gov collections features and code snippets.

@thejuliekramer
Copy link
Contributor

@estebanruseler After poking around for a few days it seems like we can move the methods we've changed for the collections search into an extension and override the existing methods in ckan... the same methods still exist in newer versions of CKAN... I think they will interact the same with the code the exists in datajson and geodatagov extensions as far as I can tell, but I will know more if we get approval to do a spike, I ran out of time to actually fully get it working locally. I would estimate the work to move the code into an extension would take about a week - and testing the feature would take another week since no tests exist. This is also all assuming we want to keep the same exact functionality and UI we have currently and not add anything else (like they did in Dept of Ed). Let me know your thoughts

@mogul
Copy link
Contributor

mogul commented Mar 26, 2020

@jbrown-xentity FYI! @estebanruseler is available to talk about the findings in more detail if you like, especially if it's useful to you in the context of other projects.

@jbrown-xentity
Copy link
Contributor

@estebanruseler would love to sync up about your findings and next steps for this!

@estebanruseler
Copy link

@jbrown-xentity @thejuliekramer and I were working on this together so we should have a chat the 3 of us. Tomorrow my day is pretty full but should we meet for 30 mins on Monday. I'm available from 1.30 to 3 pm ET.

@adborden
Copy link
Contributor

I was under the impression that it was not easy to patch/extend CKAN's search functionality from an extension, but if that's not the case then the approach sounds great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/catalog Related to catalog component playbooks/roles
Projects
None yet
Development

No branches or pull requests

6 participants