Skip to content

trln/data-documentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRLN Discovery Data Documentation

Organization of this repo

archive

Old stuff for reference

argot

Argot is the core data language for the TRLN Discovery project. This folder contains Argot documentation, including mappings from other formats into Argot.

marc

Contains tools to compare MARC fields with Argot to identify gaps in mappings as the MARC bibliographic standard changes over time. Our decisions about which MARC fields/subfields NOT to map to the Argot data model are kept here.

solr

Resources for working with Solr from the metadata side of the projectg

Data quickstart

Argot data model

Argot is the core data language for the TRLN Discovery project. All source data must be transformed into valid Argot for ingest into TRLN Discovery.

See argot directory README for full explanation of underlying Argot concepts and data transformation steps (initial hierarchical Argot → Enriched Argot → stored Solr fields)

Argot elements and subelements (i.e. fields) and the system behavior they are intended to support are defined on the elements tab of argot.xlsx.

Note

KMS, 2019-08-07: I’d mentioned hoping to convert the .xlsx file to a Google Sheet before my departure, but later realized unmapped_marc.ipynb is pulling in data from the .xlsx. I do not have time to update that to pull from a Google Sheet, and don’t want to leave that broken (and am not convinced a Jupyter Notebook is the best way to maintain that tool long term), so I am leaving the spreadsheet in Excel.

My recommendation would be that the spreadsheet get moved to the TRLN Drive space because this provides a better way for multiple people to work with and update this file over time. It should be editable only by TRLN folks. It should be publicly viewable, since staff across TRLN institutions may want to refer to it.

Mappings from MARC and other metadata schemes to each Argot element/subelement are on the mappings tab of the spreadsheet.

The elements and mappings tabs have been being converted to .csv files in the argot directory to support referring to this information online.

An explanation of the columns/values used in the spreadsheet exists, but is in sorry shape. Apparently I got interrupted working on it and didn’t get back to it. Sorry!

I’ve represented the general types of data transformations for each mapping in the spreadsheet, but for the exact details on how data gets transformed, see the code and tests.

To understand the Argot data patterns, refer to spec_docs files starting with _pattern. To see what fields get handled by special flatteners in argot-ruby, look here. To see exactly how each flattener works, consult the code.

I quit writing up spec docs for individual elements as I began instead writing tests for the transformations in MARC-to-Argot.

Important/special topics

TRLN Shared Records

Needs to be migrated from Confluence to this repo. This information should be publicly available because staff at the institutions refer to the queries for pulling stats

Location data

How locations get handled in TRLN Discovery: broader location (loc_b), narrower location (loc_n), and location facet. Needs to be migrated from Confluence to this repo

Status and availability

describes the old Endeca catalog, but we’ve mirrored a lot of this in TRLN Discovery and these concepts may be useful

Tracing data transformations

Source data to Argot transformations are going to be institution-specific, though much of the MARC transformation logic is shared and handled by MARC-to-Argot. The tests serve as a form of documentation, some more complete and/or easier to read than others.

Use https://ingest.discovery.trln.org/ to view the further-transformed data in the system. For each record, you can see: * the latest Argot submitted ("Argot") * the flattened/enriched version of Argot produced by argot-ruby ("Enriched Content (what gets sent to Solr)") * the stored Solr fields available to the Argon application for display or other non-search purposes ("Solr Document").

You can also see the stored Solr fields available to the Argon application by adding '.json' to the end of any catalog record page you are viewing: https://discovery.trln.org/catalog/UNCb5990361.json

Note

You cannot view the actual content of the indexed fields in Solr, because they have undergone extensive special analysis for indexing and are unfit for human viewing unless you are doing some deep-dive diagnosis.

To experiment with how data gets transformed (stemmed, tokenized, etc.) for search in Solr, use the Solr admin > collection = trlnbib > Analysis tool. Adam Constabaris maintains access to the Solr admin interface.

Solr

To understand how the "Enriched Content (what gets sent to Solr)" gets turned into the actual Solr fields, you will need to consult: * understand 'dynamic fields' in Solr * consult the dynamic fields section of our Solr schema (This information or a version of it---not sure how in sync these are kept---is publicly available here, but you have to scroll/find in page to get to it)

Querying Solr directly can be extremely helpful for:

  • performing searches not supported by UI

    • search on specific field(s) instead of fields grouped by search index

    • searches using negation

This is useful for: * finding examples of data issues/problems * exporting ids needing update * etc

Access to query Solr directly is limited by IP range, but seems to include staff IP ranges for all institutions. If you need to query Solr but can’t, contact the main catalog support person at your institution.

There’s a poorly formatted/organized cheatsheet of Solr queries we’ve found useful for data work at UNC.

Argon search and relevance ranking

The title search in the catalog application is actually searching many fields.

The search options available in the catalog application are specified here.

There are qf and pf values for each search type (except 'all_fields', which really isn’t all fields, but uses the default qf/pf labels)

To see all the fields included in each search type, along with the relevance boost applied to each field, look that search type’s qf/pf values up here.