Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust transformation rules to RDA changes #161

Closed
acka47 opened this issue Jun 10, 2015 · 15 comments · Fixed by hbz/lobid-resources#109
Closed

Adjust transformation rules to RDA changes #161

acka47 opened this issue Jun 10, 2015 · 15 comments · Fixed by hbz/lobid-resources#109
Assignees
Labels

Comments

@acka47
Copy link
Contributor

acka47 commented Jun 10, 2015

From 1 October 2015 people will be cataloging in hbz union catalog according to the RDA rules as documented here. We will have to adjust the transformation, i.e. the hbz01-to-lobid morph file accordingly.

After a first cursory look at the documents, I suggest the following approach:

Identifying RDA records
RDA is only implemented to newly catalogued resources which get an RDA marker r in field 030, indicator=blank,position 4 of the Aleph sequentials (aseq), see the documentation. Thus, we will have to add the RDA transformation rules only for these records.

Checking fields that will be omitted
Several fields won't be used anymore with RDA cataloging. You can see the list here. We will check whether and how we currently transform these to RDF.

Find out how to transform the new data to lobid'scurrent RDF data model
After having identified the data fields where RDA means change we will have to find find out how we integrate the new RDA data into the the current lobid RDF.

Discuss how to handle breaks in the cataloging practice
While we be able to make a seemless transformation for some of RDA cataloging so that lobid customers won't even notice that things have changed, this may not be possible for all of the changes. E.g., regarding IMD (Inhaltstyp, Medientyp, Datenträgertype)/CMC (content type, media type, carrier type) we will get better and more coherent information (see here for details).

On cases where cataloging practice significantly breaks, we will have to look, whether we will both try to map the data to the old/currrent data model and map the data according to RDA.

@acka47 acka47 self-assigned this Jun 25, 2015
@acka47 acka47 added the ready label Aug 24, 2015
@acka47
Copy link
Contributor Author

acka47 commented Oct 22, 2015

Here is a list of RDA record in lobid courtesy of @donboern: rda-ids.txt

@acka47
Copy link
Contributor Author

acka47 commented Oct 22, 2015

Most of the reosurces listed in rda-ids.txt seem to be periodicals. Here is a print book: http://lobid.org/resource/HT018779822

@acka47
Copy link
Contributor Author

acka47 commented Oct 22, 2015

What seems important to me for the start is field 419 with the pbulisher, publication place/date information.

Snippet from http://lobid.org/resource/HT018779822:

<datafield ind2="1" ind1="-" tag="419">
            <subfield code="a">New York</subfield>
            <subfield code="b">Routledge</subfield>
            <subfield code="c">2014</subfield>
</datafield>

Snippet from http://lobid.org/resource/HT018772912:

          <datafield ind2="1" ind1="-" tag="419">
            <subfield code="a">Sundern</subfield>
            <subfield code="b">Baulmann Leuchten GmbH</subfield>
            <subfield code="c">2011-</subfield>
            <subfield code="A">3</subfield>
          </datafield>

@dr0i
Copy link
Member

dr0i commented Oct 28, 2015

Thus, we will have to add the RDA transformation rules only for these records.

Does this mean that fields are ambiguous (i.e. e.g. 419-1c is the publication date if it's RDA catalogued but something different when it's old MAB2? (In this case I can see that that's not the case)).
If there is no interference it's much simpler to configure the transformation rules, that's why I ask.

@acka47
Copy link
Contributor Author

acka47 commented Nov 3, 2015

@DRoI Could you please get the Aleph XML source of all files in rda-ids.txt and put them in one file so that I can search for specific fields?

@dr0i dr0i added working and removed ready labels Nov 3, 2015
@dr0i
Copy link
Member

dr0i commented Nov 3, 2015

for i in $(cat rda-ids.txt); do xmllint --format "$i?format=source" >> rda-ids.alephMabXmlPretty.xml; done
You find that at http://lobid.org/download/rda-ids.alephMabXmlPretty.xml .

@acka47
Copy link
Contributor Author

acka47 commented Dec 3, 2015

Speaking to publisso stakeholders, they want to work with roles of persons/corporations from RDA. We will have to consider these in the transformation. Note to self: Take a look at this and open a separate issue.

@dr0i
Copy link
Member

dr0i commented Jan 22, 2016

Updated rda-ids.alephMabXmlPretty.xml . Took as base DE-605-aleph-baseline-marcxchange-2016011515.tar.gz which reveals 16k resources as RDA. Hope this suffices.

@acka47
Copy link
Contributor Author

acka47 commented Sep 2, 2016

@dr0i Could you please update rda-ids.alephMabXmlPretty.xml once more?

dr0i added a commit to hbz/lobid-resources that referenced this issue Sep 22, 2016
Provides a list of RDA records residing in hbz01 catalogue.

See hbz/lobid#161.
@dr0i dr0i assigned dr0i and unassigned acka47 Sep 22, 2016
@dr0i dr0i removed the working label Sep 23, 2016
@dr0i
Copy link
Member

dr0i commented Sep 23, 2016

Around 180k docs, concatenated in one big bzipped xml file: http://lobid.org/download/rda-ids.alephMabXmlPretty.xml.bz2

@dr0i dr0i reopened this Sep 23, 2016
@dr0i dr0i assigned acka47 and unassigned dr0i Sep 23, 2016
@dr0i dr0i added review and removed processing labels Sep 23, 2016
@acka47
Copy link
Contributor Author

acka47 commented Sep 26, 2016

Thanks. Unwieldy as the file gets, I won't ask again for creating it. Now thinking about how to work with a 1,5GB xml file...

@acka47 acka47 added ready and removed review labels Sep 26, 2016
@dr0i dr0i added working and removed ready labels Sep 28, 2016
@dr0i
Copy link
Member

dr0i commented Sep 28, 2016

Depending on what you want, you can always use the friendly stream tools like less, grep, sed etc.

@acka47
Copy link
Contributor Author

acka47 commented Oct 7, 2016

There seems to be a problem with the rda-ids.alephMabXmlPretty.xml. When I do for example cat rda-ids.alephMabXmlPretty.xml | xmllint --format - | grep --color -A 4 "<datafield tag=\"064\" ind1=\".\" ind2=\".\">" I get:

-:103: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0" encoding="UTF-8"?>
     ^
-:104: parser error : Extra content at the end of the document
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.o

@ChristophEwertowski
Copy link
Contributor

ChristophEwertowski commented Jan 13, 2017

As a finger execise I looked at the morph-hbz01-to-lobid.xml to check fields which are now omitted, how they are transformed and to document it here.

Number MAB2 Field name MAB2 if & how transformed to RDF
300 Sammlungsvermerk -
304 Einheitssachtitel a → dc/terms/alternative
310 Hauptsachtitel in Ansetzungsform → Titel
333 zu ergänzende Urheber zum Hauptsachtitel If no title exists, set as title. Also taken as CorporateBodyTitle.
334 Allgemeine Materialbenennung Match with Bibo/AudioDocument, bibo/AudioVisualDocument, bibo/Image, RDACarrierType/1020 (Microform Carriers). Used for checking if full text is online.
340, 344, 348, 352 Parallelsachtitel in Ansetzungsform -
342, 346, 350, 354 zu ergänzende Urheber zum Parallelsachtitel -
361 Beigefügte Werke -
410, 411, 412, 415, 416, 417, 418 Alter Erscheinungsvermerk -
454, 464, 474, 484, 494 Gesamttitel in Ansetzungsform – wird auf Verbundebene entschieden! -
502 Einheitssachtitel eines beigefügten oder kommentierten Werkes -
504 Angabe von Paralleltiteln → dc/terms/alternative
517 Angaben zum Inhalt -
519 Alter Hochschulschriftenvermerk If existing, multiple values are combined as RDA Elements/u/P60489
532 Hinweise auf frühere und spätere sowie zeitweise gültige Titel -
610 – 645 Segment Sekundärformen 619a (Erscheinungsjahr(e) in Vorlageform) matched with 021 (Identifikationsnummer der Primaerform)
652 Spezifische Materialbenennung und Dateityp a (stands for RAK-NBM) → Online ressource
653 Physische Beschreibung der Computerdatei auf Datenträger -
8XX Segment Nichtstandardmäßige Nebeneintragungen Matches with some GND-id?
9XX Bei RSWK-Schlagwörtern erstes Unterfeld $f Matches with some GND-id?

@acka47
Copy link
Contributor Author

acka47 commented Apr 21, 2017

Closing this super-issue as the two remaining sub-issues are sufficient for future orientation (and don't need to be implemented for the launch).

@acka47 acka47 closed this as completed Apr 21, 2017
@acka47 acka47 removed the ready label Apr 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants