rOpenGov/estc?? #15

bhughesshelton · 2017-11-13T16:27:43Z

Hi,
I was just wondering what happened to your repo over at http://github.com/rOpenGov/estc. I'm doing some computational bibliography and found your article, "A Quantitative Study of History in the English Short-Title Catalogue (ESTC), 1470-1800." I had a look at the source code a few weeks ago, but the repo seems to be gone now. Is there any way you can send me the src or allow me to fork the repo? Apologies if this isn't the right venue for this kind of question.

antagomir · 2017-11-13T16:30:14Z

Hi ! Yes the repository was permanently moved to http://github.com/COMHIS/estc very recently and we are still updating all cross-linkings. Apoloiogies for the hassle. Let us know if we can provide support,

antagomir · 2017-11-13T16:31:48Z

However, note that this code and analyses relies on data that is not public. We got the data via confidential collaboration agreement. Therefore, the estc repository itself has mostly information value but does not allow reproducing the analysis in the paper, unless you have your own copy of the data.

bhughesshelton · 2017-11-13T18:31:10Z

Thanks, I have the complete ESTC already--UC Riverside was kind enough to provide me with it. I appreciate you sending me the new link to the repo. I've written some pretty kludgey Perl scripts ( https://github.com/bhughesshelton/ESTC/blob/master/ESTCmeta.pl) to handle the data, but am looking forward to seeing what your code can do. Any suggestions for normalizing historical name data? Cheers, Barry

…

On Mon, Nov 13, 2017 at 11:31 AM, Leo Lahti ***@***.***> wrote: However, note that this code and analyses relies on data that is not public. We got the data via confidential collaboration agreement. Therefore, the estc repository itself has mostly information value but does not allow reproducing the analysis in the paper, unless you have your own copy of the data. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARJm_SvpzfFwlE_y3mFtabbFd6SKXccCks5s2G71gaJpZM4QcCS7> .

antagomir · 2017-11-13T23:03:22Z

Thanks for your interest! We are now reorganizing the code and the complete workflow is at the moment not replicable for various technical reasons. The aim is to really get this set up for the complete data cleaning process and we are working on it.

If you are interested in specific fields, I can see what we could do. Do you refer to historical person names, place names, or something else ?

We would like You to kindly cite the work where appropriate.

bhughesshelton · 2017-11-14T02:33:11Z

I've been trying to find some way to deal with personal names, specifically normalizing the spelling of the proper names that I'm grepping out of the MARC 260 fields in the ESTC. I just spent a few hours doing it by hand in an excel sheet, and finished the period up to 1641--everything covered by the original STC. Even though I work mostly in Perl, I'm pretty good with R as well, so let me know if you guys ever need any help. I'd be glad to contribute in any way I can.

…

On Mon, Nov 13, 2017 at 6:03 PM, Leo Lahti ***@***.***> wrote: Thanks for your interest! We are now reorganizing the code and the complete workflow is at the moment not replicable for various technical reasons. The aim is to really get this set up for the complete data cleaning process and we are working on it. If you are interested in specific fields, I can see what we could do. Do you refer to historical person names, place names, or something else ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARJm_Xc5SCSnAJnpJ2YbVtFhmPDX_eY-ks5s2Mq7gaJpZM4QcCS7> .

markjhill · 2017-11-15T15:44:57Z

I'm actually working on this aspect of the ESTC right now. Out of curiosity, what is your goal of normalizing spelling? Having unique identifiers for each author?

antagomir · 2017-11-15T15:46:11Z

Great to hear! Might be useful to compare the matchings up to 1641 at least as our procedure is largely automated whereas yours seems to be manual. This would provide some quality control. It would also be helpful to check through our lists to spot possible mistakes. This is now ongoing and presumably ready rather soon.

bhughesshelton · 2017-11-15T18:17:44Z

Well, the names of authors are already (mostly) standardized in the ESTC--what I've done is extract and standardize the names of the other people associated with each text: printers, publishers and booksellers. And yes, I then assigned everyone a UUID, finally tying that number back to the UUID for each text, i.e. the ESTC number.

…

On Wed, Nov 15, 2017 at 10:44 AM, drmhill ***@***.***> wrote: I'm actually working on this aspect of the ESTC right now. Out of curiosity, what is your goal of normalizing spelling? Having unique identifiers for each author? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARJm_aESqpMy-ISWffEMXao6v3yG-ogiks5s2wb6gaJpZM4QcCS7> .

bhughesshelton · 2017-11-15T18:24:21Z

Right, developing an automated process for historical name disambiguation would be almost impossible (and you'd end up with loads of mistakes). What I've done is run some fancy pattern matching routines across the MARC records to extract the information I was interested in, and drive into a relational db. Then I was able to open an JDBC connection between that db and a spreadsheet where I could sort the names and copy/paste the most frequent spelling over the less frequent ones.

…

On Wed, Nov 15, 2017 at 10:46 AM, Leo Lahti ***@***.***> wrote: Great to hear! Might be useful to compare the matchings up to 1641 at least as our procedure is largely automated whereas yours seems to be manual. This would provide some quality control. It would also be helpful to check through our lists to spot possible mistakes. This is now ongoing and presumably ready rather soon. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARJm_WbHUMSgjVKdTtl29_NRuO0Ugdm5ks5s2wdEgaJpZM4QcCS7> .

antagomir · 2017-11-15T18:34:09Z

Yes that's the key & what we do as well: automate as much as possible, and do the rest by hand. But some degree of automation is crucial here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rOpenGov/estc?? #15

rOpenGov/estc?? #15

bhughesshelton commented Nov 13, 2017

antagomir commented Nov 13, 2017

antagomir commented Nov 13, 2017

bhughesshelton commented Nov 13, 2017 via email

antagomir commented Nov 13, 2017 •

edited

Loading

bhughesshelton commented Nov 14, 2017 via email

markjhill commented Nov 15, 2017

antagomir commented Nov 15, 2017

bhughesshelton commented Nov 15, 2017 via email

bhughesshelton commented Nov 15, 2017 via email

antagomir commented Nov 15, 2017

rOpenGov/estc?? #15

rOpenGov/estc?? #15

Comments

bhughesshelton commented Nov 13, 2017

antagomir commented Nov 13, 2017

antagomir commented Nov 13, 2017

bhughesshelton commented Nov 13, 2017 via email

antagomir commented Nov 13, 2017 • edited Loading

bhughesshelton commented Nov 14, 2017 via email

markjhill commented Nov 15, 2017

antagomir commented Nov 15, 2017

bhughesshelton commented Nov 15, 2017 via email

bhughesshelton commented Nov 15, 2017 via email

antagomir commented Nov 15, 2017

antagomir commented Nov 13, 2017 •

edited

Loading