Title of Dataset:

This README file was generated on 2024-06-25 by (author) Erik Fredner and last modified on 2024-12-31 by (data curator) Sarah Reiff Conell

Title of Dataset:

Moving the Capital of US Literature from Boston to New York: Evidence from 11 million Library of Congress Records

Author Information:

Name: Erik Fredner

Institution: University of Richmond

ORCID: 0000-0002-2993-4961

Email: erik.fredner@richmond.edu

Website: https://fredner.org/

Sharing/Access Information:

Licenses/restrictions placed on the data: CC0-1.0 Creative Commons Universal

Licenses/restrictions placed on the code: MIT The MIT License

Other publicly accessible locations of the data: https://github.com/erikfredner/c19dc

Data Derived from: https://lccn.loc.gov/2020445551

Methods:

Getting nineteenth century data from Library of Congress book records

This data publication contains code and data for "Moving the Capital of US Literature from Boston to New York: Evidence from 11 million Library of Congress records," published by the Nineteenth Century Data Collective.

The code can be modified to extract data from any MARC record fields from any Library of Congress Classification Outline range available in the MDSConnect "Books All" dataset.

Basic information about the dataset

Normalized places of publication for C19 US literature books
Original data source: https://lccn.loc.gov/2020445551
Data analysis: Erik Fredner
Dataset languages: Primarily English, though some titles in other languages

Data structure

See the data folder in this repository for the dataset (data.csv) and a data dictionary (data_dictionary.csv) describing its columns. All transformatons to the MDSConnect Books All dataset are recorded in this repository.

Ethics

MDSConnect is the LC's open access MARC records dataset.

Format

The essay and dataset are both available in plain text formats (.qmd and .csv respectively). The essay is also available as Quarto-rendered .html or .pdf documents.

About this repo

This repository follows the conventions outlined here.

Patrick J Mineault & The Good Research Code Handbook Community (2021). The Good Research Code Handbook. Zenodo. doi:10.5281/zenodo.5796873.

Local reproduction and customization

This code has been tested on macOS 15.1 (24B83) with the conda environment indicated in environment.yml

Requirements

Download and extract the most current version of the MDS Connect books all dataset.
- All figures referenced in the essay refer to the 2019 version, the most current as of this writing.
Reset any local paths referenced in the Jupyter notebooks to paths on your machine.

Original use

As written, this identifies and extracts LC "PS..." records in the American literature range from the complete books dataset.
It also extacts publication information, including first author, title, publication place, and year, as visibile in data/data.csv
It then normalizes place names and publication years to measure the changing imprint geographies of US literature.

How to modify

Assuming that another user of this code might want to select a different set of records or extract different fields from them, they will need to modify src/data_collection.py.
There are values hard-coded in process_record() that can be modified to change the records to be pulled.
- For instance, change the str(classification).startswith("PS") expression to subset for records in any value given in the Library of Congress Classification Outline.
- To change the subfield retrieved, either modify or expand the calls to extract_subfields() in process_record() to a subfield as defined in the MARC format.

For example, if you wanted to extract information about the extent of each book, you could do so by referencing MARC field 300 (Physical Description) and adding a call to process_record() like so:

extent = extract_subfields(record=record, tag="300", subfield_code="a", ns=ns)

Known Limitations

The automated cleaning processes do not catch 100% of records by design. For some applications, it might be desirable to perform other steps to correct certain values.

For example, while almost all works have a four-digit year of publication, some are expressed by catalogers with uncertainty, e.g., "18--?" or "187-?" The current model ignores such values (setting them to 0, so that they are excluded from the analysis in preference to assuming that "187?" should be treated as 1870 or 1875. Other researchers might prefer the information encoded at the level of the decade or century in such strings. Less than 1% of records have uncertain or ambiguous dates.

AI Statement

I used GitHub Copilot completions in writing some of this code.

File and Folder Directory:

├── data
│   ├── data.csv
│   ├── data_dictionary.csv
├── docs
│   ├── essay.html
│   ├── essay.pdf
│   ├── essay.qmd
├── results
│   ├── lc_ps.png
│   ├── lc_ps_pct.png
│   ├── moby.png
├── scripts
│   ├── 1 Get LC Data.ipynb
│   ├── 2 Clean LC Data.ipynb
│   ├── 3 Visualize LC Data.ipynb 
├── src
│   ├── __init__.py
│   ├── data_cleaning.py
│   ├── data_collection.py
├── .gitignore
├── README.md
├── environment.yml

Data Specific Information:

Variables and Abbreviations can be found in data_dictionary.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Title of Dataset:

Author Information:

Sharing/Access Information:

Methods:

Getting nineteenth century data from Library of Congress book records

Basic information about the dataset

Data structure

Ethics

Format

About this repo

Local reproduction and customization

Requirements

Original use

How to modify

Known Limitations

AI Statement

File and Folder Directory:

Data Specific Information:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
docs		docs
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

sec122/c19dc

Folders and files

Latest commit

History

Repository files navigation

Title of Dataset:

Author Information:

Sharing/Access Information:

Methods:

Getting nineteenth century data from Library of Congress book records

Basic information about the dataset

Data structure

Ethics

Format

About this repo

Local reproduction and customization

Requirements

Original use

How to modify

Known Limitations

AI Statement

File and Folder Directory:

Data Specific Information:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages