Skip to content

Commit

Permalink
Add alt text to brandon cluster
Browse files Browse the repository at this point in the history
  • Loading branch information
walshbr committed Sep 7, 2019
1 parent cc250ab commit 397ffd1
Show file tree
Hide file tree
Showing 9 changed files with 79 additions and 73 deletions.
7 changes: 4 additions & 3 deletions en/lessons/data-mining-the-internet-archive.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ activity: acquiring
topics: [web-scraping]
abstract: "The collections of the Internet Archive include many digitized historical sources. Many contain rich bibliographic data in a format called MARC. In this lesson, you'll learn how to use Python to automate the downloading of large numbers of MARC files from the Internet Archive and the parsing of MARC records for specific information such as authors, places of publication, and dates. The lesson can be applied more generally to other Internet Archive files and to MARC records found elsewhere."
redirect_from: /lessons/data-mining-the-internet-archive
avatar_alt: Group of of men working in a mine
---

{% include toc.html %}
Expand Down Expand Up @@ -336,7 +337,7 @@ for item in search:
print item['identifier']
```

You should get the same results.
You should get the same results.

The second thing to note about the *for loop* is that the indented block
could could have contained other commands. In this case, we printed each
Expand Down Expand Up @@ -547,10 +548,10 @@ def map_xml(function, *files):
"""
map a function onto the file, so that for each record that is
parsed the function will get called with the extracted record
def do_it(r):
print r
map_xml(do_it, 'marc.xml')
"""
```
Expand Down
1 change: 1 addition & 0 deletions en/lessons/data_wrangling_and_management_in_R.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ abstract: "This tutorial explores how scholars can organize 'tidy' data, underst
layout: lesson
review-ticket: https://github.com/programminghistorian/ph-submissions/issues/60
redirect_from: /lessons/data-wrangling-and-management-in-R
avatar_alt: Bar of soap
---

{% include toc.html %}
Expand Down

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions en/lessons/downloading-multiple-records-using-query-strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ abstract: "Downloading a single record from a website is easy, but downloading m
previous: output-keywords-in-context-in-html-file
python_warning: true
redirect_from: /lessons/downloading-multiple-records-using-query-strings
avatar_alt: Figures working in a mine, pushing carts
---

{% include toc.html %}
Expand Down Expand Up @@ -130,7 +131,7 @@ Take a look at the URL produced with the last search results page. It
should look like this:

``` xml
https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0
https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0
```

We had a look at URLs in [Viewing HTML Files][], but this looks a lot
Expand All @@ -150,7 +151,7 @@ https://www.oldbaileyonline.org/search.jsp
&toYear=1750
&toMonth=99
&start=0
&count=0
&count=0
```

In this view, we see more clearly our 12 important pieces of information
Expand All @@ -161,7 +162,7 @@ it does not do anything.) and a series of 10 *name/value pairs* put
together with & characters. Together these 10 name/value pairs comprise
the query string, which tells the search engine what variables to use in
specific stages of the search. Notice that each name/value pair contains
both a variable name: toYear, and then assigns that variable a value: 1750.
both a variable name: toYear, and then assigns that variable a value: 1750.
This works in exactly the same way as *Function Arguments* by
passing certain information to specific variables. In this case, the
most important variable is `_divs_fulltext=` which has been given the
Expand Down Expand Up @@ -243,7 +244,7 @@ page. We have already got the first one by using the form on the
website:

``` xml
https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0
https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0
```

We could type this URL out twice and alter the ‘*start*’ variable to get
Expand Down Expand Up @@ -463,7 +464,7 @@ def getSearchResults(query, kwparse, fromYear, fromMonth, toYear, toMonth, entri
url += '&toMonth=' + toMonth
url += '&start=' + str(startValue)
url += '&count=0'

#download the page and save the result.
response = urllib2.urlopen(url)
webContent = response.read()
Expand Down Expand Up @@ -567,7 +568,7 @@ def getSearchResults(query, kwparse, fromYear, fromMonth, toYear, toMonth, entri
url += '&toMonth=' + toMonth
url += '&start=' + str(startValue)
url += '&count=0'

#download the page and save the result.
response = urllib2.urlopen(url)
webContent = response.read()
Expand Down Expand Up @@ -711,7 +712,7 @@ the trials. The first entry starts with “Anne Smith” so you can use the
Notice Anne’s name is part of a link:

``` xml
browse.jsp?id=t17160113-18&div=t17160113-18&terms=mulatto*_negro*#highlight
browse.jsp?id=t17160113-18&div=t17160113-18&terms=mulatto*_negro*#highlight
```

Perfect, the link contains the trial ID! Scroll through the remaining
Expand Down Expand Up @@ -1063,7 +1064,7 @@ the command output so we know which files failed to download. This
should be added as the last line in the function.

```
print "failed to download: " + str(failedAttempts)
print "failed to download: " + str(failedAttempts)
```

Now when you run the program, should there be a problem downloading a
Expand Down
1 change: 1 addition & 0 deletions en/lessons/editing-audio-with-audacity.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ topics: [data-manipulation]
abstract: "In this lesson you will learn how to use Audacity to load, record, edit, mix, and export audio files."
review-ticket: https://github.com/programminghistorian/ph-submissions/issues/15
redirect_from: /lessons/editing-audio-with-audacity
avatar_alt: Two gramophones facing each other
---

{% include toc.html %}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ topics: [network-analysis]
date: 2017-08-23
abstract: "This lesson introduces network metrics and how to draw conclusions from them when working with humanities data. You will learn how to use the NetworkX Python package to produce and work with these network statistics."
redirect_from: /lessons/exploring-and-analyzing-network-data-with-python
avatar_alt: Train tracks intersecting
---

{% include toc.html %}
Expand Down
1 change: 1 addition & 0 deletions en/lessons/extracting-illustrated-pages.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ difficulty: 2
activity: acquiring
topics: [api]
abstract: Machine learning and API extensions by HathiTrust and Internet Archive are making it easier to extract page regions of visual interest from digitized volumes. This lesson shows how to efficiently extract those regions and, in doing so, prompt new, visual research questions.
avatar_alt: Scientific measuring device
---

{% include toc.html %}
Expand Down
31 changes: 16 additions & 15 deletions en/lessons/extracting-keywords.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ topics: [data-manipulation]
abstract: "This lesson will teach you how to use Python to extract a set of keywords very quickly and systematically from a set of texts."
python_warning: true
redirect_from: /lessons/extracting-keywords
avatar_alt: Woman churning butter or milk
---

{% include toc.html %}
Expand Down Expand Up @@ -59,7 +60,7 @@ The first step of this process is to take a look at the data that we will be usi

{% include figure.html filename="extracting-keywords-1.png" caption="Screenshot of the first forty entries in the dataset" %}

Download the dataset and spend a couple of minutes looking at the types of information available. You should notice three columns of information. The first, 'Name', contains the name of the graduate. The second: 'Details', contains the biographical information known about that person. The final column, 'Matriculation Year', contains the year in which the person matriculated (began their studies). This final column was extracted from the details column in the pre-processing stage of this tutorial. The first two columns are as you would find them on the British History Online version of the *Alumni Oxonienses*. If you notice more than three columns then your spreadsheet programme has incorrectly set the [delimiter](https://en.wikipedia.org/wiki/Delimiter) between columns. It should be set to "," (double quotes, comma). How you do this depends on your spreadsheet programme, but you should be able to find the solution online.
Download the dataset and spend a couple of minutes looking at the types of information available. You should notice three columns of information. The first, 'Name', contains the name of the graduate. The second: 'Details', contains the biographical information known about that person. The final column, 'Matriculation Year', contains the year in which the person matriculated (began their studies). This final column was extracted from the details column in the pre-processing stage of this tutorial. The first two columns are as you would find them on the British History Online version of the *Alumni Oxonienses*. If you notice more than three columns then your spreadsheet programme has incorrectly set the [delimiter](https://en.wikipedia.org/wiki/Delimiter) between columns. It should be set to "," (double quotes, comma). How you do this depends on your spreadsheet programme, but you should be able to find the solution online.

Most (but not all) of these bibliographic entries contain enough information to tell us what county the graduate came from. Notice that a large number of entries contain placenames that correspond to either major cities ('of London', in the first entry) or English counties ('of Middlesex' in entry 5 or 'of Wilts' - short for Wiltshire in entry 6). If you are not British you may not be familiar with these county names. You can find a list of [historic counties of England](http://en.wikipedia.org/wiki/Historic_counties_of_England) on Wikipedia.

Expand Down Expand Up @@ -161,7 +162,7 @@ The fourth line closes the open text file. The fifth line prints out the results

Save this file as `extractKeywords.py`, again to the same folder as the other files, and then run it with Python. To do this from the command line, first you need to launch your command line terminal.

On Windows it is called `Command Prompt`. Windows users may find it easier to launch Python by opening the folder containing your `extractKeywords.py` file, then press `shift` + `right-click` and then select 'open command window here'. Assuming you have Python installed, you should be able to run your programme using the command beginning with 'python' below.
On Windows it is called `Command Prompt`. Windows users may find it easier to launch Python by opening the folder containing your `extractKeywords.py` file, then press `shift` + `right-click` and then select 'open command window here'. Assuming you have Python installed, you should be able to run your programme using the command beginning with 'python' below.

On Mac OS X, this is found in the `Applications` folder and is called `Terminal`. Once the Terminal window is open, you need to point your Terminal at the directory that contains all of the files you have just created. I have called my directory 'ExtractingKeywordSets' and I have it on my computer's Desktop. To change the Terminal to this directory, I use the following command:

Expand Down Expand Up @@ -268,7 +269,7 @@ This code will automatically check each word in a text, keeping track of matches

If it looks like it worked, delete the 'print matches' line and move to the next step.

### Step 5: Output results
### Step 5: Output results

If you have got to this stage, then your Python program is already finding the matching keywords from your gazetteer. All we need to do now is print them out to the command output pane in a format that's easy to work with.

Expand All @@ -282,7 +283,7 @@ Add the following lines to your program, minding the indentation as always:
matchString = ''
for matches in storedMatches:
matchString = matchString + matches + "\t"

print matchString

```
Expand All @@ -293,7 +294,7 @@ If there IS a match, then the program creates a new variable called 'matchString

When all of the matching keywords have been added to 'matchString', the program prints it out to the command output before moving on to the next text.

If you save your work and run the program, you should now have code that achieves all of the steps from the algorithm and outputs the results to your command output.
If you save your work and run the program, you should now have code that achieves all of the steps from the algorithm and outputs the results to your command output.

The finished code should look like this:

Expand All @@ -314,7 +315,7 @@ f.close()
for entry in allTexts:
matches = 0
storedMatches = []

#for each entry:
allWords = entry.split(' ')
for words in allWords:
Expand All @@ -332,15 +333,15 @@ for entry in allTexts:
else:
storedMatches.append(words)
matches += 1

#if there is a stored result, print it out
if matches == 0:
print ' '
else:
matchString = ''
for matches in storedMatches:
matchString = matchString + matches + "\t"

print matchString
```

Expand Down Expand Up @@ -416,7 +417,7 @@ with open('The_Dataset_-_Alumni_Oxonienses-Jas1.csv') as csvfile:
for row in reader:
#the full row for each entry, which will be used to recreate the improved CSV file in a moment
fullRow.append((row['Name'], row['Details'], row['Matriculation Year']))

#the column we want to parse for our keywords
row = row['Details'].lower()
allTexts.append(row)
Expand Down Expand Up @@ -484,7 +485,7 @@ with open('The_Dataset_-_Alumni_Oxonienses-Jas1.csv') as csvfile:
for row in reader:
#the full row for each entry, which will be used to recreate the improved CSV file in a moment
fullRow.append((row['Name'], row['Details'], row['Matriculation Year']))

#the column we want to parse for our keywords
row = row['Details'].lower()
allTexts.append(row)
Expand All @@ -503,7 +504,7 @@ with open(filename, 'a') as csvfile:
fieldnames = ['Name', 'Details', 'Matriculation Year', 'Placename']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()

#NEW! define the output for each row and then print to the output csv file
writer = csv.writer(csvfile)

Expand All @@ -512,24 +513,24 @@ with open(filename, 'a') as csvfile:

matches = 0
storedMatches = []

#for each entry:
allWords = entry.split(' ')
for words in allWords:

#remove punctuation that will interfere with matching
words = words.replace(',', '')
words = words.replace('.', '')
words = words.replace(';', '')

#if a keyword match is found, store the result.
if words in allKeywords:
if words in storedMatches:
continue
else:
storedMatches.append(words)
matches += 1

#CHANGED! send any matches to a new row of the csv file.
if matches == 0:
newRow = fullRow[counter]
Expand Down
3 changes: 1 addition & 2 deletions en/lessons/fetch-and-parse-data-with-openrefine.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ activity: acquiring
topics: [data-manipulation, web-scraping, api]
abstract: "OpenRefine is a powerful tool for exploring, cleaning, and transforming data. In this lesson you will learn how to use Refine to fetch URLs and parse web content."
redirect_from: /lessons/fetch-and-parse-data-with-openrefine
avatar_alt: Machine for water filtration
---

{% include toc.html %}
Expand Down Expand Up @@ -640,5 +641,3 @@ OpenRefine is a flexible, pragmatic tool that simplifies routine tasks and, when
[^use]: As of July 2017, see [API Documentation](http://text-processing.com/docs/index.html).
[^1]: Jacob Perkins, "Sentiment Analysis with Python NLTK Text Classification", [http://text-processing.com/demo/sentiment/](http://text-processing.com/demo/sentiment/).
[^2]: Vivek Narayanan, Ishan Arora, and Arjun Bhatia, "Fast and accurate sentiment classification using an enhanced Naive Bayes model", 2013, [arXiv:1305.6143](https://arxiv.org/abs/1305.6143).


0 comments on commit 397ffd1

Please sign in to comment.