From 397ffd1ff25be40b87ba8c5db2504e0753af4f3c Mon Sep 17 00:00:00 2001 From: Brandon Walsh Date: Sat, 7 Sep 2019 10:34:14 -0400 Subject: [PATCH] Add alt text to brandon cluster --- .../data-mining-the-internet-archive.md | 7 +- .../data_wrangling_and_management_in_R.md | 1 + ...g-data-and-network-analysis-using-neo4j.md | 90 +++++++++---------- ...ng-multiple-records-using-query-strings.md | 17 ++-- en/lessons/editing-audio-with-audacity.md | 1 + ...-and-analyzing-network-data-with-python.md | 1 + en/lessons/extracting-illustrated-pages.md | 1 + en/lessons/extracting-keywords.md | 31 +++---- .../fetch-and-parse-data-with-openrefine.md | 3 +- 9 files changed, 79 insertions(+), 73 deletions(-) diff --git a/en/lessons/data-mining-the-internet-archive.md b/en/lessons/data-mining-the-internet-archive.md index 9d220202b..4355ec699 100755 --- a/en/lessons/data-mining-the-internet-archive.md +++ b/en/lessons/data-mining-the-internet-archive.md @@ -16,6 +16,7 @@ activity: acquiring topics: [web-scraping] abstract: "The collections of the Internet Archive include many digitized historical sources. Many contain rich bibliographic data in a format called MARC. In this lesson, you'll learn how to use Python to automate the downloading of large numbers of MARC files from the Internet Archive and the parsing of MARC records for specific information such as authors, places of publication, and dates. The lesson can be applied more generally to other Internet Archive files and to MARC records found elsewhere." redirect_from: /lessons/data-mining-the-internet-archive +avatar_alt: Group of of men working in a mine --- {% include toc.html %} @@ -336,7 +337,7 @@ for item in search: print item['identifier'] ``` -You should get the same results. +You should get the same results. The second thing to note about the *for loop* is that the indented block could could have contained other commands. In this case, we printed each @@ -547,10 +548,10 @@ def map_xml(function, *files): """ map a function onto the file, so that for each record that is parsed the function will get called with the extracted record - + def do_it(r): print r - + map_xml(do_it, 'marc.xml') """ ``` diff --git a/en/lessons/data_wrangling_and_management_in_R.md b/en/lessons/data_wrangling_and_management_in_R.md index 7facf3d33..3973fe29e 100755 --- a/en/lessons/data_wrangling_and_management_in_R.md +++ b/en/lessons/data_wrangling_and_management_in_R.md @@ -16,6 +16,7 @@ abstract: "This tutorial explores how scholars can organize 'tidy' data, underst layout: lesson review-ticket: https://github.com/programminghistorian/ph-submissions/issues/60 redirect_from: /lessons/data-wrangling-and-management-in-R +avatar_alt: Bar of soap --- {% include toc.html %} diff --git a/en/lessons/dealing-with-big-data-and-network-analysis-using-neo4j.md b/en/lessons/dealing-with-big-data-and-network-analysis-using-neo4j.md index 73375958a..48f44a9c6 100755 --- a/en/lessons/dealing-with-big-data-and-network-analysis-using-neo4j.md +++ b/en/lessons/dealing-with-big-data-and-network-analysis-using-neo4j.md @@ -2,7 +2,7 @@ title: Dealing with Big Data and Network Analysis Using Neo4j collection: lessons slug: dealing-with-big-data-and-network-analysis-using-neo4j -authors: +authors: - Jon MacKay date: 2018-02-20 reviewers: @@ -17,6 +17,7 @@ activity: analyzing abstract: "In this lesson we will learn how to use a graph database to store and analyze complex networked information. This tutorial will focus on the Neo4j graph database, and the Cypher query language that comes with it." layout: lesson redirect_from: /lessons/dealing-with-big-data-and-network-analysis-using-neo4j +avatar_alt: Constellation chart --- {% include toc.html %} @@ -24,24 +25,24 @@ redirect_from: /lessons/dealing-with-big-data-and-network-analysis-using-neo4j # Introduction In this lesson we will learn how to use a graph database to store and analyze complex networked information. -Networks are all around us. +Networks are all around us. Social scientists use networks to better understand how people are connected. This information can be used to understand how things like rumors or even communicable diseases can spread throughout a community of people. The patterns of relationships that people maintain with others captured in a network can also be used to make inferences about a person's position in society. For example, a person with many social ties is likely to receive information more quickly than someone who maintains very few connections with others. Using common network terminology, one would say that a person with many ties is more central in a network, and a person with few ties is more peripheral in a network. -Having access to more information is generally believed to be advantageous. +Having access to more information is generally believed to be advantageous. Similarly, if someone is very well-connected to many other people that are themselves well-connected than we might infer that these individuals have a higher social status. Network analysis is useful to understand the implications of ties between organizations as well. Before he was appointed to the Supreme Court of the United States, Louis Brandeis called attention to how anti-competitive activities were often organized through a web of appointments that had directors sitting on the boards of multiple ostensibly competing corporations. Since the 1970s sociologists have taken a more formal network-based approach to examining the network of so-called corporate interlocks that exist when directors sit on the boards of multiple corporations. -Often these ties are innocent, but in some cases they can be indications of morally or legally questionable activities. -The recent release of the +Often these ties are innocent, but in some cases they can be indications of morally or legally questionable activities. +The recent release of the [Paradise Papers](https://neo4j.com/blog/icij-releases-neo4j-desktop-download-paradise-papers/) by -the -[International Consortium of Investigative Journalists](https://icij.org) +the +[International Consortium of Investigative Journalists](https://icij.org) and the ensuing news scandals throughout the world shows how important understanding relationships between people and organizations can be. @@ -58,8 +59,8 @@ By the end of this lesson you will be able to construct, analyze and visualize networks based on big --- or just inconveniently large --- data. The final section of this lesson contains code and data to illustrate the key points of this lesson. -Although beyond the scope of this tutorial, those interested in trying to better understand social networks -can refer to a number of sources. +Although beyond the scope of this tutorial, those interested in trying to better understand social networks +can refer to a number of sources. Sociologists Robert A. Hanneman and Mark Riddle maintain an [on-line textbook on network analysis](http://faculty.ucr.edu/~hanneman/nettext/). There are also regular conferences hosted and useful resources available from the [International Network for Social Network Analysis](http://www.insna.org). @@ -68,13 +69,13 @@ I strongly recommend that you read the lesson through before trying the example Wherever possible I have included links back to more detailed documentation or tutorials. -# What is Neo4j and why use it? +# What is Neo4j and why use it? Neo4j is a specialized database that manages graphs. Traditional database software stores information in tables -- much like data is displayed in Excel -spreadsheets except on a much larger scale. Neo4j is also concerned with storing large -amounts of data but it is primarily designed to capture the relationship between items of -information. Therefore, the organizing principle underlying Neo4j is to store information as a network of relationships rather than a table. Networks contain nodes and nodes are connected through +spreadsheets except on a much larger scale. Neo4j is also concerned with storing large +amounts of data but it is primarily designed to capture the relationship between items of +information. Therefore, the organizing principle underlying Neo4j is to store information as a network of relationships rather than a table. Networks contain nodes and nodes are connected through ties. (Nodes are also referred to as "vertices" and ties are referred to as "edges" or links. Networks are also frequently referred to as graphs.) Databases are designed for dealing with large amounts of data. @@ -83,8 +84,8 @@ The *Programming Historian* has excellent tutorials for dealing with network dat For an introduction, see [Exploring and Analyzing Network Data with Python](/lessons/exploring-and-analyzing-network-data-with-python). -# Installing and creating a Neo4j database -Neo4j is currently the most popular graph database on the market. +# Installing and creating a Neo4j database +Neo4j is currently the most popular graph database on the market. It is also well documented and open-source so this tutorial will focus on it. Accessing information within this type of database is as easy as following connections across the nodes of the graph. @@ -99,7 +100,7 @@ Once you download the desktop and install it you will be prompted to enter your At this point, you can choose to log in with an existing social media account or create a new login name and password.
-You may be prompted to update the software. Our recommendation is to allow the installation to continue and update the software afterwards. +You may be prompted to update the software. Our recommendation is to allow the installation to continue and update the software afterwards.
Once you start the Neo4j Desktop installation process, the software will take care of installing all of the software it depends on including the latest Java Runtime Environment it depends on. @@ -108,7 +109,7 @@ This step requires that you have a connection to the Internet. {% include figure.html filename="new_neo4j_desktop_install.png" caption="Neo4j Desktop installation" %} ## Creating a new project -When the Neo4j Desktop starts for the first time, you will see a list of icons on the far left. +When the Neo4j Desktop starts for the first time, you will see a list of icons on the far left. The topmost icon is a small file folder. This is the projects tab. You can edit projects by simply clicking on a project in the project list. When you do so, the contents of the project will be displayed on the far right of the application (the part with the white background). @@ -134,7 +135,7 @@ Now that we have the Neo4j database installed, we need to add some example data The easiest way to load data into the Neo4j database is to load the information you are interested in using comma separated value (CSV) files. You will need to separate your data into data for nodes and data for edges. -This is a common way for network information to be separated. +This is a common way for network information to be separated. In this lesson we will use some example data that has already been formatted. Using the CSV batch loading mechanism in Neo4j is the fastest way to import data into your new database. @@ -146,7 +147,7 @@ This process assumes that you have an empty database. [edges_director_duration.csv](/assets/dealing-with-big-data-and-network-analysis-using-neo4j/edges_director_duration.csv). The canonical guide to loading data from CSV is on [the Neo4j website](https://neo4j.com/developer/guide-import-csv/).** -Now that we have the example CSV files downloaded, we will use the **Cypher** query language to load them into our empty Neo4j database. +Now that we have the example CSV files downloaded, we will use the **Cypher** query language to load them into our empty Neo4j database. Cypher is a specialized query language that is designed to load and manipulate data in the Neo4j database. ## Formatting CSV files for loading into Neo4j @@ -160,7 +161,7 @@ Let's examine the basic format of the two CSV files we downloaded. | companyId | name | |----------------------|--------------| | 1 | CANADIAN BANK OF COMMERCE | -| 2 | SHAWINIGAN WATER AND POWER | +| 2 | SHAWINIGAN WATER AND POWER | | ... | ... | **edges_director_duration.csv** @@ -176,7 +177,7 @@ By looking at the two data files we can see that the Canadian Bank of Commerce a This director effectively acts as a tie (also known as a corporate interlock) between the two companies. -Note that we could just as easily make the directors the nodes and the companies the edges that connect them. +Note that we could just as easily make the directors the nodes and the companies the edges that connect them. This would give us a clearer picture of the professional network that unites individual directors. Another alternative would be to represent both Companies and Directors as node types. @@ -204,11 +205,11 @@ This process assumes that your data is cleanly separated into node and edge CSV ## Moving the CSV files to the import directory -Click on the "Manage" button in the database pane, then the drop down menu next to "Open Folders" and select "Import." A window will appear with a directory. +Click on the "Manage" button in the database pane, then the drop down menu next to "Open Folders" and select "Import." A window will appear with a directory. {% include figure.html filename="new-neo4j-files.png" caption="Pressing the Open Folders button" %} -You now need to copy the +You now need to copy the `nodes_nodes_companies.csv` and the `edges_director_duration.csv` files there. Now we can use a Cypher command to load the files. @@ -222,7 +223,7 @@ In order to start the database, press the triangular play icon. In the "Details" tab beneath, you will see information about the database starting. You'll notice that the database server is running on "HTTP port 7474". -Neo4j allows access to the database through a web server. In the next step, we will open +Neo4j allows access to the database through a web server. In the next step, we will open a browser to connect to the database. ## Opening the browser @@ -233,7 +234,7 @@ First, you will need to login to your new database. By default, the username and password are both `neo4j`. After you log in the first time, you will be prompted to create a new password. -At the top of the window is a prompt with a blinking cursor. +At the top of the window is a prompt with a blinking cursor. We can add our Cypher command to load our data here {% include figure.html filename="new-neo4j-browser.png" caption="Browser window" %} @@ -253,7 +254,7 @@ The Cypher command LOADs the CSV file that contains informative headers (i.e. th from the file we placed in the import directory. (By default, Neo4j can only load files from this directory.) The results will be stored as an object called **nodes**. -The second line CREATEs data in our database. In particular, we create a series of node objects of the type COMPANY +The second line CREATEs data in our database. In particular, we create a series of node objects of the type COMPANY that contain a `name` and an `id`. We set the name of this new company node to the name stored in the `nodes.name` object and the id to the same as stored in the `nodes.companyID`. Notice that the fields that are stored in the nodes object corresponds to the headers we set in the CSV files. We also use the `toInteger()` function to make sure our numbers are represented as integers and not as text. @@ -263,11 +264,11 @@ Next we need to load the edge data. This command does something similar. However, a new command called MATCH has been introduced. The first line loads the CSV file from the import directory and assigns it to a variable called **edges**. The next two lines use MATCH. The first line goes to the existing database and finds a COMPANY node with -an id the same as START_ID. The next line does the same thing, except looks for a match with the END_ID column +an id the same as START_ID. The next line does the same thing, except looks for a match with the END_ID column in the CSV file. These results are assigned to the variables `a` and `b`, respectively. The final line CREATES a relationship between these nodes. In this case, the relationship type is called INTERLOCK. -There is a field called years within the INTERLOCK that is set to the years_served value from the CSV. +There is a field called years within the INTERLOCK that is set to the years_served value from the CSV. ```sql LOAD CSV WITH HEADERS FROM "file:///edges_director_duration.csv" AS edges @@ -277,7 +278,7 @@ CREATE (a)-[r:INTERLOCK{weight:toInteger(edges.years_served)}]->(b); ``` **Note: If you have difficulties during the loading process, you can delete all of the nodes and -edges in your database using the following command.** +edges in your database using the following command.** ``` MATCH (n) @@ -285,7 +286,7 @@ DETACH DELETE n ``` -### Using the Cypher query language +### Using the Cypher query language Cypher is a powerful language to query graph databases. Cypher is a language dedicated to loading, selecting or altering data that is stored in the Neo4j database. @@ -303,15 +304,15 @@ CREATE (acompany:COMPANY { id:900, name:"Economical Mutual Fire Insurance Compan ``` In this example, `acompany` is the variable name we have given to the node object we created in the database. -We marked the node object as being a `COMPANY` type. +We marked the node object as being a `COMPANY` type. A COMPANY has an attribute called `id` which is a unique number assigned to that particular company. -In the examples above, each entry also has a `name` field. +In the examples above, each entry also has a `name` field. We can use this unique id to query the database for information about the ties from each firm Now suppose that the database already contains data and we aren't sure if there is information about a given company. In this case, we can use the MATCH statement to match a unique node and manipulate it. -In the following example, we MATCH both the companynodes (represented by the variables c and p). +In the following example, we MATCH both the companynodes (represented by the variables c and p). The CREATE statement then uses the match for each company and CREATEs a relationship between the two nodes. In this case, the relationship is of the type INTERLOCK. @@ -326,13 +327,13 @@ The relationship between COMPANIES is defined as an INTERLOCK. But it is important to note that we can define multiple different kinds of nodes and relationships.
-Data can be represented many different ways. +Data can be represented many different ways. It is worth carefully considering what insights you want to get out of your data before you commit to a structure in the database.
Finally, the RETURN statement returns the variables for us to further manipulate. -For example, we might decide to add another attribute to the company. -Here we add a URL attribute to the company object that contains the company's current web site. +For example, we might decide to add another attribute to the company. +Here we add a URL attribute to the company object that contains the company's current web site. ``` SET c.url = "https://economical.com"; @@ -340,12 +341,12 @@ SET c.url = "https://economical.com"; ### Reviewing the data -The data supplied in the `nodes_companies.csv` and `edges_director_duration.csv` files +The data supplied in the `nodes_companies.csv` and `edges_director_duration.csv` files provides us with the basic corporate interlock network that existed in Canada in 1912. If we use the web interface that comes with Neo4j we'll be able to see what parts of this network looks like by using a simple query. -With the Neo4j database running, we can open up the built in browser to make more Cypher queries. +With the Neo4j database running, we can open up the built in browser to make more Cypher queries. (Or we can put the following URL into a browser [http://localhost:7474/browser/](http://localhost:7474/browser/). Add the following Cypher query. @@ -364,10 +365,10 @@ You should see a network that looks something like this. ### A brief note on INDEX -Creating an index is important for any database to run efficiently. +Creating an index is important for any database to run efficiently. An index is a particular field in a database that is designated for the database to optimize so that lookups are as fast as possible. -To create an index in Neo4j, we would issue the following Cypher command. +To create an index in Neo4j, we would issue the following Cypher command. Creating an index only needs to be done once. ``` CREATE INDEX ON :COMPANY(id) @@ -381,14 +382,14 @@ CREATE INDEX ON :COMPANY(name) Creating this index will greatly speed up any queries we make based on the unique keys `id` and `name`.
-Don't create more indexes than you need. +Don't create more indexes than you need. Creating too many indexes will have the effect of slowing down your database. Again, designing your database so that you have a unique key to do lookups is crucial.
### Querying Neo4j: CREATE, MATCH, SET -So far we have used the basic syntax of the Cypher query language. +So far we have used the basic syntax of the Cypher query language. We've seen that relationships on a graph are written quite intuitively using Cypher. ``` (n1:NODE)-[:relationship]->(n2:NODE) @@ -412,7 +413,7 @@ set c.degree = size((c)-->()); This code simply matches to each node and counts the size (or degree) of each node. We use the SET command to set the degree value as an attribute of each node. -Now we can examine those nodes with the highest degree. +Now we can examine those nodes with the highest degree. Here we list companies where there are 75 or more connections (via high level employees or directors to other companies). ``` match (c0:COMPANY)-[r]-(c1) where c0.degree > 75 @@ -447,10 +448,9 @@ of Finance. Palgrave Macmillan. # Conclusion -In this lesson we've introduced the Neo4j graph database. +In this lesson we've introduced the Neo4j graph database. We've shown how we can talk directly to the database using the Cypher query language. We've also shown how easy it is to visualize different parts of graphs stored in Neo4j using Neo4j's built in visualization system. Finally, we've also included some data and example code that reinforces the key topics of this lesson. Wherever possible this lesson has also linked to primary documents and software to make getting started as easy as possible. - diff --git a/en/lessons/downloading-multiple-records-using-query-strings.md b/en/lessons/downloading-multiple-records-using-query-strings.md index 49503e193..e49ddc854 100755 --- a/en/lessons/downloading-multiple-records-using-query-strings.md +++ b/en/lessons/downloading-multiple-records-using-query-strings.md @@ -18,6 +18,7 @@ abstract: "Downloading a single record from a website is easy, but downloading m previous: output-keywords-in-context-in-html-file python_warning: true redirect_from: /lessons/downloading-multiple-records-using-query-strings +avatar_alt: Figures working in a mine, pushing carts --- {% include toc.html %} @@ -130,7 +131,7 @@ Take a look at the URL produced with the last search results page. It should look like this: ``` xml -https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0 +https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0 ``` We had a look at URLs in [Viewing HTML Files][], but this looks a lot @@ -150,7 +151,7 @@ https://www.oldbaileyonline.org/search.jsp &toYear=1750 &toMonth=99 &start=0 -&count=0 +&count=0 ``` In this view, we see more clearly our 12 important pieces of information @@ -161,7 +162,7 @@ it does not do anything.) and a series of 10 *name/value pairs* put together with & characters. Together these 10 name/value pairs comprise the query string, which tells the search engine what variables to use in specific stages of the search. Notice that each name/value pair contains -both a variable name: toYear, and then assigns that variable a value: 1750. +both a variable name: toYear, and then assigns that variable a value: 1750. This works in exactly the same way as *Function Arguments* by passing certain information to specific variables. In this case, the most important variable is `_divs_fulltext=` which has been given the @@ -243,7 +244,7 @@ page. We have already got the first one by using the form on the website: ``` xml -https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0 +https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_divs_fulltext=mulatto*+negro*&kwparse=advanced&_divs_div0Type_div1Type=sessionsPaper_trialAccount&fromYear=1700&fromMonth=00&toYear=1750&toMonth=99&start=0&count=0 ``` We could type this URL out twice and alter the ‘*start*’ variable to get @@ -463,7 +464,7 @@ def getSearchResults(query, kwparse, fromYear, fromMonth, toYear, toMonth, entri url += '&toMonth=' + toMonth url += '&start=' + str(startValue) url += '&count=0' - + #download the page and save the result. response = urllib2.urlopen(url) webContent = response.read() @@ -567,7 +568,7 @@ def getSearchResults(query, kwparse, fromYear, fromMonth, toYear, toMonth, entri url += '&toMonth=' + toMonth url += '&start=' + str(startValue) url += '&count=0' - + #download the page and save the result. response = urllib2.urlopen(url) webContent = response.read() @@ -711,7 +712,7 @@ the trials. The first entry starts with “Anne Smith” so you can use the Notice Anne’s name is part of a link: ``` xml -browse.jsp?id=t17160113-18&div=t17160113-18&terms=mulatto*_negro*#highlight +browse.jsp?id=t17160113-18&div=t17160113-18&terms=mulatto*_negro*#highlight ``` Perfect, the link contains the trial ID! Scroll through the remaining @@ -1063,7 +1064,7 @@ the command output so we know which files failed to download. This should be added as the last line in the function. ``` -print "failed to download: " + str(failedAttempts) +print "failed to download: " + str(failedAttempts) ``` Now when you run the program, should there be a problem downloading a diff --git a/en/lessons/editing-audio-with-audacity.md b/en/lessons/editing-audio-with-audacity.md index 514e0ad96..e005bb67b 100755 --- a/en/lessons/editing-audio-with-audacity.md +++ b/en/lessons/editing-audio-with-audacity.md @@ -15,6 +15,7 @@ topics: [data-manipulation] abstract: "In this lesson you will learn how to use Audacity to load, record, edit, mix, and export audio files." review-ticket: https://github.com/programminghistorian/ph-submissions/issues/15 redirect_from: /lessons/editing-audio-with-audacity +avatar_alt: Two gramophones facing each other --- {% include toc.html %} diff --git a/en/lessons/exploring-and-analyzing-network-data-with-python.md b/en/lessons/exploring-and-analyzing-network-data-with-python.md index c5c283253..8c43129fa 100755 --- a/en/lessons/exploring-and-analyzing-network-data-with-python.md +++ b/en/lessons/exploring-and-analyzing-network-data-with-python.md @@ -23,6 +23,7 @@ topics: [network-analysis] date: 2017-08-23 abstract: "This lesson introduces network metrics and how to draw conclusions from them when working with humanities data. You will learn how to use the NetworkX Python package to produce and work with these network statistics." redirect_from: /lessons/exploring-and-analyzing-network-data-with-python +avatar_alt: Train tracks intersecting --- {% include toc.html %} diff --git a/en/lessons/extracting-illustrated-pages.md b/en/lessons/extracting-illustrated-pages.md index 5598ba713..3ca13981b 100644 --- a/en/lessons/extracting-illustrated-pages.md +++ b/en/lessons/extracting-illustrated-pages.md @@ -16,6 +16,7 @@ difficulty: 2 activity: acquiring topics: [api] abstract: Machine learning and API extensions by HathiTrust and Internet Archive are making it easier to extract page regions of visual interest from digitized volumes. This lesson shows how to efficiently extract those regions and, in doing so, prompt new, visual research questions. +avatar_alt: Scientific measuring device --- {% include toc.html %} diff --git a/en/lessons/extracting-keywords.md b/en/lessons/extracting-keywords.md index e38afaeae..2cde79a3b 100755 --- a/en/lessons/extracting-keywords.md +++ b/en/lessons/extracting-keywords.md @@ -17,6 +17,7 @@ topics: [data-manipulation] abstract: "This lesson will teach you how to use Python to extract a set of keywords very quickly and systematically from a set of texts." python_warning: true redirect_from: /lessons/extracting-keywords +avatar_alt: Woman churning butter or milk --- {% include toc.html %} @@ -59,7 +60,7 @@ The first step of this process is to take a look at the data that we will be usi {% include figure.html filename="extracting-keywords-1.png" caption="Screenshot of the first forty entries in the dataset" %} -Download the dataset and spend a couple of minutes looking at the types of information available. You should notice three columns of information. The first, 'Name', contains the name of the graduate. The second: 'Details', contains the biographical information known about that person. The final column, 'Matriculation Year', contains the year in which the person matriculated (began their studies). This final column was extracted from the details column in the pre-processing stage of this tutorial. The first two columns are as you would find them on the British History Online version of the *Alumni Oxonienses*. If you notice more than three columns then your spreadsheet programme has incorrectly set the [delimiter](https://en.wikipedia.org/wiki/Delimiter) between columns. It should be set to "," (double quotes, comma). How you do this depends on your spreadsheet programme, but you should be able to find the solution online. +Download the dataset and spend a couple of minutes looking at the types of information available. You should notice three columns of information. The first, 'Name', contains the name of the graduate. The second: 'Details', contains the biographical information known about that person. The final column, 'Matriculation Year', contains the year in which the person matriculated (began their studies). This final column was extracted from the details column in the pre-processing stage of this tutorial. The first two columns are as you would find them on the British History Online version of the *Alumni Oxonienses*. If you notice more than three columns then your spreadsheet programme has incorrectly set the [delimiter](https://en.wikipedia.org/wiki/Delimiter) between columns. It should be set to "," (double quotes, comma). How you do this depends on your spreadsheet programme, but you should be able to find the solution online. Most (but not all) of these bibliographic entries contain enough information to tell us what county the graduate came from. Notice that a large number of entries contain placenames that correspond to either major cities ('of London', in the first entry) or English counties ('of Middlesex' in entry 5 or 'of Wilts' - short for Wiltshire in entry 6). If you are not British you may not be familiar with these county names. You can find a list of [historic counties of England](http://en.wikipedia.org/wiki/Historic_counties_of_England) on Wikipedia. @@ -161,7 +162,7 @@ The fourth line closes the open text file. The fifth line prints out the results Save this file as `extractKeywords.py`, again to the same folder as the other files, and then run it with Python. To do this from the command line, first you need to launch your command line terminal. -On Windows it is called `Command Prompt`. Windows users may find it easier to launch Python by opening the folder containing your `extractKeywords.py` file, then press `shift` + `right-click` and then select 'open command window here'. Assuming you have Python installed, you should be able to run your programme using the command beginning with 'python' below. +On Windows it is called `Command Prompt`. Windows users may find it easier to launch Python by opening the folder containing your `extractKeywords.py` file, then press `shift` + `right-click` and then select 'open command window here'. Assuming you have Python installed, you should be able to run your programme using the command beginning with 'python' below. On Mac OS X, this is found in the `Applications` folder and is called `Terminal`. Once the Terminal window is open, you need to point your Terminal at the directory that contains all of the files you have just created. I have called my directory 'ExtractingKeywordSets' and I have it on my computer's Desktop. To change the Terminal to this directory, I use the following command: @@ -268,7 +269,7 @@ This code will automatically check each word in a text, keeping track of matches If it looks like it worked, delete the 'print matches' line and move to the next step. -### Step 5: Output results +### Step 5: Output results If you have got to this stage, then your Python program is already finding the matching keywords from your gazetteer. All we need to do now is print them out to the command output pane in a format that's easy to work with. @@ -282,7 +283,7 @@ Add the following lines to your program, minding the indentation as always: matchString = '' for matches in storedMatches: matchString = matchString + matches + "\t" - + print matchString ``` @@ -293,7 +294,7 @@ If there IS a match, then the program creates a new variable called 'matchString When all of the matching keywords have been added to 'matchString', the program prints it out to the command output before moving on to the next text. -If you save your work and run the program, you should now have code that achieves all of the steps from the algorithm and outputs the results to your command output. +If you save your work and run the program, you should now have code that achieves all of the steps from the algorithm and outputs the results to your command output. The finished code should look like this: @@ -314,7 +315,7 @@ f.close() for entry in allTexts: matches = 0 storedMatches = [] - + #for each entry: allWords = entry.split(' ') for words in allWords: @@ -332,7 +333,7 @@ for entry in allTexts: else: storedMatches.append(words) matches += 1 - + #if there is a stored result, print it out if matches == 0: print ' ' @@ -340,7 +341,7 @@ for entry in allTexts: matchString = '' for matches in storedMatches: matchString = matchString + matches + "\t" - + print matchString ``` @@ -416,7 +417,7 @@ with open('The_Dataset_-_Alumni_Oxonienses-Jas1.csv') as csvfile: for row in reader: #the full row for each entry, which will be used to recreate the improved CSV file in a moment fullRow.append((row['Name'], row['Details'], row['Matriculation Year'])) - + #the column we want to parse for our keywords row = row['Details'].lower() allTexts.append(row) @@ -484,7 +485,7 @@ with open('The_Dataset_-_Alumni_Oxonienses-Jas1.csv') as csvfile: for row in reader: #the full row for each entry, which will be used to recreate the improved CSV file in a moment fullRow.append((row['Name'], row['Details'], row['Matriculation Year'])) - + #the column we want to parse for our keywords row = row['Details'].lower() allTexts.append(row) @@ -503,7 +504,7 @@ with open(filename, 'a') as csvfile: fieldnames = ['Name', 'Details', 'Matriculation Year', 'Placename'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() - + #NEW! define the output for each row and then print to the output csv file writer = csv.writer(csvfile) @@ -512,16 +513,16 @@ with open(filename, 'a') as csvfile: matches = 0 storedMatches = [] - + #for each entry: allWords = entry.split(' ') for words in allWords: - + #remove punctuation that will interfere with matching words = words.replace(',', '') words = words.replace('.', '') words = words.replace(';', '') - + #if a keyword match is found, store the result. if words in allKeywords: if words in storedMatches: @@ -529,7 +530,7 @@ with open(filename, 'a') as csvfile: else: storedMatches.append(words) matches += 1 - + #CHANGED! send any matches to a new row of the csv file. if matches == 0: newRow = fullRow[counter] diff --git a/en/lessons/fetch-and-parse-data-with-openrefine.md b/en/lessons/fetch-and-parse-data-with-openrefine.md index 3a5fc68cd..f1f7954bc 100755 --- a/en/lessons/fetch-and-parse-data-with-openrefine.md +++ b/en/lessons/fetch-and-parse-data-with-openrefine.md @@ -15,6 +15,7 @@ activity: acquiring topics: [data-manipulation, web-scraping, api] abstract: "OpenRefine is a powerful tool for exploring, cleaning, and transforming data. In this lesson you will learn how to use Refine to fetch URLs and parse web content." redirect_from: /lessons/fetch-and-parse-data-with-openrefine +avatar_alt: Machine for water filtration --- {% include toc.html %} @@ -640,5 +641,3 @@ OpenRefine is a flexible, pragmatic tool that simplifies routine tasks and, when [^use]: As of July 2017, see [API Documentation](http://text-processing.com/docs/index.html). [^1]: Jacob Perkins, "Sentiment Analysis with Python NLTK Text Classification", [http://text-processing.com/demo/sentiment/](http://text-processing.com/demo/sentiment/). [^2]: Vivek Narayanan, Ishan Arora, and Arjun Bhatia, "Fast and accurate sentiment classification using an enhanced Naive Bayes model", 2013, [arXiv:1305.6143](https://arxiv.org/abs/1305.6143). - -