Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Ticket: Working with batches of PDF files #258

Closed
amsichani opened this issue Sep 16, 2019 · 46 comments
Closed

Review Ticket: Working with batches of PDF files #258

amsichani opened this issue Sep 16, 2019 · 46 comments

Comments

@amsichani
Copy link
Contributor

The Programming Historian has received the following proposal for a lesson on 'Working with batches of PDF files' by @maehr. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/working-with-batches-of-pdf-files

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

@amsichani will act as editor. Her role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our Ombudsperson (@amandavisconti). Thank you for helping us to create a safe space.

@amsichani
Copy link
Contributor Author

Thanks @maehr for this submission 👋 👋
I will be reading the lesson and providing some feedback for you to respond to, and then I will solicit formal reviewers. I anticipate being able to get an initial read to you back by the end of next week. I'll let you know if anything changes, and in the meantime let me know if you have any questions about the process.

@amsichani
Copy link
Contributor Author

Hi @maehr , before posting my initial feedback on the tutorial, may I ask you on the lesson's layout? You 've used a different template to generate the lesson's preview? there are a number of formatting issues related to this, as you can see, and it might worth correcting them before moving to the next phase of peer review, as it is really hard to read (esp the last sections). Cheers!

@maehr
Copy link
Contributor

maehr commented Sep 29, 2019

Hi @amsichani
I started to work on this lesson before the new guidelines were released, so there might be a mixup. Can you please be more specific and point out, what parts need correction?
Thanks and best regards
Moritz

@amsichani
Copy link
Contributor Author

amsichani commented Sep 29, 2019

'Text recognition in PDF files' and til the end is really messed up - v difficult to even read it.

@maehr
Copy link
Contributor

maehr commented Sep 29, 2019

I removed alert divs with inline code / code blocks and yaml errors; hopefully it helps. I cannot build the jekyll site locally because of parsing errors of other lessons.(The repo gh pages are not rebuild at the moment, so the change is only visible over here https://github.com/programminghistorian/ph-submissions/blob/gh-pages/lessons/working-with-batches-of-pdf-files.md )

@mdlincoln
Copy link
Contributor

@amsichani please refer to the updated guidelines: https://programminghistorian.org/en/editor-guidelines#3-add-yaml-metadata-to-the-lesson-file

unfortunately what Adam had posted up briefly included a lot of square brackets [] which was a fatally incorrect thing to do. If you remove all those (and the orcid ID from the authors which is not at all in the specification???) then this builds fine.

@mdlincoln
Copy link
Contributor

mdlincoln commented Oct 1, 2019

The issue with formatting going on after Text Recognition in PDF files is that you try to put markdown inside the HTML block of the <div class="alert alert-info"> Once you start an HTML block inside a markdown document, everything in there needs to be HTML, not markdown.

@amsichani
Copy link
Contributor Author

amsichani commented Oct 1, 2019

Many thanks @mdlincoln for this! @maehr could you amend this bit so we can have a clear reading version of the preview of the lesson ?

@amsichani
Copy link
Contributor Author

Also, we're going to use this editorial process to help familiarize a newer member of the editorial team with the process and workflow. So @fdlaramee will be shadowing along as I work with you.

@maehr
Copy link
Contributor

maehr commented Oct 2, 2019

To my knowledge, I changed everything accordingly. Jekyll builds locally without warnings. Please tell me, if any other problem pops up or if I forgot to fix a problem.

@maehr
Copy link
Contributor

maehr commented Oct 2, 2019

@mdlincoln In the lesson template the endnote formatting is invalid. It is like this:

#### An End Note:

This is some text.[^1]
This is some more text.[^2]

##### Endnotes
[^1] Properly formatted citation using Chicago Manual of Style
[^2] Properly formatted citation using Chicago Manual of Style

Should be like this, with :, according to Markdown.

#### An End Note:

This is some text.[^1]
This is some more text.[^2]

##### Endnotes
[^1]: Properly formatted citation using Chicago Manual of Style
[^2]: Properly formatted citation using Chicago Manual of Style

@amsichani
Copy link
Contributor Author

amsichani commented Oct 2, 2019

Hi @maehr ,
The lesson looks great to me. It’s a useful addition to our lessons, and I'm glad that you've taken the time to put it together! What really fascinates me is that you are (re)using existing tools and platforms to execute certain procedures which is a sustainable practice.
I think there are only minor things to address before we send the lesson out for peer review. I have a couple of structural / typo remarks and a few technical points.

  • p. 1 The (retro-)digitisation : not sure the meaning of this term is clear here. Lets wait to see what the peer reviewers feel about it.

  • p.2 . (Batch processing): (Batch processing). The sentence finishes with a period after the parenthesis

  • p.2 Scope: after the first sentence put : instead of .

  • p. 3 Objectives : after the first sentence put : instead of .

  • p.15 you note that that : omit that

  • p.16 omit the –

  • we managed to get the lesson's rendered version to appear correctly http://programminghistorian.github.io/ph-submissions/lessons/working-with-batches-of-pdf-files - thanks @mdlincoln !

  • great catch on the footnotes re the template - we ll amend it ! thanks!

  • the two images are not rendered properly, although they are where and how they should be. I m guess this is caused because there is no redirection to the lesson's image folder. Could you have a look?

Let me know if there is anything unclear. Given that my remarks are minor, we could try for a quick turnaround. Once you have made these revisions, I could then contact reviewers and move things forward.

maehr added a commit that referenced this issue Oct 2, 2019
solved most of the issues noted in #258 (comment)
@maehr
Copy link
Contributor

maehr commented Oct 2, 2019

Hi @amsichani Thanks for your feedback. I fixed everything mentioned above, retro-digitsation (which is a quite literal translation from the German Retrodigitalisierung) and images included.

I found another little issue within the YAML frontmatter of the lesson template. The original field should only be included in translations because it messes with the image path.

original: LEAVE BLANK
review-ticket: LEAVE BLANK
difficulty: LEAVE BLANK
activity: LEAVE BLANK
topics: LEAVE BLANK
abstract: LEAVE BLANK

@amsichani
Copy link
Contributor Author

Fantastic @maehr ! I will now try to contact reviewers for your lesson and I ll get back to you here once I have some news . Stay tuned!

@amsichani
Copy link
Contributor Author

@cderose and @jackpay have agreed to serve as our lesson reviewers 🎉 . They've agreed to a submission date for their reviews of 15 November 2019 (if not earlier). Do let me know if there is anything I can help with.

@cderose
Copy link

cderose commented Nov 12, 2019

@maehr and @amsichani, thank you for the opportunity to review this lesson in advance. It will be a great addition to The Programming Historian. It provides a really nice walkthrough of the various steps a researcher might take when working with text files. I especially appreciated that it was structured around a case study since that informed what the driving goal was for each of the steps. The code snippets are also concise and accessible and will be terrific to have on hand.

For the notes that follow, I included my top 3 thoughts/suggestions first, after which I listed light edits or error messages I received. I would be happy to clarify or discuss any of them.

For reference, I was using a computer with MacOS Mojave.

  1. For setting up the working directory, you could have users create a new directory within Downloads from the beginning like you do in P34. They could then download all of the PDF files for the first section there. This would ensure they're not touching other PDFs they might already have in Downloads. Alternatively, you might encourage users to empty the Downloads folder prior to the lesson and could have them wait to download DARIAH until after the initial PDF section is finished (otherwise, the GREP command in P24 will search through it and return not useful stuff).

  2. For each code snippet (for example, P21), depending on the anticipated audience for this lesson, you might include a sentence that breaks down what the different pieces of the code are doing. If a more advanced shell user is assumed, you could include a note early on (maybe in P4) that encourages users to paste the code into something like https://explainshell.com/ to see how it's working if they have questions.

  3. P36 & P37 - Would it be possible to include a pre-processed dataset for downloading? While it's very realistic that this work takes several hours, that asks a lot of users working through a tutorial since it essentially ties up their computer. If possible, it might be more effective to have users run the code on a subset of the documents to confirm they can OCR and extract text successfully. After that, if they could download the already fully processed dataset, they would be able to move on to the topic modeling portion without a signficant delay.

  • Table of contents
    Change Evaluation of to "Evaluate the Topic Models" to match the verb pattern in the rest of the sequence

  • P1 Motivation
    Might insert something like "often" into the first sentence to acknowledge that not all humanities work involves working with text-based sources: "Humanities scholars often work with..."

  • P1
    Might rephrase the sentence that begins: "As a result, humanities scholars are increasingly being forced to..." Forced to makes it sounds like humanists are unwilling and uninterested. Given the audience for this lesson, it might be more compelling to frame the increase in data as an opportunity that requires/leverages/employs Distant Reading and other algorithmic tools to surface patterns

  • P3 Objectives
    Typo in the fifth bullet point: "Do all of the above..."

  • P3 Objectives
    Could make OCR software the hyperlinked text for Tesseract rather than languages

  • P5 Windows 10
    Might include the full specification for absolute clarity: "Fortunately, since the Windows 10 Fall Creators Update"

  • P7 MacOS
    When I ran code, I received an error about accepting the Xcode license; if Xcode is indeed a dependency, it could be worth calling it out. Another error I received that seems specific to MacOS Mojave: Error: Xcode alone is not sufficient on Mojave. Install the Command Line Tools: xcode-select --install After installing the command line tools, I still got the following error: Error: The brew link step did not complete successfully The formula built, but is not symlinked into /usr/local Could not symlink bin/2to3 Target /usr/local/bin/2to3 already exists. You may want to remove it: rm '/usr/local/bin/2to3' This error didn't prevent me from running any of the code in the lesson, but it might be worth mentioning in the main text or in a footnote of possible errors that can be ignored.

  • P10 Topic Modeling
    Holding control and double clicking didn't work on my Mac; you could add something along the lines of: If that doesn't work, go to Systems Preferences, click on Security & Privacy, and then click Open Anyway

  • P19
    My output looked slightly different - it was nothing significant (I had an additional info line that read: INFO - Start processing 8 pages concurrent), but you could add a parenthetical to the caption that says (output might look slightly different)

  • P21
    This is extremely helpful code to have on hand. Since in this particular case, the other documents are already OCRed, you might include a sentence that explains: All of our PDFs now have text that can be extracted. For future reference, to process all PDF files in your working directory at once, run... If you like, you can run this line now to see the error message that appears when you try to OCR text that has already been OCRed (press Control + C to stop the code if you don't want to wait for it to go through all of the pages).

  • P22
    Could add a sentence to encourage users to look in their working directory and confirm that a new file has indeed been added

  • P23
    This processed all of the files, but it also output an error message that might be worth calling out: Syntax Warning: Invalid Font Weight

  • P29
    This line also returned Syntax Warning: Invalid Font Weight. By "image extraction," are you referring to the images of the scanned pages themselves and not images (like illustrations or photographs) that might be present in some of the pages? Originally, I though extracting images referred to the latter, but looking at the PNGs that this line of code returned, it looks like it took each PDF and turned each page into an image file. Since there are other Programming Historian lessons that talk about image extraction in the sense of extracting illustrations, you might add a note here that explains that in this context, extracting images from PDF files means turning each page into an image file.

  • P31
    Might add a note that this step could take a few minutes

  • P36
    Typo: "This will download all English..." You could include a parenthetical to say how many texts (340) are being downloaded as a way for users to make sure the code grabbed them all.

  • P37
    Might specify that 340 text files will be generated, along with "list_of_files.txt," a document that includes the names of all of the extracted texts.

  • P38
    For the last bullet point, you might add a sentence about how to interpret/evaluate/explain the results since each run could be different.

  • P39
    In anticipation of things users might do accidentally, you could include a parenthetical note after "all 340 text files" to remind users not to include "list_of_files.txt." Also, is it possible to hyperlink "example Corpus" so that it goes to the stoplist you mention?

  • P40
    Typo: "evaluate the Topic Model and its thirty topics."

  • P46
    It would be helpful to have a sentence here that explains what's in those documents - why should someone read them, what do they say about PDFs (do they describe how they're created or do they focus on working with or archiving them)? Depending on the answers to those questions, this might be better as a footnote rather than as a concluding remark.

@amsichani
Copy link
Contributor Author

Fantastic @cderose ! Many thanks for this. Waiting now for the review from @jackpay . Once received and in line with our editorial guidelines, I will aim to summarise the reviews as soon as possible and then @maehr could proceed with necessary revisions.

@jackpay
Copy link

jackpay commented Nov 18, 2019

As per the guideline I include a summary of my main observations, followed by any point specific observations, edits, points. Thank you very much for this opportunity.

Summary:

Navigating file structures:
It might be worth spending a little time in intro paragraphs to talk about navigating file structures on the command line and getting people comfortable so they don't get lost later on.
e.g. cd ~/ takes you to your home directory, ./is your current directory, ~/Downloads takes you to Downloads.
This can then be used as a lead in to creating a different working directory other than Downloads. i.e. Run this command mkdir ~/PDF2text - which will create a working directory in your home folder. Now navigate to this directory cd ~/PDF2text.

Topic Modelling:
It might be worth mentioning a little on topic modelling early on. If anything just to establish that we are moving away from the typical way in which we as humans understand topics and one of documents generated (i.e. generative model for documents) from probabilistic distributions over words, which are defined as topics.

  1. Possible edit: You don’t have access to commercial software, such as Adobe Acrobat Professional or Abbyy FineReader.

  2. Perhaps establish a working directory that is not Downloads to establish a sensible process when they do this for themselves. For example, create one in their home directory.

  3. Provide a link back to the 5 in case they skipped past or missed the links and advise regarding Linux on Windows.

  4. Typo:
    Change:
    you will include one more file to our corpus
    To:
    you will include one more files to our corpus

  5. Could use a little clarification or and a link to this sentence.
    'To separate the two operations - processing PDF files and Topic Modelling - and avoid confusion, do this later in the lesson.'

  6. In code snippet change
    cd ./Downloads to ~/Downloads - this ensure that wherever they are on the file system they will navigate to home -> Downloads.

  7. Similar to point made in summary regarding navigating file structures and getting them used to moving around on the command line.

  8. Maybe clarify that the wildcard operator is * and therefore *.png is saying all files with any file name that only have the suffix .png.
    May also want to specify a directory in the command ~/Downloads/*.png.

  9. 'Only the frequency of words in a document or corpus is measured.'
    This is not true for LDA. The frequency of words matters greatly but more importantly it is capturing topics through co-occurrence of words. i.e. words appearing together across documents increase the likelihood of that topic existing in a document and ultimately the corpus.

'Each word has a probability to belong to one or more topics. The algorithm finds the corresponding probabilities of the individual words.' Technically all words appear in every topic with some probability but are higher in others and therefore define a topic.

  1. Typo:
    ...explore and evaluate the Topic Model ant its thirty topics...
    Should be:
    ...explore and evaluate the Topic Model and its thirty topics...

@amsichani
Copy link
Contributor Author

Thanks @jackpay for this!
@maehr given that the two reviews are written in detail, I think there is no point for me summarising and rewriting all the points/ remarks. It d be good if you try and address all of them, as they are to the point. Do you have an estimated date when you will be able to provide an updated version of the lesson?

@maehr
Copy link
Contributor

maehr commented Nov 20, 2019

Thank you all very much. I agree, a summary is not necessary. I will have uploaded a revised version by Monday 2.12.2019 at the latest.

@amsichani
Copy link
Contributor Author

Fantastic @maehr ! Looking forward to the edited version of the lesson . If you have any questions , please don't hesitate to ask me here, or if there is anything you need to clarify with the reviewers.

@maehr
Copy link
Contributor

maehr commented Dec 6, 2019

Hi @amsichani

In P12 I tried to help the user navigate the filesystem. Do you expect more specific explanations?

P37 and P39
The intermediate file "list_of_files.txt" is not generated anymore and I hyperlinked the stopwords. I mention the 340 files in P38.

P39
I included remarks regarding topic modeling and "stable" results in P40 and P41.

ILO got back to me with more specific questions. Hopefully we can put the dataset on Zenodo. I should be able to get a definitiv answer before Christmas.

My ORCID is 0000-0002-1367-1618.
My bio: "Moritz Mähr investigates the history of computers and migration at ETH Zurich in Switzerland."

@amsichani
Copy link
Contributor Author

amsichani commented Dec 6, 2019

Fantastic @maehr ! Zenodo should work fine and we are currently exploring this option for hosting large assets. , so this might be a interesting case study for us.
I will now continue with transferring the lesson to our managing editor and do please let us know about how you progress with ILO.

@amsichani
Copy link
Contributor Author

amsichani commented Dec 6, 2019

Hi @svmelton ,

Here are the lesson files you'll need:

lesson file - /lessons/working-with-batches-of-pdf-files.md
images folder - /images/working-with-batches-of-pdf-files
gallery image - /galleryworking-with-batches-of-pdf-files.png
gallery original - /gallery/originals/working-with-batches-of-pdf-files.png

Please note that there isn't an asset folder for this lesson ; instead we are still waiting for a dataset to be deposited on Zenodo. @maehr will let us know when is up and I guess he will also need to update the lesson accordingly.

also note:

- name: Moritz Mähr
  team: false
  orcid: 0000-0002-1367-1618
  bio:
      en: |
        Moritz Mähr investigates the history of computers and migration at ETH Zurich in Switzerland.

Let me know if I'm missing anything.

@amsichani
Copy link
Contributor Author

Hi @svmelton do we have a timeline on the publication of this lesson from your part (I know you have a lot in your plate right now in terms of pubs)? Also, @maehr do we have an update on the ILO front? Thank you all for your hard work on this and lets try to publish before xmas(?!)

@svmelton
Copy link
Contributor

Hi @amsichani—I'm just waiting on the dataset, and then we can move forward with publication. Thanks!

@amsichani
Copy link
Contributor Author

Fantastic @svmelton - we are now waiting for an update from @maehr on the ILO dataset so we can move forward.

@maehr
Copy link
Contributor

maehr commented Dec 18, 2019

Fantastic @svmelton - we are now waiting for an update from @maehr on the ILO dataset so we can move forward.

I sent out a reminder one week ago (and again today). I hope I get an answer before xmas.

@maehr maehr closed this as completed in ee9cf8f Dec 18, 2019
@maehr maehr reopened this Dec 18, 2019
@maehr
Copy link
Contributor

maehr commented Dec 18, 2019

@svmelton The ILO got back to me and I was able to publish the dataset on Zenodo https://doi.org/10.5281/zenodo.3582736. I added the link to the dataset to the lesson. IMO we can move forward.

PS: Sorry, my commit message closed the issue automatically .

@amsichani
Copy link
Contributor Author

Many thanks for this @maehr ! Great work! @svmelton do let me know if you need anything else from me at this point!

@svmelton
Copy link
Contributor

Excellent! I'll work on it this weekend and ping y'all if I need anything.

@svmelton
Copy link
Contributor

Hi all—I've just run through the lesson, and it looks good! @amsichani—we're just missing a bit of metadata (reviewers, editors, review ticket, difficulty, activity, topics, abstract, avatar_alt).

@amsichani
Copy link
Contributor Author

Happy New Year everyone! Many thanks for the heads up @svmelton ! I m now working on these -- @maehr could you provide me a small lesson's abstract / description (have a look here https://programminghistorian.org/en/lessons/ )?

@maehr
Copy link
Contributor

maehr commented Jan 3, 2020

Happy New Year everyone! Many thanks for the heads up @svmelton ! I m now working on these -- @maehr could you provide me a small lesson's abstract / description (have a look here https://programminghistorian.org/en/lessons/ )?

@amsichani I have tried to capture the essence. Feel free to enhance or correct my version.

Learn how to perform OCR and text extraction with free command line tools like Tesseract and Poppler and how to get an overview of large numbers of PDF documents using topic modeling.

@amsichani
Copy link
Contributor Author

Many thanks @maehr . @svmelton I have now updated the lesson with the necessary metadata (I am not sure I get the avatar_alt) and I think we are ready to go - many thanks for your cooperation!

@spapastamkou
Copy link
Contributor

avatar_alt = the title of the avatar image of the lesson once the image will be selected by the editor (as per this PR)

@amsichani
Copy link
Contributor Author

Hi @svmelton , all metadata is now in place.

@svmelton
Copy link
Contributor

svmelton commented Jan 9, 2020

Thanks so much, @amsichani! Exciting news: we'll be able to pilot our new external copyediting with this piece! We're getting it set up, but I'll let you know ASAP when we have a timeline. Thanks for everyone's patience; I'm excited to have this piece as our first professionally copyedited publication!

@acrymble
Copy link

Thanks for your patience everyone. The copyeditor has now had a chance to look through the text and make suggestions based on the styleguide. I've attached the PDF with comments to this ticket. Her instructions include:

I've underlined in red instances where changes are needed or suggested. I've used yellow notes to give the detail on each instance. One note is purple and that's because it includes a suggestion about adding a preferred written-out date format to the style guide (just to flag to your attention).

This is our first copyedited lesson, so I think the best thing is for @amsichani and @maehr to incorporate the suggestions and discuss between themselves anywhere they disagree or need further conversation. Once you're both happy you can proceed with the rest of the publication process.
Working with batches of PDF files _ Programming Historian - copy edit 3.pdf

maehr added a commit that referenced this issue Jan 23, 2020
Hi @amsichani and @acrymble  I really love the change requests made by the copyeditor. As a non native speaker this is a blessing!  I corrected everything according to the notes. The last section (Mueller Report) needs some more attention.
Thanks a lot
@maehr
Copy link
Contributor

maehr commented Jan 23, 2020

Hi @amsichani and @acrymble I really love the change requests made by the copyeditor. As a non native speaker this is a blessing! I corrected everything according to the notes. The last section (Mueller Report) needs some more attention.
Thanks a lot
PS: I forgot that I can push my changes directly and opened (and closed) a pull request. Sorry for the inconvenience.

@amsichani
Copy link
Contributor Author

Many thanks for your patience and cooperation @maehr and @svmelton & @acrymble for navigating us through the copyediting process -- this is exciting!
I am happy with the changes that @maehr has incorporated and if there is no other comment, I think @svmelton we are ready to go live!

@svmelton
Copy link
Contributor

Fantastic! I will work on this over the next couple of days and ping you if I have any questions. :)

@svmelton
Copy link
Contributor

svmelton commented Feb 3, 2020

And we're published! Thanks to everyone for your work, I'm excited to see this live!

@svmelton svmelton closed this as completed Feb 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants