Counting in Context by @hugovk #149

hugovk · 2014-12-01T09:50:14Z

Almost forgot about this. Code and output was created before the deadline, PDF knocked up and all uploaded afterwards.

What happens if we want to find each sequential number, in words, in a big corpus?

This is what happens.

PDF | HTML | MD

It uses the Project Gutenberg CD of 600 books, containing some 3,583,389 sentences.

It runs through twice: first with the first sentence found in the corpus (from zero to fifty-five thousand); second with the shortest matching sentence (zero to forty-eight thousand).

Made something like this:

gutencounter --cache *.txt >> gutencounter-unsorted.md
gutencounter --sort --cache *.txt >> gutencounter-sorted.md
[leave running until have enough words]
cat gutencounter-unsorted.md > gutencounter.md
cat gutencounter-sorted.md >> gutencounter.md
grep "##" gutencounter.md > contents.txt
[hack contents.txt into links]
cat gutencounter.py >> gutencounter.md
wc -w gutencounter.md
[hack front matter and contents into gutencounter.md and <pre></pre> for source]
multimarkdown gutencounter.md > gutencounter.html

Then print to PDF using Chrome. Big thanks to @moonmilk for the CSS.

Source: https://github.com/hugovk/gutengrep/blob/gh-pages/gutencounter.py

The text was updated successfully, but these errors were encountered:

MichaelPaulukonis · 2014-12-01T14:59:59Z

I would like to see the numerical sentences closer together, without chapter headings.

Perhaps the numbers could be in bold?

It's too broken up. For me. The layout persists each sentence is in its original isolation. Pushing them together would allow us to see them together. As your algorithm suggests.

hugovk · 2014-12-01T19:46:54Z

Good points, both.

I'd intended to do the bold, but never got round to it. In fact, there's an not-done TODO for that :)

# s = s.replace(args.word, "**" + args.word + "**")  TODO

...

#     parser.add_argument('-b', '--bold', action='store_true',
#                         help="Embolden found text TODO")

I probably won't redo it with bold, as it'd mean re-running lots of slow code. Or messing around with regexes.

About the grouping, I just re-used the same CSS, but it could be tweaked easily and re-run quickly, so I might do that.

hugovk · 2014-12-05T21:44:39Z

I've done the easy bit and smushed the chapters closer together, which has also decreased the page count from 1,609 to 358.

hugovk added the completed label Dec 1, 2014

hugovk mentioned this issue Dec 1, 2014

In! #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counting in Context by @hugovk #149

Counting in Context by @hugovk #149

hugovk commented Dec 1, 2014

MichaelPaulukonis commented Dec 1, 2014

hugovk commented Dec 1, 2014

hugovk commented Dec 5, 2014

Counting in Context by @hugovk #149

Counting in Context by @hugovk #149

Comments

hugovk commented Dec 1, 2014

MichaelPaulukonis commented Dec 1, 2014

hugovk commented Dec 1, 2014

hugovk commented Dec 5, 2014