Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting in Context by @hugovk #149

Open
hugovk opened this issue Dec 1, 2014 · 3 comments
Open

Counting in Context by @hugovk #149

hugovk opened this issue Dec 1, 2014 · 3 comments

Comments

@hugovk
Copy link
Collaborator

hugovk commented Dec 1, 2014

Almost forgot about this. Code and output was created before the deadline, PDF knocked up and all uploaded afterwards.

What happens if we want to find each sequential number, in words, in a big corpus?

This is what happens.

It uses the Project Gutenberg CD of 600 books, containing some 3,583,389 sentences.

It runs through twice: first with the first sentence found in the corpus (from zero to fifty-five thousand); second with the shortest matching sentence (zero to forty-eight thousand).

Made something like this:

gutencounter --cache *.txt >> gutencounter-unsorted.md
gutencounter --sort --cache *.txt >> gutencounter-sorted.md
[leave running until have enough words]
cat gutencounter-unsorted.md > gutencounter.md
cat gutencounter-sorted.md >> gutencounter.md
grep "##" gutencounter.md > contents.txt
[hack contents.txt into links]
cat gutencounter.py >> gutencounter.md
wc -w gutencounter.md
[hack front matter and contents into gutencounter.md and <pre></pre> for source]
multimarkdown gutencounter.md > gutencounter.html

Then print to PDF using Chrome. Big thanks to @moonmilk for the CSS.

Source: https://github.com/hugovk/gutengrep/blob/gh-pages/gutencounter.py

@hugovk hugovk mentioned this issue Dec 1, 2014
@MichaelPaulukonis
Copy link

I would like to see the numerical sentences closer together, without chapter headings.

Perhaps the numbers could be in bold?

It's too broken up. For me. The layout persists each sentence is in its original isolation. Pushing them together would allow us to see them together. As your algorithm suggests.

@hugovk
Copy link
Collaborator Author

hugovk commented Dec 1, 2014

Good points, both.

I'd intended to do the bold, but never got round to it. In fact, there's an not-done TODO for that :)

# s = s.replace(args.word, "**" + args.word + "**")  TODO

...

#     parser.add_argument('-b', '--bold', action='store_true',
#                         help="Embolden found text TODO")

I probably won't redo it with bold, as it'd mean re-running lots of slow code. Or messing around with regexes.

About the grouping, I just re-used the same CSS, but it could be tweaked easily and re-run quickly, so I might do that.

@hugovk
Copy link
Collaborator Author

hugovk commented Dec 5, 2014

I've done the easy bit and smushed the chapters closer together, which has also decreased the page count from 1,609 to 358.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants