Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible modification suggestion?: Sentence boundary disambiguation, and sentence segmentation (each sentence on a new line) – “period” “space” with “period” "new line” #4

Open
JeffKang opened this issue Nov 7, 2014 · 2 comments

Comments

@JeffKang
Copy link

JeffKang commented Nov 7, 2014

Possible modification suggestion?: Sentence boundary disambiguation, and sentence segmentation (each sentence on a new line) – “period” “space” with “period” "new line”

Apologies in advance, as this isn’t an issue, but an off-topic idea.

Sentence boundary disambiguation, and sentence segmentation (each sentence on a new line) – search and replace

I’ve always been a terrible reader and slow learner, so to aid me in reading longer and more difficult pieces of text, I sometimes segment the text by sentence boundaries (put each sentence on a new line).

(wikipedia/org/wiki/Sentence_boundary_disambiguation)

This can allow me to quickly re-read portions of the text, as my eyes immediately find the start of sentences.

You also get a good view of the length of each sentence, so you might get a better idea of where the subject(s), verb(s), and object(s) of a sentence structure may be laid.

This can be done in a word processor with a text replacement of “period” “space”, with “period” “manual line break”, "new line", or “paragraph break”.

i.e. Search for: .
Replace: .\n

or

“period” “^l”.

or

“period” “^p”.

To segment online text, Ditto (open-source clipboard manager) can be used to gather multiple clipboard copies, and then you can paste all of the collected text into a word processor.

Other formatting examples

I don’t know anything about programming, but I think it could be similar to how people use things like the pprint (“pretty-print”) Python module to help read longer, nested data structures (?).

e.g. of pprint:

    >>> stuff = ['spam', 'eggs', 'lumberjack', 'knights', 'ni']

    >>> pp.pprint(stuff)
     [   ['spam', 'eggs', 'lumberjack', 'knights', 'ni'],
         'spam',
         'eggs',
         'lumberjack',
         'knights',
         'ni']

Other examples:

XAlign Xcode plugin:

“XAlign automatically aligns assignments just so, to appease your most egregious OCD tendencies.”

http://i.imgur.com/o0Ysfw8.gif

ClangFormat-Xcode plugin:

”ClangFormat-Xcode is a convenient wrapper around the ClangFormat tool, which automatically formats whitespace according to a specified set of style guidelines.”

http://i.imgur.com/vYts5uv.gif

nshipster/com/xcode-plugins/

JavaScript search and replace

I’ve been wondering if a piece of JavaScript could be used to segment online text so that you wouldn’t have to keep transferring text to a word processor.

Perhaps the code here in “Literally” could be modified.

Again, I don’t know how to program, but maybe the following could be adjusted:

Replace this:

v = v.replace(/\bliterally\b/g, "figuratively");

with this?:

v = v.replace(/\.\s/g, “.\n”);

(I’m not sure if the syntax and/or regular expression is correct)

Installing a plug-in

The link to the .crx file in “README.md” wasn’t working:

“The plugin may be found on the Chrome Extension Store.

Alternatively:

Download the .crx file.”

I managed to grab it from the master branch on top:

https://github.com/lazerwalker/literally/blob/master/Literally.crx

I dragged Literally.crx into the “Extensions” area of Chrome, but I don’t think you can easily install user scripts in regular Chrome anymore, so Literally isn’t enabled.

However, I did manage to get the plug-in working on Firefox by downloading and installing literally.xpi.

Yeah, I’m just wondering, and throwing the thought out there.
I’d definitely purchase a sentence segmentation browser extension.

@lazerwalker
Copy link
Owner

Thanks for writing in!

That's a really interesting idea. Do know if there are any scientific studies that have been done about the efficiency of sentence boundary disambiguation as compared to other reading aids, or the most effective way to implement it?

This isn't something I'm particularly interested in pursuing myself, but that shouldn't stop you! The syntax you suggest looks like it should work perfectly — if you're interested in learning to program, figuring out how to make a tweak like that might be a great first step :).

(Thanks for the heads-up on the broken link as well.)

@JeffKang
Copy link
Author

Do know if there are any scientific studies that have been done about the efficiency of sentence boundary disambiguation as compared to other reading aids, or the most effective way to implement it?

Implementation

The people behind Grammarly (grammar checker software) have a short overview of sentence boundary disambiguation methods.

http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html

To perform a reliable evaluation, you need to have a reliable dataset in terms of size, quality (i.e. manually annotated), coverage of different genres of text and writing styles, and a statistically valid distribution of samples.

The advantage of statistics-based systems is that they may get better with better training sets.
Most of the software we've examined is trained on the Penn TreeBank.

I’m not sure how these more accurate systems work.
(I think that the Punkt Sentence Tokenizer of NLTK (platform for building Python programs to work with human language data) is very popular.
It’s based on this paper:
Kiss, Tibor; Strunk, Jan (2006). "Unsupervised Multilingual Sentence Boundary Detection". Computational Linguistics 32 (4): 485–525. doi:10.1162/coli.2006.32.4.485. ISSN 0891-2017.
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.VGHMMmftFuM).

I’m guessing that if you were to try to implement one of these systems for quick online reading, it would be more difficult to retain the structure and formatting of the original text.
You would have to find a way to automate the scraping of online text, and comparisons to stored data, or do something more complicated.

As the sentence boundary disambiguation Wikipedia entry mentions, identifying a period, a capitalized token, and some special abbreviations (e.g. “Ph.D. ” or “Mt. ” for mountain) allows you to catch 95% of sentences.
I’m hoping that a simple JavaScript replacement will be enough to achieve what I need.

Papers

Line length and readability: speed vs. user experience?

In the past, I’ve found references for line length and readability.
Starting each sentence on a new line shortens many lines, so I feel that the information could be at least remotely related.
Researchers find that you can be faster with reading longer lines, but a lot of people prefer, and are more comfortable with reading shorter and more narrow lines.

Dyson and Kipping (1997), Dyson and Haselgrove (2001), Bernard, Fernandez, and Hull (2002), Ling and van Schaik (2006).
samnabi/com.

A YouTube Google presentation on cognitive science also mentioned this.

Eye regressions and backtracking for semantic and syntactic errors?

I haven’t researched hard to find a paper, and scholarly titles are out of my league, but I just came across one possibly relevant paper about a topic that could be tested with sentence segmentation:

Braze D, Shankweiler D, Ni W, Palumbo LC (January 2002). "Readers' eye movements distinguish anomalies of form and content". J Psycholinguist Res 31 (1): 25–44. PMC 2850050. PMID 11924838. http://www.ncbi.nlm.nih.gov/pubmed/11924838

Pursuing this possibility, the present study compares the eye-movement patterns of subjects as they read (for meaning) sentences containing anomalies of verbal morpho-syntax, and anomalies that depend on the relationship between sentence meaning and real-world probabilities (we refer to these as pragmatic anomalies), and non-anomalous sentences.

I think that it’s syntax versus semantics.

grammatically defective
The cats won’t usually eating (eat) the food.
The shirt is surely wrinkle unless it is washed in warm water.

pragmatically odd
The cats won’t usually bake (eat) the food
The bus will surely wrinkle unless it is washed in warm water

(examples from Braze, D., Shankweiler, D. P., & Tabor, W. (2004). Individual Differences in Processing Anomalies of Form and Content. Poster presented at the 17th CUNY Conference on Human Sentence Processing. College Park, MD. http://www.haskins.yale.edu/staff/braze/braze-cuny2004-2up.pdf)

Syntactic anomaly generated many regressions initially, with rapid return to baseline.
Pragmatic anomaly resulted in lengthened reading times, followed by a gradual increase in regressions that reached a maximum at the end of the sentence.

Pragmatic increased reading times more than syntactic errors.

They talk about the eye regression landing sites.

For syntactic anomalies the incidence of regressions was immediately elevated at the point of anomaly and just beyond, thereafter returning to the baseline.

In contrast, frequency of regressions for pragmatic anomalies increased progressively from the point of anomaly to the end of the sentence.

I think that means that if you’re eyes are going to regress left for a syntax error, it’s going to happen right away at that error.

For semantic errors, as you get further away from an error, you’re more likely to backtrack.

So, not only do pragmatic anomalies provoke more regressions from the sentence-final region, but those regressions land, on average, much closer to the beginning of the sentence than do regressions for either controls or syntactic anomalies.
These differences in landing sites give additional evidence that the parser uses pragmatic and syntactic information differently to guide re-reading.

If you have to go back after experiencing a semantic error, you’re more likely to land closer to the beginning of a sentence than when you regress in a syntactic error and anomaly.

Thoughts: a simple test for search time only

So for pragmatic anomalies, it takes longer to read, and you regress further back than syntactic errors.
I’d like some more examples, but I personally found that to be the case.

For one basic and preliminary test, I think that you could put general comprehension aside.
Assume that you can reliably make material with pragmatic anomalies that frequently cause a user group to regress all the way back to the beginning of the sentence.
When the tracked eyes are found to regress to, and fixate on a beginning of a sentence, track the time that it takes for a user to locate the beginning.
Compare these go-back-to-the-beginning-of-the-sentence times between a group that had their sentences segmented, and a control group that does not.
If you need to go back to the beginning of a sentence, does knowing that it is always somewhere on the left help reduce the time of searching for it?

Counterargument: pragmatic anomalies are not natural – pragmatic anomalies = difficult-to-read material?

I think that a counterargument is that these pragmatic errors are artificially created, and will rarely appear when reading normal text.

However, whether it’s pragmatic anomalies, or difficult-to-read material, I think that either could induce a similar confused state.
Pragmatic errors seem to be more confusing than syntactic ones.
Therefore, naturally elevating the difficulty, and thus, the confusion, might cause more regressions that make users backtrack to the beginning of a sentence.

Counterargument: regressions to the beginning of the currently read sentence aren’t that far

If you need to repeat a sentence after failing to understand it, the beginning isn’t really that far, and you might not be saving that much time.

Other thoughts: backtracking and regression across multiple sentences and paragraphs – thought tangents, and working memory

I would also be interested in backtracking to previous sentences and sections.
Something that you read later could provide context for some of the things that you read earlier.

There’s also the “zoning out” that can occur where you’ve read text, but you weren’t paying attention.
I think that someone mentioned online that this state of inattentiveness can happen more frequently when you come across content that is harder for your brain to process.

For example, people with high working-memory capacities show greater executive control, and therefore report less frequent "thought tangents" during an attention-demanding task (Kane et al., 2007).
Some networks in the frontal cortex control allocation of attentional resources to cognitively demanding tasks.
during problem solving or high order cognition, the salience network allocates attentional/working memory resources to these other networks to permit 'thinkin hard.'
This connectivity is weaker in children than adults, and is associated with poorer task performance in children. (Supekar and Menon, PLoS computational biology 2012 v8).

-Reddit comments

Not comprehending a few sentences in a row due to a thought tangent might require a farther regression, and a longer search for a previous point to start.

Lastly, while individual sentences might be fine, the overall content might be structured less adequately.
The user might want to jump around.

Future experiments: additional factors to manipulate: grammar, length, skill, new material

I think that if there are or will be experiments, additional factors that could create more challenging material, and thus a possibly more confused state could be:

  1. Grammatically correct, but lengthy sentences

There could be sentences with multiple qualifiers, prepositions, classes, etc.

  1. Reading skill of the user

People can have very different reading comprehension abilities.
There’s also crystallized intelligence (previously stored knowledge), and fluid intelligence (ability to think in novel situations).

  1. New material

Fresh material with concepts that a user doesn’t already know can be harder to read.

Thanks for the start

I have somewhat of a repetitive strain injury of tendinosis (keep those wrists neutral, especially when you game), so I kind of move at a snail’s pace, but over 99% of the code is already written for me, and I truly believe that this could be useful for some people if it works out, so I should really get around to figuring out how to tweak it myself.
(I have to attempt to learn programming in hopes of working with accessibility software one day anyway =). E.g. stuff like the Eye Tribe eye tracker, Microsoft Kinect, Touch+ sensor, Nimble Sense sensor, Dragon NaturallySpeaking speech recognition, etc.).

Thanks for having this extension and code up in the first place.
You may have adapted it from somewhere, but the extension idea was popular enough for many people to mention it online, which helped to find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants