Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whitespace issues #1

Closed
codinguncut opened this issue May 26, 2017 · 4 comments · Fixed by #2
Closed

whitespace issues #1

codinguncut opened this issue May 26, 2017 · 4 comments · Fixed by #2

Comments

@codinguncut
Copy link

codinguncut commented May 26, 2017

it appears that .xpath('normalize-space()') does not deal with whitespace in an ideal way in all cases.

Examples:

  • <span class="dropcap">A</span>Telephone => ATelephone
  • <span>Phone</span>1-855-445-9710 => Phone1-855-445-9710
  • <option value="156">Vifon</option><option value="157">Vinamilk</option><option value="158">Vinaphone</option> => VifonVinamilkVinaphone

I understand that the behavior may be in line with html inline tags and whitespace, but it does not work IMHO for real-world html documents.

I had hoped there would be a way to add ' '.join(fragments), but it doesn't look quite so easy...

I believe Rolando already adressed this previously. Maybe it was done using along the lines of ' '.join(x.strip() for x in cleaned.xpath('//text()').extract())...

@lopuhin
Copy link
Contributor

lopuhin commented May 26, 2017

yeah, people use inline tags in weird ways... thanks for spotting this!
Adding a whitespace in this cases looks like a better default, even if some words are split it's easier to capture them with ngrams. And I don't think that pages that wrap every letter in a span are that common.

@lopuhin
Copy link
Contributor

lopuhin commented May 26, 2017

' '.join(x.strip() for x in cleaned.xpath('//text()').extract())

yes, this seems to work, thanks for the pointer! I'll need to check the difference on some real texts

lopuhin added a commit that referenced this issue May 26, 2017
Thanks @codinguncut for suggestion. Still needs testing.
re.sub is replicating xpath's normalize-space behaviour.
See GH-1
@codinguncut
Copy link
Author

awesome, thanks for looking into this!
just implemented something similar ;)
it may make sense to precompile the rex for speed: REX = re.compile(r'\s+'), later REX.sub('', string)

@lopuhin
Copy link
Contributor

lopuhin commented May 26, 2017

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants