-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whitespace issues #1
Comments
yeah, people use inline tags in weird ways... thanks for spotting this! |
yes, this seems to work, thanks for the pointer! I'll need to check the difference on some real texts |
Thanks @codinguncut for suggestion. Still needs testing. re.sub is replicating xpath's normalize-space behaviour. See GH-1
awesome, thanks for looking into this! |
Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits. The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad. |
it appears that
.xpath('normalize-space()')
does not deal with whitespace in an ideal way in all cases.Examples:
<span class="dropcap">A</span>Telephone
=>ATelephone
<span>Phone</span>1-855-445-9710
=>Phone1-855-445-9710
<option value="156">Vifon</option><option value="157">Vinamilk</option><option value="158">Vinaphone</option>
=>VifonVinamilkVinaphone
I understand that the behavior may be in line with html inline tags and whitespace, but it does not work IMHO for real-world html documents.
I had hoped there would be a way to add
' '.join(fragments)
, but it doesn't look quite so easy...I believe Rolando already adressed this previously. Maybe it was done using along the lines of
' '.join(x.strip() for x in cleaned.xpath('//text()').extract())
...The text was updated successfully, but these errors were encountered: