whitespace issues #1

codinguncut · 2017-05-26T13:43:25Z

it appears that .xpath('normalize-space()') does not deal with whitespace in an ideal way in all cases.

Examples:

<span class="dropcap">A</span>Telephone => ATelephone
<span>Phone</span>1-855-445-9710 => Phone1-855-445-9710
<option value="156">Vifon</option><option value="157">Vinamilk</option><option value="158">Vinaphone</option> => VifonVinamilkVinaphone

I understand that the behavior may be in line with html inline tags and whitespace, but it does not work IMHO for real-world html documents.

I had hoped there would be a way to add ' '.join(fragments), but it doesn't look quite so easy...

I believe Rolando already adressed this previously. Maybe it was done using along the lines of ' '.join(x.strip() for x in cleaned.xpath('//text()').extract())...

The text was updated successfully, but these errors were encountered:

lopuhin · 2017-05-26T13:48:20Z

yeah, people use inline tags in weird ways... thanks for spotting this!
Adding a whitespace in this cases looks like a better default, even if some words are split it's easier to capture them with ngrams. And I don't think that pages that wrap every letter in a span are that common.

lopuhin · 2017-05-26T14:02:39Z

' '.join(x.strip() for x in cleaned.xpath('//text()').extract())

yes, this seems to work, thanks for the pointer! I'll need to check the difference on some real texts

@codinguncut

Thanks @codinguncut for suggestion. Still needs testing. re.sub is replicating xpath's normalize-space behaviour. See GH-1

codinguncut · 2017-05-26T14:46:41Z

awesome, thanks for looking into this!
just implemented something similar ;)
it may make sense to precompile the rex for speed: REX = re.compile(r'\s+'), later REX.sub('', string)

lopuhin · 2017-05-26T15:22:56Z

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.

lopuhin added a commit that referenced this issue May 26, 2017

Add whitespace even for inline tags

6135ba6

Thanks @codinguncut for suggestion. Still needs testing. re.sub is replicating xpath's normalize-space behaviour. See GH-1

lopuhin mentioned this issue May 26, 2017

Fix unwanted joins for inline tags #2

Merged

lopuhin closed this as completed in #2 May 29, 2017

lopuhin mentioned this issue Sep 6, 2019

Don't always insert spaces around inline tags? #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whitespace issues #1

whitespace issues #1

codinguncut commented May 26, 2017 •

edited

Loading

lopuhin commented May 26, 2017

lopuhin commented May 26, 2017

codinguncut commented May 26, 2017

lopuhin commented May 26, 2017

whitespace issues #1

whitespace issues #1

Comments

codinguncut commented May 26, 2017 • edited Loading

lopuhin commented May 26, 2017

lopuhin commented May 26, 2017

codinguncut commented May 26, 2017

lopuhin commented May 26, 2017

codinguncut commented May 26, 2017 •

edited

Loading