Extra \n characters #93

dpzaba · 2013-04-10T11:33:02Z

Hi,

I'm extracting text (I'm not the author of the pdf) from
http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

In the first page, first line (don't count the titles), appears twice the character '\n', and I think it must appears only one. Let me show you the output:

artículo 378.7 del Reglamento del\n\n               Registro Mercantil)\n\n

I mean the characters '\n' in the middle of the string.

ruby 1.9.3p0
pdf-reader 1.3.3

Thanks and nice job!

The text was updated successfully, but these errors were encountered:

yob · 2013-04-10T11:40:03Z

Hola David, thanks for the report.

Are we looking at the same PDF? The one you linked to does not have "artículo 378.7" anywhere in the document.

dpzaba · 2013-04-10T11:42:28Z

Hi James,

I'm sorry the pdf is http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

Thanks again.

yob · 2013-04-10T11:53:13Z

OK, I can reproduce it here. It's not ideal - the layout algorithms in lib/pdf/reader/page_layout.rb could definitely be improved.

One idea might be to detect "blocks" of text that appear to be be close together vertically and render them as one.

I'm pressed for time at the moment so probably can't look into it much for now, but I'd happily accept any pull requests for review.

dpzaba · 2013-04-10T16:37:07Z

Hi James,

I think the problem is the line:

artículo 378.7 del Reglamento del\n\n               Registro Mer...

Should be:

#only one \n
artículo 378.7 del Reglamento del\n               Registro Mer...

** Be careful!! I'm a beginner in Ruby
I was reading the code lib/pdf/reader/page_layout.rb in line 35:

interesting_rows(page).map(&:rstrip).join("\n")

I think the problem is the method interesting_rows receive an invalid element (an empty element) in page parameter (and then join with "\n"). Right?
Maybe something wrong with TextRun? (I need to understand this better)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra \n characters #93

Extra \n characters #93

dpzaba commented Apr 10, 2013

yob commented Apr 10, 2013

dpzaba commented Apr 10, 2013

yob commented Apr 10, 2013

dpzaba commented Apr 10, 2013

Extra \n characters #93

Extra \n characters #93

Comments

dpzaba commented Apr 10, 2013

yob commented Apr 10, 2013

dpzaba commented Apr 10, 2013

yob commented Apr 10, 2013

dpzaba commented Apr 10, 2013