Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra \n characters #93

Open
dpzaba opened this issue Apr 10, 2013 · 4 comments
Open

Extra \n characters #93

dpzaba opened this issue Apr 10, 2013 · 4 comments

Comments

@dpzaba
Copy link

dpzaba commented Apr 10, 2013

Hi,

I'm extracting text (I'm not the author of the pdf) from
http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

In the first page, first line (don't count the titles), appears twice the character '\n', and I think it must appears only one. Let me show you the output:

artículo 378.7 del Reglamento del\n\n               Registro Mercantil)\n\n

I mean the characters '\n' in the middle of the string.

ruby 1.9.3p0
pdf-reader 1.3.3

Thanks and nice job!

@yob
Copy link
Owner

yob commented Apr 10, 2013

Hola David, thanks for the report.

Are we looking at the same PDF? The one you linked to does not have "artículo 378.7" anywhere in the document.

@dpzaba
Copy link
Author

dpzaba commented Apr 10, 2013

Hi James,

I'm sorry the pdf is http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

Thanks again.

@yob
Copy link
Owner

yob commented Apr 10, 2013

OK, I can reproduce it here. It's not ideal - the layout algorithms in lib/pdf/reader/page_layout.rb could definitely be improved.

One idea might be to detect "blocks" of text that appear to be be close together vertically and render them as one.

I'm pressed for time at the moment so probably can't look into it much for now, but I'd happily accept any pull requests for review.

@dpzaba
Copy link
Author

dpzaba commented Apr 10, 2013

Hi James,

I think the problem is the line:

artículo 378.7 del Reglamento del\n\n               Registro Mer...

Should be:

#only one \n
artículo 378.7 del Reglamento del\n               Registro Mer...

** Be careful!! I'm a beginner in Ruby
I was reading the code lib/pdf/reader/page_layout.rb in line 35:

interesting_rows(page).map(&:rstrip).join("\n")

I think the problem is the method interesting_rows receive an invalid element (an empty element) in page parameter (and then join with "\n"). Right?
Maybe something wrong with TextRun? (I need to understand this better)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants