Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookmarks/Links drop when combining PDFs #44

Closed
andrewbaker00 opened this issue Jan 26, 2016 · 38 comments
Closed

Bookmarks/Links drop when combining PDFs #44

andrewbaker00 opened this issue Jan 26, 2016 · 38 comments

Comments

@andrewbaker00
Copy link

Combining two PDF's with appropriate bookmarks and links to named destinations creates a pdf where

  1. bookmarks are gone, and
  2. clicking on one of the previous in-document links throws a "The document's page tree contains an invalid node." (in Acrobat).

I see issue #31, but that shouldnt apply here because the pdf has different named destinations (resolved with GUIDs)

@andrewbaker00
Copy link
Author

And just for context, what I am trying to do is I have one pdf file with a cover page and TOC and a group of other "body" PDF's that are the documents referenced in the TOC.

Im trying to merge the cover/toc pdf with the "body" pdf's, and have links in the TOC point to the appropriate page in the final PDF as well as having proper bookmarks in the final PDF.

I can think of two approaches:

  1. I add blank pages to the cover/toc pdf for every "body" pdf page that i will be appending, with a named destination at the very top of the page. Then stamping the "body" pdf pages into their appropriate pages.
  2. Just appending the "body" pdf's and inserting the named destinations on the fly.

The issue stated above refered to approach # 1, I can join the pdfs fine, but the links/bookmarks from the toc/cover pdf are lost in the final pdf.

@boazsegev
Copy link
Owner

Hi Andrew,

Thank you for bringing these issues to my attention.

Allow me to point out that CombinePDF creates a new PDF file, so I doubt if Bookmarks are transferrable, although that could be an interesting feature to have. I will try looking into bookmarks when done with the TOC.

As for keeping named destinations intact, it's definitely on the roadmap.

The issue, at the moment is that ad-hock links within the PDF file (the links using the PDF's "name tree") are preserved, but TOC links (links based on the entries in the table of contents) break since the original table of contents is discarded.

I definitely want to merge the TOC of different PDF files as well, perhaps adding a new "root" for each file (since the TOC is a semi-complex binary tree were each parent must provide data about it's children), but this requires me to dig deeper into the standard again and I'm in the middle of a few projects that take most of my time...

I'm keeping this open. If you have any further ideas about implementing these features, let me know.

I'll keep you posted when I get to this.

Bo.

@aried3r
Copy link

aried3r commented May 27, 2016

I definitely want to merge the TOC of different PDF files as well

With ToC, do you mean the outline 12.3.3 or named destinations 8.2.1?

We're currently facing the problem that our outline went missing.

@boazsegev
Copy link
Owner

When talking about the TOC, I was referring to the outline.

Named destinations are used for links inside a page. The outline is used as an "external" navigation tool.

I'm assuming the outline will need to be extended when merging outlines, so that a root tree node is created and that this root tree node expands each "branch" to include all the data in the original document and any imported document.

On the other hand, this root node would need to recognize CombinePDF root nodes in order to merge with them rather then swallow them.

@aried3r
Copy link

aried3r commented May 27, 2016

Is there a workaround? At least to not lose the outline of the main document?

@boazsegev
Copy link
Owner

There's no workaround at the moment, as preserving the data and moving it around isn't as simple as a single line of code.

At the moment, the ToC is lost during the parsing stage, unlike the way the named destinations are preserved.

After the parsing, the data will need to be saved to the PDF object.

Then, the ToC would need to get updated during the import of new objects, a similar process happens with the names object and the form data.

Also, the ToC might reference named destinations and it might use conflicting keywords... i.e. names are updated to prevent conflicts and the ToC might need to go through a similar process.

The last step to write would be to get the ToC into the new catalog/PDF object, similar to the way the names object is added for the rendering.

I had tons of stuff on my plate, so I didn't get to it, but it is definitely possible to code this... it's just more complex of an undertaking.

@sLe1tner
Copy link
Contributor

Got kind of stuck.. or lost, somewhere in the outlines hash
Merging 2 Outline hashes seems not to work by using

actual_value(@outlines).update actual_value(data.outlines_object), &self.class.method(:hash_merge_new_no_page)

"seems not to work" as in SystemStackError - stack level too deep from calling merge in hash_merge_new_no_page over and over.
This is making me believe that I made a mistake somewhere in the steps 1-3, you described (see 6d16830), especially because of the fact that a pdf with 7 bookmarks needed tens of thousands of lines (literally) on a fullscreened console to print the outlines hash.

I'd appreciate a little nudge in the right direction if you don't mind :)

This is what the beginning on the outlines hash looks like, starts with obvious and usefull data, but deeper in is stuff like fonts, annots, meadiaboxes and so on.

{:Type => :Outlines,
   :Count => 7,
   :First => {:is_reference_only => true, :referenced_object => {:Title => "\xFE\xFF\x00I\x00N\x00H\x00A\x00L\x00T\x00S\x00V\x00E\x00R\x00Z\x00E\x00I\x00C\x00H\x00N\x00I\x00S",
      :Parent => {:is_reference_only => true, 
        :referenced_object => {:Type => :Outlines, 
          :Count => 7,
          :First => {...}, 
          :Last => {:is_reference_only => true, 
            :referenced_object => {:Title => "\xFE\xFF\x00A\x00U\x00F\x00G\x00A\x00B\x00E\x00N", 
              :Parent => {:is_reference_only => true, 
                :referenced_object => {...}}, 
              :Count => 2, 
              :First => {:is_reference_only => true, 
                :referenced_object => {:Title => "\xFE\xFF\x00E\x00v\x00a\x00l\x00u\x00i\x00e\x00r\x00u\x00n\x00g\x00 \x00v\x00o\x00m\x00 \x000\x009\x00.\x000\x006\x00.\x002\x000\x001\x006", 
                  :Parent => {:is_reference_only => true, 
                    :referenced_object => {...}}, 
                  :Count => 0, 
                  :Next => {:is_reference_only => true, 
                    :referenced_object => {:Title => "\xFE\xFF\x00R\x00e\x00e\x00v\x00a\x00l\x00u\x00i\x00e\x00r\x00u\x00n\x00g\x00 \x00v\x00o\x00m\x00 \x000\x009\x00.\x000\x006\x00.\x002\x000\x001\x006", 
                      :Parent => {:is_reference_only => true, :referenced_object => {...}}, 
                    :Count => 0, 
                    :Prev => {:is_reference_only => true, 
                      :referenced_object => {...}}, 
                    :Dest => [{:is_reference_only => true, 
                               :referenced_object => {:Type => :Page, 
                                 :Parent => {:is_reference_only => true, 
                                   :referenced_object => {:Type => :Pages, 
                                     :Count => 9, 
                                     :Kids => [{:is_reference_only => true, 
                                                :referenced_object => {:Type => :Page, 
                                                  :Parent => {:is_reference_only => true, :referenced_object => {...}}, 
                                                  :MediaBox => [0, 0, 595.28, 841.89], 
                                                  :Contents => {:is_reference_only => true, 
                                                    :referenced_object => {:Length => 1423, 
                                                      :Filter => [:FlateDecode], 
                                                      :raw_stream_content => RAW STRUM}}, 

@boazsegev
Copy link
Owner

I don't think you're lost... yet :-)

I think the issue is the recessive nature of the update function.

The Outline hash references the pages which, in turn, reference the catalog, which brings us back to the outline... But this isn't the issue because the hash_merge_new_no_page prevents this recursive branch during it's "deep update" implementation.

There must be another branch of data that causes recursive updating.

It seems that each node in the outline tree references it's parent node (:Parent)... these references might be the one leading to the recursive update.

I'm guessing a manual update scheme should be implemented or a different hash update function should be used to resolve data conflicts (instead of hash_merge_new_no_page). maybe the hash_merge_new_no_page function should ignore and :Parent key names...

@sLe1tner
Copy link
Contributor

sLe1tner commented Jun 22, 2016

Seems like that's the last piece missing.. How to merge the outlines
In theory, it's just about inserting into a two-way list, but I'm having trouble understanding the logic behind the way the object gets build in the parser, or rather the result, where the outlines hash has the first and the last outline node and everything else about the outline is "stuffed" into those 2 entries and some information about the bookmark order can be found multiple times within the hash.
I'm not sure, how to insert the new bookmarks into that, yet.

@boazsegev
Copy link
Owner

I'm sorry it's taking me a while to answer, it's a busy week for me.

To resolve this we need to read through the PDF specification.

Section 12.3 to the specs deals with outlines and I think this is the part we should be looking at.

When merging the data, we should make it fit the specs, either by:

  • merging existing outlines as tree branches (no need to unpack each outline, we simply create a master outline that treats these outlines as children; OR
  • merging the actual outline data (probably preferable but might be complicated) in accordance with location at which the incoming PDF object is injected.

At this point, named destinations (see 12.3.2.3 in the specs) need to be reviewed, to make sure we don't have naming conflicts. Similar code already exists and could probably be adapted for the outline.

It seems to me that in order to combine the outline (which is a linked list), the code could "walk" the list, updating the :First, :Next, :Last and :Count with each "step".

The code will probably need to walk only the root of the outline (assuming branches of the outline remain unchanged).

If the PDF is injected in the middle, the :First and :Last entries might not need to be updated, but in both other cases (injected at the beginning or the end) these entries will need to be updated.

I have no idea how to handle the middle case, as I have no idea how to discover which "bookmarks" reference the page after which the PDF is injected.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jun 28, 2016

Well, I was able to append the second outline to the original one so far (here: ee14937).
As far as the other (probably preferable option goes, I assume we'd have to use the :destination keys in the outline nodes to identify the page it links to and then insert them accordingly, but the insertion part gets kind of tricky, because the outline is full of circular references, of course there are the :Page and the :Parent as you pointed out earlier, but every node can have up to 5 references to other nodes (:Parent :Prev :Next :First :Last) so a node that has a :Next node it points to will also have that node pointing to it with : Prev, which makes "walking" through the hash very difficult, Which is also why I didn't do that in the commit I wrote about at the beginning. I updated the outline :Count and then all the :Parent's for nodes that have the outline as parent, then changed the :Last key of the main outlinedictionary to the new outline and finally appended the new outline to the old one by going to the last node of the old outlines :First key and inserting it there.
Tomorrow I'm going to test it some more and figure out if the result is according to the specifications and then hopefully open a pull request.

Also, I might be wrong here, but I am pretty sure that we don't need to do the same thing that has to be done for the named destinations, named destinations literally have a 'word' as link/anchor while the bookmarks use page links, in my tests so far, even merging 2 identical Pdfs didn't cause any problems with names or page links.

Edit/Add: Depending on what I find in the specifications, I might try to add a parent node to the new outline, so it is easier to identify an inserted part.

@boazsegev
Copy link
Owner

Hi Stefan,

I've been working on a lot of re-writes and bug corrections for resolving lost form data, decryption issues and the like... It actually brought some in-depth changes, so things might be easier now.

For instance, the reference resolution should be both faster and easier... but it is implemented differently, so (I'm sorry) some of the code you wrote will need to be revised.

For instance, you have this line:

@objects << catalog
# [...]
add_referenced catalog[:Outlines], false

which will break under the new implementation (and isn't required, everything in the catalog is automatically processed).

This includes a new recursion (or repetition) protection helper at CombinePDF::PDF::RECORSIVE_PROTECTION

I hope this will help to "walk" the outline.

After reading the code for the update, I thought I would give it a try myself. Your code looked concise and thoughtful, however, I though a destructive pattern in a single function would perform better and be easier to maintain over time.

I came up with this untested (and probably broken) destructive update pattern (updates the data in place).

def merge_outlines(old_data, new_data, position)
  old_data = actual_object(old_data)
  new_data = actual_object(new_data).dup
  if old_data.empty?
    # old_data is a reference to the actual object,
    # so if we update old_data, we're done, no need to take any further action
    old_data.update new_data
  else
    old_data[:Count] += new_data[:Count]
    # walk the Hash here ...
    # I'm just using the start / finish position for now...
    # FIXME to implement an insert in the middle of the file?
    prev = nil
    pos = first = actual_object(((position < 0) ? new_data : old_data)[:First])
    median = actual_object(((position < 0) ? old_data : new_data)[:First])
    last = actual_object(((position < 0) ? old_data : new_data)[:Last])
    first = old_data[:First] = {is_reference_only: true, referenced_object: first}
    last = old_data[:Last] = {is_reference_only: true, referenced_object: last}
    parent = {is_reference_only: true, referenced_object: old_data}
    # the walking
    while(pos)
        pos[:First] = first # already a reference
        pos[:Last] = last # already a reference
        pos[:Parent] = parent # already a reference
        if(prev)
          pos[:Prev] = {is_reference_only: true, referenced_object: prev}
        else
          pos.delete :Prev
        end
        # connect the two outlines
        if(pos[:Next].nil?)
          pos[:Next] = median
          median = nil
        end
        prev = pos
        pos = actual_object(pos[:Next])
    end
    # make sure the last object doesn't have the :Next property
    prev.delete :Next
  end
  # print_dat_outline(old_data)
  return nil # no need to return the data, the update had taken place destructively.
end

I wonder if you tried a similar approach...?

I know you already have something working, but could you test it out and see if there's a performance difference?

I know there are a few things missing:

  1. This code updates the outline directly, I don't think it's corrupting the "incoming" PDF object because I think everything was duplicated, but this will need to be tested.
  2. This code doesn't deal with nested outlines... but does it need to? is there any change required for the nested data?
  3. The code doesn't deal with cases where the insert position isn't 0 (beginning) nor -1 (end), but rather a relative value (i.e. -3, three pages from the end)... but I think this might be as close as we get for now.

The uptake of this approach is that it is minimizing temporary objects and function calls (which are relatively expensive in Ruby), editing the data "in place".

I also think it's easier to maintain, but it might be because my mind is slow at computing function call jumps.

Named destinations?

I understand the PDF files you tested with all used page references as TOC destinations. This is great news.

However, I think the standard states that named destinations (using either Name objects (Ruby symbols) or Strings) are allowed (see sec. 12.3.2.3).

I'm not sure, but it could be that we're be covered either way, we just need to check that the existing name conflict resolution logic will be applied also on the names in the TOC (which it probably would).

Edit/Add:

At first I thought that would be cool, but then there is the question of naming that header position (we could probably use the PDF#title property). This could actually make the outline merge easier, though it would require some thought...

On one hand, I'm not sure this is what I would want to see when merging a file.

On the other hand, this could also be used to create an outline for files without an outline, so all inserted PDF files appear on the outline and the reader can jump directly to this file or the other.

But this will probably only work if files are inserted at the beginning or the end. I don't see how this can be implemented for PDF files inserted in the middle of a page stream.

In practice, I think this idea should be scratched in favor of practicality.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jun 29, 2016

I came up with this untested (and probably broken) destructive update pattern (updates the data in place).

I appreciate it, I see where you are going with that and made it work with your updates the data in place approach.

I know there are a few things missing:
This code doesn't deal with nested outlines... but does it need to? is there any change required for the nested data?

This code updates the outline directly, I don't think it's corrupting the "incoming" PDF object because I think everything was duplicated, but this will need to be tested.
(Edit: Messed up copy-pasting this)

I assume that you are talking about the ToC's subsections. The answer is no. As long as we only enable insertion at the start or end and not somewhere in the middle (within a subsection) it doesn't matter. I would have explained the :First and :Last key a little better if I knew that you were going to come up with an update method. Those keys are only present in a subsection parent node and they point to the start and end of that subsection (the main nodes we see in the outline are basically a subsection of the main outline node (by which I mean the tree base, which only has { Type: Outline, Count: #, First{..}, Last{..}}) which is not visible. Which is why we don't need to touch them, except in the main outline node.
After tweaking your merge method to look like this:

def merge_outlines(old_data, new_data, position)
  old_data = actual_object(old_data)
  new_data = actual_object(new_data).dup
  if old_data.empty?
    # old_data is a reference to the actual object,
    # so if we update old_data, we're done, no need to take any further action
    old_data.update new_data
  else
    old_data[:Count] += new_data[:Count]
    # walk the Hash here ...
    # I'm just using the start / finish position for now...
    # FIXME to implement an insert in the middle of the file?
    prev = nil
    pos = first = actual_object(((position < 0) ? new_data : old_data)[:First])
    last = actual_object(((position < 0) ? old_data : new_data)[:Last])
    median = {is_reference_only: true, referenced_object: actual_object(((position < 0) ? old_data : new_data)[:First])}
    old_data[:First] = {is_reference_only: true, referenced_object: first}
    old_data[:Last] = {is_reference_only: true, referenced_object: last}
    parent = {is_reference_only: true, referenced_object: old_data}
    # the walking
    while(pos)
      # pos[:First] = first # already a reference
      # pos[:Last] = last # already a reference
      pos[:Parent] = parent if pos[:Parent] # already a reference
      if(prev)
        pos[:Prev] = {is_reference_only: true, referenced_object: prev}
      else
        pos.delete :Prev
      end
      # connect the two outlines
      if(pos[:Next].nil?)
        pos[:Next] = median
        median = nil
      end
      prev = pos
      pos = actual_object(pos[:Next])
    end
    # make sure the last object doesn't have the :Next property
    prev.delete :Next
  end
  # print_dat_outline(old_data)
  return nil # no need to return the data, the update had taken place destructively.
end

its performance was the following:

SMALL PDFS:
user     system      total        real
0.000000   0.000000   0.000000 (  0.000007)
0.000000   0.000000   0.000000 (  0.000027)
0.000000   0.000000   0.000000 (  0.000013)
0.000000   0.000000   0.000000 (  0.000014)
LARGE PDFS:
user     system      total        real
0.000000   0.000000   0.000000 (  0.000009)
0.000000   0.000000   0.000000 (  0.000054)
0.000000   0.000000   0.000000 (  0.000089)
0.000000   0.000000   0.000000 (  0.000129)

While my version had this performance:

SMALL PDFS:
user     system      total        real
0.000000   0.000000   0.000000 (  0.000019)
0.000000   0.000000   0.000000 (  0.000016)
0.000000   0.000000   0.000000 (  0.000016)
0.000000   0.000000   0.000000 (  0.000028)
LARGE PDFS:
user     system      total        real
0.000000   0.000000   0.000000 (  0.000006)
0.000000   0.000000   0.000000 (  0.000074)
0.000000   0.000000   0.000000 (  0.000087)
0.000000   0.000000   0.000000 (  0.000112)

Merging 4 identical pdfs,
one of them 4 pages with 5 outline nodes (SMALL PDFS),
one 214 pages with 100+ outline nodes (LARGE PDFS),
Test called in pdf_public.rb, by doing this instead of a normal call to the method:

Benchmark.bm do |bm|
  bm.report { merge_outlines(@outlines, data.outlines_object, 0) }
  # bm.report { actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), actual_value(data.outlines_object)) unless actual_value(data.outlines_object).empty? }
end

using the according call for the new and the old method testing (new being your approach)

I don't think it's corrupting the "incoming" PDF object because I think everything was duplicated, but this will need to be tested.

I'd have to test more specifically, all I can say at this point is that everything looks fine in the result.

The code doesn't deal with cases where the insert position isn't 0 (beginning) nor -1 (end)

I think that's sufficient for now aswell. It would be quite some effort to figure out where to put it, especially since there's even multiple ways to define an outline nodes destination, as you pointed out.

However, I think the standard states that named destinations (using either Name objects (Ruby symbols) or Strings) are allowed (see sec. 12.3.2.3).

Yup, you're right, I'm looking into that now.

In practice, I think this idea should be scratched in favour of practicality.

Alright, I see your point.

Keep in mind that everything I wrote so far is from an unmerged upstream standpoint.
I just started merging your changes in, but I'm not clear on how to do this add_referenced catalog[:Outlines], false now. Just skipping it doesn't work.

@boazsegev
Copy link
Owner

Wow, that's a lot of work... nice. I'm super thankful to you for this... this feature should be named after you 👍

Two questions about performance and the merge function.
  1. Is it easy to move the if statements out of the loop?

    1.1 I think calling Hash#delete will be faster then calling if(prev) for every rotation... we know it's always the last object, or am I wrong?

    1.2 I know it's my bad, I preferred DRY over performance, but we might consider writing two loops instead of the if(pos[:Next].nil?), but two loops should (in theory) be faster.

  2. The benchmarks had wide statistical variance... they both seem fast, but I fear context switches and benchmarking overhead might have to much of an effect to learn much.

    I'm sure my idea will corrupt any TOC, but maybe we can keep adding it to itself a few times (maybe 1000 times), just to see how these functions behave under heavier load. My goal is that the benchmark will spend about a second actually performing the task, so some of the overhead and context switching effects could be drowned out we might need to try more repetitions, but I think they grow exponentially.

Benchmark.bm do |bm|
  merge_outlines(@outlines, data.outlines_object, 0)
  bm.report { 1000.times { merge_outlines(@outlines, @outlines, 0) } }
  # actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), actual_value(data.outlines_object)) 
  # bm.report { actual_value(@outlines).update merge_outlines_old(actual_value(@outlines), @outlines) }
end
Merging with upstreamed changes

I'm sorry I made you work harder... But I hope no big changes will pop up.

I think I might have messed up the RECORSIVE_PROTECTION Hash in a way that messes up the reference collection.

The PDF::RECORSIVE_PROTECTION hash excludes certain keys / objects from being searched.

When I wrote it I assumed there is a :Kids array listing all the outlines, but now that I think of it, there wasn't one... We probably need to remove the First: true, Next: true, Prev: true from the RECORSIVE_PROTECTION Hash... the whole thing should probably be removed and replaced with the simple k == :Parent it was before... Sorry

If this isn't the reason it fails, I'm not sure what it might be, so let me explain the new logic, maybe it will help.

The new reference collector looks at any existing objects and adds any references they hold into the @objects array.

Since this is done after the catalog is added to the (now empty) @objects array, the new reference collector should have automatically added any missing referenced objects from the @outline (referenced by the catalog object).

In other words, it should have worked automatically once you had this in the code in the rebuild_catalog function (and saved the file):

catalog_object = { Type: :Catalog,
                 Pages: { referenced_object: pages_object, is_reference_only: true },
                 Names: { referenced_object: @names, is_reference_only: true },
                 Outlines: { referenced_object: @outlines, is_reference_only: true } }

However, for performance reasons, the reference collector (add_referenced) is only called when there's risk to the data structure's integrity or there's a need to format the PDF data (i.e. saving).

Any tests performed after rebuilding the catalog will fail unless the reference collector is invoked. I'm hoping this is why it failed...

If this doesn't work I might have a bug in my new reference collector... that would definitely suck, but i's possible.

A simple workaround would be to add the @outlines directly to the @objects array before invoking the reference collector... but I would need to search for the bug and fix it anyway.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jun 30, 2016

Happy to help, also you did a lot by coming up with the base of how to update the outlines (:

performance and the merge function.

  1. Is it easy to move the if statements out of the loop?

I'm not sure if you are deleting pos[:Prev] for some performance reasons I can't see? If not, it wouldn't be needed at all.

while(pos)
  pos[:Parent] = parent if pos[:Parent] # already a reference
  # connect the two outlines
  if(pos[:Next].nil?)
    pos[:Prev] = {is_reference_only: true, referenced_object: prev} if prev
    median[:referenced_object][:Prev] = {is_reference_only: true, referenced_object: prev} if median
    pos[:Next] = median
    median = nil
  end
  prev = pos
  pos = actual_object(pos[:Next])
end

this would work aswell. It would kind of combine your 1.1 and 1.2, if the if(prev) was in it's own loop, that loop would have to run through one whole branch checking [:Next].nil? every rotation while it's trying to get to the end. This way everything happens in this 1 if, I'm not sure though if I overlooked anything performance wise.

2. Performance
Let's just say the old way explodes with big data.
Benchmarking the way you suggested didn't quite work for reasons I haven't fully figured out yet (calling the method more than once results in ruby getting stuck like it's in an infinite loop). I worked around that by creating pdfs with huge ToC's using the prawn gem and merging those pdfs. And this is where my method fell short.. when working with multiple really big ToC's. I know it's not very specific, but it was really just a way to confirm what we already knew, so I didn't have to waste time on those methods.

Merging with upstreamed changes
Everything worked out after reverting to k == :Parent

I'm out of time for today, but next up is rebuilding the named destinations used in the outline.

@boazsegev
Copy link
Owner

Cool 👍

1. The loop

I love the new loop and unified if statement, nice work.

I think deleting the :Prev property was my bad, I didn't think it through and it isn't required.

The only thing is, if we have a nil value, it will be a NULL object in the PDF file, which just takes up space for no reason and might cause issues in the future.

The first object probably shouldn't have the :Prev key and the last once probably shouldn't have the :Last key (even if the key value is nil)... but this considerations definitely shouldn't be part of the loop.

Performance

Kudos for taking the time to create PDF files with huge ToCs... If you can post any of them (unless they contain sensitive data), I could use them for some of the tests I perform before releasing updates.

I don't add the PDFs used for testing to the online repo, since some of them might contain sensitive data... and I should probably write automated tests, though... which I haven't gotten around to. I keep using the demo app for testing half the features (merge PDF files, text boxes, font importing, tables and page numbering) and then do some manual work for testing other features (i.e. page stumping)... this isn't very effective...

... I digress.

Upstream

I'm happy this is resolved and that the new reference collection works, as I think the new reference collection will make future updates that much easier.

Named destinations

If the names already exist in the name dictionary, this might be handled by the existing name resolution that is automatically applied whenever the object catalog is rebuilt.

If this suspicion I have is correct, then this feature is practically done 👍

@sLe1tner
Copy link
Contributor

sLe1tner commented Jul 1, 2016

The Loop

if we have a nil value, it will be a NULL object in the PDF file

"if we have a nil value" - where in outline would that nil, you are talking about, be?

The first object probably shouldn't have the :Prev key and the last once probably shouldn't have
the :Last key (even if the key value is nil)... but this considerations definitely shouldn't be part of the loop.

I can definitely add a deletion of those keys after the loop, since those nodes are accessible without looping

Performance
Well, I'm testing my work in the very rails app that brought me here. The app already generates PDFs using Prawn, but for the tests I generated different simplistic PDFs. The code I use are just things like this:

report = CombinePDF.new

b_pdf = Prawn::Document.new
b_pdf.text("Hi prawn")
b_pdf.outline.page title: 'First pdf'
b_pdf.start_new_page
b_pdf.text("Heeeeelloh!")
b_pdf.outline.page title: '1st pdf page 2'
b_pdf.start_new_page
b_pdf.text("aloha!")
b_pdf.outline.define do
  100.times do |t|
    section "Chapter #{t}", closed: false do
      page :title => 'Page 1', destination: 1
      page :title => 'Page 2', destination: 2
      page :title => 'Page 3', destination: 1
      page :title => 'Page 4', destination: 2
    end
  end
end

report << CombinePDF.parse(b_pdf.render)
report << CombinePDF.parse(a_pdf.render) # a_pdf being a similarly created pdf

final_report = report.to_pdf

send_data final_report,
          filename: "test.pdf",
          type: 'application/pdf',
          disposition: 'inline'

Adding multipliers like this 100.times where I see fit, this b_pdf for example has 502 outline nodes + the base node

Named destinations
Well, I'm having a hard time creating an outline that uses this kind of link at the moment.
But I tried merging 2 Prawn manual PDFs and that didn't have the desired result. The merge itself went fine, but both the outlines linked to the same pages of the first PDF. To make things harder my little outline hash print method (don't ask me what happened to the indentation there) also isn't able to print the hash in a readable manner, thanks to the huge raw streams.
So, I'm still trying to figure out what the problem with that PDF is.

Another thing I noticed, by the way, is that the way the outline hash gets build, results in some cases, in the outline :dests linking to the pages dictionary, which bloats the outline up greatly, since potentially every node can be a link to a page, which means that every node's :dest contains all the pages?

inserting pages
When iterating over a PDFs pages and inserting them one by one, that PDFs outline gets left out.
Just letting you know, so you don't get surprised by it later on.
tested like this:

CombinePDF.parse(b_pdf.render).pages.each_with_index do |page, index|
  if index < 2 # b_pdf only has 3 pages here
    report << page
  else
    report << CombinePDF.parse(a_pdf.render)
    report << page
  end
end

Resulting PDF contains only the outline from a_pdf

@boazsegev
Copy link
Owner

boazsegev commented Jul 1, 2016

Hi Stefan,

Named destinations

First, please let me apologize, and hopefully make you happy at the same time.

Last night an Issue was posted that exposed a vulnerability in the named destinations algorithm that was in place.

I had to rewrite the whole thing... but it now works better (although it might be glitchy, as I put it together late at night).

It could be that some of the issues you experienced will self resolve (I doubt, but I hope).

Duplicate objects / bloating

As to having the pages in the named destination hash, it's supposed to be this way. These are the same Ruby objects (same object_id) linked to from different places. They have a container that wraps them, so whenever the PDF is formatted, a reference is printed instead of the actual data.

The objects should only print once, which is what the add_referenced algorithm is making sure of... it's both making sure that they actually print and it's making sure the same object doesn't appear more then once.

which is also where we get this funny issue...

"both the outlines linked to the same pages of the first PDF"

I think this is why you get this funny behavior where both the outlines point to pages of the first PDF instead of their own original PDF.

Except for page objects, all objects that contain the same data are reduced to a single object.

For example, when we add pages that use the same font over and over again, we don't want to have multiple copies of the font... so the reference is resolved to the first copy and the rest of the font objects are discarded.

Pages, because of an issue with an older adobe acrobat reader, must have unique references. For this reason, Pages in the page catalog that contain the same data are have a shallow copy made specifically for the catalog. Only the Page's dictionary is duplicated, but not any of the resources they use (fonts etc' are still reduced to a single copy of each).

Th reason I think this might cause the issue where both the outlines point to pages of the first PDF is since the data in the pages is the same, the reference is forced to resolved to the original page (reduction takes place).

I don't think there's a way around it except (maybe) numbering the pages or editing them, because I can think of too many things that might go wrong if we don't enforce this behavior (like the references pointing to the pages in the other PDF object, and then formatting and managing the PDF can start doing weird things)...

...However, I'm not sure that's a contingency we need to explore. If people are just bloating their PDF files by duplicating their own data many times, I doubt if the ToC will be their source of disappointment.

nil value

I'm sorry, I think you answered me, but I'm just making sure, because sometimes I write too much and then I'm not clear.

"if we have a nil value" - where in outline would that nil, you are talking about, be?

In the :Prev key of the first element and the Next key of the last (which you answered me about)... but it's probably not really important. and

Upstream changes...

I hope it's easy to merge the new named destination function... sorry for the extra work.

Bo.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jul 5, 2016

Named Destinations

.. but it now works better (although it might be glitchy, as I put it together late at night).

I just realized that my named destinations (generated with prawn), don't work anymore at all. Also tested without using the outline merge, so I assume that it is caused by the your new name rebuilding method.
If I'm not mistaken, the link to a page and the anchor on that page should be renamed to contain something like 'CombinePDF_0000006' as well as the original link/anchor name, which the links currently don't have (haven't checked the anchors).

@boazsegev
Copy link
Owner

boazsegev commented Jul 5, 2016

I'm far from my computer at the moment (hurray for smartphones), but I found a small issue with the renaming logic (I have an extra dic << in the first line of the renaming block.

I'll fix this when I get home. Could you please send me the test files? It would save me the time of creating my own test files.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jul 5, 2016

Sure, here are some:
testpdf_1_named_destinations.pdf
testpdf_2_named_destinations.pdf
testpdf_3_named_destinations_no_outline.pdf
testpdf_4_named_destinations_no_outline.pdf
testpdf_5_big_outline.pdf
testpdf_6_bigger_outline.pdf

It doesn't work with any combination of those, actually not even using a single one without merging, would keep working named destination links.

@boazsegev
Copy link
Owner

I found my two mistakes:

  1. I was a spelling mistake (I spelled :Dests as :Dest singular, so it was ignored).
  2. I left the is_reference_only: true key-value pair out... so the linking to the data was all broken.

I think it should be working now... could you see if you're getting better results?

I think the outlines in the test files didn't all work (the ones not under a chapter didn't seem to work even on the originals).

There isn't a need to unpack the page stream and edit it, since the named destinations don't make it into the page stream. It seems that named destinations are a type of annotation that is overlaid over the page data.

I did find a dictionary we might be missing when supporting these page url jumps, and this is the Pageable data in the catalog (section 12.4.2 to the standard).

However, I think we should finish this one (and celebrate it) before tackling a new feature extension.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jul 7, 2016

The test files
They are woking like they are supposed to, only the entries within the chapters are supposed to link to pages 1 or 2, I didn't see the need to make every entry link to any page, after all I'm not talking about named destination links in the outline (which this files don't have), I'm talking about the normal named destinations found on different pages, linking to different other pages. (I'm not sure if that was clear before, so I'm just making sure, in case it wasn't).

The problem with renaming
What you did is definitely going in the right direction.
When not merging, but only having 1 PDF run through the process of the gem, the named destinations work correctly.
When merging 2 PDFs, however, the named destinations of the first PDF don't work correctly anymore (they are still links, but when clicking them it's not jumping to any other page / nothing happens), but the named destination links found in the second PDF work like they are supposed to.

So, to sum this up, before your latest change, none of the named destinations links, found in the files I uploaded above, worked (nothing happened when clicking the link). Now, after your latest changes, the links work if it's only 1 PDF, and if it's 2 PDFs that are being merged, the links in the second PDF work, but the ones in the first PDF don't.

What I noticed:
In the first example, where I only used 1 PDF, the named destinations didn't get renamed to have something like 'CombinePDF_0000006', but obviously neither did the anchors, since these links worked.
In the second example, when merging 2 PDFs, the links of the first PDF (this links did not work!) didn't have the 'CombinePDF_0000006' string. The second PDF (this links worked!) did have a 'CombinePDF_0000006' string!
Sorry I can't tell you what the page anchors contained, since I haven't found an easy way of checking them.

I suspect that the problem with the links that don't work is that the links and anchors don't match at the moment, maybe you are updating all the anchors (to contain a 'CombinePDF_0000006' string), but the links of the first PDF get "forgotten"?

@boazsegev
Copy link
Owner

I suspect that the problem with the links that don't work is that the links and anchors don't match at the moment, maybe you are updating all the anchors (to contain a 'CombinePDF_0000006' string), but the links of the first PDF get "forgotten"?

I'll try and work out an answer and review the code at the same time, this way if I skip something in the logic you can point it out.

The logic is this:

  • The parser unifies all strings - Within a single PDF object, all strings that are equal in content are the same object. This happens in the parser:
def _parse_
#[...]
        ##########################################
        ## parse a Hex String
##########################################
        elsif str = @scanner.scan(/<[0-9a-fA-F]+>/)
          # warn "Found a hex string"
          out << unify_string([str[1..-2]].pack('H*').force_encoding(Encoding::ASCII_8BIT))
# [...]
        ##########################################
        ## parse a Literal String
##########################################
        elsif @scanner.scan(/\(/)
        # [...]
                  out << unify_string(str.pack('C*').force_encoding(Encoding::ASCII_8BIT))
# [...]
end

# [...]
 def unify_string(str)
      @strings_dictionary[str] ||= str
end
  • Since all strings are one string, editing a string in place (v.s replacing the string) should updated the information everywhere this sting was referenced in the PDF. i.e.
a = "test"
b = a
a << " this"
b == "test this" # => true
dic << (pos[i * 2].clear << base.next!)
  • Since all strings are one string, editing the string in place should negate the possibility of "forgetting" to update any links.

Caveats and possible issues:

  • If any strings are parsed but not unified, the whole premise is broken. I don't think is the issue.
  • If, after parsing, strings are copied (str.dup) they will no longer be unified. This might happen when copying PDF data and it could be a good idea to review the copying (and secure copying) logic to make sure string aren't duplicated or (if they are duplicated) that they are re-unified.
  • Strings that are used as Hash keys are automatically duplicated and frozen by the underlying Ruby implementation, which means that if named destination names where placed as hash keys (not values), they wouldn't be updated and the information would be lost.

when merging 2 PDFs, the links of the first PDF (this links did not work!) didn't have the 'CombinePDF_0000006' string. The second PDF (this links worked!) did have a 'CombinePDF_0000006' string!

I'm very happy I asked you to test this also on your system, because I can't seem to duplicate the issue.

Here is the code I used for testing, the list is a list with all the files you sent me and their paths on my machine, I use the list to:

  1. For each file: load and save.
  2. For each file: load, merge with the same file (reloaded, not same object).
  3. Merge all files to review total merger.
lists = %w{./Ruby/test\ pdfs/outlines/big_toc.pdf ./Ruby/test\ pdfs/outlines/bigger_toc.pdf ./Ruby/test\ pdfs/outlines/named_dest_no_toc.pdf ./Ruby/test\ pdfs/outlines/named_dest_no_toc2.pdf ./Ruby/test\ pdfs/outlines/named_dest.pdf ./Ruby/test\ pdfs/outlines/named_dest2.pdf};

i = 0
lists.each{|n| CombinePDF.load(n).save("07_#{(i+=1).to_s}_#{n.split('/')[-1]}"); (CombinePDF.load(n) << CombinePDF.load(n)).save("07_#{(i).to_s}x2_#{n.split('/')[-1]}") }
pdf = CombinePDF.new
lists.each{|n| pdf << CombinePDF.load(n) }
pdf.save("07_named destinations.pdf")

In my test files all the links work... Can you post your test code, so I can try and replicate the issue?

@boazsegev
Copy link
Owner

Stefan,

Happy Weekend 🎉

I hope I'll be able to replicate and find the issue and release a fixed version this weekend.

I'm thankful for all you've done.

B.

@boazsegev
Copy link
Owner

Hi Stefan, I think @Kagetsuki exposed the source of the issue - see issue #71 .

@boazsegev
Copy link
Owner

Nope.. probably not the issue... this shouldn't effect page links, only outlines.

@sLe1tner
Copy link
Contributor

sLe1tner commented Jul 11, 2016

Hi Bo,
This is getting weirder with every test I do..
Honestly I have no clue what's going on here, I can't reproduce the failed examples I told you about anymore.
The one example I was able to reproduce earlier today, used a specific version of a report generated by prawn with loads of images, but after resetting my seeded data that suddenly seems to work aswell now... 😕

If it happens again I'll put together a more detailed report about how it happened and how I'm testing for it. As for the previous examples, the difference between our tests is that I used the data generated by prawn with CombinePDF.parse while you had the actual PDFs, but since I can't even reproduce the behaviour anymore, it doesn't seem that relevant at the moment.

@Kagetsuki
Copy link

@sLe1tner Did you update your gems? @boazsegev released like two patch versions in the last day 🤘🏼. If you updated your gems/bundle the patch probably got in there. You could peg the gem at .21 and then .23 and see if you get the error again - in which case your issue was also fixed in .25.

@boazsegev
Copy link
Owner

🎉
I take this as good news - maybe the issues were related to the fixes in the latest releases (i.e. empty strings being ignored / catalog inheritance)...

...I think this means we have a (probably) stable outline merging feature. I'm very happy about this 👍

I'll close this issue, but if anything pops up or you find something, please open an issue and we'll work on it.

B.

@sLe1tner
Copy link
Contributor

Ah that might have had something to do with it, i'll check it out, thanks @Kagetsuki

...I think this means we have a (probably) stable outline merging feature. I'm very happy about this 👍

Yea that sounds great :)

@daymun
Copy link

daymun commented Feb 9, 2017

I'm attempting to prepend a cover page to my pdf which has a table of contents. I would like the table of contents to remain unchanged.

Currently the TOC is unchanged but links are dropped. Is this expected to work and is there anything I can try to fix the links?

Thanks for all your work on this.

@Kagetsuki
Copy link

@daymun How are the links formed? Are they to named anchors? Can you provide sources/samples?

@daymun
Copy link

daymun commented Feb 9, 2017

@Kagetsuki the TOC links are generated with wkhtmltopdf. Here are some sample files: https://drive.google.com/drive/folders/0B0vK1fMzDbarMndpdExVUFlCbE0

As you can see, links work in main.pdf but not in combined.pdf.

boazsegev pushed a commit that referenced this issue Feb 10, 2017
@boazsegev
Copy link
Owner

boazsegev commented Feb 10, 2017

@daymun ,

Thank you for opening this issue.

I've started looking in to this and it might have something to do with the fact that wkhtmltopdf uses an old way for naming destinations (this format was updated in PDF v.1.2), where the Dest dictionary is referenced directly within the Catalog object, instead of using a Names object.

This was a bigger problem than I thought at first glance, because wkhtmltopdf uses the old format with page numbers instead of page references - which means that it breaks when CombinePDF tries to parse it (page numbers are meaningless before parsing is complete, and even afterwards, this becomes a performance issue).

Anyway...

I've uploaded an updated version to GitHub, to deal with wkhtmltopdf compatibility.

This update might have a negative impact on performance in some cases... but this might be the price we pay when we need to write work arounds.

Also, I didn't finish testing this version for any negative impact on the actual PDF data... I don't promise this change will make it to the release version until I test it some more.

Could you possibly test that this fix actually helps you before I push this through?

@daymun
Copy link

daymun commented Feb 10, 2017

@boazsegev, many thanks for your quick resolution of this issue. Everything seems to be working with the fix you pushed yesterday and I haven't noticed any negative impact on PDF data.

@romarioliveira25
Copy link

romarioliveira25 commented Jan 13, 2020

same issue, isn't working here.

@boazsegev
Copy link
Owner

@romarioliveira25 ,

Could you open a new issue, explaining exactly what isn't working and adding a short example that would allow me to replicate any issue you might be experiencing?

Kindly,
Bo.

P.S.,

I'm mostly offline for the next two-three weeks. I might be slower than usual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants