Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: make attachements compatible with kids, and allow list in RF #2197

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Sep 17, 2023

closes #2087
closes #2090

add also compatibility with RF (adding list)

still in progress

@pubpub-zz pubpub-zz changed the title Kids embd files BUG/ENH: make attachements compatible with kids, and allow list in RF Sep 17, 2023
@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Attention: 170 lines in your changes are missing coverage. Please review.

Comparison is base (5a2dd75) 94.45% compared to head (ab96331) 92.53%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2197      +/-   ##
==========================================
- Coverage   94.45%   92.53%   -1.93%     
==========================================
  Files          43       43              
  Lines        7650     7930     +280     
  Branches     1511     1576      +65     
==========================================
+ Hits         7226     7338     +112     
- Misses        262      393     +131     
- Partials      162      199      +37     
Files Coverage Δ
pypdf/_protocols.py 100.00% <ø> (ø)
pypdf/_reader.py 91.53% <100.00%> (-0.17%) ⬇️
pypdf/constants.py 100.00% <100.00%> (ø)
pypdf/generic/__init__.py 100.00% <ø> (ø)
pypdf/generic/_base.py 99.33% <71.42%> (-0.67%) ⬇️
pypdf/_writer.py 86.91% <58.62%> (-1.47%) ⬇️
pypdf/generic/_data_structures.py 80.62% <46.46%> (-11.29%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
progressing well on this PR (still some test to ensure proper coverage) 😋

@pubpub-zz pubpub-zz marked this pull request as ready for review October 13, 2023 20:32
@pubpub-zz pubpub-zz marked this pull request as draft October 13, 2023 20:34
@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
I have a dilemma,
The way attachments are handled has been misunterstood (and I think I have to plead guilty)
for reference, information are mainly available in Pdf 1.7 reference §3.10 (p178) ;
the attachements are stored in a name tree : there can not be any duplicates
in most of the attachement is stored in EF key : in such case there is only one file/data
if RF key are used, there is a list of file/data indexed by filenames (and in such cases, an EF file is also required)

therefore attachements should return a dictionary where keys are the names in the dictionnary
the value should be an list of a single extended bytes with the following functions added:
indirect_reference /property/ : will point to the File Specification Object
name /property/ : returns the preferred file name.
get_file_stream( subfile:str="") : return the embedded file stream structure associated with the subfile name (default is EF entry)
within_page: (reserved for extension) return the page object containing the attachment if within an annotation else returns None

the point to output the return within a list is here to not change the interface. based on the better understanding of the spec, a list can not have more than on entry.

@MartinThoma
Copy link
Member

I'm sorry, I don't understand the dilemma. What are the options we have?

a list can not have more than on entry.

I don't understand. Can you give me a pseudo-code example?

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Oct 27, 2023

I'm sorry, I don't understand the dilemma. What are the options we have?

a list can not have more than on entry.

I don't understand. Can you give me a pseudo-code example?

As the key in the name tree are uniques, the attachements can only return one file specification stored in /EF field. therefore the list always have only one item which would be a bytes object with some extended properties/functions.
/RF entries (not originally adressed) should return a dictionary where the filenames are the keys. list are meaningless here too. this cannot be easily stored in a extended bytes object.

my dilemma is:
a) is it ok to change the API to return either bytes or dictionnary for the RF entries?
a) should take opportunity to remove the overall useless list

we will get the following interface:

class AttachmentBytes(bytes):
     ...

class PdfReader:   # the same for PdfWriter
    ...
    @property
    def attachments(self) -> Mapping[str, Union[AttachmentBytes, Dict[str, AttachmentBytes]]]:
          ...

@MartinThoma
Copy link
Member

The current signature is

def attachments(self) -> Mapping[str, List[bytes]]

What is AttachmentBytes? In which way would you extend the bytes object?

Changing the signature to def attachments(self) -> Mapping[str, bytes] would be a breaking change and thus we cannot do this for a longer time. We want to have a 4.0 release next year, hence we could add an info-message right now + do the change with pypdf==4.0.0. I don't see a way to do this over multiple releases. Would that be your a)?

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
I've got a version available. can you give me your opinion before I complete coverage

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
+1 ?

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
+1?

@MartinThoma
Copy link
Member

Sorry, I forgot about this 🙈

A bit of notes for myself: RF is short for "Related Files" and "EF" is short for "Embedded File"; see Table 44 – Entries in a file specification dictionary. It is an optional entry with a dictionary:

A dictionary with the same structure as the EF dictionary,
which shall be present. Each key in the RF dictionary shall also be present in
the EF dictionary. Each value shall be a related files array (see 7.11.4.2,
"Related Files Arrays") identifying files that are related to the corresponding
file in the EF dictionary. If this entry is present, the Type entry is required and
the file specification dictionary shall be indirectly referenced.

This PR attempts to solve two issues:

I'd prefer to have a PR + tests which deal with #2087 first before we fix #2090 or do bigger refactorings. First dealing with #2087 would make the current PR smaller, right?

Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's super hard for me to review this PR. Would you mind if I made a couple of smaller merges with the trivial parts (e.g. the constants, some smaller tests)?

I'd use the GitHub Co-authored-by feature to give you full credit, of course.

Comment on lines +2232 to +2238
@property
def attachments_names(self) -> List[str]:
"""
Returns:
dictionary of filename -> Union[bytestring or List[ByteString]]
if the filename exists multiple times a List of the different version will be provided
List of names
"""
catalog = cast(DictionaryObject, self.trailer["/Root"])
# From the catalog get the embedded file names
try:
filenames = cast(
ArrayObject,
cast(
DictionaryObject,
cast(DictionaryObject, catalog["/Names"])["/EmbeddedFiles"],
)["/Names"],
)
except KeyError:
return {}
attachments: Dict[str, Union[bytes, List[bytes]]] = {}
# Loop through attachments
for i in range(len(filenames)):
f = filenames[i]
if isinstance(f, str):
if filename is not None and f != filename:
continue
name = f
f_dict = filenames[i + 1].get_object()
f_data = f_dict["/EF"]["/F"].get_data()
if name in attachments:
if not isinstance(attachments[name], list):
attachments[name] = [attachments[name]] # type:ignore
attachments[name].append(f_data) # type:ignore
else:
attachments[name] = f_data
return attachments
return self.attachments.keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to have the new attachments_names property. I think users and pypdf itself should call self.attachments.keys() directly.

Comment on lines +152 to +156
F = "/F" # A file specification string of the file as described in Section 3.10.1
UF = "/UF" # A unicode string of the file as described in Section 3.10.1
EF = "/EF" # dictionary, containing a subset of the keys F , UF , DOS , Mac , and Unix
RF = "/RF" # dictionary, containing arrays of /EmbeddedFile
DESC = "/Desc" # description of the file as de
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those could be in an individual PR - that would be easy to review/quick to merge :-)

@MartinThoma MartinThoma added this to the pypdf==4.0.0 milestone Jan 1, 2024
@MartinThoma MartinThoma removed this from the pypdf==4.0.0 milestone Jan 19, 2024
@MartinThoma
Copy link
Member

@pubpub-zz Should we close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in add_attachment() when attaching several files _list_attachments() doesn't take into account /Kids
2 participants