Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdfstream as cmap #264

Merged
merged 7 commits into from
Jul 31, 2019
Merged

Pdfstream as cmap #264

merged 7 commits into from
Jul 31, 2019

Conversation

fakabbir
Copy link
Contributor

Extension of work done in #228.
Encapsulates initialization of cmap in a setter function.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I have some small remarks though, and @ganeshtata also already made some small remarks on #228.

if type(name) is PDFStream:
if 'CMapName' in name:
name = name.get('CMapName').name
if name in('DLIdent-H','OneByteIdentityH','Identity-H') :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganeshtata made some suggestions about this: #228 (comment).

name = 'unknown'
if type(name) is PDFStream:
if 'CMapName' in name:
name = name.get('CMapName').name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganeshtata made some suggestions about this: #228 (comment)

I agree that overwriting name is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree,

if strict:
raise PDFFontError(e)
self.cmap = CMap()
self.cmap = (spec, strict)
Copy link
Member

@pietermarsman pietermarsman Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time a @property is used in this repository. Do you have any special reason to introduce it here? Otherwise I would suggest keeping it simple and don't use properties.

Suggested change
self.cmap = (spec, strict)
self.cmap = self.get_cmap_from_spec(spec, strict)

@pietermarsman
Copy link
Member

This PR solves #210

@pombredanne
Copy link
Contributor

@fakabbir could you also add some tests?

@pietermarsman
Copy link
Member

pietermarsman commented Jul 15, 2019

Yes, if you add one ore more tests this PR is totally ready for merging.

Suggestion for test (the first is not working on master):

from pdfminer.cmapdb import IdentityCMap
from pdfminer.pdffont import PDFCIDFont
from pdfminer.pdftypes import PDFStream


def test_pdf_cid_font_with_pdfstream_as_cmap_name():
    stream = PDFStream({}, 'Identity-H')

    spec = {'Encoding': stream}
    font = PDFCIDFont(None, spec)

    assert isinstance(font.cmap, IdentityCMap)


def test_pdf_cid_font_with_string_as_cmap_name():
    spec = {'Encoding': 'Identity-H'}
    font = PDFCIDFont(None, spec)

    assert isinstance(font.cmap, IdentityCMap)

@pietermarsman
Copy link
Member

Looking good. I have still a few questions left though:

In fa40043 you remove the mapping from different cmap names to standard cmap names. I think this changes the behavior, is this on purpose?
For example, earlier 'DLIdentit-H' was mapped to 'Identity-H' which returned IdentityCMap(WMode=0) from CMapDB.get_cmap(). With this PR it returns CMap().

-CMAP_ENCODER = {
-    'DLIdent-H': 'Identity-H',
-    'OneByteIdentityH': 'Identity-H',
-    'Identity-H': 'Identity-H',
-    'DLIdent-V': 'Identity-V',
-    'OneByteIdentityV': 'Identity-V',
-    'Identity-V': 'Identity-V'
-}
+IDENTITY_ENCODER = ('Identity-H', 'Identity-V')

Also, I have some remarks in the review.

pdfminer/pdffont.py Outdated Show resolved Hide resolved
spec = {'Encoding': PSLiteral('Identity-V')}
font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap)

This comment was marked as outdated.

if strict:
raise PDFFontError('Encoding is unspecified')
cmap_name = 'unknown'
if type(cmap_name) is PDFStream:

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def literal_name(x):
    if not isinstance(x, PSLiteral):
        if settings.STRICT:
            raise PSTypeError('Literal required: %r' % (x,))
        else:
            name=x
    else:
        name=x.name
        if six.PY3:
            try:
                name = str(name,'utf-8')
            except:
                pass
    return name
    def get_cmap_from_spec(self, spec, strict):
        try:
            spec_encoding = spec['Encoding']
            if hasattr(spec_encoding, 'name'):
                cmap_name = literal_name(spec['Encoding'])
            else:
                cmap_name = literal_name(spec_encoding['CMapName'])
        except KeyError:
            if strict:
                raise PDFFontError('Encoding is unspecified')
            cmap_name = 'unknown'
        if type(cmap_name) is PDFStream:
            if 'CMapName' in cmap_name:
                cmap_name = cmap_name.get('CMapName').name
            else:
                if strict:
                    raise PDFFontError('CMapName unspecified for encoding')
                cmap_name = 'unknown'
        if cmap_name in IDENTITY_ENCODER:
            return CMapDB.get_cmap(cmap_name)
        else:
            return CMap()

This is never true since literal_name always returns a str

If x is not an instance of PSLiteral it will return the __repr__(if repr exist). __repr__ function actually doesn't return a type str. So the name.replace operation isn't working for such cases.

In case the cmap_name is not an instance of PSLiteral but an instance of PDFStream, we would try to extract cmap_name, so if type(cmap_name) is PDFStream is helpful for this case.

With regards to PDFs mentioned in #210. Those PDF are being extracted fine with no errors.

For example, earlier 'DLIdentit-H' was mapped to 'Identity-H' which returned IdentityCMap(WMode=0) from CMapDB.get_cmap(). With this PR it returns CMap().

I have mapped DLIdentit-H intentionally with CMap() for the following reasons.

  • The PR @jtkese raised solved the issue by using .get method to extract cmap_name when the instance type is PDFStream.
  • I haven't found any conclusive evidence of considering same mapping for DLIdentit-H and Identity-H (documentations available online). Also I haven't seen DLIdentit-H as supported encoding.

Copy link
Member

@pietermarsman pietermarsman Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If x is not an instance of PSLiteral it will return the repr(if repr exist). repr function actually doesn't return a type str.

That seems strange to me. __repr__ should always return a string. But I guess we can leave it as it is for now. Could you point out where this is happening? I cannot find any related use of repr() or __repr__() in the repo.

I haven't found any conclusive evidence of considering same mapping for DLIdentit-H and Identity-H

Good enough. Let's use the default of CMap() when we do not know what to do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/pdftypes.py#L201

    def __repr__(self):
        if self.data is None:
            assert self.rawdata is not None
            return '<PDFStream(%r): raw=%d, %r>' % (self.objid, len(self.rawdata), self.attrs)
        else:
            assert self.data is not None
            return '<PDFStream(%r): len=%d, %r>' % (self.objid, len(self.data), self.attrs)

That seems strange to me. repr should always return a string.

Yes, I have also assumed that the string operations are valid for __repr__ before coming through this bug. This being the very reason many developers use __str__ data model for denoting class object.

Copy link
Member

@pietermarsman pietermarsman Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds strange to me! I still do not understand how this __repr__() returns something else than a str (or an AssertionError). In which cases does this __repr__() returns something other than a string? If you could provide an example than we can open a new super specific issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases does this __repr__() returns something other than a string? If you could provide an example than we can open a new super specific issue.

I feel it's depending on PDFs. Let me see some PDF that have been reported by users as incompatible.

if strict:
raise PDFFontError('Encoding is unspecified')
cmap_name = 'unknown'
if type(cmap_name) is PDFStream:
Copy link
Member

@pietermarsman pietermarsman Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If x is not an instance of PSLiteral it will return the repr(if repr exist). repr function actually doesn't return a type str.

That seems strange to me. __repr__ should always return a string. But I guess we can leave it as it is for now. Could you point out where this is happening? I cannot find any related use of repr() or __repr__() in the repo.

I haven't found any conclusive evidence of considering same mapping for DLIdentit-H and Identity-H

Good enough. Let's use the default of CMap() when we do not know what to do!

@tataganesh
Copy link
Member

The base branch should have been develop, not master. I merged without checking that. Any idea how I can revert this?

@fakabbir
Copy link
Contributor Author

fakabbir commented Aug 1, 2019

@ganeshtata For admins there should be a "revert" and "restore branch" button which appears in the PR thread. Just near the message. https://youtu.be/DgKNEH3capM

@fakabbir
Copy link
Contributor Author

fakabbir commented Aug 1, 2019

There is the Revert and View Details Button
image

This was referenced Aug 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants