-
Notifications
You must be signed in to change notification settings - Fork 944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pdfstream as cmap #264
Pdfstream as cmap #264
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I have some small remarks though, and @ganeshtata also already made some small remarks on #228.
pdfminer/pdffont.py
Outdated
if type(name) is PDFStream: | ||
if 'CMapName' in name: | ||
name = name.get('CMapName').name | ||
if name in('DLIdent-H','OneByteIdentityH','Identity-H') : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ganeshtata made some suggestions about this: #228 (comment).
pdfminer/pdffont.py
Outdated
name = 'unknown' | ||
if type(name) is PDFStream: | ||
if 'CMapName' in name: | ||
name = name.get('CMapName').name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ganeshtata made some suggestions about this: #228 (comment)
I agree that overwriting name
is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree,
pdfminer/pdffont.py
Outdated
if strict: | ||
raise PDFFontError(e) | ||
self.cmap = CMap() | ||
self.cmap = (spec, strict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time a @property
is used in this repository. Do you have any special reason to introduce it here? Otherwise I would suggest keeping it simple and don't use properties.
self.cmap = (spec, strict) | |
self.cmap = self.get_cmap_from_spec(spec, strict) |
This PR solves #210 |
@fakabbir could you also add some tests? |
Yes, if you add one ore more tests this PR is totally ready for merging. Suggestion for test (the first is not working on master): from pdfminer.cmapdb import IdentityCMap
from pdfminer.pdffont import PDFCIDFont
from pdfminer.pdftypes import PDFStream
def test_pdf_cid_font_with_pdfstream_as_cmap_name():
stream = PDFStream({}, 'Identity-H')
spec = {'Encoding': stream}
font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap)
def test_pdf_cid_font_with_string_as_cmap_name():
spec = {'Encoding': 'Identity-H'}
font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) |
Looking good. I have still a few questions left though: In fa40043 you remove the mapping from different cmap names to standard cmap names. I think this changes the behavior, is this on purpose? -CMAP_ENCODER = {
- 'DLIdent-H': 'Identity-H',
- 'OneByteIdentityH': 'Identity-H',
- 'Identity-H': 'Identity-H',
- 'DLIdent-V': 'Identity-V',
- 'OneByteIdentityV': 'Identity-V',
- 'Identity-V': 'Identity-V'
-}
+IDENTITY_ENCODER = ('Identity-H', 'Identity-V') Also, I have some remarks in the review. |
spec = {'Encoding': PSLiteral('Identity-V')} | ||
font = PDFCIDFont(None, spec) | ||
assert isinstance(font.cmap, IdentityCMap) | ||
|
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
if strict: | ||
raise PDFFontError('Encoding is unspecified') | ||
cmap_name = 'unknown' | ||
if type(cmap_name) is PDFStream: |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def literal_name(x):
if not isinstance(x, PSLiteral):
if settings.STRICT:
raise PSTypeError('Literal required: %r' % (x,))
else:
name=x
else:
name=x.name
if six.PY3:
try:
name = str(name,'utf-8')
except:
pass
return name
def get_cmap_from_spec(self, spec, strict):
try:
spec_encoding = spec['Encoding']
if hasattr(spec_encoding, 'name'):
cmap_name = literal_name(spec['Encoding'])
else:
cmap_name = literal_name(spec_encoding['CMapName'])
except KeyError:
if strict:
raise PDFFontError('Encoding is unspecified')
cmap_name = 'unknown'
if type(cmap_name) is PDFStream:
if 'CMapName' in cmap_name:
cmap_name = cmap_name.get('CMapName').name
else:
if strict:
raise PDFFontError('CMapName unspecified for encoding')
cmap_name = 'unknown'
if cmap_name in IDENTITY_ENCODER:
return CMapDB.get_cmap(cmap_name)
else:
return CMap()
This is never true since literal_name always returns a str
If x
is not an instance of PSLiteral
it will return the __repr__
(if repr exist). __repr__
function actually doesn't return a type str. So the name.replace operation isn't working for such cases.
In case the cmap_name is not an instance of PSLiteral but an instance of PDFStream, we would try to extract cmap_name
, so if type(cmap_name) is PDFStream
is helpful for this case.
With regards to PDFs mentioned in #210. Those PDF are being extracted fine with no errors.
For example, earlier 'DLIdentit-H' was mapped to 'Identity-H' which returned IdentityCMap(WMode=0) from CMapDB.get_cmap(). With this PR it returns CMap().
I have mapped DLIdentit-H intentionally with CMap() for the following reasons.
- The PR @jtkese raised solved the issue by using
.get
method to extract cmap_name when the instance type isPDFStream
. - I haven't found any conclusive evidence of considering same mapping for DLIdentit-H and Identity-H (documentations available online). Also I haven't seen DLIdentit-H as supported encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If x is not an instance of PSLiteral it will return the repr(if repr exist). repr function actually doesn't return a type str.
That seems strange to me. __repr__
should always return a string. But I guess we can leave it as it is for now. Could you point out where this is happening? I cannot find any related use of repr()
or __repr__()
in the repo.
I haven't found any conclusive evidence of considering same mapping for DLIdentit-H and Identity-H
Good enough. Let's use the default of CMap()
when we do not know what to do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/pdftypes.py#L201
def __repr__(self):
if self.data is None:
assert self.rawdata is not None
return '<PDFStream(%r): raw=%d, %r>' % (self.objid, len(self.rawdata), self.attrs)
else:
assert self.data is not None
return '<PDFStream(%r): len=%d, %r>' % (self.objid, len(self.data), self.attrs)
That seems strange to me. repr should always return a string.
Yes, I have also assumed that the string operations are valid for __repr__
before coming through this bug. This being the very reason many developers use __str__
data model for denoting class object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds strange to me! I still do not understand how this __repr__()
returns something else than a str
(or an AssertionError
). In which cases does this __repr__()
returns something other than a string? If you could provide an example than we can open a new super specific issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases does this
__repr__()
returns something other than a string? If you could provide an example than we can open a new super specific issue.
I feel it's depending on PDFs. Let me see some PDF that have been reported by users as incompatible.
if strict: | ||
raise PDFFontError('Encoding is unspecified') | ||
cmap_name = 'unknown' | ||
if type(cmap_name) is PDFStream: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If x is not an instance of PSLiteral it will return the repr(if repr exist). repr function actually doesn't return a type str.
That seems strange to me. __repr__
should always return a string. But I guess we can leave it as it is for now. Could you point out where this is happening? I cannot find any related use of repr()
or __repr__()
in the repo.
I haven't found any conclusive evidence of considering same mapping for DLIdentit-H and Identity-H
Good enough. Let's use the default of CMap()
when we do not know what to do!
The base branch should have been develop, not master. I merged without checking that. Any idea how I can revert this? |
@ganeshtata For admins there should be a "revert" and "restore branch" button which appears in the PR thread. Just near the message. https://youtu.be/DgKNEH3capM |
Extension of work done in #228.
Encapsulates initialization of cmap in a setter function.