Skip to content

KeyError: "There is no item named 'word/NULL' in the archive" #797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KristenMoore opened this issue Mar 17, 2020 · 8 comments
Closed

KeyError: "There is no item named 'word/NULL' in the archive" #797

KristenMoore opened this issue Mar 17, 2020 · 8 comments

Comments

@KristenMoore
Copy link

I'm getting this error when opening an unremarkable looking Word file from a corpus of over 1K Word files which haven't had a problem. Opening it Word and saving it made no difference.

<ipython-input-122-9b21415cda29> in <module>
----> 1 doc = docx.Document(f"data/{filename}")

/usr/local/lib/python3.7/site-packages/docx/api.py in Document(docx)
     23     """
     24     docx = _default_docx_path() if docx is None else docx
---> 25     document_part = Package.open(docx).main_document_part
     26     if document_part.content_type != CT.WML_DOCUMENT_MAIN:
     27         tmpl = "file '%s' is not a Word file, content type is '%s'"

/usr/local/lib/python3.7/site-packages/docx/opc/package.py in open(cls, pkg_file)
    126         *pkg_file*.
    127         """
--> 128         pkg_reader = PackageReader.from_file(pkg_file)
    129         package = cls()
    130         Unmarshaller.unmarshal(pkg_reader, package, PartFactory)

/usr/local/lib/python3.7/site-packages/docx/opc/pkgreader.py in from_file(pkg_file)
     34         pkg_srels = PackageReader._srels_for(phys_reader, PACKAGE_URI)
     35         sparts = PackageReader._load_serialized_parts(
---> 36             phys_reader, pkg_srels, content_types
     37         )
     38         phys_reader.close()

/usr/local/lib/python3.7/site-packages/docx/opc/pkgreader.py in _load_serialized_parts(phys_reader, pkg_srels, content_types)
     67         sparts = []
     68         part_walker = PackageReader._walk_phys_parts(phys_reader, pkg_srels)
---> 69         for partname, blob, reltype, srels in part_walker:
     70             content_type = content_types[partname]
     71             spart = _SerializedPart(

/usr/local/lib/python3.7/site-packages/docx/opc/pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)
    108                 phys_reader, part_srels, visited_partnames
    109             )
--> 110             for partname, blob, reltype, srels in next_walker:
    111                 yield (partname, blob, reltype, srels)
    112 

/usr/local/lib/python3.7/site-packages/docx/opc/pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)
    108                 phys_reader, part_srels, visited_partnames
    109             )
--> 110             for partname, blob, reltype, srels in next_walker:
    111                 yield (partname, blob, reltype, srels)
    112 

/usr/local/lib/python3.7/site-packages/docx/opc/pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)
    103             reltype = srel.reltype
    104             part_srels = PackageReader._srels_for(phys_reader, partname)
--> 105             blob = phys_reader.blob_for(partname)
    106             yield (partname, blob, reltype, part_srels)
    107             next_walker = PackageReader._walk_phys_parts(

/usr/local/lib/python3.7/site-packages/docx/opc/phys_pkg.py in blob_for(self, pack_uri)
    106         matching member is present in zip archive.
    107         """
--> 108         return self._zipf.read(pack_uri.membername)
    109 
    110     def close(self):

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in read(self, name, pwd)
   1404     def read(self, name, pwd=None):
   1405         """Return file bytes (as a string) for name."""
-> 1406         with self.open(name, "r", pwd) as fp:
   1407             return fp.read()
   1408 

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in open(self, name, mode, pwd, force_zip64)
   1443         else:
   1444             # Get info object for name
-> 1445             zinfo = self.getinfo(name)
   1446 
   1447         if mode == 'w':

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py in getinfo(self, name)
   1371         if info is None:
   1372             raise KeyError(
-> 1373                 'There is no item named %r in the archive' % name)
   1374 
   1375         return info

KeyError: "There is no item named 'word/NULL' in the archive"```
@scanny
Copy link
Contributor

scanny commented Mar 17, 2020

Hi Kristen. This is a problem we see occasionally. Our best guess is that there is a Word plugin like maybe Small Business Productivity Pak or something like that which is not too careful about cleaning up after itself when it deletes things.

Unfortunately there's no easy fix, but depending on your skill level and determination you can fix it. You can find some other issues related to it be searching Google with NULL relationship "python-docx" and some others by substituting "python-pptx" for "python-docx".

The way I would fix it on a single file would be to extract the package using opc-diag, grep through the relationship files to find NULL with something like grep NULL *.rels, and then just delete the offending relationship line.

There might be one or two more accessible ways if that sounds like Greek.

Let us know how you go.

@KristenMoore
Copy link
Author

KristenMoore commented Mar 20, 2020

Thanks for the quick reply. This all makes sense, I looked up some other issues too like you said, but I can't find NULL in any relationship files (or anywhere else for that matter) in this doc.

@scanny
Copy link
Contributor

scanny commented Mar 20, 2020

Can you share the doc? You can send it to me by email if you want. Otherwise, I don't see how it could not be in there given where the exception is happening. "NULL" is not a string that would normally occur in Python so would indicate something like Java or C# as the source.

Have you unzipped the .docx file and grep-ed it for "NULL"?

@KristenMoore
Copy link
Author

Apologies - it worked. Don't know what I did wrong the first time.
Many thanks.

@scanny
Copy link
Contributor

scanny commented Mar 23, 2020

No worries, glad you got it working Kristen :)

@aorsten
Copy link

aorsten commented Dec 3, 2020

I get a similar error message: KeyError: "There is no item named 'word/#MyBookmark' in the archive"

This is achieved by:

  1. Adding a picture to the Word file
  2. Right-click -> Link
  3. Add link to an internal bookmark.

Then the hyperlink ends up like this, notice the a:hlinkClick relationship ID:

                <w:drawing>
                    <wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="2A30E332" wp14:editId="3F2B5F70">
                       (...)
                        <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
                            <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
                                <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
                                    <pic:nvPicPr>
                                        <pic:cNvPr id="28" name="Picture 28" descr="(...)">
                                            <a:hlinkClick r:id="rId21"/>
                                        </pic:cNvPr>
                                        <pic:cNvPicPr/>
                                    </pic:nvPicPr>
                                    <pic:blipFill>
                                        (...)
                                    </pic:blipFill>
                                    (...)
                                </pic:pic>
                            </a:graphicData>
                        </a:graphic>
                    </wp:inline>
                </w:drawing>

Now, in word/_rels/document.xml.rels, we get:

    <Relationship Id="rId21" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="#MyBookmark"/>

This item bugs python-docx for me. I'll admit I'm using a 2.5-year-old version of the package, since I needed to modify stuff for my own usecase, so I am not sure whether this has been fixed after that. I was looking for whether this had been solved somehow, and it seems it is very much related to this issue.

@scanny Do you reckon this is easily solved - and do you have any suggestions to how? I see in the pkgreader that the target_mode can be used to identify external targets, and that external targets receive special treatment to avoid such zipfile issues. From what I gather, RT.HYPERLINK elements that have a Target starting with # should be treated specially - like some sort of internal bookmark relationship (or similar).

@scanny
Copy link
Contributor

scanny commented Dec 3, 2020

@aorsten probably a separate ticket is best. You can refer to this one from there if you think this is related enough, but this seems like possibly an ambiguity in the spec rather than a (maybe) violation of it like the NULL relationship one is.

Make sure to include the stack trace in the report.

The error seems to be coming from attempting to load the part, so wherever the code is deciding which relationships are loadable sounds like the right neighborhood. Probably in opc/package.py or thereabouts.

@wonzer
Copy link

wonzer commented Nov 1, 2022

You can find a solution here->
#1105 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants