Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing big XML files with "lxml.objectify.fromstring" returns an error #87

Open
antonhagg opened this issue Jan 27, 2016 · 7 comments
Open
Labels
Milestone

Comments

@antonhagg
Copy link

This is mainly related to #78 where an xml- file can grow quite big (in my case its around 500 mb and contains 779917 files and 90361 folders). But I guess this could happen otherwise too.

Anyway, there is an option to use a custom parser with the option "huge_tree" (http://stackoverflow.com/questions/11850345/using-python-lxml-etree-for-huge-xml-files). Would this be an option or is there another way of parsing large xml-files, for example in chunks?

reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 27 Jan 2016 14:13:59 GMT
header: Accept-Ranges: bytes
header: path_list_total_files: 779917
header: path_list_total_folders: 90361
header: Content-Type: text/xml
header: Transfer-Encoding: chunked
header: Server: Jetty(8.1.4.v20120524)
DEBUG:requests.packages.urllib3.connectionpool:"GET /jfs/XX/Jotta/Sync/Backup2?mode=list HTTP/1.1" 200 None
Traceback (most recent call last):
  File "C:\Python27\Scripts\jotta-download-script.py", line 9, in <module>
    load_entry_point('jottalib==0.4.1.post1', 'console_scripts', 'jotta-download')()
  File "c:\python27\lib\site-packages\jottalib\cli.py", line 258, in download
    fileTree = remote_object.filedirlist().tree #Download the folder tree
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 304, in filedirlist
    return self.jfs.getObject(url)
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 851, in getObject
    o = self.get(url)
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 839, in get
    o = lxml.objectify.fromstring(self.raw(url))
  File "src/lxml/lxml.objectify.pyx", line 1801, in lxml.objectify.fromstring (src\lxml\lxml.objectify.c:26755)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:82934)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:124533)
  File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:123074)
  File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:117114)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:110510)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:112276)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:111367)
lxml.etree.XMLSyntaxError: None
@antonhagg
Copy link
Author

So I have found a workaround for this by first writing the xml to a file and then reading it into memory. This means that it doesn't have to have them both in memory at the same time. Is this an acceptable solution?

def get(self, url):
        'Make a GET request for url and return the response content as a generic lxml object'    
        url = self.escapeUrl(url)
        if "?mode=list" in url: #Check if we are requested a full tree of the directory
            if os.path.exists('temp.xml'):
                os.remove('temp.xml') 
            with open("temp.xml", "w") as text_file:
                text_file.write(self.raw(url))
            o = lxml.objectify.parse("temp.xml")
            o = o.getroot()
            if os.path.exists('temp.xml'):
               os.remove('temp.xml') 
        else:
            o = lxml.objectify.fromstring(self.raw(url))
        if o.tag == 'error':
            JFSError.raiseError(o, url)
        return o

@antonhagg antonhagg mentioned this issue Jan 29, 2016
@havardgulldahl
Copy link
Owner

Hey @antonhagg, I think you are right, we need to do something to limit our resource requirements. I'll take a look at your code, thanks!

@havardgulldahl
Copy link
Owner

havardgulldahl commented Jul 2, 2016

Maybe we could try to create a StringIO object and , if we see that the file is really big, we write it to disk.

Then we parse with objectify.parse(fileobject).

@antonhagg
Copy link
Author

Sounds like a good idea, won't have time to do anything until August. So if anyone else is up for the job, feel free. =)

havardgulldahl added a commit that referenced this issue Aug 26, 2016
@havardgulldahl
Copy link
Owner

@antonhagg I had a go at it, will you please test to see if current code in master works for you now?

@havardgulldahl havardgulldahl added this to the 0.6 milestone Aug 26, 2016
@antonhagg
Copy link
Author

Since "folder download" is not in the 0.5.1 release, I will have to add that first.
Tried a new innstallation of the 0.5.1, but ran into a lot of trouble... will have to sort that out first.

@havardgulldahl
Copy link
Owner

@antonhagg The code has not been released yet. Are you able to install from git head? that is, with git clone, and not with pip?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants