Parsing big XML files with "lxml.objectify.fromstring" returns an error #87

antonhagg · 2016-01-27T15:24:25Z

This is mainly related to #78 where an xml- file can grow quite big (in my case its around 500 mb and contains 779917 files and 90361 folders). But I guess this could happen otherwise too.

Anyway, there is an option to use a custom parser with the option "huge_tree" (http://stackoverflow.com/questions/11850345/using-python-lxml-etree-for-huge-xml-files). Would this be an option or is there another way of parsing large xml-files, for example in chunks?

reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 27 Jan 2016 14:13:59 GMT
header: Accept-Ranges: bytes
header: path_list_total_files: 779917
header: path_list_total_folders: 90361
header: Content-Type: text/xml
header: Transfer-Encoding: chunked
header: Server: Jetty(8.1.4.v20120524)
DEBUG:requests.packages.urllib3.connectionpool:"GET /jfs/XX/Jotta/Sync/Backup2?mode=list HTTP/1.1" 200 None
Traceback (most recent call last):
  File "C:\Python27\Scripts\jotta-download-script.py", line 9, in <module>
    load_entry_point('jottalib==0.4.1.post1', 'console_scripts', 'jotta-download')()
  File "c:\python27\lib\site-packages\jottalib\cli.py", line 258, in download
    fileTree = remote_object.filedirlist().tree #Download the folder tree
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 304, in filedirlist
    return self.jfs.getObject(url)
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 851, in getObject
    o = self.get(url)
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 839, in get
    o = lxml.objectify.fromstring(self.raw(url))
  File "src/lxml/lxml.objectify.pyx", line 1801, in lxml.objectify.fromstring (src\lxml\lxml.objectify.c:26755)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:82934)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:124533)
  File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:123074)
  File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:117114)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:110510)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:112276)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:111367)
lxml.etree.XMLSyntaxError: None

The text was updated successfully, but these errors were encountered:

antonhagg · 2016-01-28T13:26:43Z

So I have found a workaround for this by first writing the xml to a file and then reading it into memory. This means that it doesn't have to have them both in memory at the same time. Is this an acceptable solution?

def get(self, url):
        'Make a GET request for url and return the response content as a generic lxml object'    
        url = self.escapeUrl(url)
        if "?mode=list" in url: #Check if we are requested a full tree of the directory
            if os.path.exists('temp.xml'):
                os.remove('temp.xml') 
            with open("temp.xml", "w") as text_file:
                text_file.write(self.raw(url))
            o = lxml.objectify.parse("temp.xml")
            o = o.getroot()
            if os.path.exists('temp.xml'):
               os.remove('temp.xml') 
        else:
            o = lxml.objectify.fromstring(self.raw(url))
        if o.tag == 'error':
            JFSError.raiseError(o, url)
        return o

havardgulldahl · 2016-06-11T06:50:43Z

Hey @antonhagg, I think you are right, we need to do something to limit our resource requirements. I'll take a look at your code, thanks!

havardgulldahl · 2016-07-02T16:36:08Z

Maybe we could try to create a StringIO object and , if we see that the file is really big, we write it to disk.

Then we parse with objectify.parse(fileobject).

antonhagg · 2016-07-05T10:34:48Z

Sounds like a good idea, won't have time to do anything until August. So if anyone else is up for the job, feel free. =)

havardgulldahl · 2016-08-26T17:21:34Z

@antonhagg I had a go at it, will you please test to see if current code in master works for you now?

antonhagg · 2016-08-29T15:25:20Z

Since "folder download" is not in the 0.5.1 release, I will have to add that first.
Tried a new innstallation of the 0.5.1, but ran into a lot of trouble... will have to sort that out first.

havardgulldahl · 2016-09-05T18:34:41Z

@antonhagg The code has not been released yet. Are you able to install from git head? that is, with git clone, and not with pip?

antonhagg mentioned this issue Jan 29, 2016

Master #89

Closed

havardgulldahl added the bug label Jun 11, 2016

havardgulldahl added a commit that referenced this issue Aug 26, 2016

Fixing #87. Parsing big xml files

8506a7c

havardgulldahl added a commit that referenced this issue Aug 26, 2016

Test for #87

2462f1a

havardgulldahl added this to the 0.6 milestone Aug 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing big XML files with "lxml.objectify.fromstring" returns an error #87

Parsing big XML files with "lxml.objectify.fromstring" returns an error #87

antonhagg commented Jan 27, 2016

antonhagg commented Jan 28, 2016

havardgulldahl commented Jun 11, 2016

havardgulldahl commented Jul 2, 2016 •

edited

Loading

antonhagg commented Jul 5, 2016

havardgulldahl commented Aug 26, 2016

antonhagg commented Aug 29, 2016

havardgulldahl commented Sep 5, 2016

Parsing big XML files with "lxml.objectify.fromstring" returns an error #87

Parsing big XML files with "lxml.objectify.fromstring" returns an error #87

Comments

antonhagg commented Jan 27, 2016

antonhagg commented Jan 28, 2016

havardgulldahl commented Jun 11, 2016

havardgulldahl commented Jul 2, 2016 • edited Loading

antonhagg commented Jul 5, 2016

havardgulldahl commented Aug 26, 2016

antonhagg commented Aug 29, 2016

havardgulldahl commented Sep 5, 2016

havardgulldahl commented Jul 2, 2016 •

edited

Loading