Skip to content

Breaking up large XML files

Andrew Philpot edited this page May 20, 2015 · 1 revision

Because Karma uses a non-streaming XML parser, it doesn't work well with very large XML files. You can use a tools such as xml_strip to break these files up into more manageable fragments.

xml_strip is Perl-based and is part of the Perl XML::Twig package.

sudo cpan install xml_strip

xml_strip will be installed by default somewhere in your perl directory. For me this is time /opt/local/libexec/perl5.16/sitebin/

/opt/local/libexec/perl5.16/sitebin/xml_split -s input.xml

will create files of size about named input-00.xml, input-01.xml, ....

I used as 100M

input-00.xml is a skeleton file instructing xml_join how to put those files back together to recreate the exact original. We should ignore this. Each of the other input-NN.xml will be the top-level objects from our input file, wrapped in a tag which looks like <xml_split:root xmlns:xml_split="http://xmltwig.com/xml_split"> ... </xml_split:root>

We can simply get rid of the first and last line of each file

bash> rm input-00.xml bash> for file in input*.xml; do sed -i '' '1d;$d' $file done

time /opt/local/libexec/perl5.16/sitebin/xml_split -s100M ducksouth.com.xml