Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to parse XML comment as stream #14

Open
strarsis opened this issue Sep 7, 2020 · 14 comments
Open

Fails to parse XML comment as stream #14

strarsis opened this issue Sep 7, 2020 · 14 comments

Comments

@strarsis
Copy link

strarsis commented Sep 7, 2020

When a SVG XML file begins with an XML comment, the SVG file is not parsed (stream ends immediately):

<!-- Test -->
<svg height="200" width="500">
  <polyline points="20,20 40,25 60,40 80,120 120,140 200,180" style="fill:none;stroke:black;stroke-width:3" />
</svg>

Without the comment, parsing as stream works fine:

<svg height="200" width="500">
  <polyline points="20,20 40,25 60,40 80,120 120,140 200,180" style="fill:none;stroke:black;stroke-width:3" />
</svg>
@TobiasNickel
Copy link
Owner

that looks like a bug, thanks, I will have a test till the weekend.

@TobiasNickel
Copy link
Owner

I just added comment support in streams, that was missing: 5e55ea2#diff-39fbd46f4381411507cbf464fd4fea02R491

And had some more tests. But first I want to help you with your task at hand. I think it is better to read the complete svg file and parse it in one go, even if the file is 20mb. If you need to do lots of processing, I would advise to use workers processes, maybe using piscina.

So, back to the comments in stream issue. I found this svg file currently only works when the svg file has a trailing comma. You have to know, that when parsing the svg not skipping the opening svg tag, will in the async loop still only do one round, and giving you the entire SVG. If you really need comments in the file, can you add them after the opening svg-tag? When you skip the initial tag, you can get its children one by one. But when the elements are deeper nested, it is likely to make more sense to parse the complete file in one go.

The stream was implemented and debugged using the OSM planet file. there is the initial type-definition, then a genral osm tag followed by the list of all node and waytags, in a pretty much flat manner. It has tag tags nested, but that is ok. For parsing the big wikipedia export file, the structure is quite similar, some initial stuff followed by articles, followed by history information. So big database exports are usually in a pretty much flat format.

the current implementation ignores the comments. I will have some more test before publish to NPM. Thanks for opening this issue, let me know your thoughts.

@strarsis
Copy link
Author

strarsis commented Sep 12, 2020

@TobiasNickel: The current svgo uses the sax parser which is sadly not maintained anymore, and issues with characters/whitespaces/comments have accumulated, resulting in numerous issues (see the linked issues in that PR, that tries to use saxes, but ran into performance issues).

This is the parsing part of SVGO: https://github.com/svg/svgo/blob/master/lib/svgo/svg2js.js#L26
It transforms the SVG to the svgo AST, then svgo plugins modify that AST and in the the third and last step the svgo AST is stringified back to an SVG. As you can see, as sax is used, some kind of streaming must be handled. An advantage for using streaming is that for each node some CSS-related processing is done. It is fairly quick, but doing this in the parsing step should be better for performance than building the whole SVG DOM tree and recursively traversing through it afterwards.

When you could drop-in/integrate your tXml into the svg2js.js part (ideally by PR), those previously mentioned SVG parsing issues with sax should be fixed and also fixed in the future (as this project is maintained), and maybe even a performance boost achieved. The stringifcation of the svgo AST is done separately, so the parser only has to parse the SVG, no AST translation back to string is required.

@TobiasNickel
Copy link
Owner

It would be cool, if svgo could profit from this library. still, the streaming works much different than on other libraries. Not every tag, text, attribute is an event/item, but complete sub-trees. )

At the weekend I made two things, fork svgo and re-implement the scg2js.js. Second, an update on the parser is needed. Throw error when some tag was broken and extened parsing parsing for the doctype. With it currently I was able to fulfill all of the libs own tests, but not the plugin tests jet.

@strarsis what do you think?

for now the changes are in a branch, as i think they are breaking changes in txml. I need some more tests. now that there are a number of users for this lib, we have to be more careful. however woth this work, this parser can get more stable.

@strarsis
Copy link
Author

strarsis commented Sep 14, 2020

@TobiasNickel: Awesome!

  1. The tests run nice and fast, that's great!
  2. From a first glance some tests fail because text is missing and something with the line endings is amiss.
    For example the test sortDefsChildren.01 fails because the <text/> elements are missing their text referenced text.
    And it seems that nodes that contain prefixed attributes are skipped (e.g. xlink:href).
    Then there are tests that just fail because the line breaks or white spaces are slightly different, these tests can be adjusted to pass by updating the expected strings.
  3. Concerning the backward-compatibility of this parser: Node-level streaming parsing is quite useful, maybe you could try to find a common denominator for both streaming approaches (sub-trees and individual nodes) and then offer two different streaming methods/modes?
  4. The following parser issues/requirements cropped up and are unresolved because of sax
    (though most of these issues are probably not a problem with tXml):

@TobiasNickel
Copy link
Owner

hi, just want to say, this matter is not forgotten. Last week I was working on a blog post about Web-Push. Today, I continued a little with debugging the svgo fork. I pass about 15 more plugin tests.

@TobiasNickel
Copy link
Owner

hi, little update, Today, I was able to run tall tests of svgo !!! before making a PR, I first have to make an update to txml. absolutely, some update to the parser was needed. The work of working through this, has lead to some very good changes, to txml, that I am going to publish soon as version 4.

@strarsis
Copy link
Author

@TobiasNickel: This is awesome! Finally some solid progress is possible with svgo.

@TobiasNickel
Copy link
Owner

wow, version 4 of txml is published. now it does not directly export the parse function, but an object with a parse function. this is for better compatibility between node and typescript modules.

and here is the PR to svgo:
svg/svgo#1301

only one file changed and all its tests passed.

@TobiasNickel
Copy link
Owner

and, as for this issue, there is now the keepComments. it can be used for parse and for transformStream.

@santialbo
Copy link

I wansn't sure about opening another issue but processing instructions also cause transformStream to not return anything. Use this xml as example:

<?xml version="1.0" encoding="UTF-8"?>
<a>b</a>

@TobiasNickel
Copy link
Owner

Hi @santialbo. Sorry for by late reply, I was kind of busy during chinese new year.

The parsing of the processing instructions was quite a new feature for the main parse method since December. for the transformStream txml would have to handle the parsing of those instructions separately.

If your file is not to big (even up to ~50-100mb and if you give node.js enough memory, even parsing 2gb would be possible, but you can imagine not optimal) it is ok to parse the complete thing at once (as long as it is only one file at a time).

usually large xml files (wikipedia export or open street map planet file) have a structure like this:

<? processing instructions ?>
<somewrapper>
  <item><child>data</child></item>
  <item><child>data1</child></item>
  <item><child>data2</child></item>
  <item><child>data3</child></item>
</somewrapper>

When parsing large files, with txml you need to first skip over the pre-ample in my example that is until after <somewrapper>. Using a for await loop you will then get one item at a time. Never a child, but child will always be in the children property of the item. To read child and data from the item is custom logic developed by you.

As for now, I am not sure. Are there xml files with a structure without a wrapper?:

<? processing instructions ?>
<item><child>data</child></item>
<item><child>data1</child></item>
<item><child>data2</child></item>
<item><child>data3</child></item>

Currently you can still parse the items with txml transformStream, but not the processing instructions, but not the items:

  const processingInstructionStream = fs.createReadStream(files.processingInstructionStream)
    .pipe(tXml.transformStream(`<?xml version="1.0" encoding="UTF-8"?>`));
  for await(let element of processingInstructionStream) {
    console.log(element)
  }

note: that the first argument to transformStream is an offset as number of string (if its a string only its length is used).

I hope this explanation is clear. can I ask what kind of xml documents you want to process?

@ambianBeing314
Copy link

ambianBeing314 commented Jun 28, 2022

Hi @TobiasNickel, first thanks for this awesome lib. I have a pretty large xml file ~200MB or more and trying to use the transformStream(), it return no elements when the file has xml declaration at the top.

<?xml version="1.0" encoding="utf8"?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream())

for await(let element of xmlStream) {
    console.log(element)
  }

Also if the file is formatted without the xml decalration,

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

like with spaces and tabs it only works with this < OR 0 is passed to the transformStream()

const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream('<')) // this works

const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream('0')) // this works

const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream()) // this does not work

could you explain more how to use this parameter, in what cases

@TobiasNickel
Copy link
Owner

the idea of the offset(the first argument is to skip the preample that is not data). in your case that would be: '<?xml version="1.0" encoding="utf8"?>'

or '<?xml version="1.0" encoding="utf8"?>'.length.

does this work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants