HTMLFileContent and component for extracting IDL fragments #46

m-cheung · 2017-05-30T17:29:46Z

This work is towards #41. This PR is dependent on foam-framework/foam2#415 and foam-framework/foam2#423

arobins · 2017-05-30T18:14:40Z

lib/org/chromium/webidl/HTMLFileContent.js

+            });
+          } else {
+            // Currently not doing anything for:
+            //  - Self closing tags (item.type.name === OPEN_CLOSE)


OPEN_CLOSE that aren't pre

arobins · 2017-05-30T18:15:56Z

lib/org/chromium/webidl/HTMLFileContent.js

+
+        if (foam.core.FObject.isInstance(item)) {
+          var top = tags[tags.length - 1];
+          if (top === undefined || item.type.name === OPEN) {


I don't think you need the top defined check. Only open tags should be pushed, so if you somehow ended up with an empty stack and either an extra close tag (or plain text, if parseString returns text), not pushing is probably the right thing to do.

arobins · 2017-05-30T18:19:13Z

test/node/parsing/HTMLFileContent-test.js

+  });
+
+  it('should parse a pre tag with no content', function() {
+    var content = '<pre class="idl"></pre>';


Missing the parse step here

Whoops, this was actually one of the old test that I am no longer using. Will likely modify this test or remove in next commit

arobins · 2017-05-30T18:28:24Z

lib/org/chromium/webidl/HTMLFileContent.js

+            // Determine if node is of class IDL
+            var isIDL = false;
+            item.attributes.forEach(function(attr) {
+              if (attr.name === 'class' && attr.value.split(' ').includes('idl')) {


Wasn't there something about having to scan back up the stack for a parent tag with the right class?

I thought about this for a little and decided to implement it slightly differently.

Scanning the stack every time we have a potential tag we are interested in seemed like it could be very costly (e.g. when there is deep nesting of tags occurring). So I am using a variable (skipStack) to keep track of the level in the stack where these excluded tags are included. It itself behaves as a stack as we may potentially have nesting of excluded tags.

Please let me know if you see any flaws / problems with this approach.

arobins · 2017-05-30T18:32:59Z

test/node/parsing/HTMLFileContent-test.js

+          var expectedContent = fs.readFileSync(`${testDirectory}/${filename}`).toString();
+          expect(preBlocks[testNum].content.trim()).toBe(expectedContent.trim());
+        } else if (filename !== 'spec.html') {
+          console.warn(`${filename} was not used in ${testName} spec test`);


This should probably fail the test.

m-cheung · 2017-05-31T17:34:03Z

test/node/parsing/WebGL/6

@@ -0,0 +1,552 @@
+typedef ([AllowShared] Uint32Array or sequence<GLuint>) Uint32List;


This line is currently causing problems in parsing. The W3C Grammar does seem to have a rule for [ExtendedAttributes] proceeding a type in a typedef definition. It is present in HeyCam's Grammar

I think the approach taken so far has been to extend the IDL parser for each source of IDL content if something is encountered that doesn't follow the spec.

The comment in Parser.js seems to imply that it was designed around HeyCam's Grammar. Perhaps there were changes to the spec / grammar between the time the Parser was written and the current specs. I am currently working on adding this change. Hoping to have a PR before the end of the day today (hopefully).

whatwg/webidl#286

Resolved by #47

m-cheung · 2017-06-07T19:09:09Z

Refactor is complete. The test will not pass until foam-framework/foam2#415 and foam-framework/foam2#423 are merged into the beta-1 branch.

codecov-io · 2017-06-08T14:56:11Z

Codecov Report

Merging #46 into master will increase coverage by 0.59%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   94.32%   94.92%   +0.59%     
==========================================
  Files          81       84       +3     
  Lines         564      630      +66     
==========================================
+ Hits          532      598      +66     
  Misses         32       32

Impacted Files	Coverage Δ
lib/org/chromium/webidl/IDLFileContents.js	`100% <ø> (ø)`	⬆️
config/files.js	`100% <ø> (ø)`	⬆️
lib/org/chromium/webidl/IDLFragmentExtractor.js	`100% <100%> (ø)`
lib/org/chromium/webidl/HTMLFileContents.js	`100% <100%> (ø)`
lib/org/chromium/webidl/URLExtractor.js	`100% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 441f1c5...c93dec6. Read the comment docs.

arobins · 2017-06-09T14:21:20Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

+
+        if (!tagMatching) {
+          // Ignoring all tags. Only extracting text within pre tags.
+          if (isTag && item.nodeName === 'pre') {


Does this need to handle nested pre tags?

It is hard to say at this point whether nested pre tags affect the information we care about. From my current observations, there hasn't been any IDL fragments within nest pre tags (they seem to be mostly used for formatting). Thus, it seems like it currently is sufficient for our purposes (at least I hope so)

We could attempt to put the content through another round of processing or implement a proper HTML parser (which was my first attempt at this problem, but was scrapped since it did a lot more than it needed to and likely had other issues too).

arobins · 2017-06-09T14:23:10Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

+            tagStack.push(item);
+          } else if (top && item.type.name === CLOSE && top.nodeName === item.nodeName) {
+            var parentCls = extractAttr(top, 'class');
+            if (isExcluded(parentCls)) exclude = false;


Is it possible for there to be an excluded tag inside an excluded tag?

I have not yet observed an instance of an exclude tag nested within another, but there is a relatively simple fix to this issue (using a stack to track excluded tags instead of a bool), so that has been implemented. Will be in next set of changes.

m-cheung · 2017-06-13T15:28:33Z

All requested changes should be made by now. @arobins please let me know if you have any feedback the latest set of changes and comments.

arobins · 2017-06-13T15:36:11Z

LGTM. @mdittmer should probably look over it as well, because I'm definitely not a FOAM expert.

arobins · 2017-06-13T15:33:10Z

lib/org/chromium/webidl/HTMLFileContents.js

+      class: 'String',
+      name: 'url',
+      required: true,
+      final: true


nit: add trailing comma

mdittmer

I'm comfortable with this design once comments are addressed. Future alternative also proposed at #53

mdittmer · 2017-06-14T12:38:45Z

lib/org/chromium/webidl/HTMLFileContents.js

+      class: 'String',
+      name: 'url',
+      required: true,
+      final: true,


Here and elsewhere: May not be able to get away with "final: true" anymore; DatastoreDAO (which we will use eventually) instantiates objects first, then sets properties. I think "final: true" will create setters that silently fail.

mdittmer · 2017-06-14T12:39:11Z

lib/org/chromium/webidl/HTMLFileContents.js

+    },
+    {
+      class: 'String',
+      name: 'content',


HTMLFileContents.contents would be a more consistent name.

mdittmer · 2017-06-14T12:39:53Z

lib/org/chromium/webidl/HTMLFileContents.js

+  package: 'org.chromium.webidl',
+  name: 'HTMLFileContents',
+
+  documentation: 'An HTML file that stores it contents.',


More docs: Is the HTMLFileContents.contents pre-processed in any way? (E.g., &foo;-escaped?) or is it the raw request body?

mdittmer · 2017-06-14T12:40:10Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

+foam.CLASS({
+  package: 'org.chromium.webidl',
+  name: 'IDLFragmentExtractor',
+  documentation: 'extracts IDL Fragments from HTML files',


nit: Full sentence. (with capital and period)

mdittmer · 2017-06-14T12:40:36Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

+      var lexer = self.HTMLLexer.create();
+      var OPEN = lexer.TagType.OPEN.name;
+      var CLOSE = lexer.TagType.CLOSE.name;
+      var extractAttr = function(node, attrName) {


I think it's legitimate to have:

<node-name attr="value1 value2" attr="value3">

to yield {attr: ['value1, 'value2', 'value3']}

I assume HTMLLexer doesn't collapse whitespace, so I think we need to revise this.

I have made minor changes to the extract code which allows for this. It will be part of the next set of changes.

mdittmer · 2017-06-14T12:45:00Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

+        // As of this writing, there has not been any IDL fragments
+        // that has been found within nested pre tags.
+        if (!tagMatching) {
+          // Ignoring all tags. Only extracting text within pre tags.


I like this comment. Can we get a comment at the top of each if branch in this method? The logic is pretty complex.

mdittmer · 2017-06-14T12:46:57Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

+            }
+            tagStack.push(item);
+          } else if (top && item.type.name === CLOSE && top.nodeName === item.nodeName) {
+            var parentCls = extractAttr(top, 'class');


This isn't parent, right? It's openTag? Maybe the code would be easier to read if this branch started with:

var openTag = top; var closeTag = item

and open and close prefixes are used to refer to tag-related things.

mdittmer · 2017-06-14T12:48:05Z

test/any/htmlFileClasses-test.js

+    expect(file.content).toBe(content);
+  });
+
+  it('should fail to set HTMLFileContent props after creation', function() {


We should probably nix this due to how DatastoreDAO works.

mdittmer · 2017-06-14T12:49:45Z

test/any/htmlFileClasses-test.js

@@ -0,0 +1,51 @@
+// Copyright 2017 The Chromium Authors. All rights reserved.


file name: we just test HTMLFileContents, yes? Let's name this file after that: HTMLFileContents-test.js.

mdittmer · 2017-06-14T12:51:48Z

test/node/parsing/IDLFragmentExtractor-test.js

+  var IDLFragmentExtractor;
+  var Parser;
+
+  function cmpTest(testName, testDirectory, expectedIDL) {


expectedIDL is a count/length, right? numExpectedIDLFragments?

mdittmer

Just a few minor things, and I think we can drop the test.

mdittmer · 2017-06-26T16:47:10Z

lib/org/chromium/webidl/HTMLFileContents.js

+      class: 'Array',
+      of: 'String',
+      name: 'references',
+      factory: function() { return []; },


This is the default factory for Array; you can leave it out.

mdittmer · 2017-06-26T16:47:33Z

lib/org/chromium/webidl/HTMLFileContents.js

+    },
+    {
+      class: 'Array',
+      of: 'String',


documentation?

Nits: usually in order: class, of, documentation, name. Please uncomment documentation.

mdittmer · 2017-06-26T16:49:07Z

lib/org/chromium/webidl/IDLFragmentExtractor.js

          }
        });
        return retVal;
      };

-      var results = lexer.parseString(self.file.content).value;
+      var results = lexer.parseString(this.file.contents).value;
+      if (!results) throw "IDL Parse was not successful.";


nit: throw new Error(<msg>).

mdittmer · 2017-06-26T16:52:23Z

test/any/HTMLFileContents-test.js

@@ -0,0 +1,28 @@
+// Copyright 2017 The Chromium Authors. All rights reserved.


This test doesn't seem worthwhile. It amounts to checking that FOAM's create() implementation is correct.

If you intend to guard against mistakenly finaling something that may be set, we could switch the test to do foo.create(); foo.bar = 'bar'; expect(foo.bar).toBe('bar'), but if that's not what you meant to test, then I'd say we can drop this test entirely.

mdittmer

Please merge after addressing last nits.

m-cheung added 5 commits May 23, 2017 17:59

WIP: Adding HTML Content parsing

dd775b5

Adding preliminary HTMLParsing and corresponding tests

58e5e48

Merge branch 'htmlContent'

32366c5

Renaming HTMLParser to HTMLFileContent and adding additional tests

6376b3d

Removing additional files

959d608

m-cheung requested a review from arobins May 30, 2017 18:27

arobins reviewed May 30, 2017

View reviewed changes

m-cheung added 2 commits May 30, 2017 16:39

Adding some additional tests and filtering out examples / notes

b35f750

Adding additional tests

853d12d

m-cheung commented May 31, 2017

View reviewed changes

m-cheung added 6 commits May 31, 2017 13:39

Temporary work on pipeline

0f92a6b

Merge branch 'master' into htmlContent

b640239

Merge branch 'master' into htmlContent

e9eb44d

Merge branch 'master' into htmlContent

5bcfc90

Refactoring HTMLFileContent to correspond with HTMLLexer changes

a041087

Part 2 of refactor due to HTMLLexer changes

5969d4b

Renaming components

ad7669e

m-cheung changed the title ~~Adding HTMLFileContent Class for parsing HTML files~~ HTMLFileContent and component for extract IDL fragments Jun 8, 2017

Removing parsing step from extractor

dd2bd1d

m-cheung changed the title ~~HTMLFileContent and component for extract IDL fragments~~ HTMLFileContent and component for extracting IDL fragments Jun 8, 2017

m-cheung added 3 commits June 8, 2017 11:19

Removing unused variables and fixing comment styling

f812b58

Fixing description for HTMLFileContents

ff6ad90

Removing requirement for HTTPRequest in HTMLFileContents

c5fb6fb

arobins reviewed Jun 9, 2017

View reviewed changes

m-cheung added 3 commits June 9, 2017 10:51

Allowing for nested excluded tags

c736c28

Adding support for nested exclude tags

af4cd9d

Adding Future message with regards to nested pre tags

20932a6

arobins reviewed Jun 13, 2017

View reviewed changes

Fixing minor formatting problem

5f93bbb

mdittmer reviewed Jun 14, 2017

View reviewed changes

m-cheung added 3 commits June 14, 2017 10:35

Additional documentation in IDLFragmentExtractor and HTMLFileContents

30df62a

Adding additional documentation to IDLFragmentExtractor

7c3d78e

Formatting and style fixes on HTMLFileContents and IDLFragmentExtractor

6865625

mdittmer reviewed Jun 26, 2017

View reviewed changes

Minor changes to HTMLFileContents and removing unneeded test

2dc37e8

mdittmer approved these changes Jun 27, 2017

View reviewed changes

Addressing formatting changes in HTMLFileContents

c93dec6

m-cheung merged commit f5bdb2e into GoogleChromeLabs:master Jun 27, 2017

m-cheung deleted the htmlContent branch June 27, 2017 14:27

m-cheung mentioned this pull request Jun 27, 2017

Port HTML parser #3

Closed

		@@ -0,0 +1,552 @@
		typedef ([AllowShared] Uint32Array or sequence<GLuint>) Uint32List;

		@@ -0,0 +1,51 @@
		// Copyright 2017 The Chromium Authors. All rights reserved.

		@@ -0,0 +1,28 @@
		// Copyright 2017 The Chromium Authors. All rights reserved.

HTMLFileContent and component for extracting IDL fragments #46

HTMLFileContent and component for extracting IDL fragments #46

Uh oh!

Conversation

m-cheung commented May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-cheung May 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-cheung commented Jun 7, 2017

Uh oh!

codecov-io commented Jun 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-cheung commented Jun 13, 2017

Uh oh!

arobins commented Jun 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdittmer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdittmer left a comment

Choose a reason for hiding this comment

m-cheung commented May 30, 2017 •

edited

Loading

m-cheung May 31, 2017 •

edited

Loading

codecov-io commented Jun 8, 2017 •

edited

Loading