Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFP Flowifier – HTML to HtmlFlow translator #43

Closed
lcduarte opened this issue Nov 14, 2018 · 45 comments
Closed

RFP Flowifier – HTML to HtmlFlow translator #43

lcduarte opened this issue Nov 14, 2018 · 45 comments

Comments

@lcduarte
Copy link
Member

In order to setup an HtmlFlow template it would be useful to have a translator tool – Flowifier – that is able to convert an HTML document into the equivalent template definition with HtmlFlow.

This RFP aims to discuss implementation approaches and requirements for Flowifier.

Some of the topics under discussion (not limited):

  • HTML parsing strategy
  • Nodes visiting approach
  • Execution model
  • Parent repository
@ghost
Copy link

ghost commented Oct 3, 2019

@lcduarte What about using JSoup for parsing? I'll have to write this kind of tool as I plan to use HtmlFlow to replace my blog (with hundreds of articles) and the website of my game by a single static website. JSoup can sanitize and output tidy HTML. The DOM traversal would allow to go through all nodes and elements to produce their HtmlFlow equivalents.

I plan to leave Github, would you accept a patch? Is there any mean to contact you both outside Github?

@fmcarvalho
Copy link
Member

fmcarvalho commented Oct 4, 2019

@gouessej Thanks for your proposal. I agree that Jsoup may be the best option for parsing.

I am not used to deal with patch life cycle and coexistence with further releases of the main repository. A pull request looks a better option. I am not sure if I have understood how do you plan to integrate your contributions!?

You can always contact me out of Github to my outlook.com address with username miguelgamboa.

Thanks

@ghost
Copy link

ghost commented Oct 6, 2019

@fmcarvalho You're welcome. Yesterday, I switched my website from HTML 4 to HTML 5 in order to work on converting some clean HTML 5 source code into "flowified" Java source code first.

I suggest to put the Flowifier into a dedicated package "flowifier" in this project as a first step. When the project is a bit more mature, we could move it into a separate project in order to avoid adding JSoup into the dependencies of HtmlFlow (it's a dependency only in the "test" scope right now) whereas only its flowifier will really need it.

Ok, I'll try making a pull request before leaving Github.

By the way, I suggested to use JSoup instead of DOM, JDOM, Sax, Stax, etc because the input is in HTML format, not in XHTML.

@fmcarvalho
Copy link
Member

I suggest to put the Flowifier into a dedicated package "flowifier"
in this project as a first step.

Yes, it seems to be the better choice.

I suggested to use JSoup instead of DOM, JDOM, Sax, Stax, etc
because the input is in HTML format, not in XHTML.

Yes, you are right.

@ghost
Copy link

ghost commented Oct 7, 2019

@fmcarvalho The first blueprint produces some valid Java source code. I'll test it on some much more complicated webpages.

@fmcarvalho
Copy link
Member

@gouessej WoW you were fast! Great! Your approach sounds nicely.

From my analysis of your code I have only a few questions/suggestions:

  1. I did not understand the advantage of the Flowifier constructor receiving a class rather than an instance. Why are you passing to the constructor a Class< HtmlToFlowNodeVisitor> and not only an instance of HtmlToFlowNodeVisitor ? If I clearly understood the only thing you do with that class is to instantiate it in htmlToFlowNodeVisitorClass.getConstructor().newInstance();.

  2. Did you consider using an auxiliary library to write Java source code, such as https://github.com/square/javapoet or any other? I think this would help to ensure the correctness of the generated source code.

  3. Maybe you should not emit the call to .render(). Maybe the string field html could be an HtmlView view field instead and let the end user to decide what to do with that view.

  4. Maybe the package should be htmlflow.flowifier rather than flowifier.

@ghost
Copy link

ghost commented Oct 8, 2019

@fmcarvalho I plan to improve the lifecycle of the visitor so that I can safely accept an object instead of accepting a class and creating an instance. My first try was a bit overcomplicated.

I have improved the look of the code, I'll look at javapoet.

Good idea, I'll return an HtmlView.

Ok, I'll move it into htmlflow.flowifier.

@ghost
Copy link

ghost commented Oct 9, 2019

@fmcarvalho Now I use an object instead of a class, I return an HtmlView, the test is in a separate class and I support partial conversion (conversion starting from a node instead of the whole document). I have some problems with comments and scripts. When they're fixed, I'll be able to convert Wikipedia's homepage.

I don't use Javapoet yet but it might help to produce prettier Java source code.

@ghost
Copy link

ghost commented Oct 9, 2019

@fmcarvalho Only a very few content types are currently supported :s

@ghost
Copy link

ghost commented Oct 10, 2019

@fmcarvalho Please can you clarify which elements need end tags in HtmlFlow? It would help to implement this method correctly.

I borrow 2 methods of javapoet.

I still need to pass Long instances to attrWidth and attrHeight and to use addAttr for unknown attributes.

@lcduarte
Copy link
Member Author

Hello,

The elements without end tags are the following:

 * Void elements: area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr.

All the remaining elements should have end tags.

I haven't had much time to analyze your work but from what I've seen it's on the right track, thanks for your help :)

@ghost
Copy link

ghost commented Oct 11, 2019

@lcduarte Thank you. Eclipse's workbench often crashes when the resulting Java class is opened. I'll take your suggestion into account as soon as possible.

@lcduarte
Copy link
Member Author

I had the same issue myself with IntelliJ. Since HtmlFlow relies heavily on types to enforce the language rules the IDEs sometimes have some issues validating the multiple chained calls because of the depth of HTML tree. The only solution I've found to that was to extract variables every now and then, which eases the load from the IDE.

@ghost
Copy link

ghost commented Oct 19, 2019

@lcduarte I tried your suggestion to ensure the generated source code was correct, thank you. Well, it seems to work as expected, feel free to review my source code before I make a pull request:
my branch

I'm glad with the current result. Maybe I'll add an option to disable/enable indentation.

@fmcarvalho
Copy link
Member

@gouessej I will try to check your work tomorrow and return some feedback.

Thank you

@bwfrieds
Copy link

So does Flowifier expect a well formed html5 document as input? Does the document need to pass strict validation first? If yes, does that mean it would fail to run on html files built to use Mustache, Handlebars, Thymeleaf, FreeMarker, and Velocity?

@fmcarvalho
Copy link
Member

@bwfrieds I think so. This is still under development. And for a first release I would not include support for template specific tags. Since this current proposal is using JSoup to parse the HTML input document I am not aware if it is hard or not to add that feature. But from a quick search in Stackoverflow it looks that JSoup has not seamless support for template engine tags.

In your case, I would plan to use only the HTML resulting from your template engines to be used via
Flowifier and then manually add the HtmlFlow control to the resulting Java code. Yet, we are only rambling around a feature that is not still completed!

Thanks

@ghost
Copy link

ghost commented Oct 25, 2019

@bwfrieds Yes but you can use an HTML4 document, use JSoup to clean the mess and pass it to the flowifier but it would fail on template specific tags.

@fmcarvalho
Copy link
Member

@gouessej Great! I took a look to your branch patch-1. It is very nice, organized and very well commented. I see that you are still not using Javapoet or any similar tool.

I tried to manually compile the resulting Flowified class in a separate file and I got a few problems. I have also tried to do it with InteliJ bu the build breaks with a stackoverflow exception.

It would be good if your Unit test asserts the result of Flowified.get().render().replaceAll("\\s", "").toLowerCase(); against the original htmlFlowJavaClassSourceCodeWithHtmlViewGetter.replaceAll("\\s", "").toLowerCase();. Maybe it is not easy but it would give further guarantees.

Ideally the unit test should compile at run-time and load the resulting Flowified class and check the previous assertion.

@ghost
Copy link

ghost commented Oct 26, 2019

@fmcarvalho I use 2 methods of Javapoet. I'm not against using it later.

Yes, it's a known limitation, the same happens with Eclipse.

Yes, it's an excellent idea, I can use javax.tools.ToolProvider. I have to see how to pass the dependencies to the Java compiler task:
https://stackoverflow.com/a/1564025

@fmcarvalho
Copy link
Member

@gouessej Regarding JavaPoet I was thinking that you could use its addCode() feature rather than appending String into an Appendable.

Taking a closer look into AbstractHtmlToJavaHtmlFlowNodeVisitor I see that head() method is quite big. Maybe you could refactor it. It is hard to understand it like this.

I do not sympathies either with the try/catch idiom as a way to control flow. If I am not wrong I saw a lot of try {...} catch (final NoSuchMethodException nsme1) { do something}. Maybe you could move this into an auxiliary method that retrieves the desired method or null if not found.

I think you can move this array into a static field.

Try to create more auxiliary methods with intuitive names that suggest the purpose of their actions. For instance, it is hard to read: https://github.com/gouessej/HtmlFlow/blob/patch-1/src/main/java/htmlflow/flowifier/AbstractHtmlToJavaHtmlFlowNodeVisitor.java#L216

It would be easier to understand if you had invoked a method like getClassNameFrom(nodename) instead, or anything similar.

Thanks,
Miguel

@ghost
Copy link

ghost commented Oct 31, 2019

@fmcarvalho Thank you for your feedbacks, I followed your advice, it's easier to read now and there is no longer any try {...} catch (final NoSuchMethodException nsme1) { do something}. I created the static array and 2 methods.

I haven't had enough spare time to look at JavaPoet, is it this method you were talking about?
https://square.github.io/javapoet/javadoc/javapoet/com/squareup/javapoet/MethodSpec.Builder.html#addCode(com.squareup.javapoet.CodeBlock)

@ghost
Copy link

ghost commented Nov 1, 2019

@fmcarvalho Now the unit test check if the generated Java class can be successfully compiled. I get this exception trace when testing the flowifier on Wikipedia's homepage:

compiler message file broken: key=compiler.misc.msg.bug arguments=11.0.1, {1}, {2}, {3}, {4}, {5}, {6}, {7}
java.lang.StackOverflowError
	at jdk.compiler/com.sun.tools.javac.tree.TreeScanner.scan(TreeScanner.java:49)
	at jdk.compiler/com.sun.tools.javac.tree.TreeScanner.visitSelect(TreeScanner.java:302)
	at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCFieldAccess.accept(JCTree.java:2110)
	at jdk.compiler/com.sun.tools.javac.tree.TreeScanner.scan(TreeScanner.java:49)
	at jdk.compiler/com.sun.tools.javac.tree.TreeScanner.visitApply(TreeScanner.java:237)
	at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCMethodInvocation.accept(JCTree.java:1634)

@ghost
Copy link

ghost commented Nov 5, 2019

@fmcarvalho I've just implemented the comparison between the original HTML and the generated HTML, disabling the indentation doesn't seem to work, I call HtmlView.setIndented(false).

@ghost
Copy link

ghost commented Nov 6, 2019

@fmcarvalho I've just found my mistake, I have to call HtmlView.setIndented(false) earlier. I'll give it another try very soon. I'll add this call here:
https://github.com/gouessej/HtmlFlow/blob/patch-1/src/main/java/htmlflow/flowifier/AbstractHtmlToJavaHtmlFlowNodeVisitor.java#L112

@ghost
Copy link

ghost commented Nov 6, 2019

It's now possible to enable/disable indentation. &nbsp; isn't correctly treated, it's absent from the generated HTML source code whereas it's present in the original HTML source code, I have to fix this bug.

@ghost
Copy link

ghost commented Nov 6, 2019

I've just fixed this bug, I had to escape the content of the text nodes. To sum up, I have to remove the nasty warning message caused by the use of the reflection API to modify the class loader and I have to work around the StackOverflowError occurring in the compiler by splitting the generated Java source code into small pieces. Then, I'll be ready to make a pull request (except if I decide to spend some time in using Java Poet or if I find some new bugs).

The first JUnit test passes :)

@ghost
Copy link

ghost commented Nov 7, 2019

I've just got rid of the warnings :) The last remaining bug is the crash occurring when there are too many chained method calls.

@ghost
Copy link

ghost commented Nov 8, 2019

I just have to use a more recent version of OpenJDK, I suspect that it's fixed in OpenJDK 11.0.5:
https://bugs.openjdk.java.net/browse/JDK-8220578

I'll give it a try as soon as possible.

@fmcarvalho
Copy link
Member

fmcarvalho commented Nov 8, 2019

@gouessej WoW you have been working hard. I just checked out your branch and I see that testFlowifierTuerSourceforgeHomepage is passing successfully. Great news !!!! It will be awesome to have the next release of HtmlFlow including your first implementation of Flowifier. Really amazing! =)

@ghost
Copy link

ghost commented Nov 9, 2019

@fmcarvalho Thank you for the compliment :) The second JUnit test still crashes, it's possibly coming from an unfixed compiler bug, I tested with OpenJDK 11.0.5 yesterday, I obtained the same stack trace (except the version number of course). I'll provide a small Maven project with a single Java class to reproduce this bug, I'll fill a bug report against OpenJDK. If it's not enough, I'll try to write a smaller reproducer, some kind of SSCCE. In my humble opinion, the safest option is to fix the bug in the compiler instead of implementing a dirty kludge in our project. While I don't know exactly why it crashes, implementing a workaround only risks to drive the source code harder to read and to maintain on the long term, it would be like using a solution without knowing the problem to solve.

@fmcarvalho
Copy link
Member

@gouessej I think the bellow tips may help.

Before that, I am not sure if that compilation behavior is a BUG but a feature instead. I made the following observations:

  1. I have compiled to Java 1.8 target an HtmlFlow library and then I used it to compile the output of your testFlowifierWikipediaHomepage and I still got the same StackOverflowError with Java 8 compiler.

  2. I did not find any way of changing the javac stack size configuration. But you can set the maximum stack size of runtime with -Xss option (https://stackoverflow.com/a/2127262/1140754). Thus, I changed in Intellij under Run -> Edit Configurations in Configuration tab the VM options field to include -Xss16m. Now the testFlowifierWikipediaHomepage runs to completion and only fails on assertion.

I changed a little bit this assertion statement to the following to figure out where the assertion is failing:

Iterator<String> actual = Arrays
	.stream(generatedHtmlSourceCode.split("<"))
	.iterator();
Arrays
	.stream(originalHtmlSourceCode.split("<"))
	.forEach(expected -> {
		Assert.assertEquals(
			expected.replaceAll("\\s", "").replaceAll("\\h", "").replaceAll("\\v", "").toLowerCase(),
			actual.next().replaceAll("\\s", "").replaceAll("\\h", "").replaceAll("\\v", "").toLowerCase());
	});

It fails on the following comparison:

junit.framework.ComparisonFailure: 
Expected :linkrel="stylesheet"href="/w/load.php?lang=en&amp;modules=ext.uls.interlanguage%7cext.visualeditor.desktoparticletarget.noscript%7cext.wikimediabadges%7cmediawiki.legacy.commonprint%2cshared%7cmediawiki.skinning.interface%7cskins.vector.styles&amp;only=styles&amp;skin=vector">
Actual   :linkrel="stylesheet"href="/w/load.php?lang=en&modules=ext.uls.interlanguage%7cext.visualeditor.desktoparticletarget.noscript%7cext.wikimediabadges%7cmediawiki.legacy.commonprint%2cshared%7cmediawiki.skinning.interface%7cskins.vector.styles&only=styles&skin=vector">

@fmcarvalho
Copy link
Member

@gouessej BTW we may add the following configuration on pom.xml to suppress the StackOverflowError when deploying with Maven:

<plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-surefire-plugin</artifactId>
   <configuration>
        <argLine>-Xss16m</argLine>
   </configuration>
</plugin>

@ghost
Copy link

ghost commented Nov 12, 2019

@fmcarvalho Thank you very much. Maybe I have to escape the content of DataNode in order to turn & into &amp;, not a big deal.

@ghost
Copy link

ghost commented Nov 12, 2019

@fmcarvalho The second JUnit test passes :) I'll just fix the indentation in the classes I added into the project and I'll make a pull request. Thank you for your valuable help. Please let me know which branch I have to use as a target of the pull request, master or development?

@fmcarvalho
Copy link
Member

@gouessej please provide your Pull Request to the development branch.

Only a few points:

  1. The pom.xml does not include yet the maven surefire plugin configuration mentioned above. I will need that to deploy a release.

  2. Maybe you could disable the log info also in pom.xml to avoid some noise on console output when we are deploying. I know that I made a lot of this mistake in past with many hard-coded println among my unit tests, but one day we must be more accurate :-). So I would propose to add a logging.properties file to src/test/resources with java.util.logging.ConsoleHandler.level=SEVERE and the following configuration in pom.xml:

<plugin>
    <artifactId>maven-surefire-plugin</artifactId>
    <configuration>
       <argLine>-Xss16m</argLine>
       <systemProperties>
         <property>
           <name>java.util.logging.config.file</name>
           <value>src/test/resources/logging.properties</value>
         </property>
       </systemProperties>
    </configuration>
</plugin>
  1. Please add the copyright to all your new files as you made in your Flowifier.java. But replace the line Copyright (c) 2014-18, mcarvalho (gamboa.pt) and lcduarte (github.com/lcduarte) by Copyright (c) 2014-19 HtmlFlow.org. Then I will add you to the HtmlFlow organization too.

Thank you,
Miguel

@ghost
Copy link

ghost commented Nov 13, 2019

  1. I'll do it. Note that I had to modify the configuration of JUnit within Eclipse to modify the stack size.
  2. It makes sense to me, I'll do it as soon as possible too.
  3. Good point, I've forgotten the other files.

I'll make a request for enhancement against JSoup as I need a public method to escape characters inside an attribute.

@ghost
Copy link

ghost commented Nov 13, 2019

Done: jhy/jsoup#1278

@fmcarvalho
Copy link
Member

@gouessej I ran sonar on flowifier and there a couple of code smells. I have pushed results to sonarcloud. You may check it here: https://sonarcloud.io/project/issues?id=com.github.xmlet%3Ahtmlflow&resolved=false&sinceLeakPeriod=true&types=CODE_SMELL

@ghost
Copy link

ghost commented Nov 13, 2019

@fmcarvalho The first code smell comes from Java Poet and is very easy to fix by using some conditional instructions (it will become easier to read even for me), the others are quite trivial to fix too. I'll try to move the management of attributes into a dedicated method to reduce the complexity.

This one requires the intervention of the JSoup's maintainer(s):
https://sonarcloud.io/project/issues?id=com.github.xmlet%3Ahtmlflow&open=AW5k4TaMrXAiup-72QJp&resolved=false&sinceLeakPeriod=true&types=CODE_SMELL

@ghost
Copy link

ghost commented Nov 13, 2019

@fmcarvalho Done: #52

@ghost
Copy link

ghost commented Nov 14, 2019

@fmcarvalho I'll have to fix a tiny bug in the comparison of actual and expected results, I think that it doesn't work when the number of lines differs.

fmcarvalho added a commit that referenced this issue Nov 15, 2019
RFP Flowifier – HTML to HtmlFlow translator: issue #43
@fmcarvalho
Copy link
Member

@gouessej I have already merged your pull request to development branch. Because I am moving my development environment to my new laptop I need to reconfigure GPG and other stuff. So, I will deploy a new release only on next week.

For now I am planning to close this Issue which has already tackled many different aspects. Any new bug or feature should be opened on new issues.

Thank you,
Miguel

@ghost
Copy link

ghost commented Nov 19, 2019

@fmcarvalho JSoup's maintainer hasn't replied yet, I might have to provide a less clumsy workaround, preferably before the end of the week.

@fmcarvalho
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants