Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problem about reindexChildren with replace parentNode #1281

Closed
gjchoi opened this issue Dec 2, 2019 · 7 comments
Closed

Performance problem about reindexChildren with replace parentNode #1281

gjchoi opened this issue Dec 2, 2019 · 7 comments
Milestone

Comments

@gjchoi
Copy link

gjchoi commented Dec 2, 2019

Hello,
my application suffers from poor performance in case of over 10Mbyte html data..

In my case,
I take a parsed HTML document from jsoup parser, and replace element with wrapperElement class ( custom class RichElement ) to add more custom data with element attributes.

This call the method addChildren at Element.class,
with over 100,000 parsed elements,
but this is very slow...

because addChildren call reparentChild ( => setParentNode) 100,000 times,
each setParentNode occures u​nnecessary reindexing for sibling children about 1/2 * 100,000 times in this case;;

Do you hava any idea,
only replace parent keep children without u​nnecessary reindexing?

or

Could you add some method like below?

Element.class

    protected Element replaceParent(List<Node> children) {

        ArrayList<Node> nodes = new ArrayList<>(children);
        Node[] nodeArray = nodes.toArray(new Node[nodes.size()]);
        replaceChildrenWithoutReindex(nodeArray);
        return this;
    }

Node.class

    protected void replaceChildrenWithoutReindex(Node... children) {
        Validate.isTrue(this.childNodeSize() == 0, "replaceChildrenWithoutReindex valid with new parent (no child)");


        List<Node> nodes = this.ensureChildNodes();
        Node[] var3 = children;
        int var4 = children.length;

        for(int var5 = 0; var5 < var4; ++var5) {
            Node child = var3[var5];
            //this.reparentChild(child); 
            child.parentNode = this;       //without reindex, reparent parent keep sibling indexes;
            nodes.add(child);
            child.setSiblingIndex(nodes.size() - 1);
        }
    }
@jhy
Copy link
Owner

jhy commented Jan 28, 2020

Hi - do you have an example code snippet and some HTML? So I can validate and perf test.

@whsoul
Copy link

whsoul commented Jan 30, 2020

Hi, this is sample code about this

    @Test
    public void sample() throws IOException {
        InputStream is = TempTest.class.getClassLoader().getResourceAsStream("10000row.txt");
        InputStreamReader isr = new InputStreamReader(is);
        BufferedReader br = new BufferedReader(isr);
        String raw = br.lines().collect(Collectors.joining("\n"));

        ReaderInputStream inputStream = new ReaderInputStream(new DocumentCharacterFilterReader(new StringReader(raw), false), StandardCharsets.UTF_8);
        org.jsoup.nodes.Document document = Jsoup.parse(inputStream, StandardCharsets.UTF_8.name(), "");

        ElementWrapper newParent = new ElementWrapper("div", "A", "B");

        // very slow line with long row data
        newParent.insertChildren(0, document.body().childNodes());   

    }

    public class ElementWrapper extends Element{
        String a;
        String b;

        public ElementWrapper(String tag, String a, String b) {
            super(tag);
            this.a = a;
            this.b = b;
        }
    }

sample is not special, just many rows html data below

this is just 100 rows data, it will be not slow
but if you make 100,000 rows data and test, it will be extreamly slow (exponentially)

<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
<p>long text sample line</p>
....

@jhy jhy added this to the 1.12.2 milestone Jan 31, 2020
@jhy
Copy link
Owner

jhy commented Jan 31, 2020

Thanks. I took a first look, and I think the bulk of the spent is in the re-index when removing from the original parent. It looks like it's quadratic performance. I think an approach will be in addChildren, see if all the input nodes have the same parent (which is likely almost all the case), and if so, do a bulk move and only call reindex once on each of the input and output nodes.

@whsoul
Copy link

whsoul commented Jan 31, 2020

it's good your idea.
I'll be wait 1.12.2 release
thanks~!

( can I know when 1.12.2 will be released? )

@jhy jhy closed this as completed in 5a570cf Feb 2, 2020
@jhy
Copy link
Owner

jhy commented Feb 2, 2020

Thanks, have implemented a fast path.

Hoping to get 1.12.2 out over the next week. Please feel free to build from HEAD and let me know if you run into any issues.

@jhy
Copy link
Owner

jhy commented Feb 9, 2020

Hi @whsoul, jsoup 1.12.2 is available now. https://jsoup.org/news/release-1.12.2

@whsoul
Copy link

whsoul commented Feb 10, 2020

@jhy
thanks for your quick response and release (1.12.3)
I will check and apply to my application.

This was referenced Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants