Memory Leak #36

JakeAustwick · 2013-07-22T21:41:24Z

Hey, I'm new to Go so please excuse me if the problem is with my code.

When I run the following code, the memory just keeps growing and growing, as if it is being leaked somewhere. It leaks slow though, if I run it with a list of 42k urls, it slowly just keeps on climbing and climbing. You should be able to spot it with this url list: https://gist.github.com/JakeAustwick/82c9d4ce300639a4d275/raw/368c41ce6ba95f03cbc25a188dd3c07646a068b0/gistfile1.txt

Can you spot what I'm doing wrong, or have I found a bug?

package main

// import "github.com/hoisie/redis"
// import "code.google.com/p/go.net/html"
// import "github.com/davecheney/profile"
import "github.com/moovweb/gokogiri"
import "github.com/moovweb/gokogiri/xml"
import "github.com/moovweb/gokogiri/html"
import "github.com/mreiferson/go-httpclient"

// import "github.com/PuerkitoBio/goquery"
import "log"
import "time"

// import "strconv"
import "errors"

// import "net/url"

import "strings"

// import "io"
import "io/ioutil"

// import "bytes"

import "net/http"
import "runtime"

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    WORKER_COUNT := 25

    transport := &httpclient.Transport{
        ConnectTimeout:        3 * time.Second,
        RequestTimeout:        8 * time.Second,
        ResponseHeaderTimeout: 3 * time.Second,
    }
    http.DefaultClient = &http.Client{Transport: transport}
    // QUEUES := []string{"feed"}

    jobChannel := make(chan string)
    for id := 0; id < WORKER_COUNT; id++ {
        go fetchPage(id, jobChannel)
    }
    // Not most efficient way to read file, but isn't gonna use up all
    // available RAM. This isn't the problem.
    content, err := ioutil.ReadFile("urls.txt")
    if err != nil {
        log.Panicln("FILE READ ERROR")
    }
    urls := strings.Split(string(content), "\n")

    for _, url := range urls {
        jobChannel <- url
    }

    // Stop program from ending
    c := make(chan string)
    <-c

}

func fetchPage(id int, jobChannel chan string) {
    for url := range jobChannel {
        resp, err := http.Get(url)
        if err != nil {
            log.Println("http error")
            continue
        }
        body, err := ioutil.ReadAll(resp.Body)
        if err != nil {
            log.Println("read error")
            continue
        }
        resp.Body.Close()
        doc, err := gokogiri.ParseHtml(body)
        if err != nil {
            log.Println("parse error")
        }

        form, err := BiggestForm(doc)
        if err != nil {
            log.Println(err)
        } else {
            log.Println(form.Name())
        }
        doc.Free()
    }
}

func PostForms(doc *html.HtmlDocument) []xml.Node {
    var forms []xml.Node
    foundForms, err := doc.Search("//form")
    if err != nil {
        return forms
    }
    return foundForms
}

func BiggestForm(doc *html.HtmlDocument) (xml.Node, error) {
    forms := PostForms(doc)
    if len(forms) > 0 {
        high_input_count := 0
        high_index := 0
        for i, f := range forms {
            inputs, _ := f.Search(".//descendant::input")
            textareas, _ := f.Search(".//descendant::textarea")
            fieldCount := len(inputs) + len(textareas)
            if fieldCount > high_input_count {
                high_input_count = fieldCount
                high_index = i
            }
        }
        return forms[high_index], nil
    } else {
        var f xml.Node
        return f, errors.New("No form on the Page")
    }
}

Once again I apologise if it is me doing something wrong.

mdayaram · 2013-07-24T01:36:43Z

No worries.
So libxml likes to hold onto memory for things. Once all the workers are done processing, you should try calling the following:

import "github.com/moovweb/gokogiri/help"

help.LibxmlCleanUpParser()

Note, that once you call the above function, you can no longer use gokogiri to parse any more documents. This should clean up a lot of memory, but it renders the parser useless until your program exits.

JakeAustwick · 2013-07-24T01:48:09Z

Right, that explains why when my program is done it held onto a few hundred MB.

So are you saying a long running process(parsing tens of thousands of large pages) with gokogiri is not possible, because memory will keep growing?

mdayaram · 2013-07-31T00:28:03Z

Sorry it took me so long to get back to you. Do you have a copy of the 42k list of URLs you're using? I'd like to replicate the issue if possible to investigate further.

JakeAustwick · 2013-07-31T00:29:25Z

No problem, I linked to the 42K url list in the original post.

mdayaram · 2013-07-31T00:30:20Z

woops, sorry, I'm silly =x

mdayaram · 2013-07-31T00:32:00Z

Also, what version of go are you using?

JakeAustwick · 2013-07-31T00:34:52Z

go1.1.1 amd64

mdayaram · 2013-08-05T21:40:36Z

Hey @JakeAustwick, sorry, I haven't had too much time to investigate.

However, I did manage to do this:

I ran the program you provided with the urls you provided, and though the memory did keep climbing for a while, it seemed to somewhat stabilize around 300-400MB on my machine. I totally understand that that's a rather high memory footprint for what that program is doing, but my intuition is that perhaps go's garbage collection is not as aggressive as it could be when cleaning up resources.

Have you tried tweaking the runtime garbage collection parameters?

JakeAustwick · 2013-08-05T22:03:19Z

I'd accept 300-400mb happily, my problem was that it rose to >1Gb and kept rising.

Did you run through the entire list?

mdayaram · 2013-08-06T03:43:39Z

So I've noticed the memory jump up all the way up to 700MB at some points, but after going through the entire list of URLs that you provided, the memory settled down to 430MB, at least for my machine, which makes me feel that it's a garbage collection issue.

You could try forcing the garbage collector to run at each iteration and see how that affects it, however, that'll take quite a while to run probably =/

JakeAustwick · 2013-08-06T20:28:32Z

Hmm, I seem to remember my memory going much higher. I'll give it another run and let you know.

jwarzech · 2014-03-19T14:56:39Z

@JakeAustwick @mdayaram Did you find a solution to this? I'm looking at doing something similar with Gokogiri and have ran into a similar issue with other libs in the past.

JakeAustwick · 2014-03-19T14:58:35Z

I never continued with this project in Go, I simply rewrote it in Python with gevent and lxml. Sorry.

mdayaram · 2014-03-20T01:12:00Z

Unfortunately I was never able to replicate the issue. One thing that we never clarified was what operating system @JakeAustwick was running in. I ran all my tests in Linux/Ubuntu, however, I know that there are some memory issues with Windows. Also, this issues could've all disappeared with the introduction to go1.2.

As far as I can tell, it seemed specific to the environment. My advice for you @jwarzech would be to run the test program in this issue with the URLs and see how your memory is handled. If it's doing OK (around the 400MBish range), then I wouldn't be concerned with it.

Here at moovweb we have several production boxes constantly parsing html pages using gokogiri without any major memory issues, but all those boxes are also running Linux and go1.2.

JakeAustwick · 2014-03-20T01:13:24Z

Just for reference, I was running on ubuntu 12.04, with my version of libxml installed from the ubuntu repos.

kpx-dev · 2014-03-21T22:03:02Z

would implementing this inside the lib helps? http://golang.org/pkg/runtime/#SetFinalizer
I'm using valgrind to test and it seems to help so far.

mdayaram · 2014-03-25T03:17:26Z

SetFinalizer would only help if explicitly calling the Free helper methods helps alleviate the problems. However, from what I can tell, @JakeAustwick attempted that and still got memory issues which I could not reproduce =(

We did have an implementation of SetFinalizer a while back but it was decided that it was better form to provide explicit freeing of objects to give more power to the programmer as to when memory should be cleaned up. Unfortunately I was not part of those early conversations so I don't know the specifics as to why the decision was made.

pkieltyka · 2014-05-08T14:29:02Z

Hello, I was wondering if this project is still active? it's been 4 months since any commits with an open memory leak which is a pretty scary things for server use. Either way, thank you for this project, it's great work.

jbowtie · 2014-05-08T19:29:57Z

No one has been able to reproduce this memory leak (it should probably be closed by @mdayaram for that reason).

akhleung · 2014-05-08T21:12:02Z

If it can't be reproduced, we'll close it.

weil · 2015-01-17T19:24:34Z

Hi @mdayaram , I had the same issue using "doc, err := gokogiri.ParseHtml(page)" on my mac pro with go version go1.4 darwin/amd64

I have done a heavy test parsing more than 46557 websites. My memory usage kept going up to 120GB without declining, then the program process got killed and my machine ran out of memory.

Note: I simply used gokogiri.ParseHtml(page). I tried to use runtime.GC() every min, but It did not help.

akhleung · 2015-01-18T03:28:14Z

This still appears to be popping up intermittently, so I'll take a look this weekend and see if I can turn up more info. Reopening for now.

weil · 2015-01-19T20:46:53Z

@akhleung @JakeAustwick
I just tried to following code, parsing the same url again and again. Memory usage keep going up to a few GBs within a minute.

url = "http://www.bestbuy.com/site/withings-pulse-o2-tracker-black/5686019.p?id=1219148342760&skuId=5686019" 

resp, _ := http.Get(url)
page, _ := ioutil.ReadAll(resp.Body)

for {
    gokogiri.ParseHtml(page)
    runtime.GC()
}

akhleung · 2015-01-19T20:48:52Z

Thanks for the test case! I'll try it out.

akhleung · 2015-01-19T21:34:04Z

@weil I ran your example, and it did indeed leak very badly. However, when I modified the loop to assign the document to a variable and call its Free() method, the memory usage stabilized. Can you try that and see if it solves the problem? For reference, here's what I ran:

package main

import (
  "net/http"
  "io/ioutil"
  "gokogiri"
  "runtime"
)

func main() {
  url := "http://www.bestbuy.com/site/withings-pulse-o2-tracker-black/5686019.p?id=1219148342760&skuId=5686019" 

  resp, _ := http.Get(url)
  page, _ := ioutil.ReadAll(resp.Body)

  for {
      doc, _ := gokogiri.ParseHtml(page)
      doc.Free()
      runtime.GC()
  }
}

jbowtie · 2015-01-19T21:40:18Z

If this indeed fixes the leak, the documentation and example for ParseHtml (and ParseXml) should be updated to include a call defer doc.Free() immediately after a call to ParseHtml/ParseXml. This will make it clearer that freeing the returned document needs to be explicitly managed by the programmer.

hashbender · 2016-07-20T08:30:48Z

I'm seeing this as well and can reproduce with weil's code

See moovweb/gokogiri#36

rafgugi · 2020-05-29T12:59:26Z

hi
is this issue still happening?

im new to golang and i want to use this library to process some small html page

akhleung closed this as completed May 8, 2014

akhleung reopened this Jan 18, 2015

eonpatapon added a commit to eonpatapon/contrail-introspect-cli that referenced this issue Mar 28, 2017

Add Collection.Close() method to free XmlDocument memory

2196e66

See moovweb/gokogiri#36

eonpatapon mentioned this issue Mar 28, 2017

Add Collection.Close() method to free XmlDocument memory nlewo/contrail-introspect-cli#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak #36

Memory Leak #36

JakeAustwick commented Jul 22, 2013

mdayaram commented Jul 24, 2013

JakeAustwick commented Jul 24, 2013

mdayaram commented Jul 31, 2013

JakeAustwick commented Jul 31, 2013

mdayaram commented Jul 31, 2013

mdayaram commented Jul 31, 2013

JakeAustwick commented Jul 31, 2013

mdayaram commented Aug 5, 2013

JakeAustwick commented Aug 5, 2013

mdayaram commented Aug 6, 2013

JakeAustwick commented Aug 6, 2013

jwarzech commented Mar 19, 2014

JakeAustwick commented Mar 19, 2014

mdayaram commented Mar 20, 2014

JakeAustwick commented Mar 20, 2014

kpx-dev commented Mar 21, 2014

mdayaram commented Mar 25, 2014

pkieltyka commented May 8, 2014

jbowtie commented May 8, 2014

akhleung commented May 8, 2014

weil commented Jan 17, 2015

akhleung commented Jan 18, 2015

weil commented Jan 19, 2015

akhleung commented Jan 19, 2015

akhleung commented Jan 19, 2015

jbowtie commented Jan 19, 2015

hashbender commented Jul 20, 2016

rafgugi commented May 29, 2020

Memory Leak #36

Memory Leak #36

Comments

JakeAustwick commented Jul 22, 2013

mdayaram commented Jul 24, 2013

JakeAustwick commented Jul 24, 2013

mdayaram commented Jul 31, 2013

JakeAustwick commented Jul 31, 2013

mdayaram commented Jul 31, 2013

mdayaram commented Jul 31, 2013

JakeAustwick commented Jul 31, 2013

mdayaram commented Aug 5, 2013

JakeAustwick commented Aug 5, 2013

mdayaram commented Aug 6, 2013

JakeAustwick commented Aug 6, 2013

jwarzech commented Mar 19, 2014

JakeAustwick commented Mar 19, 2014

mdayaram commented Mar 20, 2014

JakeAustwick commented Mar 20, 2014

kpx-dev commented Mar 21, 2014

mdayaram commented Mar 25, 2014

pkieltyka commented May 8, 2014

jbowtie commented May 8, 2014

akhleung commented May 8, 2014

weil commented Jan 17, 2015

akhleung commented Jan 18, 2015

weil commented Jan 19, 2015

akhleung commented Jan 19, 2015

akhleung commented Jan 19, 2015

jbowtie commented Jan 19, 2015

hashbender commented Jul 20, 2016

rafgugi commented May 29, 2020