Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak #36

Open
JakeAustwick opened this issue Jul 22, 2013 · 28 comments
Open

Memory Leak #36

JakeAustwick opened this issue Jul 22, 2013 · 28 comments

Comments

@JakeAustwick
Copy link

Hey, I'm new to Go so please excuse me if the problem is with my code.

When I run the following code, the memory just keeps growing and growing, as if it is being leaked somewhere. It leaks slow though, if I run it with a list of 42k urls, it slowly just keeps on climbing and climbing. You should be able to spot it with this url list: https://gist.github.com/JakeAustwick/82c9d4ce300639a4d275/raw/368c41ce6ba95f03cbc25a188dd3c07646a068b0/gistfile1.txt

Can you spot what I'm doing wrong, or have I found a bug?

package main

// import "github.com/hoisie/redis"
// import "code.google.com/p/go.net/html"
// import "github.com/davecheney/profile"
import "github.com/moovweb/gokogiri"
import "github.com/moovweb/gokogiri/xml"
import "github.com/moovweb/gokogiri/html"
import "github.com/mreiferson/go-httpclient"

// import "github.com/PuerkitoBio/goquery"
import "log"
import "time"

// import "strconv"
import "errors"

// import "net/url"

import "strings"

// import "io"
import "io/ioutil"

// import "bytes"

import "net/http"
import "runtime"

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    WORKER_COUNT := 25

    transport := &httpclient.Transport{
        ConnectTimeout:        3 * time.Second,
        RequestTimeout:        8 * time.Second,
        ResponseHeaderTimeout: 3 * time.Second,
    }
    http.DefaultClient = &http.Client{Transport: transport}
    // QUEUES := []string{"feed"}

    jobChannel := make(chan string)
    for id := 0; id < WORKER_COUNT; id++ {
        go fetchPage(id, jobChannel)
    }
    // Not most efficient way to read file, but isn't gonna use up all
    // available RAM. This isn't the problem.
    content, err := ioutil.ReadFile("urls.txt")
    if err != nil {
        log.Panicln("FILE READ ERROR")
    }
    urls := strings.Split(string(content), "\n")

    for _, url := range urls {
        jobChannel <- url
    }

    // Stop program from ending
    c := make(chan string)
    <-c

}

func fetchPage(id int, jobChannel chan string) {
    for url := range jobChannel {
        resp, err := http.Get(url)
        if err != nil {
            log.Println("http error")
            continue
        }
        body, err := ioutil.ReadAll(resp.Body)
        if err != nil {
            log.Println("read error")
            continue
        }
        resp.Body.Close()
        doc, err := gokogiri.ParseHtml(body)
        if err != nil {
            log.Println("parse error")
        }

        form, err := BiggestForm(doc)
        if err != nil {
            log.Println(err)
        } else {
            log.Println(form.Name())
        }
        doc.Free()
    }
}

func PostForms(doc *html.HtmlDocument) []xml.Node {
    var forms []xml.Node
    foundForms, err := doc.Search("//form")
    if err != nil {
        return forms
    }
    return foundForms
}

func BiggestForm(doc *html.HtmlDocument) (xml.Node, error) {
    forms := PostForms(doc)
    if len(forms) > 0 {
        high_input_count := 0
        high_index := 0
        for i, f := range forms {
            inputs, _ := f.Search(".//descendant::input")
            textareas, _ := f.Search(".//descendant::textarea")
            fieldCount := len(inputs) + len(textareas)
            if fieldCount > high_input_count {
                high_input_count = fieldCount
                high_index = i
            }
        }
        return forms[high_index], nil
    } else {
        var f xml.Node
        return f, errors.New("No form on the Page")
    }
}

Once again I apologise if it is me doing something wrong.

@mdayaram
Copy link
Contributor

No worries.
So libxml likes to hold onto memory for things. Once all the workers are done processing, you should try calling the following:

import "github.com/moovweb/gokogiri/help"

help.LibxmlCleanUpParser()

Note, that once you call the above function, you can no longer use gokogiri to parse any more documents. This should clean up a lot of memory, but it renders the parser useless until your program exits.

@JakeAustwick
Copy link
Author

Right, that explains why when my program is done it held onto a few hundred MB.

So are you saying a long running process(parsing tens of thousands of large pages) with gokogiri is not possible, because memory will keep growing?

@mdayaram
Copy link
Contributor

Sorry it took me so long to get back to you. Do you have a copy of the 42k list of URLs you're using? I'd like to replicate the issue if possible to investigate further.

@JakeAustwick
Copy link
Author

No problem, I linked to the 42K url list in the original post.

@mdayaram
Copy link
Contributor

woops, sorry, I'm silly =x

@mdayaram
Copy link
Contributor

Also, what version of go are you using?

@JakeAustwick
Copy link
Author

go1.1.1 amd64

@mdayaram
Copy link
Contributor

mdayaram commented Aug 5, 2013

Hey @JakeAustwick, sorry, I haven't had too much time to investigate.

However, I did manage to do this:

I ran the program you provided with the urls you provided, and though the memory did keep climbing for a while, it seemed to somewhat stabilize around 300-400MB on my machine. I totally understand that that's a rather high memory footprint for what that program is doing, but my intuition is that perhaps go's garbage collection is not as aggressive as it could be when cleaning up resources.

Have you tried tweaking the runtime garbage collection parameters?

@JakeAustwick
Copy link
Author

I'd accept 300-400mb happily, my problem was that it rose to >1Gb and kept rising.

Did you run through the entire list?

@mdayaram
Copy link
Contributor

mdayaram commented Aug 6, 2013

So I've noticed the memory jump up all the way up to 700MB at some points, but after going through the entire list of URLs that you provided, the memory settled down to 430MB, at least for my machine, which makes me feel that it's a garbage collection issue.

You could try forcing the garbage collector to run at each iteration and see how that affects it, however, that'll take quite a while to run probably =/

@JakeAustwick
Copy link
Author

Hmm, I seem to remember my memory going much higher. I'll give it another run and let you know.

@jwarzech
Copy link

@JakeAustwick @mdayaram Did you find a solution to this? I'm looking at doing something similar with Gokogiri and have ran into a similar issue with other libs in the past.

@JakeAustwick
Copy link
Author

I never continued with this project in Go, I simply rewrote it in Python with gevent and lxml. Sorry.

@mdayaram
Copy link
Contributor

Unfortunately I was never able to replicate the issue. One thing that we never clarified was what operating system @JakeAustwick was running in. I ran all my tests in Linux/Ubuntu, however, I know that there are some memory issues with Windows. Also, this issues could've all disappeared with the introduction to go1.2.

As far as I can tell, it seemed specific to the environment. My advice for you @jwarzech would be to run the test program in this issue with the URLs and see how your memory is handled. If it's doing OK (around the 400MBish range), then I wouldn't be concerned with it.

Here at moovweb we have several production boxes constantly parsing html pages using gokogiri without any major memory issues, but all those boxes are also running Linux and go1.2.

@JakeAustwick
Copy link
Author

Just for reference, I was running on ubuntu 12.04, with my version of libxml installed from the ubuntu repos.

@kpx-dev
Copy link

kpx-dev commented Mar 21, 2014

would implementing this inside the lib helps? http://golang.org/pkg/runtime/#SetFinalizer
I'm using valgrind to test and it seems to help so far.

@mdayaram
Copy link
Contributor

SetFinalizer would only help if explicitly calling the Free helper methods helps alleviate the problems. However, from what I can tell, @JakeAustwick attempted that and still got memory issues which I could not reproduce =(

We did have an implementation of SetFinalizer a while back but it was decided that it was better form to provide explicit freeing of objects to give more power to the programmer as to when memory should be cleaned up. Unfortunately I was not part of those early conversations so I don't know the specifics as to why the decision was made.

@pkieltyka
Copy link

Hello, I was wondering if this project is still active? it's been 4 months since any commits with an open memory leak which is a pretty scary things for server use. Either way, thank you for this project, it's great work.

@jbowtie
Copy link
Contributor

jbowtie commented May 8, 2014

No one has been able to reproduce this memory leak (it should probably be closed by @mdayaram for that reason).

@akhleung
Copy link
Contributor

akhleung commented May 8, 2014

If it can't be reproduced, we'll close it.

@akhleung akhleung closed this as completed May 8, 2014
@weil
Copy link

weil commented Jan 17, 2015

Hi @mdayaram , I had the same issue using "doc, err := gokogiri.ParseHtml(page)" on my mac pro with go version go1.4 darwin/amd64

I have done a heavy test parsing more than 46557 websites. My memory usage kept going up to 120GB without declining, then the program process got killed and my machine ran out of memory.

Note: I simply used gokogiri.ParseHtml(page). I tried to use runtime.GC() every min, but It did not help.

@akhleung
Copy link
Contributor

This still appears to be popping up intermittently, so I'll take a look this weekend and see if I can turn up more info. Reopening for now.

@akhleung akhleung reopened this Jan 18, 2015
@weil
Copy link

weil commented Jan 19, 2015

@akhleung @JakeAustwick
I just tried to following code, parsing the same url again and again. Memory usage keep going up to a few GBs within a minute.

url = "http://www.bestbuy.com/site/withings-pulse-o2-tracker-black/5686019.p?id=1219148342760&skuId=5686019" 

resp, _ := http.Get(url)
page, _ := ioutil.ReadAll(resp.Body)

for {
    gokogiri.ParseHtml(page)
    runtime.GC()
}

@akhleung
Copy link
Contributor

Thanks for the test case! I'll try it out.

@akhleung
Copy link
Contributor

@weil I ran your example, and it did indeed leak very badly. However, when I modified the loop to assign the document to a variable and call its Free() method, the memory usage stabilized. Can you try that and see if it solves the problem? For reference, here's what I ran:

package main

import (
  "net/http"
  "io/ioutil"
  "gokogiri"
  "runtime"
)

func main() {
  url := "http://www.bestbuy.com/site/withings-pulse-o2-tracker-black/5686019.p?id=1219148342760&skuId=5686019" 

  resp, _ := http.Get(url)
  page, _ := ioutil.ReadAll(resp.Body)

  for {
      doc, _ := gokogiri.ParseHtml(page)
      doc.Free()
      runtime.GC()
  }
}

@jbowtie
Copy link
Contributor

jbowtie commented Jan 19, 2015

If this indeed fixes the leak, the documentation and example for ParseHtml (and ParseXml) should be updated to include a call defer doc.Free() immediately after a call to ParseHtml/ParseXml. This will make it clearer that freeing the returned document needs to be explicitly managed by the programmer.

@hashbender
Copy link

I'm seeing this as well and can reproduce with weil's code

eonpatapon added a commit to eonpatapon/contrail-introspect-cli that referenced this issue Mar 28, 2017
@rafgugi
Copy link

rafgugi commented May 29, 2020

hi
is this issue still happening?

im new to golang and i want to use this library to process some small html page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants