Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider working around feedparser's issues #265

Open
lemon24 opened this issue Nov 18, 2021 · 5 comments
Open

Consider working around feedparser's issues #265

lemon24 opened this issue Nov 18, 2021 · 5 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Nov 18, 2021

feedparser has some issues I would like to solve for reader:

Can we work around them by re-implementing parse()?

@lemon24
Copy link
Owner Author

lemon24 commented Nov 21, 2021

https://gist.github.com/lemon24/10ae478fafb8fc1cb091f04e0ceec03f

Done so far:

  • got rid of memory issues
  • fixed the summary/content order bug

Using an alternate SAX parser is just a matter of exposing the global list as an argument to parse(), so it doesn't need a proof of concept.

@lemon24
Copy link
Owner Author

lemon24 commented Nov 29, 2021

The plan:

  • fork feedparser (temporarily), and incorporate all 3 changes
  • cut pull requests for the changes, and a meta-issue explaining them (the encoding stuff, especially)
  • vendor the fork with reader until the changes get merged upstream

I initially wanted to wrap feedparser (like in the gist above), and maybe publish that on PyPI. However, forking is wiser, because it increases the likelihood of actually getting the changes upstream, even if it happens in 6-12 months.

@lemon24
Copy link
Owner Author

lemon24 commented Dec 18, 2021

feedparser meta-issue: kurtmckee/feedparser#296

@lemon24
Copy link
Owner Author

lemon24 commented Jan 29, 2022

feedparser PR for reducing memory usage: kurtmckee/feedparser#302

@lemon24
Copy link
Owner Author

lemon24 commented Feb 7, 2022

maxrss for update_feeds() (1 worker), before/after ea64a42, on my database (~160 feeds), with all the feeds stale:

>>> def fn(before, after, base=0):
...     return (1 - (after-base) / (before-base)) * 100
... 
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(76.8, 62.6)
18.489583333333325
>>> fn(76.8, 62.6, 28.8)
29.58333333333334
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(76.5, 57.9)
24.31372549019608
>>> fn(76.5, 57.9, 29.75)
39.7860962566845

Same, with all the feeds up-to-date:

>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(70.0, 55.8)
20.285714285714285
>>> fn(70.0, 55.8, 28.9)
34.54987834549878
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(66.3, 52.3)
21.11613876319759
>>> fn(66.3, 52.3, 30.0)
38.56749311294766
Memory measurements obtained with this script:
import sys, time, resource

def get_maxrss():
    return (
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        / 2 ** (20 if sys.platform == 'darwin' else 10)
    )

print(f"before import: {get_maxrss():.1f}")

from reader import make_reader

print(f"before make_reader(): {get_maxrss():.1f}")

reader = make_reader("db.sqlite")

print(f"before update_feeds(): {get_maxrss():.1f}")

start = time.perf_counter()
reader.update_feeds()
end = time.perf_counter()
timing = end - start

print(f"after update_feeds(): {get_maxrss():.1f} (took {timing:.1f}s)")

lemon24 added a commit that referenced this issue Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant