-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider working around feedparser's issues #265
Comments
https://gist.github.com/lemon24/10ae478fafb8fc1cb091f04e0ceec03f Done so far:
Using an alternate SAX parser is just a matter of exposing the global list as an argument to parse(), so it doesn't need a proof of concept. |
The plan:
I initially wanted to wrap feedparser (like in the gist above), and maybe publish that on PyPI. However, forking is wiser, because it increases the likelihood of actually getting the changes upstream, even if it happens in 6-12 months. |
feedparser meta-issue: kurtmckee/feedparser#296 |
feedparser PR for reducing memory usage: kurtmckee/feedparser#302 |
maxrss for update_feeds() (1 worker), before/after ea64a42, on my database (~160 feeds), with all the feeds stale: >>> def fn(before, after, base=0):
... return (1 - (after-base) / (before-base)) * 100
...
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(76.8, 62.6)
18.489583333333325
>>> fn(76.8, 62.6, 28.8)
29.58333333333334
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(76.5, 57.9)
24.31372549019608
>>> fn(76.5, 57.9, 29.75)
39.7860962566845 Same, with all the feeds up-to-date: >>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(70.0, 55.8)
20.285714285714285
>>> fn(70.0, 55.8, 28.9)
34.54987834549878
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(66.3, 52.3)
21.11613876319759
>>> fn(66.3, 52.3, 30.0)
38.56749311294766 Memory measurements obtained with this script:import sys, time, resource
def get_maxrss():
return (
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
/ 2 ** (20 if sys.platform == 'darwin' else 10)
)
print(f"before import: {get_maxrss():.1f}")
from reader import make_reader
print(f"before make_reader(): {get_maxrss():.1f}")
reader = make_reader("db.sqlite")
print(f"before update_feeds(): {get_maxrss():.1f}")
start = time.perf_counter()
reader.update_feeds()
end = time.perf_counter()
timing = end - start
print(f"after update_feeds(): {get_maxrss():.1f} (took {timing:.1f}s)") |
feedparser has some issues I would like to solve for reader:
Can we work around them by re-implementing parse()?
The text was updated successfully, but these errors were encountered: