Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping fails due to metadata changes #29

Open
andreweacott opened this issue Jul 9, 2019 · 3 comments
Open

Scraping fails due to metadata changes #29

andreweacott opened this issue Jul 9, 2019 · 3 comments

Comments

@andreweacott
Copy link

Found in version 0.1.5

As of March 2019, scraping presentations no longer works due to format changes in the presentation HTML page.

Traceback (most recent call last):
  File "/usr/local/bin/infoqscraper", line 33, in <module>
    sys.exit(main.main())
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 374, in main
    return module.main(infoq_client, args.module_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 194, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 314, in main
    builder.create_presentation()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 82, in create_presentation
    video = self.download_video()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 103, in download_video
    rvideo_path = self.presentation.metadata['video_path']
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 171, in metadata
    'title': get_title(pres_div),
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 91, in get_title
    return pres_div.find('h1', class_="general").div.get_text().strip()
AttributeError: 'NoneType' object has no attribute 'find'

In fact, the fields that scrap.py is looking for are metadata and are not used by the main application. Removing them allows presentation to be grabbed correctly.

@cykl
Copy link
Owner

cykl commented Jul 9, 2019

Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.

@andreweacott
Copy link
Author

Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.

No problem, I have a 'fix' (just removed the unused metadata fields) on my fork (https://github.com/andreweacott/infoqscraper/tree/bugfix/resolve_scraper_failure) but I've not been able to get the tests to complete so didn't want to raise a PR. The fixed app works for me though.

@skrinakron
Copy link

Even with the fixed fork by @andreweacott I keep getting the following error:

> ~/.local/bin/infoqscraper presentation download soa-without-esb
Failed to create presentation soa-without-esb.avi: Failed to download video at rtmpe://video.infoq.com/cfx/st/: rtmpdump exited with -11.
	Output:
b''

This happens with both older videos (like the one above) and new ones (e.g. work-purpose). Using Gentoo Linux with RTMPDump 2.4 (version dated 2016/12/10) and Python 2.7.15 / 3.6.5 (not sure which this program runs on). The outcome seems a little weird as rtmpdump's source only ever seems to exit with 0, 1, 2 or 3 (i.e. one of the RD_* constants), and infoqscraper's subprocess.check_output call should pass the child's exit code as-is. I'm not a Python person, but it seems the forked infoqscraper invokes rtmpdump here with the equivalent of

rtmpdump -q -e -r rtmpe://video.infoq.com/cfx/st/ -y mp4:presentations/qcon08-howbigismybus.mp4 -o temp_video.avi

which I can only get to return 1 – it fails to get the last keyframe and closes the connection. If I omit the -e flag, it connects and handshakes but then invariably segfaults with a resulting exit code of 139.

Sorry for just dumping this here, but there's no issue tracker for the fork and this is my first time using rtmpdump directly. Do you have any idea if my problem is in infoqscraper, in my version of rtmpdump itself or perhaps some misunderstanding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants