Scraping fails due to metadata changes #29

andreweacott · 2019-07-09T20:55:47Z

Found in version 0.1.5

As of March 2019, scraping presentations no longer works due to format changes in the presentation HTML page.

Traceback (most recent call last):
  File "/usr/local/bin/infoqscraper", line 33, in <module>
    sys.exit(main.main())
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 374, in main
    return module.main(infoq_client, args.module_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 194, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 314, in main
    builder.create_presentation()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 82, in create_presentation
    video = self.download_video()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 103, in download_video
    rvideo_path = self.presentation.metadata['video_path']
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 171, in metadata
    'title': get_title(pres_div),
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 91, in get_title
    return pres_div.find('h1', class_="general").div.get_text().strip()
AttributeError: 'NoneType' object has no attribute 'find'

In fact, the fields that scrap.py is looking for are metadata and are not used by the main application. Removing them allows presentation to be grabbed correctly.

The text was updated successfully, but these errors were encountered:

cykl · 2019-07-09T21:30:52Z

Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.

andreweacott · 2019-07-09T21:33:17Z

Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.

No problem, I have a 'fix' (just removed the unused metadata fields) on my fork (https://github.com/andreweacott/infoqscraper/tree/bugfix/resolve_scraper_failure) but I've not been able to get the tests to complete so didn't want to raise a PR. The fixed app works for me though.

skrinakron · 2019-10-26T11:14:25Z

Even with the fixed fork by @andreweacott I keep getting the following error:

> ~/.local/bin/infoqscraper presentation download soa-without-esb
Failed to create presentation soa-without-esb.avi: Failed to download video at rtmpe://video.infoq.com/cfx/st/: rtmpdump exited with -11.
	Output:
b''

This happens with both older videos (like the one above) and new ones (e.g. work-purpose). Using Gentoo Linux with RTMPDump 2.4 (version dated 2016/12/10) and Python 2.7.15 / 3.6.5 (not sure which this program runs on). The outcome seems a little weird as rtmpdump's source only ever seems to exit with 0, 1, 2 or 3 (i.e. one of the RD_* constants), and infoqscraper's subprocess.check_output call should pass the child's exit code as-is. I'm not a Python person, but it seems the forked infoqscraper invokes rtmpdump here with the equivalent of

rtmpdump -q -e -r rtmpe://video.infoq.com/cfx/st/ -y mp4:presentations/qcon08-howbigismybus.mp4 -o temp_video.avi

which I can only get to return 1 – it fails to get the last keyframe and closes the connection. If I omit the -e flag, it connects and handshakes but then invariably segfaults with a resulting exit code of 139.

Sorry for just dumping this here, but there's no issue tracker for the fork and this is my first time using rtmpdump directly. Do you have any idea if my problem is in infoqscraper, in my version of rtmpdump itself or perhaps some misunderstanding?

naxhh mentioned this issue Feb 3, 2020

Hotfix/fix metadata #30

Open

dotlambda mentioned this issue Apr 30, 2021

pythonPackages: migrate away from ffmpeg_3 NixOS/nixpkgs#121257

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping fails due to metadata changes #29

Scraping fails due to metadata changes #29

andreweacott commented Jul 9, 2019

cykl commented Jul 9, 2019

andreweacott commented Jul 9, 2019

skrinakron commented Oct 26, 2019

Scraping fails due to metadata changes #29

Scraping fails due to metadata changes #29

Comments

andreweacott commented Jul 9, 2019

cykl commented Jul 9, 2019

andreweacott commented Jul 9, 2019

skrinakron commented Oct 26, 2019