Just found a great way for CDP scraping ! #2003

philpw99 · 2021-11-14T07:10:07Z

philpw99
Nov 14, 2021

I noticed that Stash can use remote IP for CDP, like "http://localhost:9222/json/version". The setting was always there but I never paid any attention to it, until now.
A few weeks ago, I created an IAFD scraper with CDP, but it never worked, even though all the settings are correct. I always got only a response "Please Wait". So this IAFD-2.yml just left there accumulating dust.
Today I found out that, if I ran the CDP with remote debug mode:
chrome.exe --remote-debugging-port=9222
Set the CDP setting accordingly:
http://localhost:9222/json/version
Then my IAFD-2.yml always works !
IAFD-2.zip

This not only opens the gate for IAFD. It also opens the gates to a lot of other websites protected by cloudfare and paywalls. You can use the remote debug instance of chrome to go to the URL you want to scrape. The website will open, and you will pass all the verification, all the account log in and all those human tests. Then you can start scrape with that URL in Stash. Only this time Stash will actually get the data, because you already passed all the obstacles for it.

Man, I am so excited about this discovery now. This means for most websites we can just use xpath to scrape any pages. Cloudfare or other blocking mechanism won't stop us any more !

bnkai · 2021-11-14T11:09:37Z

bnkai
Nov 14, 2021
Maintainer

The above you use should be the same as using a local chrome instance by adding the path to the executable ``File path to the Chrome executable`
The problem is that is works for some users and doesnt for others, that was why IAFD was switched from a plain scraper to CDP and now python. I even made a PR #1361 for it but it isnt consistent. Can you try adding the full path to your chrome instance instead of the address and try again?

0 replies

philpw99 · 2021-11-14T14:53:31Z

philpw99
Nov 14, 2021
Author

I used the "full path" way and my IAFD-2.yml never worked. It consistently returns only "Please Wait" as the result.
But once I use the remote CDP, it works every time.
The python script for IAFD works only 1 in 4or5 times. Right now, the script is repeating the try up to 10 times, but this is not a good way of scraping.

The difference between Remote CDP and the Full Path CDP is huge:

Remote CDP shows you the full interactive browser, while Full Path CDP shows nothing. You don't even know what's wrong when Full Path CDP fails, as my example of IAFD-2.yml shows.
Remote CDP allows you to bypass paywall, cloudfare and other obstacles, while Full Path CDP cannot.

0 replies

bnkai · 2021-11-14T14:58:39Z

bnkai
Nov 14, 2021
Maintainer

Did you use the docker instance and the remote CDP before then or only path? We have instructions for using the remote cdp using a docker instance, i dont think we mention anywhere you could also use the local instance with the remote cdp instance like you did...

0 replies

philpw99 · 2021-11-14T15:03:46Z

philpw99
Nov 14, 2021
Author

I never used a docker instance for Stash or CDP. This is running all in a local Win10 machine. I don't know why anyone would use remote CDP in dockers.
i dont think we mention anywhere you could also use the local instance with the remote cdp instance like you did...
Exactly ! This is why no one had ever discovered this simple but powerful method before !

0 replies

bnkai · 2021-11-14T15:07:32Z

bnkai
Nov 14, 2021
Maintainer

Chrome CDP path can be set to a path to the chrome executable, or an http(s) address to remote chrome instance (for example: http://localhost:9222/json/version). As remote instance a docker container can also be used with the chromedp/headless-shell image being highly recommended.

thats from stash's in app help. Even when using remote CDP it is not always working though, that is how it is used in the chromedp/headless-shell case btw

0 replies

philpw99 · 2021-11-14T15:11:06Z

philpw99
Nov 14, 2021
Author

But the result of my IAFD-2.yml is very encouraging ! It works every time.
It's like when using remote CDP the rules have been completely changed.
I guess the reason is that when using local "Remote CDP", the Chrome instance changes from headless to "fully graphically interactive", which makes all the differences.
A remote chrome in a docker is still headless, which will be just the same as running the full path locally.

0 replies

well-thats-funny · 2024-06-06T22:51:57Z

well-thats-funny
Jun 6, 2024

Perhaps an old thread, but this also helped in my case. I tried to write a scraper for indexxx.com and couldn't make it work reliably.

Without CDP, it only worked about 1 of 5 times (mostly getting 403 from Cloudflare, sometimes a diffrent robot-blocking page). Similar results I got with python-requests, and even with plain wget. Tried various headers, but to no luck either.

With just "useCDP: true", I couldn't get anything, neither a result nor an error. Although I saw that "chromium" process appeared in the system, Stash didn't show any sign of anything. The log was silent too. (for comparison - ATK.yml scraper worked OK).

Only when I switched to the method that @philpw99 described (started Chromium separately with --remote-debugging-port), it started to work reliably. Perhaps their bot detection is somehow able to distinguish headless Chrome/Chromium instances from the visible ones?

btw, I'm using a VPN, which Cloudflare probably considers suspicious too. I suppose VPN IP + headless browser together give me enough score to get classified as a bot.

0 replies

dshort313 · 2024-08-03T14:39:36Z

dshort313
Aug 3, 2024

I am new to Stash and like it but I dont really understand scrapers, where/how to add them. I know what they are supposed to do but I cannot get it to work. 'Philpw99 would you mind writing detailed instructions on what I need to put into my Stashapp? I utilize stashapp thru a Synology NAS using container manater. Thank you!!

0 replies

feederbox826 · 2024-08-28T01:46:28Z

feederbox826
Aug 28, 2024

headless chrome does differ slightly when fingerprinted, that's where projects like undetected-chromedriver come into play. What you're probably experiencing instead is your browser already carrying and passing CF anti-bot checks and cookies since you are a real user using chrome(ium). It might be worthwhile adding it to docs that it'll have a higher success rate

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just found a great way for CDP scraping ! #2003

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Just found a great way for CDP scraping ! #2003

philpw99 Nov 14, 2021

Replies: 9 comments · 1 reply

bnkai Nov 14, 2021 Maintainer

philpw99 Nov 14, 2021 Author

bnkai Nov 14, 2021 Maintainer

philpw99 Nov 14, 2021 Author

bnkai Nov 14, 2021 Maintainer

philpw99 Nov 14, 2021 Author

well-thats-funny Jun 6, 2024

dshort313 Aug 3, 2024

feederbox826 Aug 28, 2024

philpw99
Nov 14, 2021

Replies: 9 comments 1 reply

bnkai
Nov 14, 2021
Maintainer

philpw99
Nov 14, 2021
Author

bnkai
Nov 14, 2021
Maintainer

philpw99
Nov 14, 2021
Author

bnkai
Nov 14, 2021
Maintainer

philpw99
Nov 14, 2021
Author

well-thats-funny
Jun 6, 2024

dshort313
Aug 3, 2024

feederbox826
Aug 28, 2024