Replies: 9 comments 1 reply
-
The above you use should be the same as using a local chrome instance by adding the path to the executable ``File path to the Chrome executable` |
Beta Was this translation helpful? Give feedback.
-
I used the "full path" way and my IAFD-2.yml never worked. It consistently returns only "Please Wait" as the result. The difference between Remote CDP and the Full Path CDP is huge:
|
Beta Was this translation helpful? Give feedback.
-
Did you use the docker instance and the remote CDP before then or only path? We have instructions for using the remote cdp using a docker instance, i dont think we mention anywhere you could also use the local instance with the remote cdp instance like you did... |
Beta Was this translation helpful? Give feedback.
-
I never used a docker instance for Stash or CDP. This is running all in a local Win10 machine. I don't know why anyone would use remote CDP in dockers. |
Beta Was this translation helpful? Give feedback.
-
thats from stash's in app help. Even when using remote CDP it is not always working though, that is how it is used in the chromedp/headless-shell case btw |
Beta Was this translation helpful? Give feedback.
-
But the result of my IAFD-2.yml is very encouraging ! It works every time. |
Beta Was this translation helpful? Give feedback.
-
Perhaps an old thread, but this also helped in my case. I tried to write a scraper for indexxx.com and couldn't make it work reliably. Without CDP, it only worked about 1 of 5 times (mostly getting 403 from Cloudflare, sometimes a diffrent robot-blocking page). Similar results I got with python-requests, and even with plain wget. Tried various headers, but to no luck either. With just "useCDP: true", I couldn't get anything, neither a result nor an error. Although I saw that "chromium" process appeared in the system, Stash didn't show any sign of anything. The log was silent too. (for comparison - ATK.yml scraper worked OK). Only when I switched to the method that @philpw99 described (started Chromium separately with --remote-debugging-port), it started to work reliably. Perhaps their bot detection is somehow able to distinguish headless Chrome/Chromium instances from the visible ones? btw, I'm using a VPN, which Cloudflare probably considers suspicious too. I suppose VPN IP + headless browser together give me enough score to get classified as a bot. |
Beta Was this translation helpful? Give feedback.
-
I am new to Stash and like it but I dont really understand scrapers, where/how to add them. I know what they are supposed to do but I cannot get it to work. 'Philpw99 would you mind writing detailed instructions on what I need to put into my Stashapp? I utilize stashapp thru a Synology NAS using container manater. Thank you!! |
Beta Was this translation helpful? Give feedback.
-
headless chrome does differ slightly when fingerprinted, that's where projects like undetected-chromedriver come into play. What you're probably experiencing instead is your browser already carrying and passing CF anti-bot checks and cookies since you are a real user using chrome(ium). It might be worthwhile adding it to docs that it'll have a higher success rate |
Beta Was this translation helpful? Give feedback.
-
I noticed that Stash can use remote IP for CDP, like "http://localhost:9222/json/version". The setting was always there but I never paid any attention to it, until now.
A few weeks ago, I created an IAFD scraper with CDP, but it never worked, even though all the settings are correct. I always got only a response "Please Wait". So this IAFD-2.yml just left there accumulating dust.
Today I found out that, if I ran the CDP with remote debug mode:
chrome.exe --remote-debugging-port=9222
Set the CDP setting accordingly:
http://localhost:9222/json/version
Then my IAFD-2.yml always works !
IAFD-2.zip
This not only opens the gate for IAFD. It also opens the gates to a lot of other websites protected by cloudfare and paywalls. You can use the remote debug instance of chrome to go to the URL you want to scrape. The website will open, and you will pass all the verification, all the account log in and all those human tests. Then you can start scrape with that URL in Stash. Only this time Stash will actually get the data, because you already passed all the obstacles for it.
Man, I am so excited about this discovery now. This means for most websites we can just use xpath to scrape any pages. Cloudfare or other blocking mechanism won't stop us any more !
Beta Was this translation helpful? Give feedback.
All reactions