Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear_proxy() doesn't work #309

Open
luisgstv opened this issue Jan 26, 2025 · 9 comments
Open

clear_proxy() doesn't work #309

luisgstv opened this issue Jan 26, 2025 · 9 comments
Labels
needs information more information needed

Comments

@luisgstv
Copy link

I’m running a script that opens 5 browser instances asynchronously and runs a function. It starts by using the clear_proxy() function to clean the proxy and then sets a new one with set_single_proxy(). But I noticed the proxy doesn’t actually change and keeps using the same one. Is there any way to update the proxy without having to close the browser instance and open another one?

I checked out selenium-injector, but is it as undetectable as selenium-driverless? For my project, selenium-driverless has been the only thing that worked without being detected.

Another issue I’m having is with closing the browsers. Since I’m not using a context manager, it gets tricky to handle. Is there a safer or better way to manage this without relying on a context manager?

@kaliiiiiiiiii
Copy link
Owner

I’m running a script that opens 5 browser instances asynchronously and runs a function. It starts by using the clear_proxy() function to clean the proxy and then sets a new one with set_single_proxy(). But I noticed the proxy doesn’t actually change and keeps using the same one. Is there any way to update the proxy without having to close the browser instance and open another one?

Please provide a minimal reproducible script

I checked out selenium-injector, but is it as undetectable as selenium-driverless? For my project, selenium-driverless has been the only thing that worked without being detected.

Selenium-injector is the old (deprecated) way

Another issue I’m having is with closing the browsers. Since I’m not using a context manager, it gets tricky to handle. Is there a safer or better way to manage this without relying on a context manager?

Could you provide a minimal reproducible code? Generally, you can use driver = await webdriver.Chrome and await driver.quit. Just ensure to do proper exception handling

@kaliiiiiiiiii kaliiiiiiiiii added the needs information more information needed label Jan 27, 2025
@luisgstv
Copy link
Author

luisgstv commented Jan 27, 2025

Hi, Thank you for your response! I appreciate your help. Below is a simplified version of the code I'm using.
I'm using a rotating proxy, which should assign a different IP for each request. However, with this setup, it seems to always use the same IP.

from selenium_driverless import webdriver
import psutil
import subprocess
import logging
import asyncio
import os
import shutil
import random

PROXY = 'http://user:pass@host:port/'

async def create_drivers(max_workers: int) -> list[webdriver.Chrome]:
    drivers = []
    for _ in range(max_workers):
        options = webdriver.ChromeOptions()
        options.add_argument('--headless=new')
        driver = await webdriver.Chrome(options=options).start_session()
        drivers.append(driver)

    return drivers

def clean_dirs_sync(dirs: list):
    for _dir in dirs:
        while os.path.isdir(_dir):
            shutil.rmtree(_dir, ignore_errors=True)

async def close_drivers(drivers: list[webdriver.Chrome]) -> None:
    for driver in drivers:
        try:
            with open(os.devnull, 'w') as devnull:
                subprocess.run(
                    ['taskkill', '/F', '/PID', str(driver._process.pid), '/T'],
                    stdout=devnull,
                    stderr=devnull,
                    check=True
                )
            logging.info('Finished chrome process')
        except Exception as e:
            logging.error(f'Could not finish chrome process: {e}')
    
        clean_dirs_sync([driver._temp_dir])
        clean_dirs_sync([driver._options.user_data_dir])

async def set_user_agent(driver: webdriver.Chrome):
    versions = ['132', '131', '130', '129', '128', '127', '126', '125', '124']
    version = random.choice(versions)
    ua_data = {
        '132': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36', '132.0.6834.84', '8'],
        '131': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36', '131.0.6778.205', '24'],
        '130': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36', '130.0.6723.92', '99'],
        '129': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', '129.0.6668.101', '99'],
        '128': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36', '128.0.6613.138', '99'],
        '127': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36', '127.0.6533.132', '99'],
        '126': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36', '126.0.6478.127', '99'],
        '125': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', '125.0.6422.142', '99'],
        '124': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36', '124.0.6367.119', '99'],
    }

    user_agent, full_version, not_a_brand_version = ua_data[version]
    user_agent_metadata = {
        "brands": [
            {"brand": "Chromium", "version": version},
            {"brand": "Google Chrome", "version": version},
            {"brand": "Not_A Brand", "version": not_a_brand_version}
        ],
        "fullVersionList": [
            {"brand": "Chromium", "version": full_version},
            {"brand": "Google Chrome", "version": full_version},
            {"brand": "Not_A Brand", "version": f"{not_a_brand_version}.0.0.0"}
        ],
        "platform": 'Windows',
        "platformVersion": '10.0.0',
        "architecture": "x86_64",
        "model": "",
        "mobile": False,
        "bitness": "64",
        "wow64": False
    }
    args = {'userAgent': user_agent, "userAgentMetadata": user_agent_metadata}
    await driver.execute_cdp_cmd('Network.setUserAgentOverride', args, timeout=15)

async def fetch_data(driver: webdriver.Chrome, url: str):
    try:
        await driver.clear_proxy()
        await driver.set_single_proxy(PROXY)

        wsizes = [(1920, 1080), (1366, 768), (1280, 720)]
        width, height = random.choice(wsizes)
        await driver.set_window_size(width=width, height=height)
        await driver.normalize_window()

        try:
            await driver.delete_all_cookies()
        except asyncio.TimeoutError:
            logging.warning('Could not delete cookies')

        try:
            await set_user_agent(driver)
        except asyncio.TimeoutError:
            logging.warning(f'Could not change user agent')

        await driver.get('about:blank')
        await driver.get(url)

        content = await driver.page_source
        print(f'IP: {content.split(':')[-1].split('}')[0].strip()}')

    except Exception as e:
        logging.error(f'Could not fetch data: {e}')

async def process_urls_browser(urls: list, drivers: list[webdriver.Chrome], max_retries: int = 2, max_workers: int = 5, timeout_seconds: int = 30) -> None:
    remaining_urls = urls[:]
    semaphore = asyncio.Semaphore(max_workers) 

    async def process_with_driver(driver, url):
        async with semaphore:
            try:
                await asyncio.wait_for(fetch_data(driver, url), timeout=timeout_seconds)
            except asyncio.TimeoutError:
                logging.warning(f'TimeoutError for {url}')
                return url
            except Exception as e:
                logging.warning(f'Error: {e} While processing URL: {url}')
                return url

    try:
        for attempt in range(max_retries):
            if not remaining_urls:
                break

            logging.info(f'Attempt {attempt + 1} with {len(remaining_urls)} URLs')

            tasks = [
                process_with_driver(driver, url)
                for driver, url in zip(drivers * (len(remaining_urls) // len(drivers) + 1), remaining_urls)
            ]

            results = await asyncio.gather(*tasks, return_exceptions=True)

            remaining_urls = [url for url, result in zip(remaining_urls, results) if result]

    except Exception as e:
        logging.error(f'Error while processing URLs: {e}')

    if remaining_urls:
        logging.error(f'Failed to process {len(remaining_urls)} URLs after {max_retries} tries')

async def main():
    drivers = await create_drivers(2)
    urls = ['https://httpbin.org/ip'] * 10
    await process_urls_browser(urls, drivers)
    await close_drivers(drivers)

asyncio.run(main())

Also, is there a better way to change the User-Agent and sec-ch-ua headers?

To initialize the driver, using the approach you provided works perfectly. However, I keep the driver open for several hours while running a function. Previously, when I initialized the driver with start_session() and called quit to close it, I encountered the following error from subprocess:

TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe', '--no-first-run', '--no-service-autorun', '--disable-auto-reload', ...] timed out after 30 seconds"

This came from line 823, in quit inside selenium_driverless/webdriver.py

Using your method, is there still a chance I might encounter this error? Or would it handle this situation more effectively?

Thanks again for your guidance!

@kaliiiiiiiiii
Copy link
Owner

Also, is there a better way to change the User-Agent and sec-ch-ua headers?

Regarding detection, the best way is not to change them at all in this case. In fact, probably better don't attempt to change the fingerprint at all here. There are to many other indicators which in the end expose your attempt.

timed out after 30 seconds
This usually means that chrome crashed

await driver.clear_proxy()
await driver.set_single_proxy(PROXY)

Pretty ensure only setting it (=overwriting) is enough here?

@luisgstv
Copy link
Author

Pretty ensure only setting it (=overwriting) is enough here?

No, simply setting (or overwriting) it is not sufficient. Using just those two lines with two drivers, I get the following output:

IP: "107.180.180.156"
IP: "107.180.180.156"
IP: "107.180.180.156"
IP: "45.45.203.12"
IP: "45.45.203.12"
IP: "107.180.180.156"
IP: "107.180.180.156"
IP: "45.45.203.12"
IP: "45.45.203.12"
IP: "45.45.203.12"

This usually means that chrome crashed

Does this happen because I run too many concurrent instances or because I run them for an extended period of time? I tested using the quit and got the same error again, killing the process with the pid and cleaning the temp dir is good or I need to try another thing?

Regarding detection, the best way is not to change them at all in this case.

Yes, I imagined this would expose my scraping. However, to bypass the PerimeterX protection, rotating user agents has made a significant difference. From my testing so far, without rotating them, I start getting detected very quickly. Even when emulating a different user agent, I am still able to achieve better results than without emulating them.
I just wanted to ask in case there’s a safer and more correct approach than this, as I’m currently trying a variety of different methods to bypass PerimeterX. I’ve even tried adjusting the window size to see if I could achieve better results

@juhacz
Copy link

juhacz commented Jan 28, 2025

I have a similar program with the logic below that works without a problem. There used to be a problem in an early version of selenium-driverless, but in the latest it works without a problem.


 while True:
	for proxy in proxys:
		await driver.set_single_proxy()
		await driver.get()
		await driver.clear_proxy()

@luisgstv
Copy link
Author

This works for me when using multiple proxies. However, I am using a single rotating proxy that changes with every request. When I use this proxy and call clear_proxy(), it correctly removes the proxy, and I revert to using my own IP. However, if I set the same proxy again using set_proxy(), I get the same IP. But if I close the driver and open a new one, I receive a different IP.

Could this issue be related to session handling?

@juhacz
Copy link

juhacz commented Feb 2, 2025

If I understand correctly, your proxy changes with each request. So it is enough to set it once, then in a loop call the queries e.g. 1000 times, then after downloading the data exit the program.

@kaliiiiiiiiii
Copy link
Owner

Well afaik, setting//clearing the proxy might have some racing conditions. E.g. changing it doesn't immediately apply due to some stuff in chromium//chrome I think. => would have to verify//wait//poll//confirm the change

There might be more bugs tho ofc

@luisgstv
Copy link
Author

Got it! I believe the issue might be with the proxy itself. I decided to replace it with regular proxies and implemented a function to rotate them. This way, I was able to switch proxies without any issues. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs information more information needed
Projects
None yet
Development

No branches or pull requests

3 participants