-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clear_proxy() doesn't work #309
Comments
Please provide a minimal reproducible script
Selenium-injector is the old (deprecated) way
Could you provide a minimal reproducible code? Generally, you can use |
Hi, Thank you for your response! I appreciate your help. Below is a simplified version of the code I'm using. from selenium_driverless import webdriver
import psutil
import subprocess
import logging
import asyncio
import os
import shutil
import random
PROXY = 'http://user:pass@host:port/'
async def create_drivers(max_workers: int) -> list[webdriver.Chrome]:
drivers = []
for _ in range(max_workers):
options = webdriver.ChromeOptions()
options.add_argument('--headless=new')
driver = await webdriver.Chrome(options=options).start_session()
drivers.append(driver)
return drivers
def clean_dirs_sync(dirs: list):
for _dir in dirs:
while os.path.isdir(_dir):
shutil.rmtree(_dir, ignore_errors=True)
async def close_drivers(drivers: list[webdriver.Chrome]) -> None:
for driver in drivers:
try:
with open(os.devnull, 'w') as devnull:
subprocess.run(
['taskkill', '/F', '/PID', str(driver._process.pid), '/T'],
stdout=devnull,
stderr=devnull,
check=True
)
logging.info('Finished chrome process')
except Exception as e:
logging.error(f'Could not finish chrome process: {e}')
clean_dirs_sync([driver._temp_dir])
clean_dirs_sync([driver._options.user_data_dir])
async def set_user_agent(driver: webdriver.Chrome):
versions = ['132', '131', '130', '129', '128', '127', '126', '125', '124']
version = random.choice(versions)
ua_data = {
'132': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36', '132.0.6834.84', '8'],
'131': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36', '131.0.6778.205', '24'],
'130': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36', '130.0.6723.92', '99'],
'129': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', '129.0.6668.101', '99'],
'128': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36', '128.0.6613.138', '99'],
'127': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36', '127.0.6533.132', '99'],
'126': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36', '126.0.6478.127', '99'],
'125': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', '125.0.6422.142', '99'],
'124': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36', '124.0.6367.119', '99'],
}
user_agent, full_version, not_a_brand_version = ua_data[version]
user_agent_metadata = {
"brands": [
{"brand": "Chromium", "version": version},
{"brand": "Google Chrome", "version": version},
{"brand": "Not_A Brand", "version": not_a_brand_version}
],
"fullVersionList": [
{"brand": "Chromium", "version": full_version},
{"brand": "Google Chrome", "version": full_version},
{"brand": "Not_A Brand", "version": f"{not_a_brand_version}.0.0.0"}
],
"platform": 'Windows',
"platformVersion": '10.0.0',
"architecture": "x86_64",
"model": "",
"mobile": False,
"bitness": "64",
"wow64": False
}
args = {'userAgent': user_agent, "userAgentMetadata": user_agent_metadata}
await driver.execute_cdp_cmd('Network.setUserAgentOverride', args, timeout=15)
async def fetch_data(driver: webdriver.Chrome, url: str):
try:
await driver.clear_proxy()
await driver.set_single_proxy(PROXY)
wsizes = [(1920, 1080), (1366, 768), (1280, 720)]
width, height = random.choice(wsizes)
await driver.set_window_size(width=width, height=height)
await driver.normalize_window()
try:
await driver.delete_all_cookies()
except asyncio.TimeoutError:
logging.warning('Could not delete cookies')
try:
await set_user_agent(driver)
except asyncio.TimeoutError:
logging.warning(f'Could not change user agent')
await driver.get('about:blank')
await driver.get(url)
content = await driver.page_source
print(f'IP: {content.split(':')[-1].split('}')[0].strip()}')
except Exception as e:
logging.error(f'Could not fetch data: {e}')
async def process_urls_browser(urls: list, drivers: list[webdriver.Chrome], max_retries: int = 2, max_workers: int = 5, timeout_seconds: int = 30) -> None:
remaining_urls = urls[:]
semaphore = asyncio.Semaphore(max_workers)
async def process_with_driver(driver, url):
async with semaphore:
try:
await asyncio.wait_for(fetch_data(driver, url), timeout=timeout_seconds)
except asyncio.TimeoutError:
logging.warning(f'TimeoutError for {url}')
return url
except Exception as e:
logging.warning(f'Error: {e} While processing URL: {url}')
return url
try:
for attempt in range(max_retries):
if not remaining_urls:
break
logging.info(f'Attempt {attempt + 1} with {len(remaining_urls)} URLs')
tasks = [
process_with_driver(driver, url)
for driver, url in zip(drivers * (len(remaining_urls) // len(drivers) + 1), remaining_urls)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
remaining_urls = [url for url, result in zip(remaining_urls, results) if result]
except Exception as e:
logging.error(f'Error while processing URLs: {e}')
if remaining_urls:
logging.error(f'Failed to process {len(remaining_urls)} URLs after {max_retries} tries')
async def main():
drivers = await create_drivers(2)
urls = ['https://httpbin.org/ip'] * 10
await process_urls_browser(urls, drivers)
await close_drivers(drivers)
asyncio.run(main()) Also, is there a better way to change the User-Agent and sec-ch-ua headers? To initialize the driver, using the approach you provided works perfectly. However, I keep the driver open for several hours while running a function. Previously, when I initialized the driver with start_session() and called quit to close it, I encountered the following error from subprocess: TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe', '--no-first-run', '--no-service-autorun', '--disable-auto-reload', ...] timed out after 30 seconds" This came from line 823, in quit inside selenium_driverless/webdriver.py Using your method, is there still a chance I might encounter this error? Or would it handle this situation more effectively? Thanks again for your guidance! |
Regarding detection, the best way is not to change them at all in this case. In fact, probably better don't attempt to change the fingerprint at all here. There are to many other indicators which in the end expose your attempt.
Pretty ensure only setting it (=overwriting) is enough here? |
No, simply setting (or overwriting) it is not sufficient. Using just those two lines with two drivers, I get the following output:
Does this happen because I run too many concurrent instances or because I run them for an extended period of time? I tested using the quit and got the same error again, killing the process with the pid and cleaning the temp dir is good or I need to try another thing?
Yes, I imagined this would expose my scraping. However, to bypass the PerimeterX protection, rotating user agents has made a significant difference. From my testing so far, without rotating them, I start getting detected very quickly. Even when emulating a different user agent, I am still able to achieve better results than without emulating them. |
I have a similar program with the logic below that works without a problem. There used to be a problem in an early version of selenium-driverless, but in the latest it works without a problem.
|
This works for me when using multiple proxies. However, I am using a single rotating proxy that changes with every request. When I use this proxy and call clear_proxy(), it correctly removes the proxy, and I revert to using my own IP. However, if I set the same proxy again using set_proxy(), I get the same IP. But if I close the driver and open a new one, I receive a different IP. Could this issue be related to session handling? |
If I understand correctly, your proxy changes with each request. So it is enough to set it once, then in a loop call the queries e.g. 1000 times, then after downloading the data exit the program. |
Well afaik, setting//clearing the proxy might have some racing conditions. E.g. changing it doesn't immediately apply due to some stuff in chromium//chrome I think. => would have to verify//wait//poll//confirm the change There might be more bugs tho ofc |
Got it! I believe the issue might be with the proxy itself. I decided to replace it with regular proxies and implemented a function to rotate them. This way, I was able to switch proxies without any issues. Thanks! |
I’m running a script that opens 5 browser instances asynchronously and runs a function. It starts by using the clear_proxy() function to clean the proxy and then sets a new one with set_single_proxy(). But I noticed the proxy doesn’t actually change and keeps using the same one. Is there any way to update the proxy without having to close the browser instance and open another one?
I checked out selenium-injector, but is it as undetectable as selenium-driverless? For my project, selenium-driverless has been the only thing that worked without being detected.
Another issue I’m having is with closing the browsers. Since I’m not using a context manager, it gets tricky to handle. Is there a safer or better way to manage this without relying on a context manager?
The text was updated successfully, but these errors were encountered: