Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported #2580

Closed
1 task
mehrdad-shokri opened this issue Jul 10, 2024 · 1 comment
Closed
1 task
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@mehrdad-shokri
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

The docs never mention that only http proxies are supported. I think using http proxies are a security risk. Digging deeper you end up in here which crawlee uses. I think it should support HTTPS proxies as well.

Code sample

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
      'http://Username:Password@proxyUrl:PORT',
    ],
  });
  const crawler = new PlaywrightCrawler(
    {
      proxyConfiguration,
      // Use the requestHandler to process each of the crawled pages.
      async requestHandler({request, page, enqueueLinks, log, crawler}) {
        const title = await page.title();
        content = await page.content();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({title, url: request.loadedUrl, content});

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
      },
      maxRequestsPerCrawl: 1,
      maxConcurrency: 20,
      retryOnBlocked: true,
      maxRequestRetries: 10,
    },
    new Configuration({
      persistStorage: false,
      maxUsedCpuRatio: 0.95,
      availableMemoryRatio: 0.5,
    }),
  );

await crawler.run([url])

Package version

crawlee@3.11.0 proxy-chain@2.5.1

Node.js version

v20.10.0 typescript@5.5.2

Operating system

macOS

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@mehrdad-shokri mehrdad-shokri added the bug Something isn't working. label Jul 10, 2024
@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 12, 2024
@barjin
Copy link
Contributor

barjin commented Jul 17, 2024

Hello - and thank you for your interest in this project.

Can you please provide reproduction scenario for the issue you are having?

"I think using http proxies are a security risk"

Note that this is not true - if you are connecting to the target server via HTTPS, the traffic is still end-to-end encrypted. With HTTP proxies, this is achieved via HTTP CONNECT method, which creates an opaque data tunnel from the client to the proxy server, through which the encrypted data is transferred. The intermediate proxy server cannot read this data (as it's encrypted).

If you are connecting to an HTTP target server (or you decide to fiddle around with the TLS settings - see e.g. comments under this issue), the proxy can indeed act as MITM and read your traffic - but you really have to want this - it will never happen with the default

@apify apify locked and limited conversation to collaborators Jul 17, 2024
@barjin barjin converted this issue into discussion #2584 Jul 17, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants