Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported #2580

mehrdad-shokri · 2024-07-10T18:18:35Z

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

The docs never mention that only http proxies are supported. I think using http proxies are a security risk. Digging deeper you end up in here which crawlee uses. I think it should support HTTPS proxies as well.

Code sample

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
      'http://Username:Password@proxyUrl:PORT',
    ],
  });
  const crawler = new PlaywrightCrawler(
    {
      proxyConfiguration,
      // Use the requestHandler to process each of the crawled pages.
      async requestHandler({request, page, enqueueLinks, log, crawler}) {
        const title = await page.title();
        content = await page.content();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({title, url: request.loadedUrl, content});

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
      },
      maxRequestsPerCrawl: 1,
      maxConcurrency: 20,
      retryOnBlocked: true,
      maxRequestRetries: 10,
    },
    new Configuration({
      persistStorage: false,
      maxUsedCpuRatio: 0.95,
      availableMemoryRatio: 0.5,
    }),
  );

await crawler.run([url])

Package version

crawlee@3.11.0 proxy-chain@2.5.1

Node.js version

v20.10.0 typescript@5.5.2

Operating system

macOS

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

The text was updated successfully, but these errors were encountered:

barjin · 2024-07-17T11:31:18Z

Hello - and thank you for your interest in this project.

Can you please provide reproduction scenario for the issue you are having?

"I think using http proxies are a security risk"

Note that this is not true - if you are connecting to the target server via HTTPS, the traffic is still end-to-end encrypted. With HTTP proxies, this is achieved via HTTP CONNECT method, which creates an opaque data tunnel from the client to the proxy server, through which the encrypted data is transferred. The intermediate proxy server cannot read this data (as it's encrypted).

If you are connecting to an HTTP target server (or you decide to fiddle around with the TLS settings - see e.g. comments under this issue), the proxy can indeed act as MITM and read your traffic - but you really have to want this - it will never happen with the default

mehrdad-shokri added the bug Something isn't working. label Jul 10, 2024

fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 12, 2024

B4nan assigned barjin Jul 16, 2024

apify locked and limited conversation to collaborators Jul 17, 2024

barjin converted this issue into discussion #2584 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported #2580

Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported #2580

mehrdad-shokri commented Jul 10, 2024

barjin commented Jul 17, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported #2580

Error: Invalid "proxyUrl" option: only HTTP proxies are currently supported #2580

Comments

mehrdad-shokri commented Jul 10, 2024

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

barjin commented Jul 17, 2024

This issue was moved to a discussion.

I have tested this on the `next` release