Skip to content
This repository has been archived by the owner on Dec 17, 2021. It is now read-only.

Support for headless Chrome #195

Merged
merged 38 commits into from
Feb 25, 2018
Merged

Support for headless Chrome #195

merged 38 commits into from
Feb 25, 2018

Conversation

konklone
Copy link
Contributor

@konklone konklone commented Feb 19, 2018

This builds support for headless Chrome into domain-scan.

It rewrites the third_parties scanner to be in Node, and adds central functionality for a scanner to hand off its scan method to Node. The third_parties scanner now uses puppeteer/Chrome instead of phantomas/phantomjs to load websites and trap outbound third party connections.

When running locally, it uses the awkward-but-functional method of shelling out to Node to execute a base.js file where the scanner name (as a string) and parameters (as a serialized JSON blob) over STDOUT as CLI arguments.

This PR will be augmented before it's merged with Lambda integration, so that the HTTP call from the scanner to Lambda can act as the cross-language communication, and so that the Lambda container can use just one language runtime (necessary to leave room for Chrome to be packaged directly into the sub-50MB zip file).

I also cleaned up the repository somewhat to remove some outdated scripts/ files, move scanners/utils.py to a new utils/ directory, and moved the known third party services data to its own JSON file that can be accessed by either language.

I'm sure there's a bit of technical debt here, but I think it's important to get this closer to shipping so that it can be used and built on. There's a lot that can be done with headless Chrome beyond third party service analysis.

The third_parties scanner has a few notable TODOs (likely for a future PR):

  • Better approach to timeouts, the scanner stops when the page is done loading, which will definitely miss some third parties. One idea: "Wait 2 seconds after the Page.frameStoppedLoading event (with a 5 second timeout after the request was sent just in case the event isn't fired for some reason) and if there are any network connections still open wait another second for a maximum of 5 seconds (so a request can be as quick as 2 seconds to complete or as long as 10 seconds) before it is killed. Just waiting for all network connections to terminate means some pages (especially ones with live support systems) 'never' finish loading."
  • The URL correction from pshtt should also ignore nearby redirects not just external redirects, so that it's only making requests within its own hostname. (In other words, if something.state.gov redirects to history.state.gov/something.html, skip the scan and assume that we'll get to that when we scan history.state.gov.
  • Better error handling through third_parties.js and base.js. There are some places where a crash will cause a cascade of PromiseNotResolved exceptions.

cc @h-m-f-t @jsf9k @gbinal @micahsaul

google-cloud-bigquery
google-auth-oauthlib

# For Lambda support (only used locally):
boto3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged requirements.txt to much more clearly delineate which deps are used for which sections of the tool, to aid in creating tailored Lambda packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed the requests dependency, shrinking our dependency surface area.

@@ -22,8 +21,7 @@ def init(environment, options):
analytics_path = os.path.join(utils.cache_dir(), "analytics.csv")

try:
response = requests.get(analytics_file)
utils.write(response.text, analytics_path)
utils.download(analytics_file, analytics_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analytics scanner was the only place requests was being used, and it was easy to replace with our other download method.

const chromeOptions = [
// Resolves error:
// error when launch(); No usable sandbox! Update your kernel
'--no-sandbox',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed --no-sandbox locally on Debian, not just for Lambda.

// TODO: use the Public Suffix List.
var baseDomainFor = (input) => {
return input.split("\.").slice(-2).join("\.");
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for the future: use the Public Suffix List. (For Lambda, this will probably have to be handled similarly to how it's handled for pshtt.)

str.join(" | ", data['nearby_urls']),
str.join(" | ", data['known_services']),
str.join(" | ", data['unknown_services'])
]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URLs and domains should all be URL-encoded, so | should be a tolerable delimiter.

return str(response, encoding='UTF-8')
except subprocess.CalledProcessError as exc:
if exc.returncode in allowed_return_codes:
return str(exc.stdout, encoding='UTF-8')
else:
logging.warn("Error running %s." % (str(command)))
logging.warn("Error running %s." % (str(exc.output)))
logging.warn(format_last_exception())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better handle error handling for failed shelled-out commands.

@konklone
Copy link
Contributor Author

konklone commented Feb 19, 2018

TODOs before resolving this PR:

  • Verify and document a clear path to get the right node_modules installed to the right place. I think my local environment ended up in a weird place where they're at utils/headless/node_modules, which is unlikely to be a great solution. And there's no package.json yet!
  • The executable path for Chrome needs to be overrideable and sent into base.js for spin-up.

@konklone
Copy link
Contributor Author

@micahsaul This moves the scripts/a11y directory to become utils/a11y. I'm not sure if this will have ramifications for Pulse, but happy to delay on merging this change until you have a chance to review it.

@konklone konklone requested a review from micahsaul February 19, 2018 03:45
@paulirish
Copy link

btw one alternative to the manually updated known_services.json could be https://github.com/paulirish/third-party-decode

it's hashed for various privacy reasons (we externalized some internal google data for it). but aside from that it seems like it could work decently well for your purposes.

outside of that, the puppeteer integration is quite nice. well done!

@konklone konklone changed the title Support for headless Chrome (running locally) Support for headless Chrome Feb 25, 2018
@konklone
Copy link
Contributor Author

I've merged in 13 more commits from the last week, which extend this work to provide full Lambda support for running headless Chrome via puppeteer.

  • This adds some additional scaffolding to the lambda/ directory and build system, contained in the lambda/headless directory, since the build process for Node/Chrome functions is more complicated (and involves a transpilation step, since Lambda supports Node 6.10, which doesn't support native async/await).
  • Lambda functions that involve headless Chrome are now expected to use the node6.10 Lambda runtime, and there is now a lambda/headless/handler.js file that is used as the entry point for Lambda. This allows us to avoid the Python->Node shell-based communication channel entirely in Lambda, and have the Lambda API invocation call function as the cross-language communication. This means that lambda/headless/handler.js and lambda/handler.py both need to independently meet the same API contract/schema that the scan script expects from all Lambda functions.
  • The package.json and node_modules artifacts are now moved to the root of the project, rather than partly in the build directory and partly in the scanners directory.
  • The local/Lambda abstraction split is cleaner now, if a bit more indirect - there's now local_bridge.py and local_bridge.js that both focus solely on the local data handoff between Python and Node.
  • There is still ascanners/headless/base.js that is used as the entry point for all Node-based scanners, and is used by either local_bridge.js (locally) or handler.js (in Lambda). So in a Lambda scan, the path is lambda/headless/handler.js -> scanners/headless/base.js -> scanners/third_parties.js. In a local scan, the path is scanners/headless/local_bridge.py -> scanners/headless/local_bridge.js -> scanners/headless/base.js -> scanners/third_parties.js. It's complex in terms of the number of files involved, but it allows for maximum reuse of code between the local and Lambda paths, while letting each file focus on doing one thing well.
  • Chrome 66's binary is now in the repository. While not ideal, there is some significant version-locking that has to happen to make sure that domain-scan's code, puppeteer, and Chrome are all in sync. Puppeteer version updates sometimes require/expect new versions of Chrome. The Puppeteer team has said they expect this brittleness to go away in the near future so that they can commit to forward-compatibility, but until that happens we have to keep things manually versioned.
  • The third_parties scanner is improved substantially, to support better timeouts, to return data once timeouts have occurred, and to drop data: URIs.

The third_parties scanner has been tested in Lambda on a full ~22K scan, and was able to return full results for all of them in 3.5 minutes using a pool of 900 Python workers invoking a rolling window of 900 live Lambda functions, at the cost of $4-5.

That's very impressive to me, and really made this work worthwhile. While this new scaffolding has increased domain-scan's complexity somewhat, I think the ROI supports it, and I look forward to evolving this part of domain-scan to become more powerful and maintainable over time.

I'll add documentation before merging this PR, but the rest (including making the local Chrome path overrideable) I'm planning to defer to a future PR.

"presets": [
["env", {
"targets": {
"node": "6.10"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .babelrc file in the project root is used during Lambda function transpilation to transform async/await calls into iterators during the function build process. The versioned .js code in this repository, including custom scanner code, can freely use async and await keywords, and local scan execution will use whatever version of Node is local to the scan environment.

@@ -0,0 +1 @@
HeadlessChrome-66.0.3343.0.tar.gz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a symlink, so that the Chrome binary can easily be updated in place without updating build scripts which reference it.

../../node_modules/.bin/babel ../../scanners/$SCANNER_NAME.js --out-file build/scanners/$SCANNER_NAME.js

# Copy in the known services map.
cp ../../utils/known_services.json build/utils/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above lines (similar to the Python Lambda function build script) are the most brittle, since they will need to be updated in case of the addition of any utils code, or the moving/renaming of any relevant files to JS-based scan execution.

--handler lambda_handler.handler \
--runtime nodejs6.10 \
--timeout 300 \
--memory-size 1536
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chrome headless uses more memory and needs more CPU (and so is more expensive) than the simpler Python-based scanners currently at use in domain-scan. These functions default to 1536MB during build, based on multiple suggestions online, as well as my own empirical (anecdotal, not statistically significant) testing at optimizing for GB-s pricing/performance.

.on('error', (err) => reject(err))
.pipe(tar.x({
C: setupChromePath,
}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part relies on Chrome being packaged as a tarball. It's unpackaged live in JavaScript and written out to /tmp inside the function. This might be something where we could consciously let "warm" Lambda function environments reuse an already unpackaged Chrome instead of unpacking it every time.

@@ -0,0 +1,25 @@
'use strict';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the transpilation step to Node 6.10 for Lambda adds 'use strict'; to the generated files and effectively forces this onto the source files (even if we don't say use strict at the top of them), I'm just taking it as a push to always use use strict in all of our JS files and avoid any issues at transpilation time. This requires, among other things, consistently using var or let and not implicitly creating global variables.

return callback(null, data);
};

module.exports = {scan: scan}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This base.js file doesn't do a whole lot, since local_bridge.js and handler.js handle a lot of the setup work before executing a specific scanner. But it does do some stuff, that specific scanners won't have to do, and I suspect this will grow to support a bunch of common kinds of error handling and options management.


// Hook to allow slightly easier debugging.
// TEST_LOCAL=1 ./scanners/headless/local_bridge.js third_parties example.com
if (process.env.TEST_LOCAL) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In normal use, local_bridge.py sends a serialized JSON blob to pass a dict of data to local_bridge.js. To allow for easier local testing of the local JS code path, this adds a simplistic version of the CLI API that just takes a domain name and assumes a bunch of defaults, if the TEST_LOCAL env var is present.

"CFI Group": [
"cfigroup.com"
]
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more third party services to this JSON file, but I very much would like to start using/contributing to an upstream source.

command,
stderr=subprocess.STDOUT,
shell=False, env=env
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This improves the local error message / stacktrace handling for local scan execution where shelling out is involved.

@konklone
Copy link
Contributor Author

Added documentation in bb517bf and 74ae0e8.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants