Support for headless Chrome #195

konklone · 2018-02-19T03:36:44Z

This builds support for headless Chrome into domain-scan.

It rewrites the third_parties scanner to be in Node, and adds central functionality for a scanner to hand off its scan method to Node. The third_parties scanner now uses puppeteer/Chrome instead of phantomas/phantomjs to load websites and trap outbound third party connections.

When running locally, it uses the awkward-but-functional method of shelling out to Node to execute a base.js file where the scanner name (as a string) and parameters (as a serialized JSON blob) over STDOUT as CLI arguments.

This PR will be augmented before it's merged with Lambda integration, so that the HTTP call from the scanner to Lambda can act as the cross-language communication, and so that the Lambda container can use just one language runtime (necessary to leave room for Chrome to be packaged directly into the sub-50MB zip file).

I also cleaned up the repository somewhat to remove some outdated scripts/ files, move scanners/utils.py to a new utils/ directory, and moved the known third party services data to its own JSON file that can be accessed by either language.

I'm sure there's a bit of technical debt here, but I think it's important to get this closer to shipping so that it can be used and built on. There's a lot that can be done with headless Chrome beyond third party service analysis.

The third_parties scanner has a few notable TODOs (likely for a future PR):

Better approach to timeouts, the scanner stops when the page is done loading, which will definitely miss some third parties. One idea: "Wait 2 seconds after the Page.frameStoppedLoading event (with a 5 second timeout after the request was sent just in case the event isn't fired for some reason) and if there are any network connections still open wait another second for a maximum of 5 seconds (so a request can be as quick as 2 seconds to complete or as long as 10 seconds) before it is killed. Just waiting for all network connections to terminate means some pages (especially ones with live support systems) 'never' finish loading."
The URL correction from pshtt should also ignore nearby redirects not just external redirects, so that it's only making requests within its own hostname. (In other words, if something.state.gov redirects to history.state.gov/something.html, skip the scan and assume that we'll get to that when we scan history.state.gov.
Better error handling through third_parties.js and base.js. There are some places where a crash will cause a cascade of PromiseNotResolved exceptions.

cc @h-m-f-t @jsf9k @gbinal @micahsaul

konklone · 2018-02-19T03:40:05Z

requirements.txt

 google-cloud-bigquery
 google-auth-oauthlib
-
-# For Lambda support (only used locally):
-boto3


I rearranged requirements.txt to much more clearly delineate which deps are used for which sections of the tool, to aid in creating tailored Lambda packages.

I also removed the requests dependency, shrinking our dependency surface area.

konklone · 2018-02-19T03:41:10Z

scanners/analytics.py

@@ -22,8 +21,7 @@ def init(environment, options):
        analytics_path = os.path.join(utils.cache_dir(), "analytics.csv")

        try:
-            response = requests.get(analytics_file)
-            utils.write(response.text, analytics_path)
+            utils.download(analytics_file, analytics_path)


The analytics scanner was the only place requests was being used, and it was easy to replace with our other download method.

konklone · 2018-02-19T03:41:21Z

scanners/headless/base.js

+const chromeOptions = [
+  // Resolves error:
+  //    error when launch(); No usable sandbox! Update your kernel
+  '--no-sandbox',


I needed --no-sandbox locally on Debian, not just for Lambda.

konklone · 2018-02-19T03:43:04Z

scanners/third_parties.js

+// TODO: use the Public Suffix List.
+var baseDomainFor = (input) => {
+  return input.split("\.").slice(-2).join("\.");
+};


TODO for the future: use the Public Suffix List. (For Lambda, this will probably have to be handled similarly to how it's handled for pshtt.)

konklone · 2018-02-19T03:43:34Z

scanners/third_parties.py

+        str.join(" | ", data['nearby_urls']),
+        str.join(" | ", data['known_services']),
+        str.join(" | ", data['unknown_services'])
+    ]]


The URLs and domains should all be URL-encoded, so | should be a tolerable delimiter.

konklone · 2018-02-19T03:44:19Z

utils/utils.py

        return str(response, encoding='UTF-8')
    except subprocess.CalledProcessError as exc:
        if exc.returncode in allowed_return_codes:
            return str(exc.stdout, encoding='UTF-8')
        else:
            logging.warn("Error running %s." % (str(command)))
+            logging.warn("Error running %s." % (str(exc.output)))
+            logging.warn(format_last_exception())


Better handle error handling for failed shelled-out commands.

konklone · 2018-02-19T03:44:39Z

TODOs before resolving this PR:

Verify and document a clear path to get the right node_modules installed to the right place. I think my local environment ended up in a weird place where they're at utils/headless/node_modules, which is unlikely to be a great solution. And there's no package.json yet!
~~The executable path for Chrome needs to be overrideable and sent into base.js for spin-up.~~

konklone · 2018-02-19T03:45:08Z

@micahsaul This moves the scripts/a11y directory to become utils/a11y. I'm not sure if this will have ramifications for Pulse, but happy to delay on merging this change until you have a chance to review it.

@micahsaul

cc @micahsaul

paulirish · 2018-02-19T20:54:26Z

btw one alternative to the manually updated known_services.json could be https://github.com/paulirish/third-party-decode

it's hashed for various privacy reasons (we externalized some internal google data for it). but aside from that it seems like it could work decently well for your purposes.

outside of that, the puppeteer integration is quite nice. well done!

konklone · 2018-02-25T06:47:07Z

I've merged in 13 more commits from the last week, which extend this work to provide full Lambda support for running headless Chrome via puppeteer.

This adds some additional scaffolding to the lambda/ directory and build system, contained in the lambda/headless directory, since the build process for Node/Chrome functions is more complicated (and involves a transpilation step, since Lambda supports Node 6.10, which doesn't support native async/await).
Lambda functions that involve headless Chrome are now expected to use the node6.10 Lambda runtime, and there is now a lambda/headless/handler.js file that is used as the entry point for Lambda. This allows us to avoid the Python->Node shell-based communication channel entirely in Lambda, and have the Lambda API invocation call function as the cross-language communication. This means that lambda/headless/handler.js and lambda/handler.py both need to independently meet the same API contract/schema that the scan script expects from all Lambda functions.
The package.json and node_modules artifacts are now moved to the root of the project, rather than partly in the build directory and partly in the scanners directory.
The local/Lambda abstraction split is cleaner now, if a bit more indirect - there's now local_bridge.py and local_bridge.js that both focus solely on the local data handoff between Python and Node.
There is still ascanners/headless/base.js that is used as the entry point for all Node-based scanners, and is used by either local_bridge.js (locally) or handler.js (in Lambda). So in a Lambda scan, the path is lambda/headless/handler.js -> scanners/headless/base.js -> scanners/third_parties.js. In a local scan, the path is scanners/headless/local_bridge.py -> scanners/headless/local_bridge.js -> scanners/headless/base.js -> scanners/third_parties.js. It's complex in terms of the number of files involved, but it allows for maximum reuse of code between the local and Lambda paths, while letting each file focus on doing one thing well.
Chrome 66's binary is now in the repository. While not ideal, there is some significant version-locking that has to happen to make sure that domain-scan's code, puppeteer, and Chrome are all in sync. Puppeteer version updates sometimes require/expect new versions of Chrome. The Puppeteer team has said they expect this brittleness to go away in the near future so that they can commit to forward-compatibility, but until that happens we have to keep things manually versioned.
The third_parties scanner is improved substantially, to support better timeouts, to return data once timeouts have occurred, and to drop data: URIs.

The third_parties scanner has been tested in Lambda on a full ~22K scan, and was able to return full results for all of them in 3.5 minutes using a pool of 900 Python workers invoking a rolling window of 900 live Lambda functions, at the cost of $4-5.

That's very impressive to me, and really made this work worthwhile. While this new scaffolding has increased domain-scan's complexity somewhat, I think the ROI supports it, and I look forward to evolving this part of domain-scan to become more powerful and maintainable over time.

I'll add documentation before merging this PR, but the rest (including making the local Chrome path overrideable) I'm planning to defer to a future PR.

konklone · 2018-02-25T06:48:54Z

.babelrc

+  "presets": [
+    ["env", {
+      "targets": {
+        "node": "6.10"


The .babelrc file in the project root is used during Lambda function transpilation to transform async/await calls into iterators during the function build process. The versioned .js code in this repository, including custom scanner code, can freely use async and await keywords, and local scan execution will use whatever version of Node is local to the scan environment.

konklone · 2018-02-25T06:49:47Z

lambda/headless/chrome/headless_shell.tar.gz

@@ -0,0 +1 @@
+HeadlessChrome-66.0.3343.0.tar.gz


This is just a symlink, so that the Chrome binary can easily be updated in place without updating build scripts which reference it.

konklone · 2018-02-25T06:50:48Z

lambda/headless/deploy

+../../node_modules/.bin/babel ../../scanners/$SCANNER_NAME.js --out-file build/scanners/$SCANNER_NAME.js
+
+# Copy in the known services map.
+cp ../../utils/known_services.json build/utils/


The above lines (similar to the Python Lambda function build script) are the most brittle, since they will need to be updated in case of the addition of any utils code, or the moving/renaming of any relevant files to JS-based scan execution.

konklone · 2018-02-25T06:52:58Z

lambda/headless/deploy

+    --handler lambda_handler.handler \
+    --runtime nodejs6.10 \
+    --timeout 300 \
+    --memory-size 1536


Chrome headless uses more memory and needs more CPU (and so is more expensive) than the simpler Python-based scanners currently at use in domain-scan. These functions default to 1536MB during build, based on multiple suggestions online, as well as my own empirical (anecdotal, not statistically significant) testing at optimizing for GB-s pricing/performance.

konklone · 2018-02-25T06:55:19Z

lambda/headless/lambda_handler.js

+    .on('error', (err) => reject(err))
+    .pipe(tar.x({
+      C: setupChromePath,
+    }))


This part relies on Chrome being packaged as a tarball. It's unpackaged live in JavaScript and written out to /tmp inside the function. This might be something where we could consciously let "warm" Lambda function environments reuse an already unpackaged Chrome instead of unpacking it every time.

konklone · 2018-02-25T07:07:02Z

scanners/headless/base.js

@@ -0,0 +1,25 @@
+'use strict';


Since the transpilation step to Node 6.10 for Lambda adds 'use strict'; to the generated files and effectively forces this onto the source files (even if we don't say use strict at the top of them), I'm just taking it as a push to always use use strict in all of our JS files and avoid any issues at transpilation time. This requires, among other things, consistently using var or let and not implicitly creating global variables.

konklone · 2018-02-25T07:08:20Z

scanners/headless/base.js

+  return callback(null, data);
+};
+
+module.exports = {scan: scan}


This base.js file doesn't do a whole lot, since local_bridge.js and handler.js handle a lot of the setup work before executing a specific scanner. But it does do some stuff, that specific scanners won't have to do, and I suspect this will grow to support a bunch of common kinds of error handling and options management.

konklone · 2018-02-25T07:09:34Z

scanners/headless/local_bridge.js

+
+// Hook to allow slightly easier debugging.
+// TEST_LOCAL=1 ./scanners/headless/local_bridge.js third_parties example.com
+if (process.env.TEST_LOCAL) {


In normal use, local_bridge.py sends a serialized JSON blob to pass a dict of data to local_bridge.js. To allow for easier local testing of the local JS code path, this adds a simplistic version of the CLI API that just takes a domain name and assumes a bunch of defaults, if the TEST_LOCAL env var is present.

konklone · 2018-02-25T07:12:34Z

utils/known_services.json

+    "CFI Group": [
+        "cfigroup.com"
+    ]
+}


I've added a few more third party services to this JSON file, but I very much would like to start using/contributing to an upstream source.

konklone · 2018-02-25T07:13:03Z

utils/utils.py

+            command,
+            stderr=subprocess.STDOUT,
+            shell=False, env=env
+        )


This improves the local error message / stacktrace handling for local scan execution where shelling out is involved.

konklone · 2018-02-25T20:40:39Z

Added documentation in bb517bf and 74ae0e8.

konklone added 21 commits January 13, 2018 17:25

reorg requirements.txt and remove six

6ac2e30

complete moving scripts/ to utils/ and packaging appropriately

5f6f4f3

output error details if command fails

37c0838

be more clear that this is a catch-all thing

dd8ece7

ignore lambda/envs (require a -f to add things in that dir)

f87e7ce

some early work on build scripts relating to chrome headless

fbbc9f1

ignore node_modules for headless chrome stuff right now

2c94bb5

early POC work on using chrome headless for third parties

55b46c8

Merge branch 'master' into headless-chrome

77a8229

ignore node_modules everywhere

0caa57f

moving to a Node-based local use of Chrome headless

c94fd6d

options and env being passed all the way through to headless

4c04eeb

move the CLI entry point to base.js, third_parties.js is now tiny

b6210fc

rearranging further, no more Python scan method in third_parties.py

171ae80

rename headless attribute flag

1fc63c7

convert known_services into straight JSON

0d134a0

starting to trap requests and pass hostnames onwards

f32c222

error handling and increased data for third parties

56d7e7a

a reasonably mature replacement for the third parties scanner

231a1ce

remove unused code

9d2bd97

fix flake8 errors

ea59ac8

konklone commented Feb 19, 2018

View reviewed changes

konklone requested a review from micahsaul February 19, 2018 03:45

update the process_a11y script to import from proper places

f7f422b

cc @micahsaul

konklone added 7 commits February 18, 2018 23:51

creating some scaffolding for Lambda, but blocking on Node 6

0149250

split out Node CLI bridge from Node chrome headless base code

4f38f31

major progress on getting headless working in lambda, POC done

dace588

continuing to streamline and bugfix joint local/lambda pipeline

fc441ae

moved babelrc and package.json around, ignored build/ dir

19bddd3

working headless chrome running third_parties in Lambda!

8c14f5e

the --lambda flag works with the third_parties scanner!

a17ae27

konklone added 5 commits February 19, 2018 18:03

return and store caught exceptions during lambda scans

f04a0b3

have headless browser default to ignoring HTTPS errors

d0c5fb7

ignore data URIs, handle some kind of errors more gracefully

5305a6d

allow easier local testing of node-based scanners

4507816

keep whatever data is collected after a timeout, and lower to 20s

2a26e3d

konklone mentioned this pull request Feb 25, 2018

Update a11y docs to reflect changed directories and inspect->pshtt 18F/pulse#761

Merged

konklone changed the title ~~Support for headless Chrome (running locally)~~ Support for headless Chrome Feb 25, 2018

konklone added 2 commits February 25, 2018 01:51

default to 1536 MB for Chrome headless functions

e67457d

unused debugLog function

b9e0903

konklone commented Feb 25, 2018

View reviewed changes

konklone added 2 commits February 25, 2018 15:35

Document support for headless Chrome.

bb517bf

mention npm dependencies

74ae0e8

konklone merged commit 1221e03 into master Feb 25, 2018

konklone deleted the headless-chrome branch February 25, 2018 20:42

konklone mentioned this pull request Feb 27, 2018

Further development of CSP header checks and SRI checks #193

Open

konklone mentioned this pull request Mar 20, 2018

Switch headless browser from phantomas to chrome #167

Closed

konklone mentioned this pull request Jun 6, 2018

Adding package.json to repo #259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for headless Chrome #195

Support for headless Chrome #195

konklone commented Feb 19, 2018 •

edited

Loading

konklone Feb 19, 2018

konklone Feb 19, 2018

konklone Feb 19, 2018

konklone Feb 19, 2018

konklone Feb 19, 2018

konklone Feb 19, 2018

konklone Feb 19, 2018

konklone commented Feb 19, 2018 •

edited

Loading

konklone commented Feb 19, 2018

paulirish commented Feb 19, 2018

konklone commented Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone Feb 25, 2018

konklone commented Feb 25, 2018

		@@ -0,0 +1 @@
		HeadlessChrome-66.0.3343.0.tar.gz

Support for headless Chrome #195

Support for headless Chrome #195

Conversation

konklone commented Feb 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

konklone commented Feb 19, 2018 • edited Loading

konklone commented Feb 19, 2018

paulirish commented Feb 19, 2018

konklone commented Feb 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

konklone commented Feb 25, 2018

konklone commented Feb 19, 2018 •

edited

Loading

konklone commented Feb 19, 2018 •

edited

Loading