-
Notifications
You must be signed in to change notification settings - Fork 137
Conversation
google-cloud-bigquery | ||
google-auth-oauthlib | ||
|
||
# For Lambda support (only used locally): | ||
boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rearranged requirements.txt
to much more clearly delineate which deps are used for which sections of the tool, to aid in creating tailored Lambda packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also removed the requests
dependency, shrinking our dependency surface area.
@@ -22,8 +21,7 @@ def init(environment, options): | |||
analytics_path = os.path.join(utils.cache_dir(), "analytics.csv") | |||
|
|||
try: | |||
response = requests.get(analytics_file) | |||
utils.write(response.text, analytics_path) | |||
utils.download(analytics_file, analytics_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analytics
scanner was the only place requests
was being used, and it was easy to replace with our other download method.
scanners/headless/base.js
Outdated
const chromeOptions = [ | ||
// Resolves error: | ||
// error when launch(); No usable sandbox! Update your kernel | ||
'--no-sandbox', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed --no-sandbox
locally on Debian, not just for Lambda.
scanners/third_parties.js
Outdated
// TODO: use the Public Suffix List. | ||
var baseDomainFor = (input) => { | ||
return input.split("\.").slice(-2).join("\."); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for the future: use the Public Suffix List. (For Lambda, this will probably have to be handled similarly to how it's handled for pshtt
.)
str.join(" | ", data['nearby_urls']), | ||
str.join(" | ", data['known_services']), | ||
str.join(" | ", data['unknown_services']) | ||
]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URLs and domains should all be URL-encoded, so |
should be a tolerable delimiter.
return str(response, encoding='UTF-8') | ||
except subprocess.CalledProcessError as exc: | ||
if exc.returncode in allowed_return_codes: | ||
return str(exc.stdout, encoding='UTF-8') | ||
else: | ||
logging.warn("Error running %s." % (str(command))) | ||
logging.warn("Error running %s." % (str(exc.output))) | ||
logging.warn(format_last_exception()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better handle error handling for failed shelled-out commands.
TODOs before resolving this PR:
|
@micahsaul This moves the |
btw one alternative to the manually updated known_services.json could be https://github.com/paulirish/third-party-decode it's hashed for various privacy reasons (we externalized some internal google data for it). but aside from that it seems like it could work decently well for your purposes. outside of that, the puppeteer integration is quite nice. well done! |
I've merged in 13 more commits from the last week, which extend this work to provide full Lambda support for running headless Chrome via puppeteer.
The That's very impressive to me, and really made this work worthwhile. While this new scaffolding has increased domain-scan's complexity somewhat, I think the ROI supports it, and I look forward to evolving this part of domain-scan to become more powerful and maintainable over time. I'll add documentation before merging this PR, but the rest (including making the local Chrome path overrideable) I'm planning to defer to a future PR. |
"presets": [ | ||
["env", { | ||
"targets": { | ||
"node": "6.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .babelrc
file in the project root is used during Lambda function transpilation to transform async/await
calls into iterators during the function build process. The versioned .js code in this repository, including custom scanner code, can freely use async
and await
keywords, and local scan execution will use whatever version of Node is local to the scan environment.
@@ -0,0 +1 @@ | |||
HeadlessChrome-66.0.3343.0.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a symlink, so that the Chrome binary can easily be updated in place without updating build scripts which reference it.
../../node_modules/.bin/babel ../../scanners/$SCANNER_NAME.js --out-file build/scanners/$SCANNER_NAME.js | ||
|
||
# Copy in the known services map. | ||
cp ../../utils/known_services.json build/utils/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above lines (similar to the Python Lambda function build script) are the most brittle, since they will need to be updated in case of the addition of any utils code, or the moving/renaming of any relevant files to JS-based scan execution.
--handler lambda_handler.handler \ | ||
--runtime nodejs6.10 \ | ||
--timeout 300 \ | ||
--memory-size 1536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chrome headless uses more memory and needs more CPU (and so is more expensive) than the simpler Python-based scanners currently at use in domain-scan. These functions default to 1536MB during build, based on multiple suggestions online, as well as my own empirical (anecdotal, not statistically significant) testing at optimizing for GB-s pricing/performance.
.on('error', (err) => reject(err)) | ||
.pipe(tar.x({ | ||
C: setupChromePath, | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part relies on Chrome being packaged as a tarball. It's unpackaged live in JavaScript and written out to /tmp
inside the function. This might be something where we could consciously let "warm" Lambda function environments reuse an already unpackaged Chrome instead of unpacking it every time.
@@ -0,0 +1,25 @@ | |||
'use strict'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the transpilation step to Node 6.10 for Lambda adds 'use strict';
to the generated files and effectively forces this onto the source files (even if we don't say use strict
at the top of them), I'm just taking it as a push to always use use strict
in all of our JS files and avoid any issues at transpilation time. This requires, among other things, consistently using var
or let
and not implicitly creating global variables.
return callback(null, data); | ||
}; | ||
|
||
module.exports = {scan: scan} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This base.js
file doesn't do a whole lot, since local_bridge.js
and handler.js
handle a lot of the setup work before executing a specific scanner. But it does do some stuff, that specific scanners won't have to do, and I suspect this will grow to support a bunch of common kinds of error handling and options management.
|
||
// Hook to allow slightly easier debugging. | ||
// TEST_LOCAL=1 ./scanners/headless/local_bridge.js third_parties example.com | ||
if (process.env.TEST_LOCAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In normal use, local_bridge.py
sends a serialized JSON blob to pass a dict of data to local_bridge.js
. To allow for easier local testing of the local JS code path, this adds a simplistic version of the CLI API that just takes a domain name and assumes a bunch of defaults, if the TEST_LOCAL
env var is present.
"CFI Group": [ | ||
"cfigroup.com" | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few more third party services to this JSON file, but I very much would like to start using/contributing to an upstream source.
command, | ||
stderr=subprocess.STDOUT, | ||
shell=False, env=env | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This improves the local error message / stacktrace handling for local scan execution where shelling out is involved.
This builds support for headless Chrome into domain-scan.
It rewrites the
third_parties
scanner to be in Node, and adds central functionality for a scanner to hand off itsscan
method to Node. The third_parties scanner now usespuppeteer
/Chrome instead ofphantomas
/phantomjs
to load websites and trap outbound third party connections.When running locally, it uses the awkward-but-functional method of shelling out to Node to execute a
base.js
file where the scanner name (as a string) and parameters (as a serialized JSON blob) over STDOUT as CLI arguments.This PR will be augmented before it's merged with Lambda integration, so that the HTTP call from the scanner to Lambda can act as the cross-language communication, and so that the Lambda container can use just one language runtime (necessary to leave room for Chrome to be packaged directly into the sub-50MB zip file).
I also cleaned up the repository somewhat to remove some outdated
scripts/
files, movescanners/utils.py
to a newutils/
directory, and moved the known third party services data to its own JSON file that can be accessed by either language.I'm sure there's a bit of technical debt here, but I think it's important to get this closer to shipping so that it can be used and built on. There's a lot that can be done with headless Chrome beyond third party service analysis.
The
third_parties
scanner has a few notable TODOs (likely for a future PR):something.state.gov
redirects tohistory.state.gov/something.html
, skip the scan and assume that we'll get to that when we scanhistory.state.gov
.third_parties.js
andbase.js
. There are some places where a crash will cause a cascade of PromiseNotResolved exceptions.cc @h-m-f-t @jsf9k @gbinal @micahsaul