re_extract.py
is a tool for extracting the AWS re:Invent schedule to a CSV for
easier planning. This project also contains a script called interests.py
which
can take a list of session IDs you're interested in and mark them in the
resulting sessions.txt
file output of the extract process. This is useful for
being able to plan out things that you're interested in ahead of time and then
being able to reextract (no pun intended) the catalog multiple times over to get
the freshest content and not lose the things you were interested in.
re_extract.py
is heavily based on the mda590/reinvent_schedule_extract
repo.
Because that repo hadn't been made to work successfully since re:Invent 2017, I
had enough fixes and usage changes that I felt it warranted more than a fork so
others can discover and help maintain it.
- Use the python version specified in
.python-version
file (recommend using pyenv for this) - Install requirements via
pip install -r requirements.txt
(recommend using venv for this) - Download the appropriate version of the Chrome web driver for your Chrome
browser version into the root of this project directory (Don't worry, it's
.gitignore
'd)- https://sites.google.com/a/chromium.org/chromedriver/downloads
- NOTE for Chrome version higher than 115, go here https://googlechromelabs.github.io/chrome-for-testing/
- Duplicate the
example.env
file as simply.env
and update configuration as appropriate (see configuration section below) - Run the script
The pyenv
and venv
patterns ensure this behaves the same way on your system
as it does on mine which is why I showed that model in the example below. If you
choose to use your system python and non vendored packages you may run into
compatibility issues.
# Clone this repo and change into the directory
git clone <this repo>
cd re_extract
# Download and copy chromedriver-mac-arm64 folder to root of repo
# Go here https://googlechromelabs.github.io/chrome-for-testing/
# Next, you need to override macOS malware alert in order to run it
xattr -d com.apple.quarantine ./chromedriver-mac-arm64/chromedriver
# The .env file from the example and update per the config section below
cp example.env .env
nano .env
# Use pyenv to auto-install the version of python in the .python-version file
pyenv install
# Setup pip package vendoring
python -m venv venv
source venv/bin/activate
# Install/vendor the packages
pip install -r requirements.txt
# Do the thing
python re_extract.py
When done, open the sessions.txt
file in Excel or spreadsheet app of choice
The example.env
file in this project must be copied as simply .env
and
updated per these suggestions:
Item | Notes |
---|---|
REINVENT_USERNAME |
Your username for the re:Invent catalog site |
REINVENT_PASSWORD |
Your password for the re:Invent catalog site |
VERIFY_SSL_CERTS |
Set to False to ignore cert warnings |
CHROMEDRIVER_PATH |
Path to the chromedriver executable you downloaded |
CHROMEDRIVER_USER_AGENT |
The user-agent to spoof to avoid 403's. See notes below |
In 2019 they started adding headless user agent detection and throwing 403's if detected. Update this value to the appropriate string for the version of chrome you're using. Since you have to download the version of chromedriver which matches the version of chrome you have installed anyway, go to this site: https://www.whatismybrowser.com/detect/what-is-my-user-agent and copy the user agent string and insert that into the .env file.
The interests.py
takes an input file called interests.txt
which is simply a
line-separated list of session ID's you're interested in and marks them as true
in the Interest column in the sessions.txt file. Optionally, you can add a priority
for each session in the interests.txt file. On each line of the file, put: session ID,priority.
For example, if I want session DEV303 to have priority 1 and DEV101 to have priority 2,
I would write:
DEV303,1
DEV101,2
Once you've run your extract and the sessions.txt
file exists, and created the
interests.txt file full of ID's you're interested in, just run it.
python interests.py
- As the original credited developer noted, the reinvent session catalog site is terrible and just stops paging at 300 items. To be safe I have it doing much smaller chunks at a time (with sleeps to allow for slow loading) by way of filtered searches, looping through session types and topics. This was the only way I could reliably capture 99% of the catalog contents (some sessions aren't categorized at all!)
- I probably could have made it pull session types and topics dynamically but just didn't put that time in. Feel free if the spirit moves you and do a PR.
- The
.env
configuration model is a fairly common pattern in many apps. Didn't want the user to have to enter creds at the command line or edit source