This repository contains everything you need to follow the "Thinking In Arrays" tutorial, presented at the SciPy 2024 conference on Monday, July 8, 2024 at 13:30am‒17:30pm PDT in Room 315.
This tutorial is compiled by Jim Pivarski. Here is a video of the tutorial, as it was presented by Jim last year in 2023.
Abstract:
Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata munging" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.
This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays.
Array-oriented programming is a paradigm in its own right, challenging us to think about problems in a different way. From APL in 1966 to NumPy today, most users of array-oriented programming are scientists, analyzing or simulating data. This tutorial focuses on the thought process: all of the problems are to be solved in an imperative way (for loops) and an array-oriented way. Matlab will be used for plotting, but all plotting commands will be given (not prerequisites).
We'll alternate between short lectures and small group projects (3‒4 people each), in which tutors will be available for help, followed by a guided tour through solutions, alternatives, and trade-offs.
You should have a basic familiarity with NumPy, such as the content of the "Introduction to Numerical Computing With NumPy" tutorial.
This tutorial consists of interactive lectures and exercises, all of which run in Jupyter notebooks. These notebooks depend on the libraries listed in environment.yml. On the day of the tutorial, we will use Quansight's Nebari platform to run the notebooks in the cloud with all dependencies installed. See
to get started.
You can also install the packages on your personal computer. If you're accessing this after the day of the tutorial, this is the only way to do it (Nebari won't be available), but if it's the day of the tutorial, Nebari is strongly preferred. We won't take any tutorial time to solve installation problems.
If you have some version of conda/mamba/Anaconda/Miniconda/Miniforge, you can install the environment.yml as a new environment, then activate that environment and run JupyterLab.
If you're new to this package manager, we recommend the mamba/CPython version of Miniforge, which has detailed instructions here.
Once conda/mamba is installed, the command to create an environment from a file is
wget https://raw.githubusercontent.com/ekourlit/scipy2024-tutorial-thinking-in-arrays/blob/main/environment.yml
conda env create -f environment.yml # can replace "conda" with "mamba"
The command to activate that environment (once per terminal) is
conda activate scipy2024-tutorial-thinking-in-arrays
Get a copy of this repo and enter its directory.
git clone https://github.com/ekourlit/scipy2024-tutorial-thinking-in-arrays.git
cd scipy2024-tutorial-thinking-in-arrays
Start JupyterLab with
jupyter lab
It should attempt to open a browser tab to the Jupyter process running on your computer, and provides some URLs in the terminal in case that doesn't work.
In each directory, part-1, part-2, and part-3, there are four notebooks:
- lecture-slides: the presentation, to be viewed in jupyterlab-deck
- lecture-workbook: for participants to use during the lecture; acts as a scratch-pad for solving "quizlets"
- project: a larger exercise to work on after each lecture
- solutions: discussion of different ways to solve the exercise.
During the live tutorial, participants should open the lecture-workbook. The lecture-slides will be projected on a big screen. Offline, after the event, open both notebooks and read through the lecture-slides.
When everyone is working on exercises, I'll share
to follow your progress. It should take you to a page that looks like this:
0:00‒0:30 (30 min): Part 1 lecture: array-oriented programming as a paradigm: APL, SPEAKEASY, IDL, MATLAB, S, R, NumPy. Overview of basic and advanced slicing, broadcasting, and dimensional reduction. Powerful concept: element indexing is function application and advanced slicing is function composition.
0:30‒1:00 (30 min): Project 1: Conway's Game of Life. Calculating number of neighbors and updating the board "all at once."
1:00‒1:15 (15 min): Break
1:15‒1:35 (20 min): Guided discussion of solutions to Project 1.
1:35‒2:05 (30 min): Part 2 lecture: array-oriented programming and the "iteration until converged" problem. How to update arrays in which some elements have converged and others haven't.
2:05‒2:20 (15 min): Break
2:20‒2:50 (30 min): Part 3 lecture: non-rectilinear (ragged) arrays and arrays of arbitrary data structures: Apache Arrow and Awkward Array.
2:50‒3:25 (35 min): Project 3: a big, ragged dataset: computing lengths of taxi trips from polylines with varying numbers of edges. Since this is a big dataset, we'll also look at ways to scale it up with Dask.
3:25‒3:40 (15 min): Break
3:40‒4:00 (20 min): Solutions to Project 3.