-
Notifications
You must be signed in to change notification settings - Fork 75
/
Copy path21-python_essentials.qmd
607 lines (426 loc) · 17.9 KB
/
21-python_essentials.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
---
engine: knitr
---
# Python essentials {#sec-python-essentials}
**Prerequisites**
**Key concepts and skills**
**Software and packages**
- `Python` [@python]
- `datetime>=5.5`
- `uv`
- `polars`
## Introduction
`Python` is a general-purpose programming language created by Guido van Rossum. `Python` version 0.9.0 was released in February 1991, and the current version, 3.13, was released in October 2024. It was named `Python` after *Monty Python's Flying Circus*.
`Python` is a popular language in machine learning, but it was designed, and is more commonly used, for more general software applications. This means that we will especially rely on packages when we use Python for data science. This use of `Python` in this book is focused on data science, rather than the other, more general, uses for which it was developed.
Knowing `R` will allow you to pick up `Python` for data science quickly. The main data science packages share the need to solve the same underlying problems.
## Python, VS Code, and uv
We could use `Python` within RStudio, but another option is to use what is used by the community more broadly, which is VS Code. You can download VS Code for free [here](https://code.visualstudio.com) and then install it. If you have difficulties with this, then in the same way we started with Posit Cloud and the shifted to our local machine, you could initially use Google Colab [here](https://colab.google).
Open VS Code (@fig-vscodesetup-a), and open a new Terminal: Terminal -> New Terminal (@fig-vscodesetup-b). We can then install `uv`, which is a Python package manager, by putting `curl -LsSf https://astral.sh/uv/install.sh | sh` into the Terminal and pressing "return/enter" afterwards (@fig-vscodesetup-c). Finally, to install Python we can use `uv` by putting `uv python install` into that Terminal and pressing "return/enter" afterwards (@fig-vscodesetup-d).
::: {#fig-vscodesetup layout-ncol="2"}
{#fig-vscodesetup-a width="50%"}
{#fig-vscodesetup-b width="50%"}
{#fig-vscodesetup-c width="50%"}
{#fig-vscodesetup-d width="50%"}
Opening VS Code and a new terminal and then installing uv and Python
:::
## Getting started
### Project set-up
We are going to get started with an example that downloads some data from Open Data Toronto. To start, we need to create a project, which will allow all our code to be self-contained.
Open VS Code and open a new Terminal: "Terminal" -> "New Terminal". Then use Unix shell commands to navigate to where you want to create your folder. For instance, use `ls` to list all the folders in the current directory, then move to one using `cd` and then the name of the folder. If you need to go back one level then use `..`.
Once you are happy with where you are going to create this new folder, we can use `uv init` in the Terminal to do this, pressing "return/enter" afterwards (`cd` then moves to the new folder "shelter_usage").
```{bash}
#| eval: false
#| echo: true
uv init shelter_usage
cd shelter_usage
```
By default, there will be a script in the example folder. We want to use `uv run` to run that script, which will then create an project environment for us.
```{bash}
#| eval: false
#| echo: true
uv run hello.py
```
A project environment is specific to that project. We will use the package `numpy` to simulate data. We need to add this package to our environment with `uv add`.
```{bash}
#| eval: false
#| echo: true
uv add numpy
```
We can then modify `hello.py` to use `numpy` to simulate from the Normal distribution.
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import numpy as np
def main():
np.random.seed(853)
mu, sigma = 0, 1
sample_sizes = [10, 100, 1000, 10000]
differences = []
for size in sample_sizes:
sample = np.random.normal(mu, sigma, size)
sample_mean = np.mean(sample)
diff = abs(mu - sample_mean)
differences.append(diff)
print(f"Sample size: {size}")
print(f" Difference between sample and population mean: {round(diff, 3)}")
if __name__ == "__main__":
main()
```
After we have modified and saved `hello.py` we can run it with `uv run` in exactly the same way as before.
At this point we should close VS Code. We want to re-open it to make sure that our project environment is working as it needs to. In VS Code, a project is a self-contained folder. You can open a folder with "File" -> "Open Folder..." and then select the relevant folder, in this case "shelter_usage". You should then be able to re-run `uv run hello.py` and it should work.
### Plan
We first used this dataset in @sec-fire-hose, but as a reminder, for each day, for each shelter, there is a number of people that used the shelter. So the dataset that we want to simulate is something like @fig-python_torontohomeless-data and we are wanting to create a table of average daily number of occupied beds each month, along the lines of @fig-python_torontohomeless-table.
::: {#fig-python_torontohomeless layout-ncol="2"}
{#fig-python_torontohomeless-data width="50%"}
{#fig-python_torontohomeless-table width="50%"}
Sketches of a dataset and table related shelter usage in Toronto
:::
### Simulate
We would like to more thoroughly simulate the dataset that we are interested in. We will use `polars` to provide a dataframe to store our simulated results, so we should add this to our environment with `uv add`.
```{bash}
#| eval: false
#| echo: true
uv add polars
```
Create a new Python file called `00-simulate_data.py`.
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
#### Preamble ####
# Purpose: Simulates a dataset of daily shelter usage
# Author: Rohan Alexander
# Date: 12 November 2024
# Contact: rohan.alexander@utoronto.ca
# License: MIT
# Pre-requisites:
# - Add `polars`: uv add polars
# - Add `numpy`: uv add numpy
# - Add `datetime`: uv add datetime
#### Workspace setup ####
import polars as pl
import numpy as np
from datetime import date
rng = np.random.default_rng(seed=853)
#### Simulate data ####
# Simulate 10 shelters and some set capacity
shelters_df = pl.DataFrame(
{
"Shelters": [f"Shelter {i}" for i in range(1, 11)],
"Capacity": rng.integers(low=10, high=100, size=10),
}
)
# Create data frame of dates
dates = pl.date_range(
start=date(2024, 1, 1), end=date(2024, 12, 31), interval="1d", eager=True
).alias("Dates")
# Convert dates into a data frame
dates_df = pl.DataFrame(dates)
# Combine dates and shelters
data = dates_df.join(shelters_df, how="cross")
# Add usage as a Poisson draw
poisson_draw = rng.poisson(lam=data["Capacity"])
usage = np.minimum(poisson_draw, data["Capacity"])
data = data.with_columns([pl.Series("Usage", usage)])
data.write_parquet("simulated_data.parquet")
```
We would like to write tests based on this simulated data that we will then apply to our real data. We use `pydantic` to do this and so we should add this to our environment with `uv add`.
```{bash}
#| eval: false
#| echo: true
uv add pydantic
```
Create a new Python file called `00-test_simulated_data.py`. The first step is to define a subclass ``ShelterData` of `BaseModel` which comes from `pydantic`.
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
from pydantic import BaseModel, Field, ValidationError, field_validator
from datetime import date
# Define the Pydantic model
class ShelterData(BaseModel):
Dates: date # Validates date format (e.g., 'YYYY-MM-DD')
Shelters: str # Must be a string
Capacity: int = Field(..., ge=0) # Must be a non-negative integer
Usage: int = Field(..., ge=0) # Must be non-negative
# Add a field validator for usage to ensure it does not exceed capacity
@field_validator("Usage")
def check_usage_not_exceed_capacity(cls, usage, info):
capacity = info.data.get("Capacity")
if capacity is not None and usage > capacity:
raise ValueError(f"Usage ({usage}) exceeds capacity ({capacity}).")
return usage
```
We are interested in testing that dates are valid, shelters have the correct type, and that capacity and usage are both non-negative integers. One additional wrinkle is that usage should not exceed capacity. To write a test for that we use a `field_validator`.
We can then import our simulated dataset and test it.
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
df = pl.read_parquet("simulated_data.parquet")
# Convert Polars DataFrame to a list of dictionaries for validation
data_dicts = df.to_dicts()
# Validate the dataset in batches
validated_data = []
errors = []
# Batch validation
for i, row in enumerate(data_dicts):
try:
validated_row = ShelterData(**row) # Validate each row
validated_data.append(validated_row)
except ValidationError as e:
errors.append((i, e))
# Convert validated data back to a Polars DataFrame
validated_df = pl.DataFrame([row.dict() for row in validated_data])
# Display results
print("Validated Rows:")
print(validated_df)
if errors:
print("\nErrors:")
for i, error in errors:
print(f"Row {i}: {error}")
```
To see what would have happened if there were an error we can consider a smaller dataset that contains two errors: one poorly formatted date and one situation where usage is above capacity.
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
from pydantic import BaseModel, Field, ValidationError, field_validator
from datetime import date
# Define the Pydantic model
class ShelterData(BaseModel):
Dates: date # Validates date format (e.g., 'YYYY-MM-DD')
Shelters: str # Must be a string
Capacity: int = Field(..., ge=0) # Must be a non-negative integer
Usage: int = Field(..., ge=0) # Must be non-negative
# Add a field validator for Usage to ensure it does not exceed Capacity
@field_validator("Usage")
def check_usage_not_exceed_capacity(cls, usage, info):
capacity = info.data.get("Capacity")
if capacity is not None and usage > capacity:
raise ValueError(f"Usage ({usage}) cannot exceed Capacity ({capacity}).")
return usage
# Define the dataset
df = [
{"Dates": "2024-01-01", "Shelters": "Shelter 1", "Capacity": 23, "Usage": 22},
{"Dates": "rohan", "Shelters": "Shelter 2", "Capacity": 62, "Usage": 62},
{"Dates": "2024-01-01", "Shelters": "Shelter 3", "Capacity": 93, "Usage": 88},
# Add invalid row for testing
{"Dates": "2024-01-01", "Shelters": "Shelter 4", "Capacity": 50, "Usage": 55},
]
# Validate the dataset in batches
validated_data = []
errors = []
# Batch validation
for i, row in enumerate(df):
try:
validated_row = ShelterData(**row) # Validate each row
validated_data.append(validated_row)
except ValidationError as e:
errors.append((i, e))
# Convert validated data back to a Polars DataFrame
validated_df = pl.DataFrame([row.dict() for row in validated_data])
# Display results
print("Validated Rows:")
print(validated_df)
if errors:
print("\nErrors:")
for i, error in errors:
print(f"Row {i}: {error}")
```
We get the following message:
```
Errors:
Row 1: 1 validation error for ShelterData
Dates
Input should be a valid date or datetime, input is too short [type=date_from_datetime_parsing, input_value='rohan', input_type=str]
For further information visit https://errors.pydantic.dev/2.9/v/date_from_datetime_parsing
Row 3: 1 validation error for ShelterData
Usage
Value error, Usage (55) cannot exceed Capacity (50). [type=value_error, input_value=55, input_type=int]
For further information visit https://errors.pydantic.dev/2.9/v/value_error
```
### Acquire
Using the same source as before: https://open.toronto.ca/dataset/daily-shelter-overnight-service-occupancy-capacity/
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
# URL of the CSV file
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/21c83b32-d5a8-4106-a54f-010dbe49f6f2/resource/ffd20867-6e3c-4074-8427-d63810edf231/download/Daily%20shelter%20overnight%20occupancy.csv"
# Read the CSV file into a Polars DataFrame
df = pl.read_csv(url)
# Save the raw data
df.write_parquet("shelter_usage.parquet")
```
We are likely only interested in a few columns and only rows where there are data.
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
df = pl.read_parquet("shelter_usage.parquet")
# Select specific columns
selected_columns = ["OCCUPANCY_DATE", "SHELTER_ID", "OCCUPIED_BEDS", "CAPACITY_ACTUAL_BED"]
selected_df = df.select(selected_columns)
# Filter to only rows that have data
filtered_df = selected_df.filter(df["OCCUPIED_BEDS"].is_not_null())
print(filtered_df.head())
renamed_df = filtered_df.rename({"OCCUPANCY_DATE": "date",
"SHELTER_ID": "Shelters",
"CAPACITY_ACTUAL_BED": "Capacity",
"OCCUPIED_BEDS": "Usage"
})
print(renamed_df.head())
renamed_df.write_parquet("cleaned_shelter_usage.parquet")
```
We may then want to apply the tests to the real dataset
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
from pydantic import BaseModel, Field, ValidationError, field_validator
from datetime import date
# Define the Pydantic model
class ShelterData(BaseModel):
Dates: date # Validates date format (e.g., 'YYYY-MM-DD')
Shelters: str # Must be a string
Capacity: int = Field(..., ge=0) # Must be a non-negative integer
Usage: int = Field(..., ge=0) # Must be non-negative
# Add a field validator for Usage to ensure it does not exceed Capacity
@field_validator("Usage")
def check_usage_not_exceed_capacity(cls, usage, info):
capacity = info.data.get("Capacity")
if capacity is not None and usage > capacity:
raise ValueError(f"Usage ({usage}) cannot exceed Capacity ({capacity}).")
return usage
df = pl.read_parquet("cleaned_shelter_usage.parquet")
# Convert Polars DataFrame to a list of dictionaries for validation
data_dicts = df.to_dicts()
# Validate the dataset in batches
validated_data = []
errors = []
# Batch validation
for i, row in enumerate(data_dicts):
try:
validated_row = ShelterData(**row) # Validate each row
validated_data.append(validated_row)
except ValidationError as e:
errors.append((i, e))
# Convert validated data back to a Polars DataFrame
validated_df = pl.DataFrame([row.dict() for row in validated_data])
# Display results
print("Validated Rows:")
print(validated_df)
if errors:
print("\nErrors:")
for i, error in errors:
print(f"Row {i}: {error}")
```
### Explore
Manipulate the data
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
df = pl.read_parquet("cleaned_shelter_usage.parquet")
# Convert the date column to datetime and rename it for clarity
df = df.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date"))
# Group by "Dates" and calculate total "Capacity" and "Usage"
aggregated_df = (
df.group_by("date")
.agg([
pl.col("Capacity").sum().alias("Total_Capacity"),
pl.col("Usage").sum().alias("Total_Usage")
])
.sort("date") # Sort the results by date
)
# Display the aggregated DataFrame
print(aggregated_df)
```
Make a graph
```{python}
#| eval: false
#| echo: true
#| warning: false
#| message: false
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
# Read the Polars DataFrame from a Parquet file
df = pl.read_parquet("analysis_data.parquet")
# Ensure the 'date' column is of datetime type in Polars
df = df.with_columns([
pl.col('date').cast(pl.Date)
])
# Select the relevant columns and reshape the DataFrame
df_melted = df.select(["date", "Total_Capacity", "Total_Usage"]).melt(
id_vars="date",
variable_name="Metric",
value_name="Value"
)
# Convert Polars DataFrame to a Pandas DataFrame for Seaborn
df_melted_pd = df_melted.to_pandas()
# Ensure 'date' column is datetime in Pandas
df_melted_pd['date'] = pd.to_datetime(df_melted_pd['date'])
# Set the plotting style
sns.set_theme(style="whitegrid")
# Create the plot
plt.figure(figsize=(12, 6))
sns.lineplot(
data=df_melted_pd,
x="date",
y="Value",
hue="Metric",
linewidth=2.5
)
# Format the x-axis to show dates nicely
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Add labels and title
plt.xlabel("Date")
plt.ylabel("Values")
plt.title("Total Capacity and Usage Over Time")
# Adjust layout to prevent clipping of tick-labels
plt.tight_layout()
# Display the plot
plt.show()
```
### Share
One nice aspect is that we can use Python in a Quarto document. To do this we need to add it to VS Code by installing the Quarto extension from [here](https://marketplace.visualstudio.com/items?itemName=quarto.quarto). You can render a document by running `quarto preview` in the terminal.
VS Code is built by Microsoft which also owns GitHub. So we can add our account to VS Code by going to accounts and then signing in.
## Python
For loops
List comprehensions
## Making graphs
matplotlib
seaborn
## Exploring polars
### Importing data
### Dataset manipulation with joins and pivots
### String manipulation
### Factor variables
## Exercises
### Practice {.unnumbered}
### Quiz {.unnumbered}
### Task {.unnumbered}
Free Replit "100 Days of Code" Python [course](https://replit.com/learn/100-days-of-python).