-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review notebook 1 "Pandas is table oriented" #2
base: review
Are you sure you want to change the base?
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use column *names* or *labels* more often?
I typically prefer *labels*, since "names" can conflict with Index.name / MultiIndex.names.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok with using labels. I'll start listing these type of _conventions_ in order to make them consistent in the following tutorials.
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's provide index=
to the DataFrame call above, so that we have an Index
object here and not a RangeIndex.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good point
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite true about object
, especially if https://github.com/pandas-dev/pandas/pull/27949 is merged for 1.0.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when that is merged, it will not yet be the default, right?
I would reword to somethinglike "any python object, but in practice often means strings"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As pointed in https://github.com/pandas-dev/pandas/issues/26831#issuecomment-529716754 we will exclude the discussion on object types in the first tutorial.
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on Pandas
(capital P
).
Rather than "single column version", say something like "each column in a DataFrame is a Series".
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll convert all Pandas references with capital P to lowercase in all of the tutorials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'each column in a DataFrame is a Series' -> indedd, adjusted
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
characters -> strings? (I find that a bit more common name, although in SQL land character is maybe often used)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll make sure adjustments and other tutorials will be adjusted to using strings
(cfr. https://github.com/pandas-dev/pandas/pull/27949)
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of focusing on "a dataframe has attributes", I would maybe rather start with explaining the concepts of column names and index. And then only after that explain that those two "things" are attributes ?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left discussion on attributes of a tabel out of the starting tutorial to descrease technical elements
@@ -0,0 +1,620 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the reason that you don't show that a column of a dataframe is a series, is because indexing is only explained in a later tutorial?
I would maybe consider to just show df[col]
already here, which might give a more natural introduction to series = column of dataframe.
And then in a second step we can still show that you can create one manually as well
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
A general remark on this tutorial is that it is "not very interesting". I am not sure how to make it more, though. I fully admit I have this in my tutorials as well, but showing the |
I refactored with an attempt to keep a balance between:
|
@datapythonista and @WillAyd can you have a look to see if this improves the first tutorial? |
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
- I think would be nice to just use
df
as the name of this instead ofmy_dataframe
I would defer introducing Categorical types until later
We don't need to specify the
index=
here as it matches the default
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because df
is such a widely used convention for naming DataFrames, I agree with @WillAyd that it would be better to use df
than my_dataframe
.
I also agree that there's no need to specify the index here. I think that @TomAugspurger suggested specifying the index to avoid a RangeIndex, but the index is no longer being printed out in this notebook, so it's not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my_dataframe
->df
; good point, adjusted- index is indeed of no more usage, removed
- I added categorical to make sure that when later on applying the
describe
method it is not taken into account. To overcome this, I now usedSex
as a column instead ofPclass
.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Note seems a little strange to me. I get what you are saying but for a new user it doesn't really explain much, so probably best to remove
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On overcoming too much links towards _future_ tutorials and due to the potential confusion for new users, I removed this note.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would delete the second sentence - I think users should be going through sequentially and not hopping around tutorial
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm in agreement with @WillAyd's suggestion here. Perhaps don't encourage users to jump ahead to future lessons, though in future lessons it does make sense to link back to previous lessons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good point!
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this comment as well - I don't think we care to draw that comparison this early on
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to keep this language, I would change "When you are familiar" to "If you are familiar".
Also, "base" should be "based".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the note there for the moment. I do think Python users will _like_ the link with something they already know.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not want to stress a specific function in that section and use the aggregation just as an example of applying _methods_ in general. As such, I use _do something_ in order to explain the general concept of methods applied on DataFrame/Series.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only ever seen the word "latter" used when there are two things discussed ("former" and "latter"). Since there are three things being discussed in this case, I would replace "the latter" with "the Pclass
column".
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than saying "methods are like functions", I think we should just say "methods are functions".
Reply via ReviewNB
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "PClass" should be "Pclass"
Also, I think one useful point to mention here is that many pandas operations return a DataFrame or a Series, and this is an example of a pandas operation (describe) returning a Series.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added to the tutorial.
@@ -0,0 +1,484 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adusted to:
> This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas. Continue reading the next tutorials to get you started!
Further improvement welcome
@@ -0,0 +1,476 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But a Series still has a name ...
Is that something to mention? In the example above we also give it a name.
Reply via ReviewNB
No description provided.