Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review notebook 1 "Pandas is table oriented" #2

Open
wants to merge 3 commits into
base: review
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Collaborator

No description provided.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@@ -0,0 +1,620 @@
{
Copy link

@TomAugspurger TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use column *names* or *labels* more often?

I typically prefer *labels*, since "names" can conflict with Index.name / MultiIndex.names.


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with using labels. I'll start listing these type of _conventions_ in order to make them consistent in the following tutorials.

@@ -0,0 +1,620 @@
{
Copy link

@TomAugspurger TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's provide index= to the DataFrame call above, so that we have an Index object here and not a RangeIndex.


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good point

@@ -0,0 +1,620 @@
{
Copy link

@TomAugspurger TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite true about object, especially if https://github.com/pandas-dev/pandas/pull/27949 is merged for 1.0.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when that is merged, it will not yet be the default, right?

I would reword to somethinglike "any python object, but in practice often means strings"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed in https://github.com/pandas-dev/pandas/issues/26831#issuecomment-529716754 we will exclude the discussion on object types in the first tutorial.

@@ -0,0 +1,620 @@
{
Copy link

@TomAugspurger TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on Pandas (capital P).

Rather than "single column version", say something like "each column in a DataFrame is a Series".


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll convert all Pandas references with capital P to lowercase in all of the tutorials

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'each column in a DataFrame is a Series' -> indedd, adjusted

@@ -0,0 +1,620 @@
{
Copy link
Collaborator Author

@jorisvandenbossche jorisvandenbossche Sep 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

characters -> strings? (I find that a bit more common name, although in SQL land character is maybe often used)


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll make sure adjustments and other tutorials will be adjusted to using strings (cfr. https://github.com/pandas-dev/pandas/pull/27949)

@@ -0,0 +1,620 @@
{
Copy link
Collaborator Author

@jorisvandenbossche jorisvandenbossche Sep 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of focusing on "a dataframe has attributes", I would maybe rather start with explaining the concepts of column names and index. And then only after that explain that those two "things" are attributes ?


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left discussion on attributes of a tabel out of the starting tutorial to descrease technical elements

@@ -0,0 +1,620 @@
{
Copy link
Collaborator Author

@jorisvandenbossche jorisvandenbossche Sep 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the reason that you don't show that a column of a dataframe is a series, is because indexing is only explained in a later tutorial?

I would maybe consider to just show df[col] already here, which might give a more natural introduction to series = column of dataframe.

And then in a second step we can still show that you can create one manually as well


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jorisvandenbossche
Copy link
Collaborator Author

A general remark on this tutorial is that it is "not very interesting". I am not sure how to make it more, though.

I fully admit I have this in my tutorials as well, but showing the .index, .columns, .dtypes attributes of a dataframe as a starter is maybe not attractive? I am wondering if we can have some other things in this first notebook, but the difficulty is of course that doing something actually useful directly needs some of the methods that is only seen in later notebooks ..

Copy link
Owner

I refactored with an attempt to keep a balance between:

  • introduction to dataframe/series concept
  • show that the aim is to DO something with it
  • make it more engaing
  • keep the technical level low enough

@stijnvanhoey
Copy link
Owner

@datapythonista and @WillAyd can you have a look to see if this improves the first tutorial?

@@ -0,0 +1,484 @@
{
Copy link

@WillAyd WillAyd Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These images (?) aren't rendering - maybe just an issue with ReviewNB?


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed.

@@ -0,0 +1,484 @@
{
Copy link

@WillAyd WillAyd Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • I think would be nice to just use df as the name of this instead of my_dataframe
  • I would defer introducing Categorical types until later
  • We don't need to specify the index= here as it matches the default

Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because df is such a widely used convention for naming DataFrames, I agree with @WillAyd that it would be better to use df than my_dataframe.

I also agree that there's no need to specify the index here. I think that @TomAugspurger suggested specifying the index to avoid a RangeIndex, but the index is no longer being printed out in this notebook, so it's not an issue.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • my_dataframe -> df; good point, adjusted
  • index is indeed of no more usage, removed
  • I added categorical to make sure that when later on applying the describe method it is not taken into account. To overcome this, I now used Sexas a column instead of Pclass.

@@ -0,0 +1,484 @@
{
Copy link

@WillAyd WillAyd Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Note seems a little strange to me. I get what you are saying but for a new user it doesn't really explain much, so probably best to remove


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On overcoming too much links towards _future_ tutorials and due to the potential confusion for new users, I removed this note.

@@ -0,0 +1,484 @@
{
Copy link

@WillAyd WillAyd Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete the second sentence - I think users should be going through sequentially and not hopping around tutorial


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm in agreement with @WillAyd's suggestion here. Perhaps don't encourage users to jump ahead to future lessons, though in future lessons it does make sense to link back to previous lessons.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good point!

@@ -0,0 +1,484 @@
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this comment as well - I don't think we care to draw that comparison this early on


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to keep this language, I would change "When you are familiar" to "If you are familiar".

Also, "base" should be "based".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the note there for the moment. I do think Python users will _like_ the link with something they already know.

@@ -0,0 +1,484 @@
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Aggregating Data" would be a more appropriate title


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to stress a specific function in that section and use the aggregation just as an example of applying _methods_ in general. As such, I use _do something_ in order to explain the general concept of methods applied on DataFrame/Series.

@@ -0,0 +1,484 @@
{
Copy link

@justmarkham justmarkham Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the term "alias" rather than "shortcut".


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

@@ -0,0 +1,484 @@
{
Copy link

@justmarkham justmarkham Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only ever seen the word "latter" used when there are two things discussed ("former" and "latter"). Since there are three things being discussed in this case, I would replace "the latter" with "the Pclass column".


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check.

@@ -0,0 +1,484 @@
{
Copy link

@justmarkham justmarkham Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "parenthesis" should be "parentheses"


Reply via ReviewNB

@@ -0,0 +1,484 @@
{
Copy link

@justmarkham justmarkham Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than saying "methods are like functions", I think we should just say "methods are functions".


Reply via ReviewNB

@@ -0,0 +1,484 @@
{
Copy link

@justmarkham justmarkham Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "PClass" should be "Pclass"

Also, I think one useful point to mention here is that many pandas operations return a DataFrame or a Series, and this is an example of a pandas operation (describe) returning a Series.


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added to the tutorial.

@@ -0,0 +1,484 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never heard the phrase "besides the looks", so perhaps change the wording?


Reply via ReviewNB

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adusted to:

> This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas. Continue reading the next tutorials to get you started!

Further improvement welcome

@@ -0,0 +1,476 @@
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a Series still has a name ...

Is that something to mention? In the example above we also give it a name.


Reply via ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants