Skip to content

Commit 3526a71

Browse files
authored
DOC: improve shared content between comparison pages (#38933)
1 parent eb53bf7 commit 3526a71

14 files changed

+140
-232
lines changed

doc/source/getting_started/comparison/comparison_with_sas.rst

Lines changed: 21 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -308,8 +308,8 @@ Sorting in SAS is accomplished via ``PROC SORT``
308308
String processing
309309
-----------------
310310

311-
Length
312-
~~~~~~
311+
Finding length of string
312+
~~~~~~~~~~~~~~~~~~~~~~~~
313313

314314
SAS determines the length of a character string with the
315315
`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm>`__
@@ -327,8 +327,8 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin
327327
.. include:: includes/length.rst
328328

329329

330-
Find
331-
~~~~
330+
Finding position of substring
331+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332332

333333
SAS determines the position of a character in a string with the
334334
`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm>`__ function.
@@ -342,19 +342,11 @@ you supply as the second argument.
342342
put(FINDW(sex,'ale'));
343343
run;
344344
345-
Python determines the position of a character in a string with the
346-
``find`` function. ``find`` searches for the first position of the
347-
substring. If the substring is found, the function returns its
348-
position. Keep in mind that Python indexes are zero-based and
349-
the function will return -1 if it fails to find the substring.
350-
351-
.. ipython:: python
352-
353-
tips["sex"].str.find("ale").head()
345+
.. include:: includes/find_substring.rst
354346

355347

356-
Substring
357-
~~~~~~~~~
348+
Extracting substring by position
349+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358350

359351
SAS extracts a substring from a string based on its position with the
360352
`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf>`__ function.
@@ -366,17 +358,11 @@ SAS extracts a substring from a string based on its position with the
366358
put(substr(sex,1,1));
367359
run;
368360
369-
With pandas you can use ``[]`` notation to extract a substring
370-
from a string by position locations. Keep in mind that Python
371-
indexes are zero-based.
361+
.. include:: includes/extract_substring.rst
372362

373-
.. ipython:: python
374363

375-
tips["sex"].str[0:1].head()
376-
377-
378-
Scan
379-
~~~~
364+
Extracting nth word
365+
~~~~~~~~~~~~~~~~~~~
380366

381367
The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm>`__
382368
function returns the nth word from a string. The first argument is the string you want to parse and the
@@ -394,20 +380,11 @@ second argument specifies which word you want to extract.
394380
;;;
395381
run;
396382
397-
Python extracts a substring from a string based on its text
398-
by using regular expressions. There are much more powerful
399-
approaches, but this just shows a simple approach.
400-
401-
.. ipython:: python
402-
403-
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
404-
firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
405-
firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]
406-
firstlast
383+
.. include:: includes/nth_word.rst
407384

408385

409-
Upcase, lowcase, and propcase
410-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
386+
Changing case
387+
~~~~~~~~~~~~~
411388

412389
The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm>`__
413390
`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm>`__ and
@@ -427,27 +404,13 @@ functions change the case of the argument.
427404
;;;
428405
run;
429406
430-
The equivalent Python functions are ``upper``, ``lower``, and ``title``.
407+
.. include:: includes/case.rst
431408

432-
.. ipython:: python
433-
434-
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
435-
firstlast["string_up"] = firstlast["String"].str.upper()
436-
firstlast["string_low"] = firstlast["String"].str.lower()
437-
firstlast["string_prop"] = firstlast["String"].str.title()
438-
firstlast
439409

440410
Merging
441411
-------
442412

443-
The following tables will be used in the merge examples
444-
445-
.. ipython:: python
446-
447-
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
448-
df1
449-
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
450-
df2
413+
.. include:: includes/merge_setup.rst
451414

452415
In SAS, data must be explicitly sorted before merging. Different
453416
types of joins are accomplished using the ``in=`` dummy
@@ -473,39 +436,13 @@ input frames.
473436
if a or b then output outer_join;
474437
run;
475438
476-
pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides
477-
similar functionality. Note that the data does not have
478-
to be sorted ahead of time, and different join
479-
types are accomplished via the ``how`` keyword.
480-
481-
.. ipython:: python
482-
483-
inner_join = df1.merge(df2, on=["key"], how="inner")
484-
inner_join
485-
486-
left_join = df1.merge(df2, on=["key"], how="left")
487-
left_join
488-
489-
right_join = df1.merge(df2, on=["key"], how="right")
490-
right_join
491-
492-
outer_join = df1.merge(df2, on=["key"], how="outer")
493-
outer_join
439+
.. include:: includes/merge.rst
494440

495441

496442
Missing data
497443
------------
498444

499-
Like SAS, pandas has a representation for missing data - which is the
500-
special float value ``NaN`` (not a number). Many of the semantics
501-
are the same, for example missing data propagates through numeric
502-
operations, and is ignored by default for aggregations.
503-
504-
.. ipython:: python
505-
506-
outer_join
507-
outer_join["value_x"] + outer_join["value_y"]
508-
outer_join["value_x"].sum()
445+
.. include:: includes/missing_intro.rst
509446

510447
One difference is that missing data cannot be compared to its sentinel value.
511448
For example, in SAS you could do this to filter missing values.
@@ -522,25 +459,7 @@ For example, in SAS you could do this to filter missing values.
522459
if value_x ^= .;
523460
run;
524461
525-
Which doesn't work in pandas. Instead, the ``pd.isna`` or ``pd.notna`` functions
526-
should be used for comparisons.
527-
528-
.. ipython:: python
529-
530-
outer_join[pd.isna(outer_join["value_x"])]
531-
outer_join[pd.notna(outer_join["value_x"])]
532-
533-
pandas also provides a variety of methods to work with missing data - some of
534-
which would be challenging to express in SAS. For example, there are methods to
535-
drop all rows with any missing values, replacing missing values with a specified
536-
value, like the mean, or forward filling from previous rows. See the
537-
:ref:`missing data documentation<missing_data>` for more.
538-
539-
.. ipython:: python
540-
541-
outer_join.dropna()
542-
outer_join.fillna(method="ffill")
543-
outer_join["value_x"].fillna(outer_join["value_x"].mean())
462+
.. include:: includes/missing.rst
544463

545464

546465
GroupBy
@@ -549,7 +468,7 @@ GroupBy
549468
Aggregation
550469
~~~~~~~~~~~
551470

552-
SAS's PROC SUMMARY can be used to group by one or
471+
SAS's ``PROC SUMMARY`` can be used to group by one or
553472
more key variables and compute aggregations on
554473
numeric columns.
555474

@@ -561,14 +480,7 @@ numeric columns.
561480
output out=tips_summed sum=;
562481
run;
563482
564-
pandas provides a flexible ``groupby`` mechanism that
565-
allows similar aggregations. See the :ref:`groupby documentation<groupby>`
566-
for more details and examples.
567-
568-
.. ipython:: python
569-
570-
tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
571-
tips_summed.head()
483+
.. include:: includes/groupby.rst
572484

573485

574486
Transformation
@@ -597,16 +509,7 @@ example, to subtract the mean for each observation by smoker group.
597509
if a and b;
598510
run;
599511
600-
601-
pandas ``groupby`` provides a ``transform`` mechanism that allows
602-
these type of operations to be succinctly expressed in one
603-
operation.
604-
605-
.. ipython:: python
606-
607-
gb = tips.groupby("smoker")["total_bill"]
608-
tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")
609-
tips.head()
512+
.. include:: includes/transform.rst
610513

611514

612515
By group processing

0 commit comments

Comments
 (0)