@@ -308,8 +308,8 @@ Sorting in SAS is accomplished via ``PROC SORT``
308
308
String processing
309
309
-----------------
310
310
311
- Length
312
- ~~~~~~
311
+ Finding length of string
312
+ ~~~~~~~~~~~~~~~~~~~~~~~~
313
313
314
314
SAS determines the length of a character string with the
315
315
`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm >`__
@@ -327,8 +327,8 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin
327
327
.. include :: includes/length.rst
328
328
329
329
330
- Find
331
- ~~~~
330
+ Finding position of substring
331
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332
332
333
333
SAS determines the position of a character in a string with the
334
334
`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm >`__ function.
@@ -342,19 +342,11 @@ you supply as the second argument.
342
342
put(FINDW(sex,' ale' ));
343
343
run;
344
344
345
- Python determines the position of a character in a string with the
346
- ``find `` function. ``find `` searches for the first position of the
347
- substring. If the substring is found, the function returns its
348
- position. Keep in mind that Python indexes are zero-based and
349
- the function will return -1 if it fails to find the substring.
350
-
351
- .. ipython :: python
352
-
353
- tips[" sex" ].str.find(" ale" ).head()
345
+ .. include :: includes/find_substring.rst
354
346
355
347
356
- Substring
357
- ~~~~~~~~~
348
+ Extracting substring by position
349
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358
350
359
351
SAS extracts a substring from a string based on its position with the
360
352
`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf >`__ function.
@@ -366,17 +358,11 @@ SAS extracts a substring from a string based on its position with the
366
358
put(substr(sex,1 ,1 ));
367
359
run;
368
360
369
- With pandas you can use ``[] `` notation to extract a substring
370
- from a string by position locations. Keep in mind that Python
371
- indexes are zero-based.
361
+ .. include :: includes/extract_substring.rst
372
362
373
- .. ipython :: python
374
363
375
- tips[" sex" ].str[0 :1 ].head()
376
-
377
-
378
- Scan
379
- ~~~~
364
+ Extracting nth word
365
+ ~~~~~~~~~~~~~~~~~~~
380
366
381
367
The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm >`__
382
368
function returns the nth word from a string. The first argument is the string you want to parse and the
@@ -394,20 +380,11 @@ second argument specifies which word you want to extract.
394
380
;;;
395
381
run;
396
382
397
- Python extracts a substring from a string based on its text
398
- by using regular expressions. There are much more powerful
399
- approaches, but this just shows a simple approach.
400
-
401
- .. ipython :: python
402
-
403
- firstlast = pd.DataFrame({" String" : [" John Smith" , " Jane Cook" ]})
404
- firstlast[" First_Name" ] = firstlast[" String" ].str.split(" " , expand = True )[0 ]
405
- firstlast[" Last_Name" ] = firstlast[" String" ].str.rsplit(" " , expand = True )[0 ]
406
- firstlast
383
+ .. include :: includes/nth_word.rst
407
384
408
385
409
- Upcase, lowcase, and propcase
410
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
386
+ Changing case
387
+ ~~~~~~~~~~~~~
411
388
412
389
The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm >`__
413
390
`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm >`__ and
@@ -427,27 +404,13 @@ functions change the case of the argument.
427
404
;;;
428
405
run;
429
406
430
- The equivalent Python functions are `` upper ``, `` lower ``, and `` title ``.
407
+ .. include :: includes/case.rst
431
408
432
- .. ipython :: python
433
-
434
- firstlast = pd.DataFrame({" String" : [" John Smith" , " Jane Cook" ]})
435
- firstlast[" string_up" ] = firstlast[" String" ].str.upper()
436
- firstlast[" string_low" ] = firstlast[" String" ].str.lower()
437
- firstlast[" string_prop" ] = firstlast[" String" ].str.title()
438
- firstlast
439
409
440
410
Merging
441
411
-------
442
412
443
- The following tables will be used in the merge examples
444
-
445
- .. ipython :: python
446
-
447
- df1 = pd.DataFrame({" key" : [" A" , " B" , " C" , " D" ], " value" : np.random.randn(4 )})
448
- df1
449
- df2 = pd.DataFrame({" key" : [" B" , " D" , " D" , " E" ], " value" : np.random.randn(4 )})
450
- df2
413
+ .. include :: includes/merge_setup.rst
451
414
452
415
In SAS, data must be explicitly sorted before merging. Different
453
416
types of joins are accomplished using the ``in= `` dummy
@@ -473,39 +436,13 @@ input frames.
473
436
if a or b then output outer_join;
474
437
run;
475
438
476
- pandas DataFrames have a :meth: `~DataFrame.merge ` method, which provides
477
- similar functionality. Note that the data does not have
478
- to be sorted ahead of time, and different join
479
- types are accomplished via the ``how `` keyword.
480
-
481
- .. ipython :: python
482
-
483
- inner_join = df1.merge(df2, on = [" key" ], how = " inner" )
484
- inner_join
485
-
486
- left_join = df1.merge(df2, on = [" key" ], how = " left" )
487
- left_join
488
-
489
- right_join = df1.merge(df2, on = [" key" ], how = " right" )
490
- right_join
491
-
492
- outer_join = df1.merge(df2, on = [" key" ], how = " outer" )
493
- outer_join
439
+ .. include :: includes/merge.rst
494
440
495
441
496
442
Missing data
497
443
------------
498
444
499
- Like SAS, pandas has a representation for missing data - which is the
500
- special float value ``NaN `` (not a number). Many of the semantics
501
- are the same, for example missing data propagates through numeric
502
- operations, and is ignored by default for aggregations.
503
-
504
- .. ipython :: python
505
-
506
- outer_join
507
- outer_join[" value_x" ] + outer_join[" value_y" ]
508
- outer_join[" value_x" ].sum()
445
+ .. include :: includes/missing_intro.rst
509
446
510
447
One difference is that missing data cannot be compared to its sentinel value.
511
448
For example, in SAS you could do this to filter missing values.
@@ -522,25 +459,7 @@ For example, in SAS you could do this to filter missing values.
522
459
if value_x ^= .;
523
460
run;
524
461
525
- Which doesn't work in pandas. Instead, the ``pd.isna `` or ``pd.notna `` functions
526
- should be used for comparisons.
527
-
528
- .. ipython :: python
529
-
530
- outer_join[pd.isna(outer_join[" value_x" ])]
531
- outer_join[pd.notna(outer_join[" value_x" ])]
532
-
533
- pandas also provides a variety of methods to work with missing data - some of
534
- which would be challenging to express in SAS. For example, there are methods to
535
- drop all rows with any missing values, replacing missing values with a specified
536
- value, like the mean, or forward filling from previous rows. See the
537
- :ref: `missing data documentation<missing_data> ` for more.
538
-
539
- .. ipython :: python
540
-
541
- outer_join.dropna()
542
- outer_join.fillna(method = " ffill" )
543
- outer_join[" value_x" ].fillna(outer_join[" value_x" ].mean())
462
+ .. include :: includes/missing.rst
544
463
545
464
546
465
GroupBy
@@ -549,7 +468,7 @@ GroupBy
549
468
Aggregation
550
469
~~~~~~~~~~~
551
470
552
- SAS's PROC SUMMARY can be used to group by one or
471
+ SAS's `` PROC SUMMARY `` can be used to group by one or
553
472
more key variables and compute aggregations on
554
473
numeric columns.
555
474
@@ -561,14 +480,7 @@ numeric columns.
561
480
output out= tips_summed sum = ;
562
481
run;
563
482
564
- pandas provides a flexible ``groupby `` mechanism that
565
- allows similar aggregations. See the :ref: `groupby documentation<groupby> `
566
- for more details and examples.
567
-
568
- .. ipython :: python
569
-
570
- tips_summed = tips.groupby([" sex" , " smoker" ])[[" total_bill" , " tip" ]].sum()
571
- tips_summed.head()
483
+ .. include :: includes/groupby.rst
572
484
573
485
574
486
Transformation
@@ -597,16 +509,7 @@ example, to subtract the mean for each observation by smoker group.
597
509
if a and b;
598
510
run;
599
511
600
-
601
- pandas ``groupby `` provides a ``transform `` mechanism that allows
602
- these type of operations to be succinctly expressed in one
603
- operation.
604
-
605
- .. ipython :: python
606
-
607
- gb = tips.groupby(" smoker" )[" total_bill" ]
608
- tips[" adj_total_bill" ] = tips[" total_bill" ] - gb.transform(" mean" )
609
- tips.head()
512
+ .. include :: includes/transform.rst
610
513
611
514
612
515
By group processing
0 commit comments