Skip to content

Commit cda091f

Browse files
natethedrummerjorisvandenbossche
authored andcommitted
DOC: added string processing comparison with SAS (#16497)
1 parent 7cc0fac commit cda091f

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed

doc/source/comparison_with_sas.rst

+140
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,146 @@ takes a list of columns to sort by.
357357
tips = tips.sort_values(['sex', 'total_bill'])
358358
tips.head()
359359
360+
361+
String Processing
362+
-----------------
363+
364+
Length
365+
~~~~~~
366+
367+
SAS determines the length of a character string with the
368+
`LENGTHN <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm>`__
369+
and `LENGTHC <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002283942.htm>`__
370+
functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailing blanks.
371+
372+
.. code-block:: none
373+
374+
data _null_;
375+
set tips;
376+
put(LENGTHN(time));
377+
put(LENGTHC(time));
378+
run;
379+
380+
Python determines the length of a character string with the ``len`` function.
381+
``len`` includes trailing blanks. Use ``len`` and ``rstrip`` to exclude
382+
trailing blanks.
383+
384+
.. ipython:: python
385+
386+
tips['time'].str.len().head()
387+
tips['time'].str.rstrip().str.len().head()
388+
389+
390+
Find
391+
~~~~
392+
393+
SAS determines the position of a character in a string with the
394+
`FINDW <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm>`__ function.
395+
``FINDW`` takes the string defined by the first argument and searches for the first position of the substring
396+
you supply as the second argument.
397+
398+
.. code-block:: none
399+
400+
data _null_;
401+
set tips;
402+
put(FINDW(sex,'ale'));
403+
run;
404+
405+
Python determines the position of a character in a string with the
406+
``find`` function. ``find`` searches for the first position of the
407+
substring. If the substring is found, the function returns its
408+
position. Keep in mind that Python indexes are zero-based and
409+
the function will return -1 if it fails to find the substring.
410+
411+
.. ipython:: python
412+
413+
tips['sex'].str.find("ale").head()
414+
415+
416+
Substring
417+
~~~~~~~~~
418+
419+
SAS extracts a substring from a string based on its position with the
420+
`SUBSTR <http://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf>`__ function.
421+
422+
.. code-block:: none
423+
424+
data _null_;
425+
set tips;
426+
put(substr(sex,1,1));
427+
run;
428+
429+
With pandas you can use ``[]`` notation to extract a substring
430+
from a string by position locations. Keep in mind that Python
431+
indexes are zero-based.
432+
433+
.. ipython:: python
434+
435+
tips['sex'].str[0:1].head()
436+
437+
438+
Scan
439+
~~~~
440+
441+
The SAS `SCAN <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm>`__
442+
function returns the nth word from a string. The first argument is the string you want to parse and the
443+
second argument specifies which word you want to extract.
444+
445+
.. code-block:: none
446+
447+
data firstlast;
448+
input String $60.;
449+
First_Name = scan(string, 1);
450+
Last_Name = scan(string, -1);
451+
datalines2;
452+
John Smith;
453+
Jane Cook;
454+
;;;
455+
run;
456+
457+
Python extracts a substring from a string based on its text
458+
by using regular expressions. There are much more powerful
459+
approaches, but this just shows a simple approach.
460+
461+
.. ipython:: python
462+
463+
firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
464+
firstlast['First_Name'] = firstlast['String'].str.split(" ", expand=True)[0]
465+
firstlast['Last_Name'] = firstlast['String'].str.rsplit(" ", expand=True)[0]
466+
firstlast
467+
468+
469+
Upcase, Lowcase, and Propcase
470+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
471+
472+
The SAS `UPCASE <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm>`__
473+
`LOWCASE <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm>`__ and
474+
`PROPCASE <http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/a002598106.htm>`__
475+
functions change the case of the argument.
476+
477+
.. code-block:: none
478+
479+
data firstlast;
480+
input String $60.;
481+
string_up = UPCASE(string);
482+
string_low = LOWCASE(string);
483+
string_prop = PROPCASE(string);
484+
datalines2;
485+
John Smith;
486+
Jane Cook;
487+
;;;
488+
run;
489+
490+
The equivalent Python functions are ``upper``, ``lower``, and ``title``.
491+
492+
.. ipython:: python
493+
494+
firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
495+
firstlast['string_up'] = firstlast['String'].str.upper()
496+
firstlast['string_low'] = firstlast['String'].str.lower()
497+
firstlast['string_prop'] = firstlast['String'].str.title()
498+
firstlast
499+
360500
Merging
361501
-------
362502

0 commit comments

Comments
 (0)