From Python to Tidy R (and Back)

A Running List of Key Python Operations Translated to (Mostly) Tidy R

Frequently I am writing code in Python and R. And my team relies heavily on the Tidyverse syntax. So, I am often translating key Python operations (pandas, matplotlib, etc.) to tidy R (dplyr, ggplot2, etc.). In an effort to ease that translation, and also to crowdsource a running directory of these translations, I created this repo.

This is just a start. Please feel free to share and also directly contribute or revise via pulls or issues.

Note: I recommend using the native pipe operator (|>) when constructing piped operations in practice, instead of the magrittr pipe (%>%). However, I used the latter in this repo because the | in the native R pipe threw off formatting of the markdown tables.

Key tasks

Task / Operation	Python (Pandas)	Tidyverse (dplyr, ggplot2)
Data Loading	`import pandas as pd`	`library(readr)`
	`df = pd.read_csv('file.csv')`	`data <- read_csv('file.csv')`
Select Columns	`df[['col1', 'col2']]`	`data %>% select(col1, col2)`
Filter Rows	`df[df['col'] > 5]`	`data %>% filter(col > 5)`
Arrange Rows	`df.sort_values(by='col')`	`data %>% arrange(col)`
Mutate (Add Columns)	`df['new_col'] = df['col1'] + df['col2']`	`data %>% mutate(new_col = col1 + col2)`
Group and Summarize	`df.groupby('col').agg({'col2': 'mean'})`	`data %>% group_by(col) %>% summarize(mean_col2 = mean(col2))`
Pivot/Wide to Long	`pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')`	`data %>% gather(variable, value, -id)`
Long to Wide/Pivot	`df.pivot(index='id', columns='variable', values='value')`	`data %>% spread(variable, value)`
Data Visualization	Matplotlib, Seaborn, Plotly, etc.	ggplot2
	`import matplotlib.pyplot as plt`	`library(ggplot2)`
	`plt.scatter(df['x'], df['y'])`	`ggplot(data, aes(x=x, y=y)) + geom_point()`
Data Reshaping	`pd.concat([df1, df2], axis=0)`	`bind_rows(df1, df2)`
	`pd.concat([df1, df2], axis=1)`	`bind_cols(df1, df2)`
String Manipulation	`df['col'].str.replace('a', 'b')`	`data %>% mutate(col = str_replace(col, 'a', 'b'))`
Date and Time	`pd.to_datetime(df['date_col'])`	`data %>% mutate(date_col = as.Date(date_col))`
Missing Data Handling	`df.dropna()`	`data %>% drop_na()`
Rename Columns	`df.rename(columns={'old_col': 'new_col'})`	`data %>% rename(new_col = old_col)`
Summary Statistics	`df.describe()`	`data %>% summary()` or `data %>% glimpse()`

Joining Data

This is the only table that includes SQL given that most of the R/dplyr operations were patterned and named after many SQL operations.

Join Type	SQL	Python (Pandas)	R (dplyr)
Inner Join	`INNER JOIN`	`pd.merge(df1, df2, on='key')`	`inner_join(df1, df2, by='key')`
Left Join	`LEFT JOIN`	`pd.merge(df1, df2, on='key', how='left')`	`left_join(df1, df2, by='key')`
Right Join	`RIGHT JOIN`	`pd.merge(df1, df2, on='key', how='right')`	`right_join(df1, df2, by='key')`
Full Outer Join	`FULL OUTER JOIN`	`pd.merge(df1, df2, on='key', how='outer')`	`full_join(df1, df2, by='key')`
Cross Join	`CROSS JOIN`	`pd.merge(df1, df2, how='cross')`	Not directly supported, but can be achieved with `full_join` and filtering
Anti Join	Not directly supported	`pd.merge(df1, df2, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)`	Not directly supported, but can be achieved with `anti_join` function from dplyr or by using `filter()` and `!` condition
Semi Join	Not directly supported	`pd.merge(df1, df2, on='key', how='inner', indicator=True).query('_merge == "both"').drop('_merge', axis=1)`	Not directly supported, but can be achieved with `semi_join` function from dplyr or by using `filter()` and `!` condition
Self Join	`INNER JOIN` with the same table	`pd.merge(df, df, on='key')`	`inner_join(df, df, by='key')`
Multiple Key Join	`INNER JOIN` with multiple keys	`pd.merge(df1, df2, on=['key1', 'key2'])`	`inner_join(df1, df2, by=c('key1', 'key2'))`
Join with Renamed Columns	`INNER JOIN` with renamed columns	`pd.merge(df1.rename(columns={'col1': 'key'}), df2, on='key')`	`inner_join(rename(df1, key = col1), df2, by = 'key')`
Join with Complex Condition	`INNER JOIN` with complex conditions	`pd.merge(df1, df2, on='key', how='inner', left_on=(df1['col1'] > 10) & (df1['col2'] == df2['col3']))`	Not directly supported, but can be achieved with `filter()` and complex conditions
Join with Different Key Names	`INNER JOIN` with different key names	`pd.merge(df1, df2, left_on='key1', right_on='key2')`	`inner_join(df1, df2, by = c('key1' = 'key2'))`

Iteration

Task / Operation	Python (Pandas)	Tidyverse (dplyr and purrr)
Iterate Over Rows	`for index, row in df.iterrows():`	`data %>% rowwise() %>% mutate(new_col = your_function(col))`
	`print(row['col1'], row['col2'])`
Map Function to Column	`df['new_col'] = df['col'].apply(your_function)`	`data %>% mutate(new_col = map_dbl(col, your_function))`
Apply Function to Column	`df['new_col'] = your_function(df['col'])`	`data %>% mutate(new_col = your_function(col))`
Group and Map	`for group, group_df in df.groupby('group_col'):`	`data %>% group_by(group_col) %>% nest(data = .) %>% mutate(new_col = map(data, your_function))`
Map Over List Column	`df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x])`	`data %>% mutate(new_col = map(list_col, ~map(your_function, .)))`
Map with Anonymous Function	-	`data %>% mutate(new_col = map_dbl(col, ~your_function(.)))`
Map Multiple Columns	`df['new_col'] = df.apply(lambda row: your_function(row['col1'], row['col2']), axis=1)`	`data %>% mutate(new_col = pmap_dbl(list(col1, col2), ~your_function(...)))`

Iteration Over Lists

Task / Operation	Python (Pandas)	Tidyverse (dplyr and purrr)
Map Function Across List Column	`df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x])`	`data %>% mutate(new_col = map(list_col, ~map(your_function, .)))`
Nested Map in List Column	`df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x])`	`data %>% mutate(new_col = map(list_col, ~map(your_function, .)))`
Nested Map Across Columns	-	`data %>% mutate(new_col = map2(list(col1, col2), ~map(your_function, .)))`
Nested Map Within List Column	-	`data %>% mutate(new_col = map(list_col, ~map(your_function, .)))`
Map Across Rows with Nested Map	-	`data %>% mutate(new_col = pmap(list(col1, col2), ~list(your_function(.x), your_function(.y))))`
Nested Map Within Nested List	-	`data %>% mutate(new_col = map(list(list_col), ~map(your_function, .)))`
Nested Map Across List of Lists	`df['new_col'] = df['list_col'].apply(lambda x: [list(map(your_function, i)) for i in x])`	`data %>% mutate(new_col = map2(list(list_col1, list_col2), ~map2(your_function1, your_function2, .x, .y)))`
Nested Map Across Rows and Lists	-	`data %>% mutate(new_col = pmap(list(col1, col2, col3), ~list(your_function(.x), your_function(.y), your_function(.z))))`
Map and Reduce Across List	`df['new_col'] = df['list_col'].apply(lambda x: reduce(your_function, x))`	`data %>% mutate(new_col = map(list_col, ~reduce(your_function, .)))`
Map and Reduce Across Rows	`df['new_col'] = df.apply(lambda row: reduce(your_function, row[['col1', 'col2']]), axis=1)`	`data %>% mutate(new_col = pmap(list(col1, col2), ~reduce(your_function, .)))`

String Operations

Task / Operation	Python (Pandas)	Tidyverse (dplyr and stringr)
String Length	`df['col'].str.len()`	`data %>% mutate(new_col = str_length(col))`
Concatenate Strings	`df['new_col'] = df['col1'] + df['col2']`	`data %>% mutate(new_col = str_c(col1, col2))`
Split Strings	`df['col'].str.split(', ')`	`data %>% mutate(new_col = str_split(col, ', '))`
Substring	`df['col'].str.slice(0, 5)`	`data %>% mutate(new_col = str_sub(col, 1, 5))`
Replace Substring	`df['col'].str.replace('old', 'new')`	`data %>% mutate(new_col = str_replace(col, 'old', 'new'))`
Uppercase / Lowercase	`df['col'].str.upper()`	`data %>% mutate(new_col = str_to_upper(col))`
	`df['col'].str.lower()`	`data %>% mutate(new_col = str_to_lower(col))`
Strip Whitespace	`df['col'].str.strip()`	`data %>% mutate(new_col = str_squish(col))`
Check for Substring	`df['col'].str.contains('pattern')`	`data %>% mutate(new_col = str_detect(col, 'pattern'))`
Count Substring Occurrences	`df['col'].str.count('pattern')`	`data %>% mutate(new_col = str_count(col, 'pattern'))`
Find First Occurrence of Substring	`df['col'].str.find('pattern')`	`data %>% mutate(new_col = str_locate(col, 'pattern')[, 1])`
Extract Substring with Regex	`df['col'].str.extract(r'(\d+)')`	`data %>% mutate(new_col = str_extract(col, '(\\d+)'))`
Remove Duplicates in Strings	-	`data %>% mutate(new_col = str_unique(col))`
Pad Strings	`df['col'].str.pad(width=10, side='right', fillchar='0')`	`data %>% mutate(new_col = str_pad(col, width = 10, side = 'right', pad = '0'))`
Truncate Strings	`df['col'].str.slice(0, 10)`	`data %>% mutate(new_col = str_sub(col, 1, 10))`
Title Case	-	`data %>% mutate(new_col = str_to_title(col))`
Join List of Strings	`'separator'.join(df['col'])`	`data %>% mutate(new_col = str_flatten(col, collapse = 'separator'))`
Remove Punctuation	-	`data %>% mutate(new_col = str_remove_all(col, '[[:punct:]]'))`
String Encoding/Decoding	-	`data %>% mutate(new_col = str_encode(col, to = 'UTF-8'))`

Modeling and Machine Learning

Task / Operation	Python (scikit-learn)	R (various packages)
Data Preprocessing	`from sklearn.preprocessing import ...`	`library(caret)`
	`from sklearn.pipeline import Pipeline`	`library(glmnet)`
	`preprocessor = ...`	`preprocess <- preProcess(data, ...)`
Feature Scaling	`StandardScaler()`	`preprocess$scaling`
Feature Selection	`SelectKBest()`	`caret::createFolds()`
Data Splitting	`train_test_split()`	`createDataPartition()`
Model Initialization	`model = ...()`	`model <- ...()`
Model Training	`model.fit(X_train, y_train)`	`model <- train(y ~ ., data = data)`
Model Prediction	`y_pred = model.predict(X_test)`	`y_pred <- predict(model, newdata)`
Model Evaluation	`accuracy_score(y_test, y_pred)`	`confusionMatrix(y_pred, y_true)`
Hyperparameter Tuning	`GridSearchCV()`	`tuneGrid(...)`
Cross-Validation	`cross_val_score()`	`trainControl(method = "cv")`
Model Pipelining	`pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])`	`model <- train(y ~ ., data = data, method = model, trControl = trainControl(method = "cv"))`
Feature Engineering	`from sklearn.preprocessing import ...`	`library(caret)`
	Custom feature transformers	Custom feature transformers
Handling Missing Data	`SimpleImputer()`	`preprocess$impute`
Encoding Categorical Data	`OneHotEncoder()`	`dummyVars()`
Dimensionality Reduction	`PCA()`	`preprocess$reduce`
Model Selection	`GridSearchCV()`	`caret::train()`
Ensemble Learning	Various ensemble methods	`caret::train()` with `method="stack"`
Regularization	Lasso, Ridge, Elastic Net, etc.	`glmnet()`
Model Interpretability	SHAP, Lime, etc.	DALEX, iml, etc.
Model Export/Serialization	`joblib` or `pickle`	`saveRDS` or other formats
Deploying Models	Web frameworks (e.g., Flask, Django)	Web frameworks (e.g., Shiny, Plumber)
Batch Scoring	Scripting or automation tools	R batch processing
Feature Scaling/Normalization	`StandardScaler()`, `MinMaxScaler()`, etc.	`scale()`, `normalize()`, etc.
Feature Selection with L1 Regularization	`SelectFromModel()`, `Lasso()`	`glmnet()`, `cv.glmnet()`
Handling Imbalanced Data	`RandomUnderSampler()`, `SMOTE()`, etc.	`caret::train()` with `weights` or `sampling`
Model Evaluation Metrics	`classification_report()`, `confusion_matrix()`, `mean_squared_error()`, etc.	`confusionMatrix()`, `postResample()`, `RMSE`, etc.
Feature Importance	`.feature_importances_` (Random Forest, etc.)	`varImp()`, `vip()`, etc.
Model Persistence	`joblib`, `pickle`, `sklearn.externals`	`saveRDS`, `save()`, `serialize()`, etc.
Time Series Forecasting	`Prophet`, `ARIMA`, `ExponentialSmoothing`, etc.	`forecast`, `prophet`, `auto.arima`, etc.
Natural Language Processing (NLP)	`nltk`, `spaCy`, `textblob`, etc.	`tm`, `quanteda`, `udpipe`, `tm.plugin.webmining`, etc.
Deep Learning	`Keras`, `TensorFlow`, `PyTorch`, etc.	`keras`, `tensorflow`, `torch`, `mxnet`, etc.
Model Interpretation	`SHAP`, `LIME`, `ELI5`, etc.	`DALEX`, `iml`, `iBreakDown`, `lime`, etc.
Model Deployment in Production	Containers, cloud platforms (e.g., Docker, Kubernetes, AWS SageMaker)	Containers, Shiny, Plumber, APIs, cloud platforms

Network Modeling and Dynamics

Task / Operation	Python (NetworkX)	R (various packages)
Network Creation	`G = nx.Graph()`, `G.add_node()`, `G.add_edge()`	`igraph::graph()`, `add_vertices()`, `add_edges()`
Node and Edge Attributes	`G.nodes[node]['attribute'] = value`, `G.edges[edge]['attribute'] = value`	`V(graph)$attribute <- value`, `E(graph)$attribute <- value`
Network Visualization	`nx.draw(G)`, `matplotlib` for customization	`plot(graph)`, `igraph`, `ggplot2`, `visNetwork`, etc.
Network Measures	`nx.degree_centrality(G)`, `nx.betweenness_centrality(G)`, `nx.clustering(G)`, etc.	`degree()`, `betweenness()`, `transitivity()`, etc.
Community Detection	`community.detect()` (e.g., Louvain, Girvan-Newman)	`cluster_walktrap()`, `cluster_fast_greedy()`, `cluster_leading_eigen()`, etc.
Link Prediction	`link_prediction.method()` (e.g., Common Neighbors, Jaccard Coefficient)	`link_prediction.method()` (e.g., Adamic-Adar, Preferential Attachment)
Network Filtering/Selection	`G.subgraph(nodes)`	`subgraph(graph, vertices)`
Network Embedding	`node2vec`, `GraphSAGE`, etc.	`walktrap.community`, `fastgreedy.community`, etc.
Network Simulation	`nx.erdos_renyi_graph()`, `nx.watts_strogatz_graph()`, etc.	`igraph::erdos.renyi.game()`, `igraph::watts.strogatz.game()`, etc.
Network Analysis Pipelines	Custom pipelines using NetworkX, Pandas, and other libraries	Custom pipelines using igraph, dplyr, and other packages
Dynamic Network Analysis	`dynetx` for dynamic networks	`tsna` for temporal networks, `dyngraph` for dynamic graphs, etc.
Geospatial Network Analysis	`osmnx` for urban network analysis	`stplanr` for transport planning, `spatnet` for spatial network analysis, etc.
Network Modeling for Machine Learning	Integration with scikit-learn, PyTorch, etc.	Integration with caret, glmnet, keras, etc.
Community Visualization	Visualization of detected communities using network layouts	`igraph::plot.igraph()` with community coloring
Path Analysis	Shortest paths, k-shortest paths, and all simple paths	`get.shortest.paths()`, `all.simple.paths()`
Centrality Analysis	Closeness centrality, eigenvector centrality, Katz centrality, etc.	`closeness()`, `eigen_centrality()`, `katz_centrality()`, etc.
Structural Role Analysis	Structural equivalence, equivalence-based roles	`structural_equivalence()`, `role_equiv()`, etc.
Network Robustness Analysis	Network attack simulations, robustness metrics	`robustness()` function, `remove_vertices()`, etc.
Temporal Network Analysis	Temporal networks, evolving networks	`dynnet` package for dynamic networks, temporal extensions of `igraph` functions
Multiplex Network Analysis	Analyzing multiple layers of networks	`multiplex` package for multilayer networks, `mgm` package for multilayer graphical models
Network Alignment	Aligning nodes in two or more networks	`netAlign` package for network alignment, `gmatch` package for graph matching
Dynamic Community Detection	Detecting evolving communities over time	`dynCOMM` for dynamic community detection
Network Generative Models	Generating networks from various models (e.g., ER, BA, etc.)	`igraph::sample_gnm()`, `igraph::sample_degseq()`, etc.
Geospatial Network Analysis	Geospatial network analysis and routing	`stplanr` for transport planning, `spatnet` for spatial network analysis, etc.
Network Modeling for Machine Learning	Integrating network data with machine learning libraries	Combining `igraph` or custom network features with caret, glmnet, keras, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Parallel Computing.md		Parallel Computing.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Python to Tidy R (and Back)

Table of Contents

Key tasks

Joining Data

Iteration

Iteration Over Lists

String Operations

Modeling and Machine Learning

Network Modeling and Dynamics

About

Releases

Packages

pdwaggoner/python-to-tidy-R

Folders and files

Latest commit

History

Repository files navigation

From Python to Tidy R (and Back)

Table of Contents

Key tasks

Joining Data

Iteration

Iteration Over Lists

String Operations

Modeling and Machine Learning

Network Modeling and Dynamics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages