A Running List of Key Python Operations Translated to (Mostly) Tidy R
Frequently I am writing code in Python and R. And my team relies heavily on the Tidyverse syntax. So, I am often translating key Python operations (pandas, matplotlib, etc.) to tidy R (dplyr, ggplot2, etc.). In an effort to ease that translation, and also to crowdsource a running directory of these translations, I created this repo.
This is just a start. Please feel free to share and also directly contribute or revise via pulls or issues.
Note: I recommend using the native pipe operator (|>
) when constructing piped operations in practice, instead of the magrittr
pipe (%>%
). However, I used the latter in this repo because the |
in the native R pipe threw off formatting of the markdown tables.
- Key tasks
- Joining Data
- Iteration
- Iteration Over Lists
- String Operations
- Modeling and Machine Learning
- Network Modeling and Dynamics
- Parallel Computing
Task / Operation | Python (Pandas) | Tidyverse (dplyr, ggplot2) |
---|---|---|
Data Loading | import pandas as pd |
library(readr) |
df = pd.read_csv('file.csv') |
data <- read_csv('file.csv') |
|
Select Columns | df[['col1', 'col2']] |
data %>% select(col1, col2) |
Filter Rows | df[df['col'] > 5] |
data %>% filter(col > 5) |
Arrange Rows | df.sort_values(by='col') |
data %>% arrange(col) |
Mutate (Add Columns) | df['new_col'] = df['col1'] + df['col2'] |
data %>% mutate(new_col = col1 + col2) |
Group and Summarize | df.groupby('col').agg({'col2': 'mean'}) |
data %>% group_by(col) %>% summarize(mean_col2 = mean(col2)) |
Pivot/Wide to Long | pd.melt(df, id_vars=['id'], var_name='variable', value_name='value') |
data %>% gather(variable, value, -id) |
Long to Wide/Pivot | df.pivot(index='id', columns='variable', values='value') |
data %>% spread(variable, value) |
Data Visualization | Matplotlib, Seaborn, Plotly, etc. | ggplot2 |
import matplotlib.pyplot as plt |
library(ggplot2) |
|
plt.scatter(df['x'], df['y']) |
ggplot(data, aes(x=x, y=y)) + geom_point() |
|
Data Reshaping | pd.concat([df1, df2], axis=0) |
bind_rows(df1, df2) |
pd.concat([df1, df2], axis=1) |
bind_cols(df1, df2) |
|
String Manipulation | df['col'].str.replace('a', 'b') |
data %>% mutate(col = str_replace(col, 'a', 'b')) |
Date and Time | pd.to_datetime(df['date_col']) |
data %>% mutate(date_col = as.Date(date_col)) |
Missing Data Handling | df.dropna() |
data %>% drop_na() |
Rename Columns | df.rename(columns={'old_col': 'new_col'}) |
data %>% rename(new_col = old_col) |
Summary Statistics | df.describe() |
data %>% summary() or data %>% glimpse() |
This is the only table that includes SQL given that most of the R/dplyr
operations were patterned and named after many SQL operations.
Join Type | SQL | Python (Pandas) | R (dplyr) |
---|---|---|---|
Inner Join | INNER JOIN |
pd.merge(df1, df2, on='key') |
inner_join(df1, df2, by='key') |
Left Join | LEFT JOIN |
pd.merge(df1, df2, on='key', how='left') |
left_join(df1, df2, by='key') |
Right Join | RIGHT JOIN |
pd.merge(df1, df2, on='key', how='right') |
right_join(df1, df2, by='key') |
Full Outer Join | FULL OUTER JOIN |
pd.merge(df1, df2, on='key', how='outer') |
full_join(df1, df2, by='key') |
Cross Join | CROSS JOIN |
pd.merge(df1, df2, how='cross') |
Not directly supported, but can be achieved with full_join and filtering |
Anti Join | Not directly supported | pd.merge(df1, df2, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1) |
Not directly supported, but can be achieved with anti_join function from dplyr or by using filter() and ! condition |
Semi Join | Not directly supported | pd.merge(df1, df2, on='key', how='inner', indicator=True).query('_merge == "both"').drop('_merge', axis=1) |
Not directly supported, but can be achieved with semi_join function from dplyr or by using filter() and ! condition |
Self Join | INNER JOIN with the same table |
pd.merge(df, df, on='key') |
inner_join(df, df, by='key') |
Multiple Key Join | INNER JOIN with multiple keys |
pd.merge(df1, df2, on=['key1', 'key2']) |
inner_join(df1, df2, by=c('key1', 'key2')) |
Join with Renamed Columns | INNER JOIN with renamed columns |
pd.merge(df1.rename(columns={'col1': 'key'}), df2, on='key') |
inner_join(rename(df1, key = col1), df2, by = 'key') |
Join with Complex Condition | INNER JOIN with complex conditions |
pd.merge(df1, df2, on='key', how='inner', left_on=(df1['col1'] > 10) & (df1['col2'] == df2['col3'])) |
Not directly supported, but can be achieved with filter() and complex conditions |
Join with Different Key Names | INNER JOIN with different key names |
pd.merge(df1, df2, left_on='key1', right_on='key2') |
inner_join(df1, df2, by = c('key1' = 'key2')) |
Task / Operation | Python (Pandas) | Tidyverse (dplyr and purrr) |
---|---|---|
Iterate Over Rows | for index, row in df.iterrows(): |
data %>% rowwise() %>% mutate(new_col = your_function(col)) |
print(row['col1'], row['col2']) |
||
Map Function to Column | df['new_col'] = df['col'].apply(your_function) |
data %>% mutate(new_col = map_dbl(col, your_function)) |
Apply Function to Column | df['new_col'] = your_function(df['col']) |
data %>% mutate(new_col = your_function(col)) |
Group and Map | for group, group_df in df.groupby('group_col'): |
data %>% group_by(group_col) %>% nest(data = .) %>% mutate(new_col = map(data, your_function)) |
Map Over List Column | df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x]) |
data %>% mutate(new_col = map(list_col, ~map(your_function, .))) |
Map with Anonymous Function | - | data %>% mutate(new_col = map_dbl(col, ~your_function(.))) |
Map Multiple Columns | df['new_col'] = df.apply(lambda row: your_function(row['col1'], row['col2']), axis=1) |
data %>% mutate(new_col = pmap_dbl(list(col1, col2), ~your_function(...))) |
Task / Operation | Python (Pandas) | Tidyverse (dplyr and purrr) |
---|---|---|
Map Function Across List Column | df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x]) |
data %>% mutate(new_col = map(list_col, ~map(your_function, .))) |
Nested Map in List Column | df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x]) |
data %>% mutate(new_col = map(list_col, ~map(your_function, .))) |
Nested Map Across Columns | - | data %>% mutate(new_col = map2(list(col1, col2), ~map(your_function, .))) |
Nested Map Within List Column | - | data %>% mutate(new_col = map(list_col, ~map(your_function, .))) |
Map Across Rows with Nested Map | - | data %>% mutate(new_col = pmap(list(col1, col2), ~list(your_function(.x), your_function(.y)))) |
Nested Map Within Nested List | - | data %>% mutate(new_col = map(list(list_col), ~map(your_function, .))) |
Nested Map Across List of Lists | df['new_col'] = df['list_col'].apply(lambda x: [list(map(your_function, i)) for i in x]) |
data %>% mutate(new_col = map2(list(list_col1, list_col2), ~map2(your_function1, your_function2, .x, .y))) |
Nested Map Across Rows and Lists | - | data %>% mutate(new_col = pmap(list(col1, col2, col3), ~list(your_function(.x), your_function(.y), your_function(.z)))) |
Map and Reduce Across List | df['new_col'] = df['list_col'].apply(lambda x: reduce(your_function, x)) |
data %>% mutate(new_col = map(list_col, ~reduce(your_function, .))) |
Map and Reduce Across Rows | df['new_col'] = df.apply(lambda row: reduce(your_function, row[['col1', 'col2']]), axis=1) |
data %>% mutate(new_col = pmap(list(col1, col2), ~reduce(your_function, .))) |
Task / Operation | Python (Pandas) | Tidyverse (dplyr and stringr) |
---|---|---|
String Length | df['col'].str.len() |
data %>% mutate(new_col = str_length(col)) |
Concatenate Strings | df['new_col'] = df['col1'] + df['col2'] |
data %>% mutate(new_col = str_c(col1, col2)) |
Split Strings | df['col'].str.split(', ') |
data %>% mutate(new_col = str_split(col, ', ')) |
Substring | df['col'].str.slice(0, 5) |
data %>% mutate(new_col = str_sub(col, 1, 5)) |
Replace Substring | df['col'].str.replace('old', 'new') |
data %>% mutate(new_col = str_replace(col, 'old', 'new')) |
Uppercase / Lowercase | df['col'].str.upper() |
data %>% mutate(new_col = str_to_upper(col)) |
df['col'].str.lower() |
data %>% mutate(new_col = str_to_lower(col)) |
|
Strip Whitespace | df['col'].str.strip() |
data %>% mutate(new_col = str_squish(col)) |
Check for Substring | df['col'].str.contains('pattern') |
data %>% mutate(new_col = str_detect(col, 'pattern')) |
Count Substring Occurrences | df['col'].str.count('pattern') |
data %>% mutate(new_col = str_count(col, 'pattern')) |
Find First Occurrence of Substring | df['col'].str.find('pattern') |
data %>% mutate(new_col = str_locate(col, 'pattern')[, 1]) |
Extract Substring with Regex | df['col'].str.extract(r'(\d+)') |
data %>% mutate(new_col = str_extract(col, '(\\d+)')) |
Remove Duplicates in Strings | - | data %>% mutate(new_col = str_unique(col)) |
Pad Strings | df['col'].str.pad(width=10, side='right', fillchar='0') |
data %>% mutate(new_col = str_pad(col, width = 10, side = 'right', pad = '0')) |
Truncate Strings | df['col'].str.slice(0, 10) |
data %>% mutate(new_col = str_sub(col, 1, 10)) |
Title Case | - | data %>% mutate(new_col = str_to_title(col)) |
Join List of Strings | 'separator'.join(df['col']) |
data %>% mutate(new_col = str_flatten(col, collapse = 'separator')) |
Remove Punctuation | - | data %>% mutate(new_col = str_remove_all(col, '[[:punct:]]')) |
String Encoding/Decoding | - | data %>% mutate(new_col = str_encode(col, to = 'UTF-8')) |
Task / Operation | Python (scikit-learn) | R (various packages) |
---|---|---|
Data Preprocessing | from sklearn.preprocessing import ... |
library(caret) |
from sklearn.pipeline import Pipeline |
library(glmnet) |
|
preprocessor = ... |
preprocess <- preProcess(data, ...) |
|
Feature Scaling | StandardScaler() |
preprocess$scaling |
Feature Selection | SelectKBest() |
caret::createFolds() |
Data Splitting | train_test_split() |
createDataPartition() |
Model Initialization | model = ...() |
model <- ...() |
Model Training | model.fit(X_train, y_train) |
model <- train(y ~ ., data = data) |
Model Prediction | y_pred = model.predict(X_test) |
y_pred <- predict(model, newdata) |
Model Evaluation | accuracy_score(y_test, y_pred) |
confusionMatrix(y_pred, y_true) |
Hyperparameter Tuning | GridSearchCV() |
tuneGrid(...) |
Cross-Validation | cross_val_score() |
trainControl(method = "cv") |
Model Pipelining | pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)]) |
model <- train(y ~ ., data = data, method = model, trControl = trainControl(method = "cv")) |
Feature Engineering | from sklearn.preprocessing import ... |
library(caret) |
Custom feature transformers | Custom feature transformers | |
Handling Missing Data | SimpleImputer() |
preprocess$impute |
Encoding Categorical Data | OneHotEncoder() |
dummyVars() |
Dimensionality Reduction | PCA() |
preprocess$reduce |
Model Selection | GridSearchCV() |
caret::train() |
Ensemble Learning | Various ensemble methods | caret::train() with method="stack" |
Regularization | Lasso, Ridge, Elastic Net, etc. | glmnet() |
Model Interpretability | SHAP, Lime, etc. | DALEX, iml, etc. |
Model Export/Serialization | joblib or pickle |
saveRDS or other formats |
Deploying Models | Web frameworks (e.g., Flask, Django) | Web frameworks (e.g., Shiny, Plumber) |
Batch Scoring | Scripting or automation tools | R batch processing |
Feature Scaling/Normalization | StandardScaler() , MinMaxScaler() , etc. |
scale() , normalize() , etc. |
Feature Selection with L1 Regularization | SelectFromModel() , Lasso() |
glmnet() , cv.glmnet() |
Handling Imbalanced Data | RandomUnderSampler() , SMOTE() , etc. |
caret::train() with weights or sampling |
Model Evaluation Metrics | classification_report() , confusion_matrix() , mean_squared_error() , etc. |
confusionMatrix() , postResample() , RMSE , etc. |
Feature Importance | .feature_importances_ (Random Forest, etc.) |
varImp() , vip() , etc. |
Model Persistence | joblib , pickle , sklearn.externals |
saveRDS , save() , serialize() , etc. |
Time Series Forecasting | Prophet , ARIMA , ExponentialSmoothing , etc. |
forecast , prophet , auto.arima , etc. |
Natural Language Processing (NLP) | nltk , spaCy , textblob , etc. |
tm , quanteda , udpipe , tm.plugin.webmining , etc. |
Deep Learning | Keras , TensorFlow , PyTorch , etc. |
keras , tensorflow , torch , mxnet , etc. |
Model Interpretation | SHAP , LIME , ELI5 , etc. |
DALEX , iml , iBreakDown , lime , etc. |
Model Deployment in Production | Containers, cloud platforms (e.g., Docker, Kubernetes, AWS SageMaker) | Containers, Shiny, Plumber, APIs, cloud platforms |
Task / Operation | Python (NetworkX) | R (various packages) |
---|---|---|
Network Creation | G = nx.Graph() , G.add_node() , G.add_edge() |
igraph::graph() , add_vertices() , add_edges() |
Node and Edge Attributes | G.nodes[node]['attribute'] = value , G.edges[edge]['attribute'] = value |
V(graph)$attribute <- value , E(graph)$attribute <- value |
Network Visualization | nx.draw(G) , matplotlib for customization |
plot(graph) , igraph , ggplot2 , visNetwork , etc. |
Network Measures | nx.degree_centrality(G) , nx.betweenness_centrality(G) , nx.clustering(G) , etc. |
degree() , betweenness() , transitivity() , etc. |
Community Detection | community.detect() (e.g., Louvain, Girvan-Newman) |
cluster_walktrap() , cluster_fast_greedy() , cluster_leading_eigen() , etc. |
Link Prediction | link_prediction.method() (e.g., Common Neighbors, Jaccard Coefficient) |
link_prediction.method() (e.g., Adamic-Adar, Preferential Attachment) |
Network Filtering/Selection | G.subgraph(nodes) |
subgraph(graph, vertices) |
Network Embedding | node2vec , GraphSAGE , etc. |
walktrap.community , fastgreedy.community , etc. |
Network Simulation | nx.erdos_renyi_graph() , nx.watts_strogatz_graph() , etc. |
igraph::erdos.renyi.game() , igraph::watts.strogatz.game() , etc. |
Network Analysis Pipelines | Custom pipelines using NetworkX, Pandas, and other libraries | Custom pipelines using igraph, dplyr, and other packages |
Dynamic Network Analysis | dynetx for dynamic networks |
tsna for temporal networks, dyngraph for dynamic graphs, etc. |
Geospatial Network Analysis | osmnx for urban network analysis |
stplanr for transport planning, spatnet for spatial network analysis, etc. |
Network Modeling for Machine Learning | Integration with scikit-learn, PyTorch, etc. | Integration with caret, glmnet, keras, etc. |
Community Visualization | Visualization of detected communities using network layouts | igraph::plot.igraph() with community coloring |
Path Analysis | Shortest paths, k-shortest paths, and all simple paths | get.shortest.paths() , all.simple.paths() |
Centrality Analysis | Closeness centrality, eigenvector centrality, Katz centrality, etc. | closeness() , eigen_centrality() , katz_centrality() , etc. |
Structural Role Analysis | Structural equivalence, equivalence-based roles | structural_equivalence() , role_equiv() , etc. |
Network Robustness Analysis | Network attack simulations, robustness metrics | robustness() function, remove_vertices() , etc. |
Temporal Network Analysis | Temporal networks, evolving networks | dynnet package for dynamic networks, temporal extensions of igraph functions |
Multiplex Network Analysis | Analyzing multiple layers of networks | multiplex package for multilayer networks, mgm package for multilayer graphical models |
Network Alignment | Aligning nodes in two or more networks | netAlign package for network alignment, gmatch package for graph matching |
Dynamic Community Detection | Detecting evolving communities over time | dynCOMM for dynamic community detection |
Network Generative Models | Generating networks from various models (e.g., ER, BA, etc.) | igraph::sample_gnm() , igraph::sample_degseq() , etc. |
Geospatial Network Analysis | Geospatial network analysis and routing | stplanr for transport planning, spatnet for spatial network analysis, etc. |
Network Modeling for Machine Learning | Integrating network data with machine learning libraries | Combining igraph or custom network features with caret, glmnet, keras, etc. |