From 9e832e1ffd2d7e95af7a6c0d816a80a5966cddc6 Mon Sep 17 00:00:00 2001
From: ejiawustl <89940465+ejiawustl@users.noreply.github.com>
Date: Wed, 18 Dec 2024 10:50:04 -0800
Subject: [PATCH] generating results for a sequential modeling pipeline (to
 compare to our current pipeline) (#132)

* modifying main.py to generate results for the sequential pipeline modeling (to compare to the current pipeline) for bootstrap_lassocv method. Note that a slight modification was made to the source code to utilize an additional keyword argument in generate_modeling_data, but this essentially doesn't change the source code at all

* fixed directory issue and updated tutorial to describe alternative pipeline
---
 .../interactor_modeling_workflow.ipynb        |  62 ++++++-
 yeastdnnexplorer/__main__.py                  | 158 +++++++++++++++++-
 yeastdnnexplorer/ml_models/lasso_modeling.py  |   1 +
 3 files changed, 212 insertions(+), 9 deletions(-)

diff --git a/docs/tutorials/interactor_modeling_workflow.ipynb b/docs/tutorials/interactor_modeling_workflow.ipynb
index a323d43..61b8486 100644
--- a/docs/tutorials/interactor_modeling_workflow.ipynb
+++ b/docs/tutorials/interactor_modeling_workflow.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -455,17 +455,29 @@
     "    --response_file ~/htcf_local/lasso_bootstrap/erics_tfs/response_dataframe_20241105.csv \\\n",
     "    --predictors_file ~/htcf_local/lasso_bootstrap/erics_tfs/predictors_dataframe_20241105.csv \\\n",
     "    --method bootstrap_lassocv \\\n",
-    "    --response_tf CBF1\n",
+    "    --response_tf CBF1 \\\n",
+    "    --output_dir /workflow_results\n",
     "```\n",
     "\n",
     "the output, in your `$PWD` would look like this:\n",
     "\n",
     "```raw\n",
-    "├── find_interactors_output\n",
-    "│   └── CBF1_output.json\n",
+    "├── workflow_results\n",
+    "    └── CBF1\n",
+    "        └── bootstrap_coef_df_all.csv\n",
+    "        └── bootstrap_coef_df_top.csv\n",
+    "        └── ci_dict_all.json\n",
+    "        └── ci_dict_top.json\n",
+    "        └── final_output.json\n",
+    "        └── intersection.json\n",
+    "        └── sequential_top_genes_results\n",
+    "            └── bootstrap_coef_df_sequential.csv\n",
+    "            └── ci_dict_sequential.json\n",
+    "            └── final_output_sequential.json\n",
+    "            └── intersection_sequential.json\n",
     "```\n",
     "\n",
-    "Where CBF1_output.json looks like:\n",
+    "Where final_output.json looks like:\n",
     "\n",
     "```json\n",
     "{\n",
@@ -479,6 +491,44 @@
     "}\n",
     "```"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are a number of files in this output that are returned. First, the bootstrap_coef_df csv files contain the results of the lasso bootstrap process on either all genes or the top x% of genes. It is from this data that the ci_dict json files are constructed. These files are dictionaries whose key is a confidence interval (i.e. \"95.0\") and whose value is a nested dictionary whose key is the an interaction term and whose value is the resulting confidence interval corresponding to the original key. The final_output.json file as shown above contains information based on the final models resulting from the current workflow. The intersection.json file is included to verify that only the features deemed significant by both lasso boostrap on all genes and the top x% of genes are the only features used in the pipeline moving forward. \n",
+    "\n",
+    "Lastly, it is important to note the nested directory sequential_top_genes_results. This directory contains results from running an alternative pipeline. Whereas the current pipeline \n",
+    "\n",
+    "1. performs lasso boostrap on all specified input TFs for both \n",
+    "    a. all genes  \n",
+    "    b. the top x% of genes, \n",
+    "2. finds the features from boostrapping whose given confidence intervals do not cross zero\n",
+    "3. takes the intersection of features from both \n",
+    "4. tests main effects\n",
+    "\n",
+    "This alternative pipeline instead \n",
+    "\n",
+    "1. performs lasso boostrap on ONLY all genes using all input TFs first, \n",
+    "2. identifies the significant features using a CI threshold for the bootstrapping results above\n",
+    "3. uses ONLY the features that were found to be significant to be inputs to perform lasso bootstrap on the top x% of genes  \n",
+    "4. identify the remaining significant features using a CI threshold for the bootstrapping results above\n",
+    "5. tests main effects\n",
+    "\n",
+    "You can see how the main difference is essentially that our current workflow performs lasso boostrap independently in parallel, whereas this alternative pipeline performs boostrapping sequentially. This is why we informally refer to the current method as the \"parallel\" workflow and the alternative method as the \"sequential\" workflow. Thus, in the directory sequential_top_genes_results, this uses the bootstrap_coef_df_all.csv and ci_dict_all.json files (since the start of both workflows is identical w.r.t. all genes) and then contains bootstrap_coef_df_sequential.csv and ci_dict_sequential.json which are the results after performing bootstrapping on the top x% of genes. It also contains intersection_sequential.json which serves as a sanity check to ensure that up to and after step 4 that the set of significant features comes only from the set of features that were initially significant from running lasso boostrap on all genes. Lastly, final_output_sequential.json contains results in the same format as the other final output file from above, but of course for this alternative pipeline.\n",
+    "\n",
+    "We chose to examine this alternative pipeline to see if it could potentially result in more main effects / interaction terms being kept. Our end goal is to make a decision to move forward with only one of these two pipelines."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {
@@ -497,7 +547,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.12.7"
   }
  },
  "nbformat": 4,
diff --git a/yeastdnnexplorer/__main__.py b/yeastdnnexplorer/__main__.py
index 9919d61..e73fa04 100644
--- a/yeastdnnexplorer/__main__.py
+++ b/yeastdnnexplorer/__main__.py
@@ -195,6 +195,9 @@ def run_lasso_bootstrap(args: argparse.Namespace) -> None:
     bootstrap_alphas.to_csv(bootstrap_alphas_path, index=False)
 
 
+# 12/1: new attempt
+
+
 def find_interactors_workflow(args: argparse.Namespace) -> None:
     """
     Run the find_interactors_workflow with the specified arguments.
@@ -205,12 +208,10 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
     output_dirpath = os.path.join(args.output_dir, args.response_tf)
     if os.path.exists(output_dirpath):
         raise FileExistsError(
-            f"File {output_dirpath} already exists. "
+            f"Directory {output_dirpath} already exists. "
             "Please specify a different `output_dir`."
         )
     else:
-        os.makedirs(args.output_dir, exist_ok=True)
-        # Ensure the entire output directory path exists
         os.makedirs(output_dirpath, exist_ok=True)
     if not os.path.exists(args.response_file):
         raise FileNotFoundError(f"File {args.response_file} does not exist.")
@@ -245,6 +246,10 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
     }
 
     # Save the results from bootstrapping for further analysis
+    # also perform the sequential processing where we take the
+    # results from "all" and call get_significant_predictors
+    # using only the TFs from these results on the top x% of data
+    # Sequential analysis
     if args.method == "bootstrap_lassocv":
         # Iterate through "all" and "top"
         for suffix, lasso_results in lasso_res.items():
@@ -260,6 +265,60 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
                 output_dirpath, f"bootstrap_coef_df_{suffix}.csv"
             )
             bootstrap_coef_df.to_csv(bootstrap_coef_df_path, index=False)
+        # Get significant predictors from "all" results
+        assert isinstance(lasso_res["all"]["sig_coefs"], dict)
+        significant_predictors_all = list(lasso_res["all"]["sig_coefs"].keys())
+
+        # Build the formula using these predictors
+        response_variable = f"{args.response_tf}_LRR"
+        predictor_variable = re.sub(r"_rep\\d+", "", args.response_tf)
+
+        # Ensure the main effect of the perturbed TF is included
+        formula_terms = significant_predictors_all.copy()
+        if predictor_variable not in formula_terms:
+            formula_terms.insert(0, predictor_variable)
+
+        # Build the formula string
+        formula = f"{response_variable} ~ {' + '.join(formula_terms)}"
+
+        # Add '-1' to the formula if drop_intercept is True
+        formula += " - 1"
+
+        # Run get_significant_predictors with the custom formula
+        sequential_lasso_res = get_significant_predictors(
+            args.method,
+            args.response_tf,
+            response_df,
+            predictors_df,
+            ci_percentile=args.top_ci_percentile,
+            n_bootstraps=args.n_bootstraps,
+            add_max_lrb=True,
+            quantile_threshold=args.data_quantile,
+            formula=formula,  # Pass the formula via kwargs
+        )
+
+        # Rest of the code remains the same...
+        # Store the results under a new directory
+        sequential_output_dir = os.path.join(
+            output_dirpath, "sequential_top_genes_results"
+        )
+        os.makedirs(sequential_output_dir, exist_ok=True)
+
+        # Save ci_dict and bootstrap_coef_df from sequential_lasso_res
+        ci_dict_seq = sequential_lasso_res["bootstrap_lasso_output"][0]
+        ci_dict_seq_path = os.path.join(
+            sequential_output_dir, "ci_dict_sequential.json"
+        )
+        with open(ci_dict_seq_path, "w") as f:
+            json.dump(ci_dict_seq, f, indent=4)
+
+        bootstrap_coef_df_seq = pd.DataFrame(
+            sequential_lasso_res["bootstrap_lasso_output"][1]
+        )
+        bootstrap_coef_df_seq_path = os.path.join(
+            sequential_output_dir, "bootstrap_coef_df_sequential.csv"
+        )
+        bootstrap_coef_df_seq.to_csv(bootstrap_coef_df_seq_path, index=False)
 
     # Ensure lasso_res["all"]["sig_coefs"] and
     # lasso_res["top"]["sig_coefs"] are dictionaries
@@ -400,6 +459,99 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
     with open(output_path, "w") as f:
         json.dump(output_dict, f, indent=4)
 
+    # If bootstrap_lassocv is the chosen method, need to repeat steps 3 and 4 for the
+    # sequential results
+    if args.method == "bootstrap_lassocv":
+        # Step 3 and 4 for the sequential method
+
+        # Get the significant coefficients from the sequential results
+        sequential_sig_coefs = sequential_lasso_res["sig_coefs"]
+        assert isinstance(sequential_sig_coefs, dict)
+        sequential_final_features = set(sequential_sig_coefs.keys())
+
+        # Save the sequential significant coefficients as a dictionary
+        intersection_seq_path = os.path.join(
+            sequential_output_dir, "intersection_sequential.json"
+        )
+        with open(intersection_seq_path, "w") as f:
+            json.dump(list(sequential_final_features), f, indent=4)
+
+        # Get main effects
+        main_effects_seq = []
+        for term in sequential_final_features:
+            if ":" in term:
+                main_effects_seq.append(term.split(":")[1])
+            else:
+                main_effects_seq.append(term)
+
+        # Combine main effects with final features
+        interactor_terms_and_main_effects_seq = (
+            list(sequential_final_features) + main_effects_seq
+        )
+
+        # Generate model matrix
+        _, full_X_seq = generate_modeling_data(
+            args.response_tf,
+            response_df,
+            predictors_df,
+            formula=f"~ {' + '.join(interactor_terms_and_main_effects_seq)}",
+            drop_intercept=False,
+        )
+
+        # Add max_lrb column
+        full_X_seq["max_lrb"] = max_lrb
+
+        # Use the response and classes from the "all" data (NOT sequential data)
+        response_seq = lasso_res["all"]["response"]
+        classes_seq = lasso_res["all"]["classes"]
+
+        # Step 3: Get interactor importance using "all" data
+        (
+            full_avg_rsquared_seq,
+            interactor_results_seq,
+        ) = get_interactor_importance(
+            response_seq,
+            full_X_seq,
+            classes_seq,
+            sequential_final_features,
+        )
+
+        # Update final features based on interactor results (no change needed here)
+        for interactor_variant in interactor_results_seq:
+            k = interactor_variant["interactor"]
+            v = interactor_variant["variant"]
+            sequential_final_features.remove(k)
+            sequential_final_features.add(v)
+
+        # Step 4: Compare results of final model vs. a univariate model on "all" data
+        assert isinstance(lasso_res["all"]["predictors"], pd.DataFrame)
+        avg_r2_univariate_seq = stratified_cv_r2(
+            response_seq,
+            lasso_res["all"]["predictors"][[model_tf]],
+            classes_seq,
+        )
+
+        final_model_avg_r_squared_seq = stratified_cv_r2(
+            response_seq,
+            full_X_seq[list(sequential_final_features)],
+            classes_seq,
+        )
+
+        # Prepare output dictionary
+        output_dict_seq = {
+            "response_tf": args.response_tf,
+            "final_features": list(sequential_final_features),
+            "avg_r2_univariate": avg_r2_univariate_seq,
+            "final_model_avg_r_squared": final_model_avg_r_squared_seq,
+        }
+
+        # Save the output
+        output_path_seq = os.path.join(
+            sequential_output_dir, "final_output_sequential.json"
+        )
+        with open(output_path_seq, "w") as f:
+            json.dump(output_dict_seq, f, indent=4)
+
 
 class CustomHelpFormatter(argparse.HelpFormatter):
     """Custom help formatter to format the subcommands and general options sections."""
diff --git a/yeastdnnexplorer/ml_models/lasso_modeling.py b/yeastdnnexplorer/ml_models/lasso_modeling.py
index 4af10aa..a98b274 100644
--- a/yeastdnnexplorer/ml_models/lasso_modeling.py
+++ b/yeastdnnexplorer/ml_models/lasso_modeling.py
@@ -589,6 +589,7 @@ def get_significant_predictors(
         predictors_df,
         quantile_threshold=kwargs.get("quantile_threshold", None),
         drop_intercept=True,
+        formula=kwargs.get("formula", None),  # Pass formula from kwargs
     )
 
     # NOTE: fit_intercept is set to `true`