fixed directory issue and updated tutorial to describe alternative pi…

…peline
BrentLab · Dec 18, 2024 · d392cde · cmatKhan · Dec 18, 2024 · d392cde
1 parent 7fbca45
commit d392cde
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 9 deletions.
diff --git a/docs/tutorials/interactor_modeling_workflow.ipynb b/docs/tutorials/interactor_modeling_workflow.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -455,17 +455,29 @@
     "    --response_file ~/htcf_local/lasso_bootstrap/erics_tfs/response_dataframe_20241105.csv \\\n",
     "    --predictors_file ~/htcf_local/lasso_bootstrap/erics_tfs/predictors_dataframe_20241105.csv \\\n",
     "    --method bootstrap_lassocv \\\n",
-    "    --response_tf CBF1\n",
+    "    --response_tf CBF1 \\\n",
+    "    --output_dir /workflow_results\n",
     "```\n",
     "\n",
     "the output, in your `$PWD` would look like this:\n",
     "\n",
     "```raw\n",
-    "├── find_interactors_output\n",
-    "│   └── CBF1_output.json\n",
+    "├── workflow_results\n",
+    "    └── CBF1\n",
+    "        └── bootstrap_coef_df_all.csv\n",
+    "        └── bootstrap_coef_df_top.csv\n",
+    "        └── ci_dict_all.json\n",
+    "        └── ci_dict_top.json\n",
+    "        └── final_output.json\n",
+    "        └── intersection.json\n",
+    "        └── sequential_top_genes_results\n",
+    "            └── bootstrap_coef_df_sequential.csv\n",
+    "            └── ci_dict_sequential.json\n",
+    "            └── final_output_sequential.json\n",
+    "            └── intersection_sequential.json\n",
     "```\n",
     "\n",
-    "Where CBF1_output.json looks like:\n",
+    "Where final_output.json looks like:\n",
     "\n",
     "```json\n",
     "{\n",
@@ -479,6 +491,44 @@
     "}\n",
     "```"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are a number of files in this output that are returned. First, the bootstrap_coef_df csv files contain the results of the lasso bootstrap process on either all genes or the top x% of genes. It is from this data that the ci_dict json files are constructed. These files are dictionaries whose key is a confidence interval (i.e. \"95.0\") and whose value is a nested dictionary whose key is the an interaction term and whose value is the resulting confidence interval corresponding to the original key. The final_output.json file as shown above contains information based on the final models resulting from the current workflow. The intersection.json file is included to verify that only the features deemed significant by both lasso boostrap on all genes and the top x% of genes are the only features used in the pipeline moving forward. \n",
+    "\n",
+    "Lastly, it is important to note the nested directory sequential_top_genes_results. This directory contains results from running an alternative pipeline. Whereas the current pipeline \n",
+    "\n",
+    "1. performs lasso boostrap on all specified input TFs for both \n",
+    "    a. all genes  \n",
+    "    b. the top x% of genes, \n",
+    "2. finds the features from boostrapping whose given confidence intervals do not cross zero\n",
+    "3. takes the intersection of features from both \n",
+    "4. tests main effects\n",
+    "\n",
+    "This alternative pipeline instead \n",
+    "\n",
+    "1. performs lasso boostrap on ONLY all genes using all input TFs first, \n",
+    "2. identifies the significant features using a CI threshold for the bootstrapping results above\n",
+    "3. uses ONLY the features that were found to be significant to be inputs to perform lasso bootstrap on the top x% of genes  \n",
+    "4. identify the remaining significant features using a CI threshold for the bootstrapping results above\n",
+    "5. tests main effects\n",
+    "\n",
+    "You can see how the main difference is essentially that our current workflow performs lasso boostrap independently in parallel, whereas this alternative pipeline performs boostrapping sequentially. This is why we informally refer to the current method as the \"parallel\" workflow and the alternative method as the \"sequential\" workflow. Thus, in the directory sequential_top_genes_results, this uses the bootstrap_coef_df_all.csv and ci_dict_all.json files (since the start of both workflows is identical w.r.t. all genes) and then contains bootstrap_coef_df_sequential.csv and ci_dict_sequential.json which are the results after performing bootstrapping on the top x% of genes. It also contains intersection_sequential.json which serves as a sanity check to ensure that up to and after step 4 that the set of significant features comes only from the set of features that were initially significant from running lasso boostrap on all genes. Lastly, final_output_sequential.json contains results in the same format as the other final output file from above, but of course for this alternative pipeline.\n",
+    "\n",
+    "We chose to examine this alternative pipeline to see if it could potentially result in more main effects / interaction terms being kept. Our end goal is to make a decision to move forward with only one of these two pipelines."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {
@@ -497,7 +547,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.12.7"
   }
  },
  "nbformat": 4,

diff --git a/yeastdnnexplorer/__main__.py b/yeastdnnexplorer/__main__.py
@@ -208,12 +208,10 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
     output_dirpath = os.path.join(args.output_dir, args.response_tf)
     if os.path.exists(output_dirpath):
         raise FileExistsError(
-            f"File {output_dirpath} already exists. "
+            f"Directory {output_dirpath} already exists. "
             "Please specify a different `output_dir`."
         )
     else:
-        os.makedirs(args.output_dir, exist_ok=True)
-        # Ensure the entire output directory path exists
         os.makedirs(output_dirpath, exist_ok=True)
     if not os.path.exists(args.response_file):
         raise FileNotFoundError(f"File {args.response_file} does not exist.")