Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating results for a sequential modeling pipeline (to compare to our current pipeline) #132

Merged
merged 2 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 56 additions & 6 deletions docs/tutorials/interactor_modeling_workflow.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -455,17 +455,29 @@
" --response_file ~/htcf_local/lasso_bootstrap/erics_tfs/response_dataframe_20241105.csv \\\n",
" --predictors_file ~/htcf_local/lasso_bootstrap/erics_tfs/predictors_dataframe_20241105.csv \\\n",
" --method bootstrap_lassocv \\\n",
" --response_tf CBF1\n",
" --response_tf CBF1 \\\n",
" --output_dir /workflow_results\n",
"```\n",
"\n",
"the output, in your `$PWD` would look like this:\n",
"\n",
"```raw\n",
"├── find_interactors_output\n",
"│ └── CBF1_output.json\n",
"├── workflow_results\n",
" └── CBF1\n",
" └── bootstrap_coef_df_all.csv\n",
" └── bootstrap_coef_df_top.csv\n",
" └── ci_dict_all.json\n",
" └── ci_dict_top.json\n",
" └── final_output.json\n",
" └── intersection.json\n",
" └── sequential_top_genes_results\n",
" └── bootstrap_coef_df_sequential.csv\n",
" └── ci_dict_sequential.json\n",
" └── final_output_sequential.json\n",
" └── intersection_sequential.json\n",
"```\n",
"\n",
"Where CBF1_output.json looks like:\n",
"Where final_output.json looks like:\n",
"\n",
"```json\n",
"{\n",
Expand All @@ -479,6 +491,44 @@
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a number of files in this output that are returned. First, the bootstrap_coef_df csv files contain the results of the lasso bootstrap process on either all genes or the top x% of genes. It is from this data that the ci_dict json files are constructed. These files are dictionaries whose key is a confidence interval (i.e. \"95.0\") and whose value is a nested dictionary whose key is the an interaction term and whose value is the resulting confidence interval corresponding to the original key. The final_output.json file as shown above contains information based on the final models resulting from the current workflow. The intersection.json file is included to verify that only the features deemed significant by both lasso boostrap on all genes and the top x% of genes are the only features used in the pipeline moving forward. \n",
"\n",
"Lastly, it is important to note the nested directory sequential_top_genes_results. This directory contains results from running an alternative pipeline. Whereas the current pipeline \n",
"\n",
"1. performs lasso boostrap on all specified input TFs for both \n",
" a. all genes \n",
" b. the top x% of genes, \n",
"2. finds the features from boostrapping whose given confidence intervals do not cross zero\n",
"3. takes the intersection of features from both \n",
"4. tests main effects\n",
"\n",
"This alternative pipeline instead \n",
"\n",
"1. performs lasso boostrap on ONLY all genes using all input TFs first, \n",
"2. identifies the significant features using a CI threshold for the bootstrapping results above\n",
"3. uses ONLY the features that were found to be significant to be inputs to perform lasso bootstrap on the top x% of genes \n",
"4. identify the remaining significant features using a CI threshold for the bootstrapping results above\n",
"5. tests main effects\n",
"\n",
"You can see how the main difference is essentially that our current workflow performs lasso boostrap independently in parallel, whereas this alternative pipeline performs boostrapping sequentially. This is why we informally refer to the current method as the \"parallel\" workflow and the alternative method as the \"sequential\" workflow. Thus, in the directory sequential_top_genes_results, this uses the bootstrap_coef_df_all.csv and ci_dict_all.json files (since the start of both workflows is identical w.r.t. all genes) and then contains bootstrap_coef_df_sequential.csv and ci_dict_sequential.json which are the results after performing bootstrapping on the top x% of genes. It also contains intersection_sequential.json which serves as a sanity check to ensure that up to and after step 4 that the set of significant features comes only from the set of features that were initially significant from running lasso boostrap on all genes. Lastly, final_output_sequential.json contains results in the same format as the other final output file from above, but of course for this alternative pipeline.\n",
"\n",
"We chose to examine this alternative pipeline to see if it could potentially result in more main effects / interaction terms being kept. Our end goal is to make a decision to move forward with only one of these two pipelines."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
Expand All @@ -497,7 +547,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.7"
}
},
"nbformat": 4,
Expand Down
158 changes: 155 additions & 3 deletions yeastdnnexplorer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,9 @@ def run_lasso_bootstrap(args: argparse.Namespace) -> None:
bootstrap_alphas.to_csv(bootstrap_alphas_path, index=False)


# 12/1: new attempt


def find_interactors_workflow(args: argparse.Namespace) -> None:
"""
Run the find_interactors_workflow with the specified arguments.
Expand All @@ -205,12 +208,10 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
output_dirpath = os.path.join(args.output_dir, args.response_tf)
if os.path.exists(output_dirpath):
raise FileExistsError(
f"File {output_dirpath} already exists. "
f"Directory {output_dirpath} already exists. "
"Please specify a different `output_dir`."
)
else:
os.makedirs(args.output_dir, exist_ok=True)
# Ensure the entire output directory path exists
os.makedirs(output_dirpath, exist_ok=True)
if not os.path.exists(args.response_file):
raise FileNotFoundError(f"File {args.response_file} does not exist.")
Expand Down Expand Up @@ -245,6 +246,10 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
}

# Save the results from bootstrapping for further analysis
# also perform the sequential processing where we take the
# results from "all" and call get_significant_predictors
# using only the TFs from these results on the top x% of data
# Sequential analysis
if args.method == "bootstrap_lassocv":
# Iterate through "all" and "top"
for suffix, lasso_results in lasso_res.items():
Expand All @@ -260,6 +265,60 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
output_dirpath, f"bootstrap_coef_df_{suffix}.csv"
)
bootstrap_coef_df.to_csv(bootstrap_coef_df_path, index=False)
# Get significant predictors from "all" results
assert isinstance(lasso_res["all"]["sig_coefs"], dict)
significant_predictors_all = list(lasso_res["all"]["sig_coefs"].keys())

# Build the formula using these predictors
response_variable = f"{args.response_tf}_LRR"
predictor_variable = re.sub(r"_rep\\d+", "", args.response_tf)

# Ensure the main effect of the perturbed TF is included
formula_terms = significant_predictors_all.copy()
if predictor_variable not in formula_terms:
formula_terms.insert(0, predictor_variable)

# Build the formula string
formula = f"{response_variable} ~ {' + '.join(formula_terms)}"

# Add '-1' to the formula if drop_intercept is True
formula += " - 1"

# Run get_significant_predictors with the custom formula
sequential_lasso_res = get_significant_predictors(
args.method,
args.response_tf,
response_df,
predictors_df,
ci_percentile=args.top_ci_percentile,
n_bootstraps=args.n_bootstraps,
add_max_lrb=True,
quantile_threshold=args.data_quantile,
formula=formula, # Pass the formula via kwargs
)

# Rest of the code remains the same...
# Store the results under a new directory
sequential_output_dir = os.path.join(
output_dirpath, "sequential_top_genes_results"
Copy link
Member

@cmatKhan cmatKhan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your results_20241202, I see this pattern:

results_20241202/ACE2/ACE2/

Where there is a directory with the same name nested under the other. As far as I know, this wasn't happening before. It isn't immediately clear to me that it is in this line that this occurs, but nowhere else stands out.

On a related note: somewhere in the docs, in the cmd line --help, or even as a README that gets deposited in every results otuput directory, there needs to be a definition of what "sequential_top_genes_results" are.

I agree with you that keeping this in the main workflow for now (meaning, not creating another cmd line option to turn this feature on and off) is a good idea. It just needs to be documented so that it is clear what these results are.

Copy link
Member

@cmatKhan cmatKhan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than this, it looks good to me in terms of format. I haven't pulled and run this, and __main__ is where my testing is not strict. My strategy is to implement the logic elsewhere, and do only very simple things that I am quite certain of in __main__ (not a strong strategy). Please do make sure that you go over your added code carefully. Play with it in a console/notebook if you need to be sure of something.

)
os.makedirs(sequential_output_dir, exist_ok=True)

# Save ci_dict and bootstrap_coef_df from sequential_lasso_res
ci_dict_seq = sequential_lasso_res["bootstrap_lasso_output"][0]
ci_dict_seq_path = os.path.join(
sequential_output_dir, "ci_dict_sequential.json"
)
with open(ci_dict_seq_path, "w") as f:
json.dump(ci_dict_seq, f, indent=4)

bootstrap_coef_df_seq = pd.DataFrame(
sequential_lasso_res["bootstrap_lasso_output"][1]
)
bootstrap_coef_df_seq_path = os.path.join(
sequential_output_dir, "bootstrap_coef_df_sequential.csv"
)
bootstrap_coef_df_seq.to_csv(bootstrap_coef_df_seq_path, index=False)

# Ensure lasso_res["all"]["sig_coefs"] and
# lasso_res["top"]["sig_coefs"] are dictionaries
Expand Down Expand Up @@ -400,6 +459,99 @@ def find_interactors_workflow(args: argparse.Namespace) -> None:
with open(output_path, "w") as f:
json.dump(output_dict, f, indent=4)

# If bootstrap_lassocv is the chosen method, need to repeat steps 3 and 4 for the
# sequential results
if args.method == "bootstrap_lassocv":
# Step 3 and 4 for the sequential method

# Get the significant coefficients from the sequential results
sequential_sig_coefs = sequential_lasso_res["sig_coefs"]
assert isinstance(sequential_sig_coefs, dict)
sequential_final_features = set(sequential_sig_coefs.keys())

# Save the sequential significant coefficients as a dictionary
intersection_seq_path = os.path.join(
sequential_output_dir, "intersection_sequential.json"
)
with open(intersection_seq_path, "w") as f:
json.dump(list(sequential_final_features), f, indent=4)

# Get main effects
main_effects_seq = []
for term in sequential_final_features:
if ":" in term:
main_effects_seq.append(term.split(":")[1])
else:
main_effects_seq.append(term)

# Combine main effects with final features
interactor_terms_and_main_effects_seq = (
list(sequential_final_features) + main_effects_seq
)

# Generate model matrix
_, full_X_seq = generate_modeling_data(
args.response_tf,
response_df,
predictors_df,
formula=f"~ {' + '.join(interactor_terms_and_main_effects_seq)}",
drop_intercept=False,
)

# Add max_lrb column
full_X_seq["max_lrb"] = max_lrb

# Use the response and classes from the "all" data (NOT sequential data)
response_seq = lasso_res["all"]["response"]
classes_seq = lasso_res["all"]["classes"]

# Step 3: Get interactor importance using "all" data
(
full_avg_rsquared_seq,
interactor_results_seq,
) = get_interactor_importance(
response_seq,
full_X_seq,
classes_seq,
sequential_final_features,
)

# Update final features based on interactor results (no change needed here)
for interactor_variant in interactor_results_seq:
k = interactor_variant["interactor"]
v = interactor_variant["variant"]
sequential_final_features.remove(k)
sequential_final_features.add(v)

# Step 4: Compare results of final model vs. a univariate model on "all" data
assert isinstance(lasso_res["all"]["predictors"], pd.DataFrame)
avg_r2_univariate_seq = stratified_cv_r2(
response_seq,
lasso_res["all"]["predictors"][[model_tf]],
classes_seq,
)

final_model_avg_r_squared_seq = stratified_cv_r2(
response_seq,
full_X_seq[list(sequential_final_features)],
classes_seq,
)

# Prepare output dictionary
output_dict_seq = {
"response_tf": args.response_tf,
"final_features": list(sequential_final_features),
"avg_r2_univariate": avg_r2_univariate_seq,
"final_model_avg_r_squared": final_model_avg_r_squared_seq,
}

# Save the output
output_path_seq = os.path.join(
sequential_output_dir, "final_output_sequential.json"
)
with open(output_path_seq, "w") as f:
json.dump(output_dict_seq, f, indent=4)


class CustomHelpFormatter(argparse.HelpFormatter):
"""Custom help formatter to format the subcommands and general options sections."""
Expand Down
1 change: 1 addition & 0 deletions yeastdnnexplorer/ml_models/lasso_modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,7 @@ def get_significant_predictors(
predictors_df,
quantile_threshold=kwargs.get("quantile_threshold", None),
drop_intercept=True,
formula=kwargs.get("formula", None), # Pass formula from kwargs
)

# NOTE: fit_intercept is set to `true`
Expand Down
Loading