Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to run subsequent rounds of Rosetta for specific channels #318

Closed
1 of 3 tasks
alex-l-kong opened this issue Feb 13, 2023 · 7 comments · Fixed by #319
Closed
1 of 3 tasks

Allow users to run subsequent rounds of Rosetta for specific channels #318

alex-l-kong opened this issue Feb 13, 2023 · 7 comments · Fixed by #319
Assignees
Labels
design_doc Detailed implementation plan

Comments

@alex-l-kong
Copy link
Contributor

alex-l-kong commented Feb 13, 2023

Relevant background

For specific channel crosstalk smoothing, Rosetta will need to be run subsequently after the first round. We'll refer to this as Rosetta V2, and the first round as Rosetta V1.

Rosetta V2 will need additional custom logic to accommodate this.

Design overview

Note that this process will be run using the original test set from V1.

The following accommodations will need to be made for V2:

  1. (NEW) The user should not regenerate a new test set for V2, and instead resue the test set from V1.
  2. For Rosetta V2, the tiled comparison function will need to receive a custom output channel defined in point 1. This differs from V1 in that its assumed that all the test data channels are used; it is not currently possible to limit this to just one channel.
  3. The image compensation process needs to receive a Gaussian radius of 0 and a normalization constant of 1. This differs from V1 where currently, the default values are a Gaussian radius of 1 and a norm constant of 200. For V2, these params can be easily set for the compensation function itself. However, the test image generation function will also need to explicitly take in these parameters since it calls the compensation function under the hood.
  4. The function that adds the source channel to the tiled image needs to now explicitly receive the 'rescaled' folder as an image sub folder. This differs from V1, where no image sub folder was needed. Additionally, because the images have already been normalized for V1, there is no need to renormalize the source row to be in the same range as the current tile.
  5. The default functionality for setting the run names to use should now be to programmatically list them out. Users should still have an option to manually set this, but it should be commented out with clear instructions on what to do if this is desired (similar to how we handle manually setting lists of FOVs for the segmentation notebook in ark).
  6. The "official" call to the compensation function should now take the rescaled folder itself as a data sub folder. This differs from V1, where the call to the final compensation function has no data sub folder specified.

Code mockup

4a_compensate_image_data.ipynb

This notebook should now remain mostly unchanged.

The only addition to this notebook will be to allow the user to list all of the runs in extracted_imgs_dir for testing. We can copy the logic contained in section 3 of this notebook to this section:

# default rosetta matrix provided in toffy
default_matrix_path = os.path.join('..', 'files', 'commercial_rosetta_matrix_v1.csv')

rosetta_testing_dir = 'C:\\Users\\Customer.ION\\Documents\\rosetta_testing'
extracted_imgs_dir = 'D:\\Extracted_Images'

# if you would like to process all of the run folders in the image dir instead of just the runs tested, you can use the below line
# run_names = list_folders(extracted_imgs_dir)

# read in toffy panel file
panel = load_panel(panel_path)

This addresses point 5 of the design overview.

V2 functionality will be moved over to a new notebook. At the end of this notebook, add a link that directs the user to the round 2 notebook.

We can probably leave this notebook mostly as is. There's just one thing we may need to change:

The structure of this notebook will need to be modified to accommodate V2.

While certain changes will need to be made, a lot of the underlying logic will remain the same. In section 2, either inside or directly underneath the cell that sets the channel name, multipliers, and folder name, we should allow the user to specify the following:

  • output_channels (defaults to None)
  • rosetta_sub_folder (defaults to '')
  • skip_source_norm (defaults to False, this will be explained in rosetta.add_source_channel_to_tiled_image)

In this way, we can explicitly pass these arguments to their respective functions so that Rosetta V2 can run seamlessly. We should guide the users toward the values they need to set for V2 in the documentation.

When listing out the run names, the default should now be to use os.listdir(path/to/run/folders), although we can provide a comment describing how the user can manually set this if needed.

Because V2 may require changing gaus_rad and norm_const, we'll list these out in their respective function calls so the user can change them if need be. They'll be preset to their current defaults in rosetta.py.

4a_compensate_image_data_round2.ipynb

Accepting better names for this notebook!

This notebook will allow the user to run Rosetta V2. The structure will mostly remain the same as 4a_compensate_image_data.ipynb with a few changes.

  1. Instead of generating a new test set, the user should use the same test set generated for Rosetta V1. Instead of
# copy random fovs from each run
rosetta.copy_image_files(cohort_name, run_names, rosetta_testing_dir, extracted_imgs_dir, fovs_per_run=5)

# copy rosetta matrix
shutil.copyfile(default_matrix_path, 
                os.path.join(rosetta_testing_dir, cohort_name, 'commercial_rosetta_matrix.csv'))

# rescale images to allow direct comparison with rosetta
img_out_dir = os.path.join(rosetta_testing_dir, cohort_name, 'extracted_images')
rosetta.rescale_raw_imgs(img_out_dir)

we should now use

# copy rosetta matrix (this will still need to happen)
shutil.copyfile(default_matrix_path, 
                os.path.join(rosetta_testing_dir, cohort_name, 'commercial_rosetta_matrix_v2.csv'))

if not os.path.exists(os.path.join(rosetta_testing_dir, cohort_name)):
    raise ValueError('Cohort %s does not have testing data in %s: please double check these variables' % (rosetta_testing_dir, cohort_name)

This will address point 1 in the design overview.

  1. The user will need to explicitly specify which output channel(s) to compensate against. The cell
# pick the channel that you will be optimizing the coefficient for
current_channel_name = 'Noodle'

# set multipliers
multipliers = [0.5, 1, 2]

# pick an informative name
folder_name = 'rosetta_test1'

needs to change to:

# pick the channel that you will be optimizing the coefficient for
current_channel_name = 'Noodle'

# set multipliers
multipliers = [0.5, 1, 2]

# pick an informative name
folder_name = 'rosetta_test1'

# output channel(s) to compensate against
output_channel_names = ['chan_name']  # we'll use a more informative one in the actual notebook

This addresses point 2 in the design overview.

  1. The variable rosetta_mat_path will need to be changed to rosetta_mat_path = os.path.join(rosetta_testing_dir, cohort_name, 'commercial_rosetta_matrix_v2.csv'). This will distinguish from the existing V1 matrix copied into this folder in the 4a_compensate_image_data.ipynb.

  2. The call to rosetta.generate_rosetta_test_imgs needs to be updated to:

# compensate the example fov images
rosetta.generate_rosetta_test_imgs(rosetta_mat_path, img_out_dir, multipliers, folder_path, 
                                   panel, current_channel_name, output_channel_names=output_channel_names,
                                   gaus_rad=0, norm_const=1)

This addresses points 2 and 3 in the design overview. rosetta.generate_rosetta_test_imgs will need to be updated to take gaus_rad and norm_const as parameters, this will be described later.

  1. The call to rosetta.create_tiled_comparison needs to be changed to:
rosetta.create_tiled_comparison(input_dir_list=rosetta_dirs, output_dir=stitched_dir, max_img_size=img_size, 
                                channels=output_channel_names)

This addresses point 2 in the design overview. create_tiled_comparison will need to handle both None and explicit lists for output_channel_names, this will be described later.

  1. The call to rosetta.add_source_channel_to_tiled_image needs to be changed to:
rosetta.add_source_channel_to_tiled_image(raw_img_dir=img_out_dir, tiled_img_dir=stitched_dir,
                                          output_dir=output_dir, source_channel=current_channel_name,
                                          max_img_size=img_size, img_sub_folder='rescaled',
                                          percent_norm=None)

This addresses point 4 in the design overview. add_source_channel_to_tiled_image will need to handle percent_norm=None to skip normalization of the source row, this will be described later.

  1. The call to rosetta.compensante_image_data needs to be changed to:
    rosetta.compensate_image_data(raw_data_dir=os.path.join(extracted_imgs_dir, run), 
                                  comp_data_dir=os.path.join(rosetta_image_dir, run), 
                                  comp_mat_path=final_rosetta_path, panel_info=panel,
                                  raw_data_sub_folder='rescaled', batch_size=1,
                                  gaus_rad=0, norm_const=1)

This addresses points 3 and 6 in the design overview.

rosetta.add_source_channel_to_tiled_image

This function should add control flow logic that skips the source row normalization steps on V2. Allow the parameter percent_norm to take None as a value. Add a param called skip_source_norm (analogous to skip_source_norm set in notebook 4a), and add a check prior to the normalization step:

The function will now be changed to:

def add_source_channel_to_tiled_image(raw_img_dir, tiled_img_dir, output_dir, source_channel,
                                      max_img_size, img_sub_folder='', percent_norm=98):
    """Adds the specified source_channel to the first row of previously generated tiled images
    Args:
        raw_img_dir (str): path to directory containing the raw images
        tiled_img_dir (str): path to directory contained the tiled images
        output_dir (str): path to directory where outputs will be saved
        img_sub_folder (str): subfolder within raw_img_dir to load images from
        max_img_size (int): largest fov image size
        source_channel (str): the channel which will be prepended to the tiled images
        percent_norm (int): percentile normalization param to enable easy visualization, set to
            None to skip this step"""

    # load source images
    source_imgs = load_utils.load_imgs_from_tree(raw_img_dir, channels=[source_channel],
                                                 img_sub_folder=img_sub_folder,
                                                 max_image_size=max_img_size)

    # convert stacked images to concatenated row
    source_list = [source_imgs.values[fov, :, :, 0] for fov in range(source_imgs.shape[0])]
    source_row = np.concatenate(source_list, axis=1)

    # CHANGE
    # get percentile of source row if percent_norm set, otherwise leave unset
    perc_source = np.percentile(source_row, percent_norm) if percent_norm else None

    # confirm tiled images have expected shape
    tiled_images = io_utils.list_files(tiled_img_dir)
    test_file = io.imread(os.path.join(tiled_img_dir, tiled_images[0]))
    if test_file.shape[1] != source_row.shape[1]:
        raise ValueError('Tiled image {} has shape {}, but source image {} has'
                         'shape {}'.format(tiled_images[0], test_file.shape, source_channel,
                                           source_row.shape))

    # loop through each tiled image, prepend source row, and save
    for tile_name in tiled_images:
        current_tile = io.imread(os.path.join(tiled_img_dir, tile_name))

        # CHANGE
        # if percent_norm set, normalize the source row to be in the same range as the current tile
        # otherwise, just leave as is (divide by 1)
        perc_ratio = 1
        if percent_norm:
            perc_tile = np.percentile(current_tile, percent_norm)
            perc_ratio = perc_source / perc_tile

        rescaled_source = source_row / perc_ratio

        # combine together and save
        combined_tile = np.concatenate([rescaled_source, current_tile])
        save_name = tile_name.split('.tiff')[0] + '_source_' + source_channel + '.tiff'
        image_utils.save_image(os.path.join(output_dir, save_name), combined_tile)

rosetta.generate_rosetta_test_imgs

This function now needs to explicitly receive gaus_rad and norm_const params so it can pass them to compensate_image_data.

def generate_rosetta_test_imgs(rosetta_mat_path, img_out_dir,  multipliers, folder_path, panel,
                               current_channel_name='Noodle', output_channel_names=None,
                               gaus_rad=1, norm_const=1):
    """ Compensate example FOV images based on given multipliers
    Args:
        rosetta_mat_path (str): path to rosetta compensation matrix
        img_out_dir (str): directory where extracted images are stored
        multipliers (list): list of coeffient multipliers to create different matrices for
        folder_path (str): base dir for testing, image subdirs will be stored here
        panel (pd.DataFrame): the panel containing the masses and channel names
        current_channel_name (str): channel being adjusted, default Noodle
        output_channel_names (list): subset of the channels to compensate for, default None is all
        gaus_rad: radius for blurring image data. Passing 0 will result in no blurring
        norm_const: constant used for rescaling

    Returns:
        Create subdirs containing rosetta compensated images for each multiplier and stitched imgs
    """
    io_utils.validate_paths([rosetta_mat_path, img_out_dir, folder_path])

    # get mass information
    current_channel_mass = get_masses_from_channel_names([current_channel_name], panel)

    if output_channel_names is not None:
        output_masses = get_masses_from_channel_names(output_channel_names, panel)
    else:
        output_masses = None

    # generate rosetta matrices for each multiplier
    create_rosetta_matrices(default_matrix=rosetta_mat_path, save_dir=folder_path,
                            multipliers=multipliers, current_channel_name=current_channel_name,
                            output_channel_names=output_channel_names, masses=current_channel_mass)

    # define the file prefix used for each compensation matrix file
    matrix_name = io_utils.remove_file_extensions([os.path.basename(rosetta_mat_path)])[0]
    output_chan_str = '_'.join(output_channel_names) if output_channel_names is not None else 'all'
    comp_file_prefix = f'{current_channel_name}_{output_chan_str}_{matrix_name}_mult'

    # loop over each multiplier and compensate the data
    rosetta_dirs = [img_out_dir]
    for multiplier in multipliers:
        print(f'Processing images with multiplier {multiplier}')
        rosetta_mat_path = os.path.join(folder_path, f'{comp_file_prefix}_{multiplier}.csv')
        rosetta_out_dir = os.path.join(folder_path, 'compensated_data_{}'.format(multiplier))
        rosetta_dirs.append(rosetta_out_dir)
        os.makedirs(rosetta_out_dir)
        compensate_image_data(raw_data_dir=img_out_dir, comp_data_dir=rosetta_out_dir,
                              comp_mat_path=rosetta_mat_path, raw_data_sub_folder='rescaled',
                              panel_info=panel, batch_size=1, gaus_rad=gaus_rad,
                              norm_const=norm_const, output_masses=output_masses)

rosetta.create_tiled_comparison

To support explicit channel lists passed into this function, we need to add the custom logic for setting channels:

# this will happen for V1, when channels = None
if not channels:
    channels = test_data.channels.values

Testing

This should remain the same, as we're not changing the underlying functionality of Rosetta, just utilizing parameters that have otherwise been for niche use cases.

Required inputs

For V2, the user will need to explicitly specify the parameters defined in the Code mockup section (output_channels, rosetta_sub_folder, and skip_source_norm).

Output files

Same as before.

Timeline
Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

  • A couple days
  • A week
  • Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review:

02/25/23

Estimated date when the finalized project will be merged in:

03/02/23

@alex-l-kong alex-l-kong added the design_doc Detailed implementation plan label Feb 13, 2023
@alex-l-kong alex-l-kong changed the title Allow users to run subsequent rounds of Rosetta for specific channel Allow users to run subsequent rounds of Rosetta for specific channels Feb 13, 2023
@alex-l-kong
Copy link
Contributor Author

@HPiyadasa has approved this. @ngreenwald any additional comments before I begin?

@alex-l-kong alex-l-kong self-assigned this Feb 15, 2023
@ngreenwald
Copy link
Member

4 and 5 look like they are in disagreement. Will the source images be added or not?

Since the user is almost never going to change gaus_rad or norm_const, why do these need to be added to the first part of the notebook?

@alex-l-kong
Copy link
Contributor Author

Ah yes, the classic "I accidentally updated previous information in a new bullet point, and forgot to delete the old one". Fixed.

I thought it might be easier for users to have gaus_rad and norm_const explicitly specified as variables for V2, but we can leave them as explicit function params instead. @HPiyadasa and others should know to change these there.

@ngreenwald let me know if this looks good, or if you think there's anything else we can avoid explicitly setting.

@ngreenwald
Copy link
Member

Looks good

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Mar 2, 2023

@HPiyadasa I talked with @ngreenwald at pipeline and we agree that it's best to have the Rosetta V2 as a separate notebook. In that way, after setting parameters, we allow the user to run all cells seamlessly without needing to backtrack or stop before a certain point.

I will update the design doc accordingly then tag you both on a review of it. The functionality will be the same, just that the flow will be different.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald @HPiyadasa let me know if this updated design doc looks good or if anything else needs to be changed to accommodate the new notebook.

@ngreenwald
Copy link
Member

This design doc is now a bit too long. The goal is to be able to easily read through the whole thing and identify potential issues. Given that you updated it after you'd already make progress, it makes sense that there are more details than normal, but at this point there's not much of an advantage to reading this versus just looking at the PR itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design_doc Detailed implementation plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants