Restructure results/checkpoint + New features: (analysis tools, large dataset support, 3D viz and other more) #79

calebweinreb · 2023-08-08T00:19:52Z

Summary

This PR introduces the following changes/features, which are explained in more detail below.

New logic for syllable indexing
Altered format for results and checkpoint files
New notebook and widgets for statistical analysis of syllables (thanks to @versey-sherry!)
Support for modeling large datasets (multiple GPUs + partial serialization)
Interactive visualizations for 3D data
Minor features additions (NWB support, syllable similarity plot)

New logic for syllable indexing

Until now, the "extract_results" of keypoint-MoSeq saved saved syllable sequences in their original indexing (as they were represented during modeling) along with a "reindexed" version in which syllables were re-labeled by frequency (so syllable "0" was the most frequent, and so on). But this approach had a fatal flaw: when a fitted model was applied to new data, the syllable frequencies could be different, which would lead to a slightly different re-labeling, so that e.g. syllable "0" would refer to one state in a subset of recordings and a different state in another subset.

To prevent this issue, we now reindex syllable directly inside the model object. That way, if the model is used later to generate syllable for new data, the resulting labels will always be consistent. See #72 for details. Concretely, this means that

The standard modeling pipeline now includes a new step after model fitting but before extracting results:

kpms.reindex_syllables_in_checkpoint(project_dir, model_name);

The results files no longer include separate "syllables" and "syllables_reindexed" fields (see below for more details).

New format for results and checkpoint files

This PR introduces a new format for the results.h5 and checkpoint.p files saved during modeling. This is a breaking change, meaning that results/checkpoints generated with a previous version of the code will no longer work. Below we explain the changes and provide code for converting to the new format.

How the formats have changed

From a user perspective, the main change is that the results.h5 no longer contains separate syllables and syllables_reindexed

For the results.h5 files, we have removed some fields and renamed others. Previously the format was

    results.h5
    ├──session_name1
    │  ├──estimated_coordinates  # denoised coordinates
    │  ├──syllables_reindexed    # syllables reindexed by frequency
    │  ├──syllables              # non-reindexed syllables labels (z)
    │  ├──latent_state           # inferred low-dim pose state (x)
    │  ├──centroid               # inferred centroid (v)
    │  └──heading                # inferred heading (h)
    ⋮

Now the format is

    results.h5
    ├──recording_name1
    │  ├──syllable      # syllable labels (z)
    │  ├──latent_state  # inferred low-dim pose state (x)
    │  ├──centroid      # inferred centroid (v)
    │  └──heading       # inferred heading (h)
    ⋮

The checkpoint.p files have changed more substantively. They are now saved as hdf5 files (rather than joblib) and their internal organization has changed.

Converting to the new format

The following code converts results and checkpoint files to the new format. Given a project directory and model name, a new project directory is generated with the updated files. As part of the reformatting, syllables are reindexed inside the model (see previous section) and a list of the resulting syllable name-changes is printed.

Make sure you are using the most up-to-date version of keypoint_moseq before running.

import keypoint_moseq as kpms
import numpy as np
import os, shutil
import joblib

def update_checkpoint_format(checkpoint):
    model = {k:checkpoint[k] for k in ['seed','noise_prior','params','states','hypparams']}
    model_snapshots = {str(checkpoint['iteration']): model}

    for i,hist in checkpoint['history'].items():
        model_snapshots[str(i)] = {
            'noise_prior': checkpoint['noise_prior'],
            'hypparams': checkpoint['hypparams'],
            'states': hist['states'],
            'params': hist['params'],
            'seed': hist['seed']
        }

    data = {'Y': checkpoint['Y'], 'conf':checkpoint['conf'], 'mask':checkpoint['mask']}
    keys = [l[0] for l in checkpoint['labels']]
    bounds = np.array([l[1:] for l in checkpoint['labels']])
    new_checkpoint = {'data':data, 'metadata':(keys, bounds), 'model_snapshots':model_snapshots}
    return new_checkpoint


def update_results_format(results, index=None):
    for k,v in results.items():
        if 'estimated_coordinates' in v:
            v['est_coords'] = v['estimated_coordinates']
            del v['estimated_coordinates']
       
        if 'syllables' in v:
            if index is None:
                v['syllable'] = v['syllables']
            else:
                v['syllable'] = np.argsort(index)[v['syllables']]
            del v['syllables']
           
        if 'syllables_reindexed' in v:
            del v['syllables_reindexed']  
    return results

Setup new project and model directories

old_project_dir = 'path/to/old/project_dir'
new_project_dir = 'path/to/new/project_dir'
model_name = 'name_of_model'

os.makedirs(new_project_dir)
os.makedirs(os.path.join(new_project_dir, model_name))

for filename in ['pcs-xy.pdf', 'pca_scree.pdf', 'config.yml', 'pca.p']:
    src_path = os.path.join(old_project_dir, filename)
    if os.path.exists(src_path):
        shutil.copy(src_path, new_project_dir)

Convert saved checkpoint to new format

old_checkpoint_path = os.path.join(old_project_dir, model_name, 'checkpoint.h5')
new_checkpoint_path = os.path.join(new_project_dir, model_name, 'checkpoint.h5')

old_checkpoint = joblib.load(os.path.join(old_project_dir, model_name, 'checkpoint.p'))
new_checkpoint = update_checkpoint_format(old_checkpoint)
kpms.save_hdf5(new_checkpoint_path, new_checkpoint)

Reindex syllables in the model checkpoint

index = kpms.reindex_syllables_in_checkpoint(new_project_dir, model_name)
for i,j in enumerate(index):
    print(f'Syllable {j} is now labeled {i}')

Convert saved results to new format

old_results_path = os.path.join(old_project_dir, model_name, 'results.h5')
new_results_path = os.path.join(new_project_dir, model_name, 'results.h5')

old_results = kpms.load_hdf5(os.path.join(old_project_dir, model_name, 'results.h5'))
new_results = update_results_format(old_results, index)
kpms.save_hdf5(new_results_path, new_results)

Regenerate visualizations

config = lambda: kpms.load_config(new_project_dir)
keypoint_data_path = 'path/to/data' # modify as needed
coordinates, confidences, bodyparts = kpms.load_keypoints(keypoint_data_path, 'deeplabcut')
results = kpms.load_results(new_project_dir, model_name)
kpms.save_results_as_csv(results, new_project_dir, model_name)
kpms.generate_trajectory_plots(coordinates, new_results, new_project_dir, model_name, **config())
kpms.generate_grid_movies(new_results, new_project_dir, model_name, coordinates=coordinates, **config())

New analysis tools

This PR introduces a new set of analysis widgets and a tutorial notebook (analysis.ipynb) for using them. These widgets ingest results in the updated format described above. So make sure to run the conversion code before applying the analysis pipeline to an existing project!

Support for large datasets

Currently it is not possible to model large datasets on a GPU without incurring out-of-memory (OOM errors). To address this problem, we have created a framework for mixed serial/parallel computation and added multi-GPU support.

Partial serialization

By default, modeling is parallelized across the full dataset. Here we introduce a new option for mixed parallel/serial computation where the data is split into batches that are processed one at a time. To enable this option, run the following code before fitting the model (if you have already initiated model fitting the kernel must be restarted)

from jax_moseq.utils import set_mixed_map_iters
set_mixed_map_iters(4) # adjust as needed

This will split the data into 4 batches, which should reduce the memory requirements about 4-fold but also result in a 4-fold slow-down. The number of batches can be adjusted as needed.

Multi-GPU support

To use multiple GOUs, run the following code before fitting the model (if you have already initiated model fitting the kernel must be restarted)

from jax_moseq.utils import set_mixed_map_gpus
set_mixed_map_gpus(2)

This will split the computation across two GPUs.

Additional info on implementation

Both of the above options (multi-GPU support and partial serialization) rely on a new utility called mixed_map that we added to the jax_moseq package. Below is a copy of its docstring:

def mixed_map(fun, in_axes=None, out_axes=None):
    """
    Combine jax.pmap, jax.vmap and jax.lax.map for parallelization.

    This function is similar to `jax.vmap`, except that it mixes together
    `jax.pmap`, `jax.vmap` and `jax.lax.map` to prevent OOM errors and allow
    for parallelization across multiple GPUs. The behavior is determined by
    the global variables `_MIXED_MAP_ITERS` and `_MIXED_MAP_GPUS`, which can be
    set using :py:func:`jax_moseq.utils.set_mixed_map_iters` and
    py:func:`jax_moseq.utils.set_mixed_map_gpus` respectively.

    Given an axis size of N to map, the data is padded such that the axis size
    is a multiple of the number of `_MIXED_MAP_ITERS * _MIXED_MAP_GPUS`. The
    data is then processed serially chunks, where the number of chunks is
    determined by `_MIXED_MAP_ITERS`. Each chunk is processed in parallel
    using jax.pmap to distribute across `_MIXED_MAP_GPUS` devices and jax.vmap
    to parallelize within each device.
    """

3D plotting tools

In addition to 2D projections of 3D keypoints, plot_pcs and generate_trajectory_plots now produce interactive 3D visualizations. These are rendered in the notebook and can also be viewed offline in a browser using the saved .html files.
It is now possible to generate grid movies for 3D keypoints, although they will only show 2D projections of the keypoints and not the underlying video. To generate grid movies from 3D data, include the flag keypoints_only=True and set the desired projection plane with the use_dims argument, e.g.

# generate grid movies in the x/y plane
kpms.generate_grid_movies(
   results, 
   project_dir, 
   name, 
   coordinates=coordinates, 
   keypoints_only=True, 
   use_dims=[0,1], 
   **config())

…e") throughout codebase

bainro · 2024-02-03T22:06:29Z

Is this the same issue @calebweinreb ? I'm running your lab's 3D dataset, but didn't realize I'd need multiple GPUs :(

calebweinreb · 2024-02-03T22:11:53Z

You shouldn't need multiple GPUs. Just use "mixed_map_iters" as described in here https://keypoint-moseq.readthedocs.io/en/latest/FAQs.html#troubleshooting

bainro · 2024-02-03T22:27:26Z

You're a wizard, thank you again! 😀 I'll give it a try

…

On Sat, Feb 3, 2024, 14:12 Caleb Weinreb ***@***.***> wrote: You shouldn't need multiple GPUs. Just use "mixed_map_iters" as described in here https://keypoint-moseq.readthedocs.io/en/latest/FAQs.html#troubleshooting — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHTNARA3L7W35543IVJG773YR2Y3NAVCNFSM6AAAAAA3HV535SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQ3DQNJUGY> . You are receiving this because you commented.Message ID: ***@***.***>

versey-sherry added 30 commits April 16, 2023 21:55

chore: migrating documentation

4e905a8

chore: add missing package to setup

618f7f2

chore: add DS_Store to .gitignore

a59d225

chore: add ruamel for yaml handling

9775aae

chore: remove ruamel

c455d2f

chore: organize packages

5247e61

chore: linting with pep8

9e3556b

chore: more gitignore

50130a4

feat: add progress tracking functions

9253e22

chore: update docstring

287ebb0

chore: add additional packges

ca5aaf5

feat: add group labeling widget

7a1bffe

feat: utils for progress

da81cc8

feat: utils for group labeling

634b3ed

chore: fix bugs and add function write index to progress

7209a5d

chore: add docstrings

3ba1d63

chore: pep8 formatting

798f3fa

chore: cleanup compute_moseq_df

1487f2e

chore: update progress file after index file

05532b3

chore: remove dev print statement

db379c0

chore: clean up stats dataframe

946090d

feat: add analysis notebook

4255836

feat: add fingerprint to notebook

fb7f80d

chore: refactoring group setting

444646f

feat: add syll movie viewing widget

4e859a8

fix: change name to syllable

500939a

chore: refactor moseq_df to be reused

035e45d

chore: moving wrappers out of analysis

9d9bd33

feat: adding syllable labeler

d022321

chore: account for frame rate for velocity

a8ced3a

calebweinreb added 25 commits July 28, 2023 17:58

added nwb loader

6218cf3

refactored load PoseEstimation obj and also allow init project from nwb

968a473

refactored path logic and moved some utils to jax_moseq

9435627

added checkpoint reindexing

44d01eb

refactored viz functions and trajectory fetching

b327b19

massive refactor

96fe44c

reformat style + bugfixes

4e56854

more style changes

6c5a78e

reformat and bug fixes

7415e3d

unified "name" and "model_dirname" as a single expression ("model_nam…

7369081

…e") throughout codebase

slightly renamed/bugfixed analysis code, and removed fingerprint plot

20a6cca

Merge branch '3d_plotting' into dev

c6fc010

narrow trajec plots

8ff7308

print progress during reindexing

599add6

save every 25 iters

1603cf4

Edited docs for new release

fa53fbe

slight refactor of dendrogram code

62bfc52

rename variables model_dirname > model_name

07332e1

doc updates and bugfixes

3a4f808

edit colab, also added metric for similarity utils

c96a892

minor edits to colab

e535c42

docs updates and bump jax-moseq version dependency

9915e8f

added support for 3D data in grid plots

47d3bf2

reformat 3D section of docs

5297520

Merge branch 'main' into dev

35cdc59

calebweinreb merged commit 187c2f0 into main Aug 8, 2023

calebweinreb mentioned this pull request Aug 8, 2023

OOM while trying to allocate a (seg_len, num_segments, nlags*latent_dim, nlags*latent_dim) array dattalab/jax-moseq#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure results/checkpoint + New features: (analysis tools, large dataset support, 3D viz and other more) #79

Restructure results/checkpoint + New features: (analysis tools, large dataset support, 3D viz and other more) #79

calebweinreb commented Aug 8, 2023

bainro commented Feb 3, 2024

calebweinreb commented Feb 3, 2024

bainro commented Feb 3, 2024 via email

Restructure results/checkpoint + New features: (analysis tools, large dataset support, 3D viz and other more) #79

Restructure results/checkpoint + New features: (analysis tools, large dataset support, 3D viz and other more) #79

Conversation

calebweinreb commented Aug 8, 2023

Summary

New logic for syllable indexing

New format for results and checkpoint files

How the formats have changed

Converting to the new format

New analysis tools

Support for large datasets

Partial serialization

Multi-GPU support

Additional info on implementation

3D plotting tools

bainro commented Feb 3, 2024

calebweinreb commented Feb 3, 2024

bainro commented Feb 3, 2024 via email