MoSeq2 Analysis Visualization Notebook Instructions

Overview

The "MoSeq2 Analysis Visualization Notebook" contains interactive tools to analyze behavior via MoSeq, such

labeling syllables interactively
computing syllable statistics
visualizing how frequently syllables transition to one another

You must ALWAYS run the Load Progress section before running interactive tools in the notebook.

Project Setup

If you installed MoSeq2 via Conda, please activate the MoSeq environment and start a jupyter notebook in your project folder. If you are using the Docker container, please make sure your MoSeq container is running and connected to your project folder. Make sure that the analysis notebook is copied into your project folder.

To run this notebook, you need the following files in your data directory:

progress.yaml (the progress.yaml file that contains all the required MoSeq paths)
model.p (trained AR-HMM to compute statistics from)
moseq2-index.yaml (the moseq2-index.yaml generated containing paths to extracted sessions that will be used to generate syllable crowd movies)
config.yaml (configuration file that contains configured parameters throughout the MoSeq pipeline)
_pca/ (PCA-related data generated from the PCA section)
aggregate_results/ (aggregated session data)

At this stage, the base directory should contain the necessary files above, as shown below:

.                   ** current working directory
└── <base_dir>/
    ├── progress.yaml
    ├── config.yaml
    ├── moseq2-index.yaml
    ├── model_session_path/
    ├   └── model.p
    ...
    ├── _pca/
    └── aggregate_results/

Note: this notebook uses progress.yaml to keep track of all the necessary paths. Please ensure you run the Load Progress cell before running any analysis modules. If the PCA and modeling steps are done uing the Command Line Interface, set init = True and overwrite = True in progress_paths = restore_progress_vars(progress_file=progress_filepath, init = False, overwrite = False).

Finding Best Model Fit

Get best model fit is used to determine whether the trained model has captured median syllable durations that match the principal components' changepoints. If there are more than one trained model in progress_paths['base_model_path'], the feature returns the best model that matches the principal components' changepoints from a list of models.

The command supports comparison concerning two objectives: duration and jsd. duration finds the model where the median syllable duration best matches that of the principal components' changepoints. jsd finds the model where the distribution of syllable durations best match that of the principal components' changepoints.

If there are multiple models in the inputted folder, then the outputted figure will plot multiple dashed distribution curves representing distributions of unselected models and 2 solid distribution curves that show the "Best"/chosen model and the principal compoments' changepoint durations.

best model fit

Compute Syllable Statistics

This section produces two dataframes: moseq_df and mean_df. The two dataframes are used to generate behavioral summaries, which we call fingerprints, and are used generally for analysis.

`moseq_df`

moseq_df is a vertically stacked dataframe of scalar values measured during the extraction step, aligned with the model_labels and timestamps. The shape would be (sum_of_session_frames, 31). To view all the measured scalars, type print(moseq_df.columns). This dataframe can be used to plot the scalar feature values for any session over time. The cell output in the notebook shows a preview of the top 5 rows in the dataframe, with moseq_df.head().

Note: the rows in the labels columns contain -5 for the first 3 frames of each session's recordings. This is because we use the first 3 frames to initialize the AR-HMM, and thus cannot supply a syllable label to them. We generally remove these frames from the analysis.

This dataframe contains the following columns:

column name	description	unit
angle	the orientation of the mouse body	radians
area_mm	the area of the mouse	mm^2
area_px	the area of the mouse	pixels
centroid_x_mm	center of the mouse (x coordinate)	mm
centroid_x_px	center of the mouse (x coordinate)	pixels
centroid_y_mm	center of the mouse (y coordinate)	mm
centroid_y_px	center of the mouse (y coordinate)	pixels
height_ave_mm	average height across the entire visible mouse	mm
length_mm	mouse length measured roughly across the spine	mm
length_px	mouse length measured roughly across the spine	pixels
velocity_2d_mm	mouse 2D velocity (x,y velocity)	mm/frame
velocity_2d_px	mouse 2D velocity (x,y velocity)	pixels/frame
velocity_3d_mm	mouse 3D velocity (x,y,z velocity)	mm/frame
velocity_3d_px	mouse 3D velocity (x,y,z velocity)	pixels/frame
velocity_theta	direction/angle of the velocity vector	radians
width_mm	mouse width	mm
width_px	mouse width	pixels
dist_to_center_px	distance between mouse center and arena center	pixels
group	the assigned experimental group	NA
uuid	session uuid assigned during extraction	NA
h5_path	extraction h5 file path	NA
timestamps	frame timestamp	seconds
frame index	index of the frame in the recording	NA
SessionName	name of the session pulled from the metadata.json file	NA
SubjectName	name of the subject/mouse pulled from the metadata.json file	NA
StartTime	time of day the session recording started	NA
labels (original)	original syllable labels used by the AR-HMM	NA
labels (usage sort)	syllable label sorted by usage (low numbers = most used; high numbers = rarely used)	NA
labels (frames sort)	syllable label sorted by frames (low numbers = most used; high numbers = rarely used)	NA
onset	indicates the onset of a syllable (1/True = start of syllable)	NA
syllable index	syllable index	NA

Sorting syllable labels

Syllables are arbitrarily labeled 0-100 (assuming the max-states parameter is set to 100). During the training process, the AR-HMM settles upon a random subset of of these 100 labels to describe the data it sees. After training and applying an AR-HMM model to mouse data, we generally re-label syllables so assign meaning to those 100 labels.

We apply two re-labeling schemes to the original syllable labels:

Usage sort: we re-label syllables by how many times a mouse instantiates them, regardless of the syllable's duration. For example, if we have the set of original labels applied to 20 frames: [2, 2, 2, 2, 10, 10, 2, 2, 2, 5, 5, 5, 5, 5, 5, 10, 10, 2, 2, 2] then the usage sort will say syllable 2 was instantiated 3 times, syllable 5 - 1 time, syllable 10 - 2 times. The new mapping would look like the following: 2 -> 0, 10 -> 1, 5 -> 2, and the new sequence would look like: [0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1, 1, 0, 0, 0].
Frames sort: we re-label syllables by how many frames are assigned each label. Generally, the two sortings result in similar mappings. If we use the same set of original labels as in 1., syllable 2 is assigned to 10 frames, syllable 5 - 6 frames, syllable 10 - 4 frames. The new mapping: 2 -> 0, 5 -> 1, 10 -> 2. The new sequence: [0, 0, 0, 0, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0].

`mean_df`

stats_df is a dataframe of the average syllable/scalar values for all the features included in stats_df grouped by the resorted syllable labels, model groups, and UUIDs. This dataframe will be used to plot mean syllable statistics and perform hypothesis testing. The cell output in the notebook shows a preview of the top 5 rows in the dataframe, with stats_df.head().

Notes: In the compute_behavioral_statistics function, the count parameter can be either set to 'usage' or 'frames', which determines how syllables are re-labeled. See above for details on about each sorting.

This dataframe contains the following columns:

column name	description	unit
angle	the orientation of the mouse body	radians
area_mm	the area of the mouse	mm^2
area_px	the area of the mouse	pixels
centroid_x_mm	center of the mouse (x coordinate)	mm
centroid_x_px	center of the mouse (x coordinate)	pixels
centroid_y_mm	center of the mouse (y coordinate)	mm
centroid_y_px	center of the mouse (y coordinate)	pixels
height_ave_mm	average height across the entire visible mouse	mm
length_mm	mouse length measured roughly across the spine	mm
length_px	mouse length measured roughly across the spine	pixels
velocity_2d_mm	mouse 2D velocity (x,y velocity)	mm/frame
velocity_2d_px	mouse 2D velocity (x,y velocity)	pixels/frame
velocity_3d_mm	mouse 3D velocity (x,y,z velocity)	mm/frame
velocity_3d_px	mouse 3D velocity (x,y,z velocity)	pixels/frame
velocity_theta	direction/angle of the velocity vector	radians
width_mm	mouse width	mm
width_px	mouse width	pixels
dist_to_center_px	distance between mouse center and arena center	pixels
timestamps	averge frame timestamp in the syllable	seconds
frame index	average index of the frames in the syllable	NA
usage	the probability a syllable is used	NA
duration	average syllable duration	seconds
syllable key	sorting of the syllables (see above)	NA
syllable	syllable label	NA

Interactive Syllable Labelling

Interactive Syllable Labelling is for assigning behavioral labels and short descriptions to syllables by observing the crowd movies and the Syllable Info table. This widget will automatically generate crowd movies and store them in a folder called crowd_movies in the model-specific subfolder, specified in model_session_path in the progress.yaml file. Note that syllables are relabeled by usage from here on out. A syll_info.yaml will be generated in the model-specific subfolder to record the syllable names and short descriptions. Use the contents of the syll_info.yaml file or the crowd movie file names to find the mapping from the original syllable label to one relabeled by usage.

Note: each time new crowd movies are created, the syll_info.yaml file gets re-written. If you don't want this to happen, you must manually rename the syll_info.yaml file before re-running this widget.

Version update notes

In v1.2.0, min, max, standard deviation were added to the stats_df dataframe (previously mean_df). If you have dataframe parquet files such as syll_df.parquet or moseq_scalar_dataframe.parquet, you may see KeyError: 'velocity_2d_mm_mean when you run the Crowd Movie Generation and Interactive Syllable Labelling Tool. Please delete the parquet files and rerun the cell.

Interactive Syllable Labeller

Instructions

Run the cell to launch the interactive Syllable Labelling Tool.
Select a syllable from the Syllable dropdown menu to view the associated crowd movie and syllable info.
Use the Playback Speed slider to adjust the crowd movie playback speed to better observe the behavior associated with short/fast syllables.
Enter the syllable label in the Syllable Name field and desired description in Short Description.
Click Save Setting to save the syllable label and description for later analysis.
Use Next and Previous to navigate between syllables and the syllable label and description will be automatically saved when using these buttons.

Interactive Syllable Statistics Graphing

Interactive Syllable Statistics Graphing is for plotting different syllable statistics and their differences in the modeled groups. The dendrogram displayed below the statistics plot represents the hierarchically sorted pairwise distances between the given model's autoregressive matrices representing the syllables. Interactive Stats Graphing

Instructions

Run the cell to launch the Interactive Syllable Statistics Tool.
Select the parameters from the dropdown menus to control the graph.
- If you select Difference from the Sorting dropdown menu, the syllables will be sorted by the value difference between two groups and additional menus will appear for statistical testing to test whether the differences between groups are significant.
- If you select group from Grouping, the mean of all the sessions within each group will be plotted in the graph.
- If you select SessionName or SubjectName, you can select multiple sessions/subjects in the Sessions menu by holding down the [Ctrl]/[Command]/[Shift] key. you can click on the legend items to selectively hide the corresponding data points.
- If you have labeled the syllables, you can use specify the syllables you want to plot in the Syllable to Display field, such as "run", "walk" etc. The text input is not case-sensitive.
Select a threshold criterion from the "Threshold By" dropdown menu. Use the Thresholding Slider to include syllables with statistics within a specific value range.
Hover over the circle data points to display a pop-up window with additional syllable metadata.

Syllable Transition Analysis

The notebook contains two sections that display information on syllable transition statistics. The first plots the transition matrix itself

syllable transition

while the second plots a representation of the transition matrix as a directed graph

tm graph

Interpretation

Transition matrices (TMs) compactly represent the frequency that any syllable transitions into any other syllable. It is one way to describe the average structure in behavior. Transitions between one syllable to another can also be referred to as bigrams. The row of the TM represents an incoming syllable, while the column represents the outgoing syllable. The value at a specific row and column position represents the frequency the incoming syllable transitions into the outgoing syllable or the frequency of the bigram.

TMs can be normalized in three ways:

bigram normalization: describes the absolute probability a bigram occurs within the dataset
row normalization: describes the probability that one syllable transitions into another (also known as outgoing probability)
column normalization: describes the probability that any syllable transitions into a specific syllable (also known as incoming probability)

Transition analyses can help visualize gross changes in the structure of behavior between two experimental groups, especially when visualizing TMs in directed graph format. For example, certain syllables that frequently transition into one set of syllables in one experimental condition might transition into a completely different set in another experimental condition.

Using the Syllable transition graph tool

Interactive Syllable Transition Graph Tool is for exploring the behavioral transition space of your modeled groups. Find sequences of behavior,e.g. bigrams, at different usage/transition probability ranges, and gain a better understanding of the differences across your modeling groups.

Interactive Transition

Instructions

Run the cell to launch the Interactive Syllable Transition Graphing Tool.
Use Graph Layout dropdown menu to specify the graph layout.
Use the Threshold Edge Weights slider to select a range for syllable transition probabilities to display in the graphs.
Use the Threshold Nodes by Usage slider to select a range for syllable usages to display in the graphs.
Hover over the nodes to display syllable information and the associated crowd movies.

Nodes outside these thresholds will be hidden.

Fingerprint plots

Fingerprint plots summarize behavior by showing distributions of MoSeq scalar values and MoSeq syllables. The plots are generated using moseq_df and mean_df above.

Interpretation

These plots are useful for getting a gestalt of behavior across sessions, mice, and experimental groups, and can reveal general differences across experimental groups. The rows of each plot contain summary statistics for each session. The left-most plot indicates which experimental group each session is a part of. The four middle columns plot the distributions of scalar information, where larger numbers (brighter colors) indicate greater probability mass. The final right-most column plots syllable usage across all syllables (relabeled by usage or frames).

Instructions

The n_bins variables control the number of bins to bin the scalar values and the MoSeq column shows the values by syllables.

If there is no sklearn.preprocessing object passed into the function, the unscaled raw percentage usages within each bin and each MoSeq syllable will be plotted, as shown below. Fingerprint no preprocessor