- Extended qualitative results, including audio, video and cropping are available at:
Implicit Neural Representations (INRs) have peaked interest in recent years due to their ability to encode natural signals using neural networks. While INRs allow for useful applications such as interpolating new coordinates and signal compression, their black-box nature makes it difficult to modify them post-training. In this paper we explore the idea of editable INRs, and specifically focus on the widely used cropping operation. To this end, we present Local-Global SIRENs -- a novel INR architecture that supports cropping by design. Local-Global SIRENs are based on combining local and global feature extraction for signal encoding. What makes their design unique is the ability to effortlessly remove specific portions of an encoded signal, with a proportional weight decrease. This is achieved by eliminating the corresponding weights from the network, without the need for retraining. We further show how this architecture can be used to support the straightforward extension of previously encoded signals. Beyond signal editing, we examine how the Local-Global approach can accelerate training, enhance encoding on various signals, improve downstream performance, and be applied to modern INRs such as INCODE, highlighting its potential and flexibility.
- Technological Overview and Sources
- Installation
- Encoding Images, Audio and Video
- Encoding Images with Automatic Partitioning
- Debug Outputs
- Decoding and Cropping Videos
- 3D Shape Encoding
- Local-Global INCODE and Downstream tasks
- Most of our codebase is based on SIREN:
- SIREN: Implicit Neural Representations with Periodic Activation Functions
- Released under MIT License
- Audio and video data samples under the
data
directory are taken from the official SIREN repository. - INCODE experiments are based on the official implementation:
- INCODE: Implicit Neural Conditioning with Prior Knowledge Embeddings
- Released under MIT License
- The DIV2K assets used in the experiments in the paper are under
data/DIV2K
data/DIV2K/DIV2K_subset.txt
is the list of 25 randomly selected images, used as a subset for the image encoding experiments.- Due to the supplementary material size limitation, we are not able to attach the images themselves.
- However, the full DIV2K dataset is available at DIV2K, and the subset can be created by choosing the images provided in the list.
data/DIV2K/denoising
contains the image used in the denoising experiment.data/DIV2K/superresolution
contains the image used in the super-resolution experiment.- The license agreement for DIV2K is mentioned in their website (academic use only).
- The image used for CT reconstruction is
data/img_377_ct_reconstruction.png
- The image is taken from the Kaggle Lung Nodule Analysis dataset
- The Lucy shape used for 3D encoding is under
data/preprocessed_lucy.npy
- The shape is taken from the Stanford 3D Scanning Repository
- For configuration management, we use
pyrallis
. - WandB (Weights and Biases) is integrated into the code for experiment tracking and visualization. You can deactivate WandB by setting the
--use_wandb False
option.
Run the following commands to set up the environment:
conda env create -f environment.yml
conda activate lgsirens
Refer to the appendix of the paper for the hyperparameters used for training.
- The type of network is passed using
--mode
. The options arelg
for Local-Global SIREN,lc
for SIREN-per-Partition, andmlp
for SIREN. - Global hidden features are passed using
--global_hidden_features
- Local hidden features are passed using
--hidden_features
.- Note that the script expects the number of local hidden features to be a multiplied by the number of partitions. For example, if you want to use 14 local hidden features per partition for a total of 16*16=256 partitions, you should pass
--hidden_features 3584
(14*256=3584).
- Note that the script expects the number of local hidden features to be a multiplied by the number of partitions. For example, if you want to use 14 local hidden features per partition for a total of 16*16=256 partitions, you should pass
- The number of partitions is passed using
--downsample
. For example:- For an image encoding task, if you want to use 16*16=256 partitions, you should pass
--downsample [16,16]
. - For an audio encoding task, if you want to use 32 partitions, you should pass
--downsample [1,32,1]
. - For a video encoding task, if you want to use 5*16*16=1280 partitions, you should pass
--downsample [5,16,16]
.
- For an image encoding task, if you want to use 16*16=256 partitions, you should pass
- The number of overlapping coordinates in each dimension is passed using '--overlaps'. For example:
- For a video encoding task, if you want to sample from 2 adjacent frames and from 1 pixel in each spatial dimension, you should pass
--overlaps [2,1,1]
(this is what we used in the paper).
- For a video encoding task, if you want to sample from 2 adjacent frames and from 1 pixel in each spatial dimension, you should pass
- For video encoding tasks, the fraction of sampled pixels in each iteration is passed using
--sample_frac
. For example, if you want to sample 2% of the pixels in each iteration, you should pass--sample_frac 0.02
. - Experiment scripts are run from the experiments_scripts directory:
We present examples for image, audio, and video encoding tasks. The examples are based on the hyperparameters used in the paper.
To train the model to encode an image, run the following command:
# From the experiment_scripts directory
# Local-Global SIREN
PYTHONPATH=../ python train_img.py --experiment_name test_image_lg --lr 5e-4 --num_epochs 1001 --hidden_features 3584 --epochs_til_ckpt 1000 --mode lg --global_hidden_features 84 --downsample [16,16]
# SIREN-per-Partition
PYTHONPATH=../ python train_img.py --experiment_name test_image_lc --lr 5e-4 --num_epochs 1001 --hidden_features 3840 --epochs_til_ckpt 1000 --mode lc --downsample [16,16]
# SIREN
PYTHONPATH=../ python train_img.py --experiment_name test_image_mlp --lr 5e-4 --num_epochs 1001 --hidden_features 256 --epochs_til_ckpt 1000 --mode mlp
To train the model to encode an audio file, run the following command:
# From the experiment_scripts directory
# Local-Global SIREN
PYTHONPATH=../ python train_img.py --experiment_name test_audio_lg --lr 1e-4 --num_epochs 1001 --hidden_features 1344 --epochs_til_ckpt 1000 --mode lg --global_hidden_features 72 --downsample [1,32,1]
# SIREN-per-Partition
PYTHONPATH=../ python train_img.py --experiment_name test_audio_lc --lr 1e-4 --num_epochs 1001 --hidden_features 1440 --epochs_til_ckpt 1000 --mode lc --downsample [1,32,1]
# SIREN
PYTHONPATH=../ python train_img.py --experiment_name test_audio_siren --lr 1e-4 --num_epochs 1001 --hidden_features 256 --epochs_til_ckpt 1000 --mode mlp
To train the model to encode a video file, run the following command:
# From the experiment_scripts directory
# Local-Global SIREN
python train_video.py --experiment_name test_video_lg --num_epochs 5001 --epochs_til_ckpt 1000 --mode lg --dataset cat --downsample [5,8,8] --overlaps [1,2,2] --hidden_features 17600 --global_hidden_features 180 --sample_frac 0.02
# SIREN-per-Partition
python train_video.py --experiment_name test_video_lc --num_epochs 5001 --epochs_til_ckpt 1000 --mode lc --dataset cat --downsample [5,8,8] --overlaps [1,2,2] --hidden_features 17920 --sample_frac 0.02
# SIREN
python train_video.py --experiment_name test_video_siren --num_epochs 5001 --steps_til_summary 1000 --epochs_til_ckpt 5000 --mode mlp --dataset cat --hidden_features 1030
There are three parameters that control the automatic partitioning logic:
- First and foremost, the
--partition_size
argument sets the desired partition size - The
--auto_total_number_of_parameters
argument sets the required total number of parameters in the network- It is set by default to 200k, and might deviate by roughly 5-10%
- This is equivalent to an MLP with 3 hidden layers and 256 hidden units
- The
--auto_global_weights_factor
argument sets the ratio of global weights to the entire network- It is set by default to 0.1, and might deviate by roughly 5-10%
- We use the same ratio for the global weights when encoding images throughout the paper
To train the model to encode an image with automatic partitioning on a DIV2K image, downsampled by 4 before training, run the following command:
# From the experiment_scripts directory
# Local-Global SIREN
PYTHONPATH=../ python train_img.py --experiment_name test_image_lg --dataset_path <PATH_TO_DIV2K_IMAGE> --image_resolution_factor 4 --num_epochs 2001 --steps_til_summary 200 --epochs_til_ckpt 1000 --mode lg --partition_size [32,32]
During training, debug outputs are logged to the logs/
directory by default (this can be changed using the --logging_root
flag).
We log both the metrics and the visualizations using Tensorboard. When --use_wandb
is passed, we also log the metrics to Weights and Biases.
To decode a video using the trained model, use similar configuration as before, but pass the --decode
flag and the --checkpoint_path
flag. For example:
# From the experiment_scripts directory
# Local-Global SIREN
python decode_video.py --experiment_name test_video_lg --mode lg --dataset cat --downsample 5 8 8 --overlaps 2 1 1 --hidden_features 17600 --global_hidden_features 180 --decode --checkpoint_path model.pth
To crop specific partitions of the video, use any one (or more) of the following flags:
--crop_entire_dim_values
: A list of three lists. Crop partitions based on indices in specific dimensions, across all the signal. For example:- To crop the entire spatial border of the video (across all frames), assuming the video was downsampled by a factor of 8 in each spatial dimension, pass
--crop_entire_dim_values [[],[0,7],[0,7]]
. - To crop the second and third partitions in the temporal dimension (across all spatial locations), pass
--crop_entire_dim_values [[1,2],[],[]]
.
- To crop the entire spatial border of the video (across all frames), assuming the video was downsampled by a factor of 8 in each spatial dimension, pass
- '--crop_partition_indices': A list of lists of size three. Crop specific partitions based on their indices (in this case, each partition is indexed by three coordinates). For example:
- To crop partition at the start of the video, in the bottom right corner (assuming the video was downsampled by a factor of 8 in each spatial dimension), pass
--crop_partition_indices [[0,7,7]]
.
- To crop partition at the start of the video, in the bottom right corner (assuming the video was downsampled by a factor of 8 in each spatial dimension), pass
The 3D shape encoding code is based on the official INCODE implementation.
There is a notebook available under the experiment_scripts
directory called train_3d.ipynb
that can be used to encode Lucy, as seen in the paper.
All downstream task code is available in the incode_experiments
directory.
Note: there are additional requirements that need to be installed to run the INCODE experiments, listed in the incode_experiments/requirements.txt
file.
The code is based on the official INCODE implementation. There are three notebooks, one for each downstream task:
denoising.ipynb
: Denoising experimentsuperresolution.ipynb
: Super-resolution experimentct_reconstruction.ipynb
: CT reconstruction experiment
The notebooks are ready to run, and the results can be reproduced by following the notebooks.
If you use this code or paper for your research, please cite the following:
@misc{ashkenazi2024croppableimplicitneuralrepresentations,
title={Towards Croppable Implicit Neural Representations},
author={Maor Ashkenazi and Eran Treister},
year={2024},
eprint={2409.19472},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.19472},
}