Skip to content

Files

Latest commit

83fa1b7 · May 31, 2025

History

History
310 lines (185 loc) · 25.1 KB

customization.md

File metadata and controls

310 lines (185 loc) · 25.1 KB

Customization

This README describes how to customize your system to work better with Perforated Backpropagationtm. It starts describing additional options that are available. Then if you are working with anything other than a simple MLP with linear and conv layers it is likely you will need some of the later sections of this README to get your system running.

1 Additional Settings

This section is for additional settings that were removed from the main README to simplify initial implementation.

1.1 Alternative Switch Mode

The alternative switch mode is:

PBG.switchMode = PBG.doingFixedSwitch # Switch on a fixed number of epochs rather than after improvement stops
PBG.fixedSwitchNum = 10 # How many epochs to cause a switch
# If you would like the first one to run longer since its your original training
# you can set this value to be different
PBG.firstFixedSwitchNum = 10  

1.2 CapAtN

The following can be added if you want Dendrite training cycles to be capped at the same number of epochs as the first neuron training cycle. Setting this to be True means Dendrite correlation scores will be less improved, but can save significant training time. Recommended usage is to set this to True during experimentation and only change it to False when you have a working system and want to get the absolute most out of it for your final version.

PBG.capAtN = True

When this is its default of False you may still want to shorten Dendrite training time, while not completely stopping it while Dendrites are still improving. To that end you can adjust the following settings:

PBG.pbImprovementThreshold = 0.1
PBG.pbImprovementThresholdRaw = 1e-5

These values specify how much the Dendrites must be improving in order to continue training them. The default settings are that if at least one Dendrite in the entire network has improved its score by at least 10% and at least 1e-5 over the last PBG.pEpochsToSwitch epochs then Dendrite training will continue. If it seems like the Dendrite training just keeps going up indefinitely these are the values that should be changed. Some larger models will even continue going up just due to random noise when a learning rate of 0 if these numbers are set too low.

1.3 - Configuration Values

There are many different configuration settings you can play with. The full list with detailed descriptions can be found in this API repository under preforatedai/pb_globals.py. However, the following ones are the most important which do not have default values because they should be considered in every project.

# When to switch between Dendrite learning and neuron learning. 
PBG.switchMode = PBG.doingHistory 
# How many normal epochs to wait for before switching modes, make sure this is higher than your scheduler's patience.
PBG.nEpochsToSwitch = 10  
# Same as above for Dendrite epochs
PBG.pEpochsToSwitch = 10
# The default shape of input tensors
PBG.inputDimensions = [-1, 0, -1, -1]

1.4 - Initial Configuration Values

Every switch to Dendrite learning will increase the size of your network. Because of this we recommend first starting with the following setting. This will tell the function to add Dendrites at every epoch and allow you to test how many Perforated AI cycles you will be able to add before running out of memory. This should also be used to ensure that nothing else will go wrong with your configuration quickly rather than running many wasted epochs before finding out. To ensure maximum efficacy the system should be tested up to 3 Dendrites (Cycle 6). However, it is also reasonable to just test with 1 Dendrite if you only want to add a maximum of 1 Dendrite (Cycle 2) due to memory restrictions.

PBG.testingDendriteCapacity = True

1.5 Initialization Settings

Additional options during initializaiton include:

PBU.initializePB(model, doingPB=True, saveName='PB', makingGraphs=True, maximizingScore=True)

doingPB can be set to False if you want to run with your current parameters without PB.

makingGraphs can be set to False if you would prefer to make your own graphs for output performance.

maximizingScore can be set to False when your value passed to addValidationScore is a loss value that should be minimized. Its generally better to look at the actual validation score rather than the raw loss values because loss can sometimes continue to be reduced as correct outputs are "more" correct without actually reducing the number of incorrect outputs that are wrong. However, using this can get you running quicker. If choosing to minimize loss, a setting that can help mitigate this is lowering PBG.improvementThreshold. The default is 1e-4, but setting it to 0.001 will only count a loss reduction if the current cycle is at least .1% better than the previous cycle.

saveName is defaults to 'PB' but if you run multiple experiments at once this must be changed.

1.6 Systems without Simple Optimizer/Scheduler Setups

If there is no scheduler just leave it out of the call to setupOptimizer entirely. But as a warning, we have run some experiments where PAI does not work without a scheduler so if you choose to exclude one and PAI does not improve your system we would encourage you to include the ReduceLROnPlateau scheduler and try again.

optimizer = PBG.pbTracker.setupOptimizer(model, optimArgs)

If your system is using a more complicated trainer where you can't just declare the optimizer outside of your system like this you are free to call the following instead of all of the above but it won't work quite as well.

PBG.pbTracker.setOptimizerInstance(trainer.optimizer)

If your system has multiple optimizers just pick one of them to use. However, when you call addValidationScore you should also reinitialize the other optimizer if restructuring happens.

If you are doing something separately with the scheduler or optimizer that is adjusting the learning rate based on epochs it is best if you can just define this internally in the scheduler rather than taking in epochs as a parameter to a function after the scheduler is initialized.

2 - Network Initialization

Network initialization is the most complicated part of this process that often requires thought and experimentation. This section details what needs to be done and why, but check the "Changes of Note" sections of each of the examples to see descriptions of what we did and when to try to get a feel for what you should do with your network. As a general rule though, you want to make sure everything other than nonlinearities are contained within PAI modules so that each Dendrite block performs the same processing as the associated neuron blocks. However, complexities arise when there are multiple options to do this because of modules within modules where you can convert the whole thing, or each sub-module with the options below.

2.1 - Setting Which Modules to Use for Dendrite Learning

This is often the part that has some complexity. If your network is all simple layers with linear or conv layers and nonlinearities, they will be converted automatically. However, most networks have more complicated learning modules. Performance is often better when these modules are grouped as a single PAI module as opposed to PAI-ifying each module within them. To tell the system that it must convert some blocks add them with the following option. It can be good to do some experimentation with what level of depth you want to block things off, i.e. many smaller modules or fewer large modules. They can be added with the function below before convertNetwork is called.

PBG.moduleNamesToConvert += ['moduleName']

Using moduleNamesToConvert does require all names to be unique and may not work properly if names have '.' in them or if there are multiple types with the same name, such as nn.Linear and lora.layer.Linear. In these cases add the full type to a type based array isntead, not moduleType is the type and not a string.

PBG.moduleNamesToConvert += [moduleType]

Along the same lines, all normalization layers should be contained in blocks. This always improves performance so it is checked for in the initialization function. If they are not in a module already, simply add them to a PBSequential with whatever is before them. For example:

PBG.PBSequential([normalLayer, normalizationLayer])

2.1.1 - How to Tell Modules Which are not Tagged

When you first call convertNetwork Perforated AI will print a list of all parameters which have not been wrapped. It is not required that all modules are wrapped, but any that are not wrapped will not benefit from Perforated AI optimization. It is recommended to wrap everything, but if you are having trouble with the processing in the following section for some modules it is ok to just skip them. The only modules that are automatically converted are nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear, and PBSequential. The list will look like this:

The following params are not wrapped.
------------------------------------------------------------------

...

The following params are not tracked or wrapped.
------------------------------------------------------------------

...

------------------------------------------------------------------
Press enter to confirm you do not want them to be refined

You should make sure to track every module even if its not wrapped. Tracking modules just ensures the proper PB protocol of not adjusting neuron weights while dendrites are traininig. to track a module without wrapping it just append the following arrays similar to the wrapping arrays of similar names

PBG.modulesToTrack
and
PBG.moduleNamesToTrack

Once you have seen this list and are sure it is correct, you can set it to be ignored in the future with:

PBG.unwrappedModulesConfirmed = True

2.1.2 Building New Modules

Because all functioning in between nonlinearities should be done within converted modules, you may also want to create new modules. If there is functioning done outside of converted modules just make a new modules that performs those calculations within a forward function. We generally advice putting these steps within a forward function of a modules which contains the subsequent module, as opposed to the prior module.

2.2 - Setting up Processing for Complex Modules

Finally, if any of the modules you are converting have a custom forward that has more than one tensor as input or output, for example a GRU taking in the previous and current states, you will need to write processing functions. Please check out pb_models for examples of how to create a processing function for a module. Once they are written add them with the following block of code. Make sure you do this in order. They are just added to two arrays which assumes they are added in pairs, When writing these they also must not be local, i.e. within another class as a member function.

PBG.moduleNamesWithProcessing += ['GRU']
# This processor lets the dendrites keep track of their own hidden state
PBG.moduleByNameProcessingClasses += [PBM.GRUProcessor]

A simpler examples below just ignores any outputs after the first. This will generally fix any problem, and allow the system to run, but it isnt neccesarily correct for your applicaiton:

PBG.moduleNamesWithProcessing += ['ModuleWithMultipleOutputs']
# This processor ignores all extra outputs after the first
PBG.moduleByNameProcessingClasses += [PBM.multiOutputProcesser]

You will know this is required if you get an error similar to the following:

AttributeError: 'tuple' object has no attribute 'requires_grad'

Also as a note, if you are using a GRU or LSTM in an non-traditional manner, such as passing the hidden tensor forward rather than the output, you may need to change how these processors are defined rather than using ours from pb_models.

2.2.1 - Understanding Processors

To help visualize what is happening the figure below is provided. To think about designing a processing function, one must understand that the way Dendrites work is by outputting a single connection to each neuron. This is implemented in PyTorch by taking the output tensor of a neuron layer, and adding the output tensor of the Dendrite layer multiplied by the corresponding weights. This means the Dendrite output must be a single tensor with the same dimensionality as the neuron output. This is simple if it is just a linear layer, one tensor in one tensor out, but it gets more complex when there are multiple tensors involved.

In the example below the following steps happen in the following order:

  • The input tensors are received by the PAI module. For a GRU this will mean the input tensor and the hidden tensor, which is all zeros at the first pass.
  • The GRU Neuron receives these tensors directly and outputs the usual output of a GRU layer, a tuple of (output,hidden)
  • The first neuron postprocessor splits off the Neuron Hidden Tensor (NHT) so the single tensor output can be combined with the Dendrite's output'
  • The Dendrite Preprocessor receives these inputs but must filter them before getting to the GRU Dendrite module. If it is the first input, it just returns them as usual. But if it is a subsequent input where the hidden tensor is no longer all zeros it returns the Dendrite Hidden Tensor (DHT) rather than the NHT which is what would have been passed in from the training loop.
  • The GRU Dendrite receives these tensors and outputs the Dendrite (output,hidden) tuple.
  • The Dendrite Postprocessor saves the DHT to be used in future timesteps and passes forward the single tensor output that can be combined with the Neuron's output.
  • The neuron and Dendrite output's are combined.
  • The neuron's second postprocessor creates a new tuple with this combined output and the NHT which was saved from postprocessor one.
  • The new tuple is returned from the PAI module which has the same format as the original module before being converted to a PAI module.

A note about processors

The clear_processor function is called each time the network is saved. This includes automatic saves which happen during the addValidationScore stage or any calls to saveSystem. It should not cause problems in the general case, but if you have reason to call these functions in the middle of training cycles where you don't want processors to be cleared problems could arise.

GRU Processor

3 Multiple Module Systems

Some deep learning involves components which are not single pytorch Modules. An example might be a GAN system where the discriminator and generator are separate. If this is the case they still must be converted together. This can be worked around simply by creating a class such as the following:

class Pair(nn.Module):
def __init__(self, netG, netD):
    super(Pair, self).__init__()
    self.netG = netG
    self.netD = netD

Once it is created simply create one of those objects and run as follows

model1 = create_model1()
model2 = create_model2()
model = Pair(model1, model2)
model = PBU.convertNetwork(model)
#Then set the networks directly 
model1 = model.net1
model2 = model.net2

Important note! If you do the above things, make sure to also add the same steps and adjustments to the addValidationScore section.

An alternative is to call convertNetwork twice but that still needs to be tested more thoroughly.

4 - Set Abnormal Input Dimensions

Some complex networks have different input dimensions during the process. If yours does, just the setting of inputDimensions is not enough. In these cases set inputDimensions to be the most typical case in your network. You will then have to manually call module.setThisInputDimensions(new Indexes for Node) for any modules that stray from this. This must be called after convertNetwork. Some examples are below. The process is that 0 goes in the place of the nodes index, and -1s go at every other dimension.

model.onlyRecurrentModule.setThisInputDimensions([-1,-1, 0])
model.fullyConnectedOutputLayer.setThisInputDimensions([-1, 0])
model.3dConvLayer.setThisInputDimensions([-1,-1,0,-1,-1])

This is based on the output of the layer, not the input. Try starting without any of these and then run your network, we will tell you if there is an error and how to fix it. If you suspect there might be more than one problem, set the following flag and they will all be printed to be able to be fixed at once.

PBG.debuggingInputDimensions = 1

We recommend setting this flag and if there are many problems change PBG.inputDimensions in the initial settings. Then and then do this again hopefully there will be fewer and you can do these changes with the smaller count.

5 Using Pretrained Networks

If you are working with a pretrained network but you need to make some of the changes above to the architecture, what you will have to do is define a new network that takes in the initial network in the init and copies all the values over. Once you define this network you can use it by adding to the following arrays before convertNetwork:

PBG.modulesToReplace = [pretrainedModule]
PBG.replacementModules = [newPAIVersion]

An example of this is ResNetPB in pb_models. Keep in mind, if you want to replace the main module of the network, just do it at the top level in the main function and do not rely on the PAI conversion portion with these two lines of code. As an example for ResNets:

PBG.modulesToReplace = [torchvision.models.resnet.ResNet]
PBG.replacementModules = [PBM.ResNetPB]

6 - DataParallel

For DataParallel to work with Perforated Backpropagationtm we leverage the same idea that allows for other modules with buffers to operate by just using the values from GPU:0. However, one part under the hood of the way this code is able to function with such few modifications to your original pipelines causes issues on multiple GPUs. We have created a simple, but necessary, two step process to get around these issues.

First run your pipeline on a single GPU. Settings for this run don't matter. Adjust your training loop to have the following two lines after your call to loss.backward():

loss.backward()
#This line sets up multiGPU
PBG.pbTracker.saveTrackerSettings()
exit(0) # exit this run after settings are saved.

By calling this one function the required settings will be saved into the saveName folder you have specified when you initialized the pbTracker. Once the settings have been saved, delete these two lines to go back to your original training loop. The second step is to initialize the tracker settings before you instantiate the DataParallel. This should be done after your calls to PBU.convertNetwork and PBG.pbTracker.initialize:

PBG.pbTracker.initializeTrackerSettings()
net = torch.nn.DataParallel(net, your other settings)

This should work for DataParallel, but we are still working on DistributedDataParallel.

7 Loading

If you need to load a run after something stopped it in the middle you can call:

model = PBU.loadSystem(model, your save name, 'latest', True)

If you want to load the best model for any reason you can call:

model = PBU.loadSystem(model, your save name, 'best_model', True)

This function should be called after initializePB and setThisInputDimensions, but before setup optimizer

If you want to load a pai model just for inference using only the open source code from pb_network from this API you can do so with the following:

model = fullModel()
from perforatedai import pb_network as PBN
model = PBN.loadPAIModel(model, 'name/best_model_pai.pt')

8 Optimization

If you do everything to get a system up and running but do not see improvement, these are the recommended changes to try to see if there are alternative options that will work.

Overfitting

Sometimes adding Dendrite nodes just causes the system to immediately overfit. But these can often be the best scenarios where you will be able to achieve better results with a smaller model as well. Try reducing the width or depth of your network until you start to see a larger drop in accuracy. Often modern architectures are designed to be extremely large because compute can be cheap and worth small accuracy increases. This means you can often reduce the size to a fraction of the original before seeing more than a couple percentage points lost in accuracy. Try running with a smaller model and seeing if the system still just overfits or if improvement can be found that way.

Additionally, if reducing loss is your main goal but the current system is overfitting and you are leveraging early-stopping to get best scores you can try methods like dropout. Add dropout layers throughout your network and adjust the ratios such that the training scores become worse than the validation scores before applying Perforated Backpropagation to improve the training scores.

Correlation Scores are Low

If a score is above 0.001 we typically determine that to be correlation being learned correctly. Anything less than that is likely just random noise and something is actually going wrong. In these cases play around with the options of 2.1 and 2.2 above. See if other wrapping methods or other processing functions are able to achieve better correlation scores.

Suggestions

  • If you have a layer that gets called more than once in the forward that has been seen to cause problems

  • Sometimes models will have complicated internal mechanisms and you'll have to chunk them into additional sub-modules. A key thing to consider when deciding if things need to be grouped is what happens after them. If there is non module math that happens between one module and the next you might need to wrap those steps in a module. This always includes normalization layers, but can also be things like means, applying masks, changing views. As a rule of thumb, everything other than non-linearities should be contained within a module that is converted.

Model doesn't Seem to Learn at All

  • Make sure that optimizers are being initialized correctly. If you are not just passing in a model in one place, make sure whatever you are doing happens at every restructuring so the new variables are being used.

  • Make sure the scheduler was restarted properly. If your learning rate is extremely low after the restructuring it may seem to not be changing.

  • Make sure the optimizer, scheduler, and model are all the correct variables. Sometimes these are updated within a function that doesn't return them or the real model is self.model but that is not overwritten by addValidationScore.

  • Make sure you didn't wrap something that requires specific output values for future math down the line. Adding the Dendrite output to these values will mess up that math. For example, If a module ends with a Softmax layer going into NLL loss, you need to make sure the Softmax layer is not being wrapped because the output of Softmax is supposed to be probabilities, so adding them to "Dendrite probabilities" is wrong to do. For this specific case you can also just remove it and use CrossEntropyLoss instead.

Running Multiple Experiments

If you'd like to automate parameter sweeps without typing in your password every time this can be done with the following method:

CUDA_VISIBLE_DEVICES=0 PAIPASSWORD=YOURPASSWORD python your_script.py 

Similarly, if you are using a token you can do the following:

CUDA_VISIBLE_DEVICES=0 PAIEMAIL=YOUREMAIL PAITOKEN=YOURTOKEN python your_script.py

Good Science

Datasets should always be split into train, test, and validation for machine learning projects, but it is especially important for experiments with Perforated Backpropagationtm. The validation scores are used to determine when to add dendrites, so one could argue they are even part of the training dataset because they are used for more than just the decision to stop training. Without separate splits of data it is possible to overfit to the validation data as well so be sure to always have a final test dataset to determine the test values before putting a model into production.

This can be done similar to the following example for MNIST:

test_dataset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
test_set, val_set = torch.utils.data.random_split(test_dataset, [5000, 5000])