Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running with k>2 raise "memory allocation failed" error #30

Open
NealT87 opened this issue Feb 28, 2019 · 7 comments
Open

running with k>2 raise "memory allocation failed" error #30

NealT87 opened this issue Feb 28, 2019 · 7 comments

Comments

@NealT87
Copy link

NealT87 commented Feb 28, 2019

Any value above k>2 for transfer_entropy method creates the following issue(for k=1|2 it works):

test = pyinform.transfer_entropy(x,y,k=3)

Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/JerusalemProject/JerusalemProject/ActionActorAnalysis.py", line 279, in
temp = pyinform.transfer_entropy(x,y,k=3)
File "C:\Users\user\Anaconda2\envs\Python35\lib\site-packages\pyinform\transferentropy.py", line 179, in transfer_entropy
error_guard(e)
File "C:\Users\user\Anaconda2\envs\Python35\lib\site-packages\pyinform\error.py", line 57, in error_guard
raise InformError(e,func)
pyinform.error.InformError: an inform error occurred - "memory allocation failed"

@dglmoore
Copy link
Contributor

dglmoore commented Mar 21, 2019

Hi @NealT87. Thanks for the new issue! This error, admittedly vague, usually means that the C library couldn't allocate enough memory. The amount of memory necessary depends on

  1. the base of the time series provided
  2. the history length k

Could you share the range of values in the x and y time series?

@NealT87
Copy link
Author

NealT87 commented Mar 26, 2019 via email

@silviaruiz44
Copy link

Were you able to solve the problem? I am running into the same problem.

@dglmoore
Copy link
Contributor

@silviaruiz44 Thanks for reviving this issue. I suspect the problem is the range of values in your time series. If that's the case, then there are some workarounds.

If you wouldn't mind providing a sample of the source and target time series, that would be helpful for confirming the issues.

@silviaruiz44
Copy link

Does the data have to be normalized or in close ranges? Why so?

I also have a question regarding the mutual information function. Does it depend on the scaling? I calculated the mutual information of a time series against itself and got value. When dividing the whole time series by a scalar and calculating the mutual information of the series again, I get a different value. Which is strange, because it is the same time series, (just scaled). I am wondering what is the interpretation or explanation to that?

Thanks in advance for your time.

@dglmoore
Copy link
Contributor

@silviaruiz44
To the point of why the "memory allocation failed" error is happening. We use the data that you provide to construct histograms. Each bin of the histogram represents a different value that could possibly be observed in your data, and the histogram is stored in a dense form. Say we're dealing with transfer entropy from X to Y with a history length of k, and that X and Y can take integer (more on that below) values between 0 and 99. Then we'd need an array that can store 100 future states of Y, 100 past states of X, 100^4 values representing the past and future states of Y, and 100^5 values of the combined past of X, past of Y and future of Y, for a grand total of 1.01e10 integers representing the number of times each combination is actually observed. That will require something like 40GB of RAM, hence the allocation failure. In principle, this information could be stored more efficiently using a sparse memory representation, e.g. only store what you actually observe. However, there are performance trade-offs and questions of statistical significance when you get into situations like the one above. Sometimes there are workarounds, so let me know if you are dead-set on being able to apply these methods to data like this.

Now to a bigger issue. PyInform doesn't really support continuously-valued data. The data that you pass into the time series measures, e.g. transferentropy, has to be integer-valued. We're essentially estimating the probabilities of an event using frequencies taken from the time series, and that doesn't make much sense with continuously-valued data. There are methods for handling continuous data, but they aren't currently implmented in (Py)Inform. The documentation mentions this, but not emphatically enough (you're not the first person to run into this issue.)

I'd wager that the reason the mutual information changes when you scale the values has to do with how C casts values. We are using numpy internally to convert the data you provide into numpy arrays with integer values, and numpy doesn't complain when you do something like numpy.asarray([3.0, 4.0, 5.0, 6.0], dtype=np.int32). It just happily passes the input along to C who then casts the values to integers, so you end up with [3, 4, 5, 6]. However, if you first divide the values by 2 before giving them to pyinform, the resulting array will be [1, 2, 2, 3]. You go from having 4 distinct values to only having 3.

Ideally, the time series functions would raise an exception if you provide continuously-valued data; however, we haven't decided exactly how we want to handle that since it requires an additional pass over the data to check the types.

All of that said, you have a couple of options for dealing with continuous data.

Binning

Pyinform provides some (primitive) methods for binning continuously-valued data. You can choose to bin using a fixed number of bins, a fixed bin size, or specify the boundaries between bins. There are lots of different ways of choosing, for example, the width of the bins, e.g. the Freedman-Diaconis rule or Sturges's rule. If you are dealing with data that can be easily thought of as binary, e.g. a neuron is spiking or it isn't, then you can pick a threshold and call any value above it 1 and anything below it 0.

Most of the data that I deal with personally can be reasonably binned, but that's not always the case and doing so can introduce artifacts and bias. An alternative is to use the continuous data directly.

JIDT
A really good method for estimating mutual information (and transfer entropy, which is just a special case of conditional mutual information), is the Kraskov-Stögbauer-Grassberger estimator (KSG). Unfortunatly, (Py)Inform doesn't implement it at the moment because I just haven't had enough time or the energy to implement a KD-tree in C 😄. If this is something that you desperately need, we can see about bumping this issue up in the priorities list.

In the meantime, I'd recommend considering JIDT if binning your data just won't work for what you want to do. It has just about all of the features of (Py)Inform and then some, including implementations of the KSG estimator (which JIDT calls Kraskov). It's written in Java, but it has tutorials of how to use it from Python.

@silviaruiz44
Copy link

Thank you so much for your answer! It helps me a lot. I have a last question. How can we test the significance or accuracy of the mutual information estimates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants