welcome    phd research    awards    publications  & conferences    software    links    blog

 
 

Efficiently Computing Autocorrelations;
Demonstration in
MatLab and Python

 

Efficient computation of autocorrelations; demonstration in MatLab and Python

 

Recently I was faced with the issue of computing the autocorrelation of a time series, call it f(t).


Depending on how you choose to define it, mathematically this corresponds to computing the quantity:




where T defines the integration domain which, in the limit of T approaching infinity, spans the entire non-negative real line, the superscripted star refers to the complex conjugate operation (you might have seen this as a bar over f in other contexts), and m is referred to as the “lag”. From the definition we see that the autocorrelation is a measure of how similar the data f(t) is to itself shifted by the lag m in time. So, say you have some periodic signal as f(t), the autocorrelation should peak anytime the lag equals the period of the signal because then the signal looks identical to itself (in the limit of infinite time).


But doesn’t the Fourier transform tell us this as well: it peaks whenever a certain period is present in the signal? Yes, and indeed, we will see how this connection helps us implement the autocorrelation in a different way later.


Notice the definition of the autocorrelation: the statistician and physicist defines it differently so be aware of this and even within a given field the autocorrelation can change depending on exactly what you need it for. I have seen many cases where just the word “autocorrelation” is used with no definition. Stating it explicitly above should avoid confusions. Also notice that the integral is from zero to infinity and not from minus to plus infinity. But if you have it from minus to plus infinity only small changes are needed in the following discussion.


The brute-force way of computing the autocorrelation on the computer is to discretize the integral into a (finite) summation and cutting off the integral at some suitable value where the integrand has decayed to some negligible value. The following is one way of computing the autocorrelation at a given lag m. We imagine having taken N measurements of some time series f at corresponding steps i. The measurements are spaced by a time span of dt (so, e.g., we measured some signal after 1 second, after 2 seconds, after 3 seconds, and so on, then i is 1,2,3,... and the time step dt=1 second. We can convert “step i” to “time i, called ti” by this relation: ti=i*dt). Given this data we collected, the autocorrelation at lag m, corresponding to the integral version above, is, in discrete form (I put a hat on it to indicate “discrete”):





Why does the summation stop at N-1-m? The (N-1) part simply comes because we start the summation at i=0 and have N points in total (count from zero to N-1 and include zero, this gives N points). The reason for subtracting m is this: We do not have data beyond N. In other words, we don’t have fN+1, fN+2, etc. Just try putting m=4 and N=8 then we have data points f0, f1, ..., f7. If the summation went to N-1 instead of N-m-1 we would, with m=4, at some point in the summation be requesting f7+4=f11. But we don’t have that data point! At some point, the researcher decided to stop taking more data; he had to call it a day. So we have a finite amount of data. This is always the case in a practical setting. This was not the case in the integral version because, in principle, we had infinite data so the upper limit just went to “T” (which, incidentally, is related to N via: T=N*dt).


There are a couple of reasons for why you want to implement it in the most brute-force straightforward way first with absolutely no optimization (premature optimization means mistakes!) whatsoever. First, you force yourself to think about the task at hand and what it is composed of. This will help you understanding appropriate optimizations. Second, you have something to compare to that you know must be correct; it is implemented in such a simple way (although extremely slow) that you KNOW it’s correct (at least, make sure you know by doing the most simple minded implementation!). Then, as you write optimized versions you have something to compare to.


For the impatient this is how to compute the autocorrelation corresponding to the integral above:

MatLab (data is in vector “f” loaded at some earlier point):

c = xcorr(f, M, 'biased');

autocorrelations at lags 0:M = c(M+1:end);


So element k in vector c is the autocorrelation at lag k-1 (MatLab’s arrays start at index 1 remember).


Python (data is in vector “f” loaded at some earlier point):

*You need the numpy Python package*

import numpy as np

(define variable f here as your time series data)

N = len(f)

fvi = np.fft.fft(f, n=2*N)

acf = np.real( np.fft.ifft( fvi * np.conjugate(fvi) )[:N] )

acf = acf/N

To get the autocorrelation at lags 0 through M we then do:

autocorrelations at lags 0:M = acf[:M+1]

So element k in this vector is the autocorrelation at lag k (Python’s arrays start at index zero).


Note: We normalized the raw autocorrelation by N above. This is what corresponds to the “biased” version in MatLab’s “xcorr”. If we needed the “unbiased” version we would do the following instead in the code above:

In the last line divide “acf” by

d = N - np.arange(N)

instead of just N.


The full story

In this post I focus on the tools MatLab and Python. But, as you will see, if you can Fourier transform efficiently in your favorite programming language you can also easily compute the autocorrelation.


The brute-force method works, so why consider more advanced ways?

One reason: I had to compute the autocorrelation for tens of millions of data points. The brute-force method is way too slow in this case. We need something much faster.


I am going to show you how to get the same result much faster in both MatLab and Python.


Let us start with MatLab. There is a function called “xcorr(...)” (short for cross-correlation of which autocorrelation is a special case). Its documentation in MatLab states that:

MatLab refers to the lag as m in the above doc page and Rxy(m) is the cross-correlation between the time series x and y. In the case of an autocorrelation x=y. The star on y refers to the complex conjugate operation. Incidentally, the autocorrelation is even: Rxy(m) = Rxy(-m).


We are only going to need the result for lags m>=0 (corresponding to the integral we wrote in the beginning extending from zero to plus infinity).


Calling xcorr with some maximum lag M we get an output vector c. This vector contains 2M+1 elements. What are these elements?

c(1) is the autocorrelation for lag -M (=negative value)

c(2) is the autocorrelation for lag -M+1 (=negative value)

:

c(M+1) is the autocorrelation for lag 0

c(M+2) is the autocorrelation for lag 1 (=positive value)

:

c(2M+1) is the autocorrelation for lag M (=positive value)


We need to truncate the output of this MatLab call and only choose non-negative lag values. You do this with:


c(M+1:end) = non-negative autocorrelation values


If you have to compute the autocorrelation from minus to plus infinity then, of course, you don’t need to do this truncation, just use the entire vector c, but still be aware that the middle element of the vector is lag zero, the leftmost element (i.e., the first element) is what corresponds to “minus infinity” and the rightmost element (i.e., the last element) is “plus infinity” (incidentally, you can switch the order of plus and minus infinity quickly using the built-in function “fftshift” in MatLab).


Now, the computation of the autocorrelation has different “modes”. Each mode normalizes the raw autocorrelation in some way (by default it actually does not normalize it, you just get the raw autocorrelation which corresponds to the integral I wrote in the beginning but without the “1/N” prefactor).


MatLab can compute the autocorrelation in the following modes:








The biased mode is what we want.



In “unbiased” mode one divides the autocorrelation at each lag m by the number of data points N less this lag. Yet another mode is:

This just normalizes the autocorrelation at all lags to the value at lag zero so you will obtain a normalized autocorrelation between -1 and +1.


So, given some data series f(t) (called simply f as a MatLab vector, say) we compute the autocorrelations for lags zero through some maximum lag M via:

c = xcorr(f, M, 'biased');

autocorrelation = c(M+1:end);


That is it! This will be faster than the brute-force approach. Especially for large time series.


What about Python? I did not find any built-in ways. Even in the numpy and scipy packages. I did find a package called “statsmodels” but I will not rely on this just to show you how it can be done with what numpy already has. However, we will use a theorem which relates the autocorrelation of the time series to its Fourier transform. This is most likely also what MatLab’s “xcorr” uses for large time series anyway. The theorem is called the “Wiener-Khinchin theorem” (a special case of the “cross-correlation theorem”) and states that, for lag m:




where the scripted Fv is the inverse Fourier transform operation (from frequency to time) and fv,i is the time series f Fourier transformed (from time domain to the frequency domain v) and then evaluated at step i.


Incidentally, if you have the option of choosing how many data points you have in your time series I suggest you use a power of 2 as this number as the Fourier transform is typically implemented on the computer in the “fast Fourier transform (fft)” version (so if you aim to make around and at least n data points you should instead do 2^(ceil(log2(n))) points for fastest performance).


Now, the point is that Python’s numpy package has the Fourier transforms implemented. So first, you need to import numpy and I assume you import it and call it “np” as in:

import numpy as np


Then, first Fourier transform the time series f(t) (before this, just store the total number of data points in N):

N = len(f)

fvi = np.fft.fft(f, n=2*N)

This is what I called fv,i above.


the reason for choosing n=2*N in the second argument to np.fft.fft(...) is the same as in the discrete sum version: we only have finite data, so what the function does is pad the time series with zeroes where we don’t have measurements. Choosing 2N as compared to 3N or 4N is just for efficiency: The max overlap between the two series is at 2N so choosing to pad to this value is the most optimal. Why is the max overlap 2N? Imagine the time series as being a match of length N. Then, replicate the match and put the replica exactly under the original match on a table. Each match represent the time series and when they are aligned on top of each other the lag is zero and they are perfectly aligned. Now move them relative to each other (along the axial direction) by a small amount, this is lag 1, say. Move them a bit more relative to each other, this is lag 2, and so on. At the end you will have the matches almost not on top of each other because they are shifted by the length of the match. What is the length of the two matches? Since each match is N long the two matches shifted as much as we can while still having some overlap is 2N-1 (or 2N at complete misalignment).

Now, multiply this result by its own complex conjugate (which is also an available operation in numpy):

acf = fvi * np.conjugate( fvi )


And take the inverse Fourier transform of this:

acf = np.fft.ifft( acf )


The “ifft” operation is what I called (scripted) Fv[...] above.

As before: we get the inverse Fourier transform centered around zero extending from minus to plus lags. We just need the plus lags so extract that part:

acf = acf[:N]


Notice that, in the case of calling fft like this the zero lag is found in the first position of the returned vector, followed by N positive lags and then N negative lags, from the documentation of “ifft”:


These lags were arranged differently in the vector returned by MatLabs’ xcorr; just be aware of this! So we extract the first N values to get all non-negative lags up till N. The Fourier transform is complex, we need just the real part:

acf = np.real( acf )


To get the autocorrelation at lags 0:M we then
extract this via:

autocorrelations at lags zero through M = acf[:M+1]


That is it for Python’s way. Now, you see that autocorrelation equals Fourier transforms so if you have a Fourier transform package in your language (e.g. C++’s FFTW++) then you just do the same as I did in Python above and you’re all set.


I wrote a script which actually automates all this and compares the MatLab method and Python method discussed above (and any other language you wish by implementing a bit more code). The script creates and runs both MatLab and Python versions and compares the outputs to the “brute force” method. The code is commented but if you have any questions or find bugs please send me an email:


COMPARE_AUTOCORRELATIONS_dec30.sh.zip

(It takes about 10 seconds to run on my MacBook Pro)

 

Sunday, December 29, 2013

 
 

next >

< previous