Autocorrelation
Autocorrelation is a fundamental statistical method for identifying periodicities. It has been tested empirically on varying types of biological data sets, and found to be reliable and accurate in the circumstances tested (Dowse, 2009; 2007; Dowse and Ringo, 1994; Levine et al., 2002; Palmer et al., 1994; Refinetti, 1993; 2004; 2006; Refinetti et al., 2007). Autocorrelation begins by comparing a data set to itself, point by point, from start to end, using a standard correlation analysis. Since each data point is compared against itself, the correlation is perfect, and the resulting correlation coefficient, r, is 1. The two identical sets are then shifted, or lagged, by one point, and the comparison is repeated. The computed r will not be as perfect for this position. This shifting, or lagging, is continued, one point at a time, until it has reached the lag percent specified in CAT. The resulting r values are plotted in sequence as a function of the lag, in what is called a correlogram. If the series is rhythmic, r decreases and increases regularly with a period equivalent to that present in the data series. The output, r, a correlation coefficient, is normalized at each step by dividing by the data series’ variance, and assumes values between -1 and 1.
Each time the data set is lagged, the values on the two ends have no pair for the correlation calculation and are discarded; hence the power of the test is gradually diminished. For this reason, the usual limit of the autocorrelation computation is about N/3. The 95% confidence interval and hence significance of a peak is given as 2/√N, where N is the number of data points (Chatfield, 2003; Dowse, 2009).
Interpretation: Chatifield, a recognized expert in time-series analysis, calls the correlogram "probably the most useful tool in time-series analysis after the time plot." He also says that interpreting a correlogram "is one of the hardest tasks in time-series analysis" (Chatfield, 2003). A very rough rule of thumb for interpreting a correlogram is that repeated peaks exceeding the confidence level detect periodicity (see Figure 11-13). It has been suggested to use the third peak to judge statistical significance– the rhythm is statistically significant if the third peak is above the dotted significance line on the correlogram (Dowse, 2007; 2009; Levine et al., 2002). A great deal of variation between cycles in the data, or a decrease in amplitude, will cause the correlogram to decay more rapidly than in a regular series.
Each time the data set is lagged, the values on the two ends have no pair for the correlation calculation and are discarded; hence the power of the test is gradually diminished. For this reason, the usual limit of the autocorrelation computation is about N/3. The 95% confidence interval and hence significance of a peak is given as 2/√N, where N is the number of data points (Chatfield, 2003; Dowse, 2009).
Interpretation: Chatifield, a recognized expert in time-series analysis, calls the correlogram "probably the most useful tool in time-series analysis after the time plot." He also says that interpreting a correlogram "is one of the hardest tasks in time-series analysis" (Chatfield, 2003). A very rough rule of thumb for interpreting a correlogram is that repeated peaks exceeding the confidence level detect periodicity (see Figure 11-13). It has been suggested to use the third peak to judge statistical significance– the rhythm is statistically significant if the third peak is above the dotted significance line on the correlogram (Dowse, 2007; 2009; Levine et al., 2002). A great deal of variation between cycles in the data, or a decrease in amplitude, will cause the correlogram to decay more rapidly than in a regular series.
Results from the CAT Autocorrelation functions can be seen in Figure 11 through 13. In Figure 11, a time series of 20 days of mouse activity data was correlated with itself. The repeated peaks, well exceeding the dashed 95% confidence lines, with very little attenuation, indicate a strong rhythm, and a period estimated at 23.833-hr. CAT displays the lag of the first maximum peak as an estimated period. An accompanying periodogram, performed on the autocorrelation results, provides the actual estimate of the period. The maximum lag used for the calculation is shown. The maximum lag as a percent of the whole is a parameter selected by the user, N/3 in this case. All parameters are displayed on the plot, including data filename and timestamp: number of points in a bin, bin length in minutes, bins per day, and the length of the full time series. The solid line is for reference in the figures herein (not in CAT), estimating the (normalized) height and attenuation of subsequent peaks.
Figure 12 is from a related data set from the same animal (#4), but contains only 2 days, and demonstrates the problems resulting from short time-series. CAT reports an error and does not accept an input file containing less than 3 days of data. The Autocorrelation, performed on N/3 of the submitted data bins (due to the lagging as described above) results in only 48/3 (16) hr of data being used. Obviously the assessment of a 15.5-hr period is incorrect as a result.
The plots in Figure 13 use the same data as in Figure 11, but the plots have different bin sizes, which result in different period estimates. Using autocorrelation, point estimates of the period can get closer to the actual period with smaller bin size, but period uncertainty remains the same, depending solely on record length. Other factors such as noise and waveform can also impact this relationship. Note the power and confidence intervals also vary between the correlograms.
For comparison, Figure 14 shows a plot of random data. Chatfield advises familiarity with correlogams, using model data from known sources as well as real data, as the best way to learn to interpret them. For detailed specifications on the auto-correlation function used in CAT see (Venebles and Ripley, 2002).