Dienstag, 29. Juli 2014

Careful with Trends in EXCEL

Trendlines are a valuable tool when working with EXCEL.
Among other things, they can be used to determine whether your experimental data follow a certain functional trend (like, linear, polynomial, exponential, etc.).

In order to decide such questions, you can display the data with a sketch of a trendline in charts, display the functional formula and also the regression factor R² (square of Pearson's coefficient).

R² is a real number between 0 and 1 (inclusively). The closer to 1 the better is the indication, that the selected trendline mirrors the distribution of your data.

There also exist worksheet functions to calculate R² and the coefficients of the trendline function directly (PEARSON, GROWTH, LINEST, etc.)

So far, so well known.

Now, here are issues which equally apply to exponential, power and logarithmic trendlines:

  1. In EXCEL charts, the displayed R² is not the one pertaining to your data pairs (x,y). Check this by using the CORREL function in your worksheet!
  2. The calculated trendline is not best fit in all cases. This issue cannot be overcome easily, because not only is the trendline in the chart suboptimal, but also the worksheet functions (GROWTH) yield the same.
  3. Even if your data (y values) are all negative, EXCEL refuses to generate a trendline (exponential and power case). This is a direct consequence of the solution strategy EXCEL uses - mathematically it is absurd.

Issue 1

Issue 1 is documented in EXCEL, though not easy to find nor easy to understand. And Microsoft also takes the position, that everything works as designed ...

But what is really happening? Let us take the exponential trendline case.
If data follow an exponential distribution, then the logarith of these data follow a linear distribution. In other words: if y ~ f(x) = b*EXP(a*x) for the data pairs (x,y), then also
LN(y) ~ LN(f(x)) = a*x + LN(b)
Based on this fact, EXCEL determines a straight line that best fits the data pairs [x, LN(y)].
The algorithm yields a and LN(b).
Now however, R² is being calculated between the two sets {LN(y0), LN(y1), ..., LN(yn)} and {LN(f(x1)), LN(f(x2)), ..., LN(f(xn))}.

In other words:
The R² in a chart with an exponential trend line tells you how well a linear fit is for the logarithms of your data. No more, no less.

If you take those calculated values a and b and generate a data row b*EXP(a*x), and then calculate the R² between these data and your y-data in the data sheet ... that value will be different!
... And as far as I am concerned, this R² value I would be interested in, when I want to decide, whether my data are adequately represented by the function.


Issue 2

But in reality things are worse.
The values a and b calculated by EXCEL are not the best ones (generally).
EXCEL's problem solving strategy (linear approximation of the data logarithms) is fast, simple and suboptimal.


Issue 3

From a mathematical point of view, strictly negative data (y-) values can certainly follow an exponential trend a*EXP(b*x), just with a negative "a"!
But because EXCEL uses logarithms, its solution strategy cannot work in those cases (logarithms of negative real numbers do not exist). A standard least squares algorithm like Levenberg-Marquardt's in contrast has no problem at all.
However, if you have different signs among your y-values, there cannot exist an exponential approximation, because a*EXP(b*x) is either positive or negative for all x - it only depends on the sign of the constant "a".



Summary

If you want to be certain about exponential, power or logarithmic distributions burried in you data, use something else than EXCEL.
I have attached a picture demonstrating what I wrote here.