Technical Discussions: Careful with Trends in EXCEL

Trendlines are a valuable tool when working with EXCEL.
Among other things, they can be used to determine whether your experimental data follow a certain functional trend (like, linear, polynomial, exponential, etc.).

In order to decide such questions, you can display the data with a sketch of a trendline in charts, display the functional formula and also the regression factor R² (square of Pearson's coefficient).

R² is a real number between 0 and 1 (inclusively). The closer to 1 the better is the indication, that the selected trendline mirrors the distribution of your data.

There also exist worksheet functions to calculate R² and the coefficients of the trendline function directly (PEARSON, GROWTH, LINEST, etc.)

So far, so well known.

Now, here are issues which equally apply to exponential, power and logarithmic trendlines:

In EXCEL charts, the displayed R² is not the one pertaining to your data pairs (x,y). Check this by using the CORREL function in your worksheet!
The calculated trendline is not best fit in all cases. This issue cannot be overcome easily, because not only is the trendline in the chart suboptimal, but also the worksheet functions (GROWTH) yield the same.
Even if your data (y values) are all negative, EXCEL refuses to generate a trendline (exponential and power case). This is a direct consequence of the solution strategy EXCEL uses - mathematically it is absurd.

Issue 1

Issue 1 is documented in EXCEL, though not easy to find nor easy to understand. And Microsoft also takes the position, that everything works as designed ...

But what is really happening? Let us take the exponential trendline case.
If data follow an exponential distribution, then the logarith of these data follow a linear distribution. In other words: if y ~ f(x) = b*EXP(a*x) for the data pairs (x,y), then also
LN(y) ~ LN(f(x)) = a*x + LN(b)
Based on this fact, EXCEL determines a straight line that best fits the data pairs [x, LN(y)].
The algorithm yields a and LN(b).
Now however, R² is being calculated between the two sets {LN(y0), LN(y1), ..., LN(yn)} and {LN(f(x1)), LN(f(x2)), ..., LN(f(xn))}.

In other words:
The R² in a chart with an exponential trend line tells you how well a linear fit is for the logarithms of your data. No more, no less.

If you take those calculated values a and b and generate a data row b*EXP(a*x), and then calculate the R² between these data and your y-data in the data sheet ... that value will be different!
... And as far as I am concerned, this R² value I would be interested in, when I want to decide, whether my data are adequately represented by the function.

Issue 2

But in reality things are worse.
The values a and b calculated by EXCEL are not the best ones (generally).
EXCEL's problem solving strategy (linear approximation of the data logarithms) is fast, simple and suboptimal.

Issue 3

From a mathematical point of view, strictly negative data (y-) values can certainly follow an exponential trend aEXP(bx), just with a negative "a"!
But because EXCEL uses logarithms, its solution strategy cannot work in those cases (logarithms of negative real numbers do not exist). A standard least squares algorithm like Levenberg-Marquardt's in contrast has no problem at all.
However, if you have different signs among your y-values, there cannot exist an exponential approximation, because aEXP(bx) is either positive or negative for all x - it only depends on the sign of the constant "a".

Summary

If you want to be certain about exponential, power or logarithmic distributions burried in you data, use something else than EXCEL.
I have attached a picture demonstrating what I wrote here.

2 Kommentare:

Unknown1. August 2014 um 08:12
An interesting question is:
Can it be somehow predicted when a linear approximation of data logarithms will yield a suboptimal exponential function?

I would appreciate your answer!

If one asks:
How can such a seemingly obvious and elegant strategy go wrong at all?

The answer is intuitively easy to give:
Logarithm reduces large quantities drastically - the larger the more. If your data tend to be more erratic for larger values, then the log function will even this out, "disguise" that fact. Whereas in relation scattering for small values is taken more seriously.
In a way, logarithm "distorts" reality and EXCEL will then solve the wrong problem.

Nevertheless I would prefer a more quantitative prediction method for when exactly this is the case.
A possible way seems to be a piece-wise approximation:
Divide your set of data pairs in n subsets. Then let EXCEL calculate the trend for each subset. If the R-squares for each subset follow a trend, then be careful to trust it.

Another way to prevent writing your own Levenberg-Marquardt algorithm:
Experiment with the value where b*EXP(a*x) intersects the y-axis (maybe by using SOLVER?).
AntwortenLöschen
Antworten
Unknown10. August 2014 um 10:54
I have found a way to predict, whether EXCEL's exponential trend is suboptimal:

[For simplicity let us assume the case, that the constants "a" and "b" in the trendline a*EXP(b*x) are positive, i.e. y values increase when x increases.]

(1) Take the constants "a" and "b" that the worksheet function delivers for your exponential trend and successively calculate the variances:

v2 = sum((y1 - a*EXP(b*x1))² + (y2 - a*EXP(b*x2))²)/2
v3 = sum( ... + (y3 - a*EXP(b*x3))²)/3
...
vn = sum ( ... + (yn - a*EXP(b*xn))²)/n

If you see a growing trend for the series of variance values v (use a linear trendline for this) then you can assume, that EXCEL has delivered a suboptimal exponential trend for your data pairs.

(2) Use SOLVER to find a better solution for "a" and "b", but be sure to use a high enough number of iterations (in my case 5000 worked, where 100 did not).
AntwortenLöschen
Antworten

Kommentar hinzufügen

Dienstag, 29. Juli 2014

Careful with Trends in EXCEL