guide:51a7bbf823: Difference between revisions

From Stochiki
mNo edit summary
mNo edit summary
Line 1: Line 1:
In [[wikipedia:mathematics|mathematics]] and [[wikipedia:statistics|statistics]], a '''stationary process''' (or a '''strict/strictly stationary process''' or '''strong/strongly stationary process''') is a [[wikipedia:stochastic process|stochastic process]] whose unconditional [[wikipedia:joint probability distribution|joint probability distribution]] does not change when shifted in time.<ref>{{Cite book|title=Markov Chains: From Theory to Implementation and Experimentation|last=Gagniuc|first=Paul A.|publisher=John Wiley & Sons|year=2017|isbn=978-1-119-38755-8|location=USA, NJ|pages=1–256}}</ref> Consequently, parameters such as [[wikipedia:mean|mean]] and [[wikipedia:variance|variance]] also do not change over time. To get an intuition of stationarity, one can imagine a [[wikipedia:Friction|frictionless]] [[wikipedia:Pendulum (mechanics)|pendulum]]. It swings back and forth in an oscillatory motion, yet the [[wikipedia:amplitude|amplitude]] and [[wikipedia:frequency|frequency]] remain constant. Although the pendulum is moving, the process is stationary as its "[[wikipedia:Statistic|statistics]]" are constant (frequency and amplitude). However, if a [[wikipedia:force|force]] were to be applied to the pendulum (for example, friction with the air), either the frequency or amplitude would change, thus making the process ''non-stationary.''<ref>{{Cite journal|last=Laumann|first=Timothy O.|last2=Snyder|first2=Abraham Z.|last3=Mitra|first3=Anish|last4=Gordon|first4=Evan M.|last5=Gratton|first5=Caterina|last6=Adeyemo|first6=Babatunde|last7=Gilmore|first7=Adrian W.|last8=Nelson|first8=Steven M.|last9=Berg|first9=Jeff J.|last10=Greene|first10=Deanna J.|last11=McCarthy|first11=John E.|date=2016-09-02|title=On the Stability of BOLD fMRI Correlations|url=https://doi.org/10.1093/cercor/bhw265|journal=Cerebral Cortex|doi=10.1093/cercor/bhw265|issn=1047-3211|pmc=6248456|pmid=27591147}}</ref>
'''Exponential smoothing''' is a [[wikipedia:rule of thumb|rule of thumb]] technique for smoothing [[wikipedia:time series|time series]] data using the exponential [[wikipedia:window function|window function]]. Whereas in the [[wikipedia:simple moving average|simple moving average]] the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time. It is an easily learned and easily applied procedure for making some determination based on prior assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of time-series data.


Since stationarity is an assumption underlying many statistical procedures used in [[wikipedia:time series analysis|time series analysis]], non-stationary data are often transformed to become stationary. The most common cause of violation of stationarity is a trend in the mean, which can be due either to the presence of a [[wikipedia:unit root|unit root]] or of a deterministic trend. In the former case of a unit root, stochastic shocks have permanent effects, and the process is not [[wikipedia:mean reversion (finance)|mean-reverting]]. In the latter case of a deterministic trend, the process is called a [[wikipedia:trend-stationary process|trend-stationary process]], and stochastic shocks have only transitory effects after which the variable tends toward a deterministically evolving (non-constant) mean.
Exponential smoothing is one of many [[wikipedia:window functions|window functions]] commonly applied to smooth data in [[wikipedia:signal processing|signal processing]], acting as [[wikipedia:low-pass filter|low-pass filter]]s to remove high-frequency [[wikipedia:noise|noise]]. This method is preceded by [[wikipedia:Siméon Denis Poisson|Poisson]]'s use of recursive exponential window functions in convolutions from the 19th century, as well as [[wikipedia:Kolmogorov–Zurbenko filter|Kolmogorov and Zurbenko's use of recursive moving averages]] from their studies of turbulence in the 1940s.


A trend stationary process is not strictly stationary, but can easily be transformed into a stationary process by removing the underlying trend, which is solely a function of time. Similarly, processes with one or more unit roots can be made stationary through differencing. An important type of non-stationary process that does not include a trend-like behavior is a [[wikipedia:cyclostationary process|cyclostationary process]], which is a stochastic process that varies cyclically with time.
The raw data sequence is often represented by <math>\{x_t\}</math> beginning at time <math>t = 0</math>, and the output of the exponential smoothing algorithm is commonly written as <math>\{s_t\}</math>, which may be regarded as a best estimate of what the next value of <math>x</math> will be.  When the sequence of observations begins at time <math>t = 0</math>, the simplest form of exponential smoothing is given by the formulas:<ref name=NIST />


==Strict-sense stationarity==
<math display = "block">
\begin{align*}
s_0& = x_0\\
s_t & = \alpha x_{t} + (1-\alpha)s_{t-1},\quad t>0
\end{align*}
</math>


===Definition===
where <math>\alpha</math> is the ''smoothing factor'', and <math>0 < \alpha < 1</math>.


Formally, let <math>\left\{X_t\right\}</math> be a [[wikipedia:stochastic process|stochastic process]] and let <math>F_{X}(x_{t_1 + \tau}, \ldots, x_{t_n + \tau})</math> represent the [[wikipedia:cumulative distribution function|cumulative distribution function]] of the [[wikipedia:marginal distribution|unconditional]] (i.e., with no reference to any particular starting value) [[wikipedia:joint distribution|joint distribution]] of <math>\left\{X_t\right\}</math> at times <math>t_1 + \tau, \ldots, t_n + \tau</math>. Then, <math>\left\{X_t\right\}</math> is said to be '''strictly stationary''', '''strongly stationary''' or '''strict-sense stationary''' if<ref name=KunIlPark>{{cite book | author=Park,Kun Il| title=Fundamentals of Probability and Stochastic Processes with Applications to Communications| publisher=Springer | year=2018 | isbn=978-3-319-68074-3}}</ref>{{rp|p. 155}}
==Basic (simple) exponential smoothing (Holt linear)==
The use of the exponential window function is first attributed to [[wikipedia:Siméon Denis Poisson|Poisson]]<ref name="Oppenheim, Alan V. 1975 5">{{cite book |title=Digital Signal Processing |year=1975 |publisher=[[wikipedia:Prentice Hall|Prentice Hall]] |isbn=0-13-214635-5 |author=Oppenheim, Alan V. |author2=Schafer, Ronald W.  |page= 5}}</ref> as an extension of a numerical analysis technique from the 17th century, and later adopted by the [[wikipedia:signal processing|signal processing]] community in the 1940s. Here, exponential smoothing is the application of the exponential, or Poisson, [[wikipedia:window function|window function]]. Exponential smoothing was first suggested in the statistical literature without citation to previous work by [[wikipedia:Robert Goodell Brown|Robert Goodell Brown]] in 1956,<ref>{{cite book|last=Brown|first=Robert G.|title=Exponential Smoothing for Predicting Demand|year=1956|publisher=Arthur D. Little Inc.|location=Cambridge, Massachusetts|pages=15|url=http://legacy.library.ucsf.edu/tid/dae94e00;jsessionid=104A0CEFFA31ADC2FA5E0558F69B3E1D.tobacco03}}</ref> and then expanded by [[wikipedia:Charles C. Holt|Charles C. Holt]] in 1957.<ref>{{cite journal|title=Forecasting Trends and Seasonal by Exponentially Weighted Averages|first=Charles C.|last=Holt|author-link=Charles C. Holt|journal=Office of Naval Research Memorandum|volume=52|year=1957}} reprinted in {{cite journal|title=Forecasting Trends and Seasonal by Exponentially Weighted Averages|first=Charles C.|last=Holt|author-link=Charles C. Holt|journal=[[wikipedia:International Journal of Forecasting|International Journal of Forecasting]] |volume=20 |issue=1 |date=January–March 2004|pages=5–10|doi=10.1016/j.ijforecast.2003.09.015}}</ref> The formulation below, which is the one commonly used, is attributed to Brown and is known as "Brown’s simple exponential smoothing".<ref>{{cite book|title=Smoothing Forecasting and Prediction of Discrete Time Series |last=Brown |first=Robert Goodell |year=1963 |publisher=Prentice-Hall|location=Englewood Cliffs, NJ |url = http://babel.hathitrust.org/cgi/pt?id=mdp.39015004514728;view=1up;seq=1}}</ref> All the methods of Holt, Winters and Brown may be seen as a simple application of recursive filtering, first found in the 1940s<ref name="Oppenheim, Alan V. 1975 5"/> to convert [[wikipedia:finite impulse response|finite impulse response]] (FIR) filters to [[wikipedia:infinite impulse response|infinite impulse response]] filters.


<math display = "block">\begin{equation}\label{sss} F_{X}(x_{t_1+\tau} ,\ldots, x_{t_n+\tau}) = F_{X}(x_{t_1},\ldots, x_{t_n}) \quad \text{for all } \tau,t_1, \ldots, t_n \in \mathbb{R} \text{ and for all } n \in \mathbb{N}\end{equation}</math>
The simplest form of exponential smoothing is given by the formula:


Since <math>\tau</math> does not affect <math>F_X(\cdot)</math>, <math> F_{X}</math> is not a function of time.
<math display = "block">s_t = \alpha x_t + (1-\alpha) s_{t-1} = s_{t-1} + \alpha (x_t - s_{t-1}).</math>


===Examples===
where <math>\alpha</math> is the ''smoothing factor'', and <math>0 \le \alpha \le 1</math>. In other words, the smoothed statistic <math>s_t</math> is a simple weighted average of the current observation <math>x_t</math> and the previous smoothed statistic <math>s_{t-1}</math>. Simple exponential smoothing is easily applied, and it produces a smoothed statistic as soon as two observations are available.
[[File:Stationarycomparison.png|thumb|right|390px|Two simulated time series processes, one stationary and the other non-stationary, are shown above. The [[wikipedia:Augmented Dickey-Fuller test|augmented Dickey–Fuller]] (ADF) [[wikipedia:test statistic|test statistic]] is reported for each process; non-stationarity cannot be rejected for the second process at a 5% [[wikipedia:significance level|significance level]].]]
The term ''smoothing factor'' applied to <math>\alpha</math> here is something of a misnomer, as larger values of <math>\alpha</math> actually reduce the level of smoothing, and in the limiting case with <math>\alpha</math> = 1 the output series is just the current observation. Values of <math>\alpha</math> close to one have less of a smoothing effect and give greater weight to recent changes in the data, while values of <math>\alpha</math> closer to zero have a greater smoothing effect and are less responsive to recent changes.
[[wikipedia:White noise|White noise]] is the simplest example of a stationary process.


An example of a [[wikipedia:Discrete-time stochastic process|discrete-time]] stationary process where the sample space is also discrete (so that the random variable may take one of ''N'' possible values) is a [[wikipedia:Bernoulli scheme|Bernoulli scheme]]. Other examples of a discrete-time stationary process with continuous sample space include some [[wikipedia:autoregressive|autoregressive]] and [[wikipedia:moving average model|moving average]] processes which are both subsets of the [[wikipedia:autoregressive moving average model|autoregressive moving average model]]. Models with a non-trivial autoregressive component may be either stationary or non-stationary, depending on the parameter values, and important non-stationary special cases are where [[wikipedia:unit root|unit root]]s exist in the model.
There is no formally correct procedure for choosing <math>\alpha</math>. Sometimes the statistician's judgment is used to choose an appropriate factor. Alternatively, a statistical technique may be used to ''optimize'' the value of <math>\alpha</math>. For example, the [[wikipedia:least squares|method of least squares]] might be used to determine the value of <math>\alpha</math> for which the sum of the quantities <math>(s_t - x_{t+1})^2</math> is minimized.<ref name=NIST6431>{{cite web|url=http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc431.htm|title=NIST/SEMATECH e-Handbook of Statistical Methods, 6.4.3.1. Single Exponential Smoothing|access-date=2017-07-05|publisher=NIST}}</ref>


====Example 1====
Unlike some other smoothing methods, such as the simple moving average, this technique does not require any minimum number of observations to be made before it begins to produce results. In practice, however, a "good average" will not be achieved until several samples have been averaged together; for example, a constant signal will take approximately <math>3 / \alpha</math> stages to reach 95% of the actual value. To accurately reconstruct the original signal without information loss all stages of the exponential moving average must also be available, because older samples decay in weight exponentially. This is in contrast to a simple moving average, in which some samples can be skipped without as much loss of information due to the constant weighting of samples within the average. If a known number of samples will be missed, one can adjust a weighted average for this as well, by giving equal weight to the new sample and all those to be skipped.
Let <math>Y</math> be any scalar [[wikipedia:random variable|random variable]], and define a time-series <math>\left\{X_t\right\}</math>, by
<math>X_t=Y</math> for all <math>t</math>. Then <math>\left\{X_t\right\}</math> is a stationary time series, for which realisations consist of a series of constant values, with a different constant value for each realisation. A [[wikipedia:law of large numbers|law of large numbers]] does not apply on this case, as the limiting value of an average from a single realisation takes the random value determined by <math>Y</math>, rather than taking the [[wikipedia:expected value|expected value]] of <math>Y</math>.


The time average of <math>X_t</math> does not converge since the process is not [[wikipedia:Ergodic process|ergodic]].
This simple form of exponential smoothing is also known as an [[wikipedia:moving average#Exponential moving average|exponentially weighted moving average]] (EWMA). Technically it can also be classified as an [[wikipedia:autoregressive integrated moving average|autoregressive integrated moving average]] (ARIMA) (0,1,1) model with no constant term.<ref>{{cite web |url=http://www.duke.edu/~rnau/411avg.htm |title=Averaging and Exponential Smoothing Models |first=Robert |last=Nau |access-date=26 July 2010}}</ref>


====Example 2====
===Time constant===
As a further example of a stationary process for which any single realisation has an apparently noise-free structure, let <math>Y</math> has a [[wikipedia:Uniform distribution (continuous)|uniform distribution]] on <math>(0,2\pi]</math> and define the time series <math>\left\{X_t\right\}</math> by
The [[wikipedia:time constant|time constant]] of an exponential moving average is the amount of time for the smoothed response of a [[wikipedia:unit step function|unit step function]] to reach <math>1-1/e \approx 63.2\,\%</math> of the original signal. The relationship between this time constant, <math> \tau </math>, and the smoothing factor, <math> \alpha </math>, is given by the formula:
<math display="block">X_t=\cos (t+Y) \quad \text{ for } t \in \mathbb{R}. </math>
Then <math>\left\{X_t\right\}</math> is strictly stationary since (<math> (t+ Y) </math>  modulo <math>  2 \pi </math>) follows the same uniform distribution as <math> Y </math> for any <math> t </math>.


==== Example 3 ====
<math display = "block">\alpha = 1 - e^{-\Delta T/\tau}</math>, thus  <math>\tau = - \frac{\Delta T}{\ln(1 - \alpha)}</math> where <math>\Delta T</math> is the sampling time interval of the discrete time implementation. If the sampling time is fast compared to the time constant (<math>\Delta T \ll \tau</math>) then
Keep in mind that a [[wikipedia:white noise|white noise]] is not necessarily strictly stationary. Let <math>\omega</math> be a random variable uniformly distributed in the interval <math>(0, 2\pi)</math> and define the time series <math>\left\{z_t\right\}</math>


<math display = "block">z_t=\cos(t\omega) \quad (t=1,2,...) </math>
<math display = "block">\alpha \approx \frac{\Delta T} \tau </math>


Then  <math display = "block">
===Choosing the initial smoothed value===
\begin{align*}
Note that in the definition above, <math>s_0</math> is being initialized to <math>x_0</math>. Because exponential smoothing requires that at each stage we have the previous forecast, it is not obvious how to get the method started.  We could assume that the initial forecast is equal to the initial value of demand; however, this approach has a serious drawback. Exponential smoothing puts substantial weight on past observations, so the initial value of demand will have an unreasonably large effect on early forecasts. This problem can be overcome by allowing the process to evolve for a reasonable number of periods (10 or more) and using the average of the demand during those periods as the initial forecast. There are many other ways of setting this initial value, but it is important to note that the smaller the value of <math>\alpha</math>, the more sensitive your forecast will be on the selection of this initial smoother value <math>s_0</math>.<ref>"Production and Operations Analysis" Nahmias. 2009.</ref><ref>Čisar, P., & Čisar, S. M. (2011). "Optimization methods of EWMA statistics." ''Acta Polytechnica Hungarica'', 8(5), 73–87. Page 78.</ref>
\mathbb{E}(z_t) &= \frac{1}{2\pi} \int_0^{2\pi} \cos(t\omega) \,d\omega = 0,\\
\operatorname{Var}(z_t)        &= \frac{1}{2\pi} \int_0^{2\pi} \cos^2(t\omega) \,d\omega = 1/2,\\
\operatorname{Cov}(z_t , z_j)  &= \frac{1}{2\pi} \int_0^{2\pi} \cos(t\omega)\cos(j\omega) \,d\omega = 0 \quad \forall t\neq j.
\end{align*}
</math>
So <math>\{z_t\}</math> is a white noise, however it is not strictly stationary.


==<math>N</math><sup>th</sup>-order stationarity==
===Optimization===
For every exponential smoothing method we also need to choose the value for the smoothing parameters. For simple exponential smoothing, there is only one smoothing parameter (''α''), but for the methods that follow there is usually more than one smoothing parameter.


In <math>\ref{sss}</math>, the distribution of <math>n</math> samples of the stochastic process must be equal to the distribution of the samples shifted in time ''for all'' <math>n</math>. <math>N</math><sup>th</sup> order stationarity is a weaker form of stationarity  where this is only requested for all <math>n</math> up to a certain order <math>N</math>. A random process <math>\left\{X_t\right\}</math> is said to be <math>N</math><sup>th</sup> order stationary if:<ref name=KunIlPark/>{{rp|p. 152}}
There are cases where the smoothing parameters may be chosen in a subjective manner – the forecaster specifies the value of the smoothing parameters based on previous experience. However, a more robust and objective way to obtain values for the unknown parameters included in any exponential smoothing method is to estimate them from the observed data.


<math display = "block">
The unknown parameters and the initial values for any exponential smoothing method can be estimated by minimizing the [[wikipedia:Sum of squared errors of prediction|sum of squared errors]] (SSE). The errors are specified as <math>e_t=y_t-\hat{y}_{t\mid t-1}</math> for <math> t=1, \ldots,T</math> (the one-step-ahead within-sample forecast errors). Hence we find the values of the unknown parameters and the initial values that minimize
F_{X}(x_{t_1+\tau} ,\ldots, x_{t_n+\tau}) = F_{X}(x_{t_1},\ldots, x_{t_n}) \quad \text{for all } \tau,t_1, \ldots, t_n \in \mathbb{R} \text{ and for all } n \in \{1,\ldots,N\}
</math>


==Weak or wide-sense stationarity==
<math display = "block"> \text{SSE} = \sum_{t=1}^T (y_t-\hat{y}_{t\mid t-1})^2=\sum_{t=1}^T e_t^2</math><ref name="otexts.org">{{Cite book | url=https://www.otexts.org/fpp/7/1 | title=7.1 Simple exponential smoothing &#124; Forecasting: Principles and Practice}}</ref>


===Definition===
Unlike the regression case (where we have formulae to directly compute the regression coefficients which minimize the SSE) this involves a non-linear minimization problem and we need to use an [[wikipedia:Mathematical optimization|optimization]] tool to perform this.


A weaker form of stationarity commonly employed in [[wikipedia:signal processing|signal processing]] is known as '''weak-sense stationarity''', '''wide-sense stationarity (WSS)''', or '''covariance stationarity'''. WSS random processes only require that 1st [[wikipedia:moment (mathematics)|moment]] (i.e. the mean) and [[wikipedia:autocovariance|autocovariance]] do not vary with respect to time and that the 2nd moment is finite for all times. Any strictly stationary process which has a finite [[wikipedia:mean|mean]] and a [[wikipedia:covariance|covariance]] is also WSS.<ref name="Florescu2014">{{cite book|author=Ionut Florescu|title=Probability and Stochastic Processes|date=7 November 2014|publisher=John Wiley & Sons|isbn=978-1-118-59320-2}}</ref>{{rp|p. 299}}
==="Exponential" naming===
The name 'exponential smoothing' is attributed to the use of the exponential window function during convolution. It is no longer attributed to Holt, Winters & Brown.


So, a [[wikipedia:continuous time|continuous time]] [[wikipedia:random process|random process]] <math>\left\{X_t\right\}</math> which is WSS has the following restrictions on its mean function <math>m_X(t) \triangleq \operatorname E[X_t]</math> and [[wikipedia:autocovariance|autocovariance]] function <math>K_{XX}(t_1, t_2) \triangleq \operatorname E[(X_{t_1}-m_X(t_1))(X_{t_2}-m_X(t_2))]</math>:
By direct substitution of the defining equation for simple exponential smoothing back into itself we find that


<math display = "block">
<math display = "block">
\begin{align*}
\begin{align*}
& m_X(t) = m_X(t + \tau) & & \text{for all } \tau \in \mathbb{R} \\
s_t& = \alpha x_t + (1-\alpha)s_{t-1}\\[3pt]
& K_{XX}(t_1, t_2) = K_{XX}(t_1 - t_2, 0) & &  \text{for all } t_1,t_2 \in \mathbb{R} \\
& = \alpha x_t + \alpha (1-\alpha)x_{t-1} + (1 - \alpha)^2 s_{t-2}\\[3pt]
& \operatorname E[|X(t)|^2] < \infty & & \text{for all } t \in \mathbb{R}
& = \alpha \left[x_t + (1-\alpha)x_{t-1} + (1-\alpha)^2 x_{t-2} + (1-\alpha)^3 x_{t-3} + \cdots + (1-\alpha)^{t-1} x_1 \right]
+ (1-\alpha)^t x_0.
\end{align*}
\end{align*}
</math>
</math>


The first property implies that the mean function <math>m_X(t)</math> must be constant. The second property implies that the covariance function depends only on the ''difference'' between <math>t_1</math> and <math>t_2</math> and only needs to be indexed by one variable rather than two variables.<ref name=KunIlPark/>{{rp|p. 159}} Thus, instead of writing,
In other words, as time passes the smoothed statistic <math>s_t</math> becomes the weighted average of a greater and greater number of the past observations <math>s_{t-1},\ldots, s_{t-}</math>, and the weights assigned to previous observations are proportional to the terms of the geometric progression


<math display="block">\,\!K_{XX}(t_1 - t_2, 0)\,</math>
<math display = "block">1, (1-\alpha), (1-\alpha)^2,\ldots, (1-\alpha)^n,\ldots</math>


the notation is often abbreviated by the substitution <math>\tau = t_1 - t_2</math>:
A [[wikipedia:geometric progression|geometric progression]] is the discrete version of an [[wikipedia:exponential function|exponential function]], so this is where the name for this smoothing method originated according to [[wikipedia:Statistics|Statistics]] lore.


<math display="block">K_{XX}(\tau) \triangleq K_{XX}(t_1 - t_2, 0)</math>
===Comparison with moving average===
Exponential smoothing and moving average have similar defects of introducing a lag relative to the input data. While this can be corrected by shifting the result by half the window length for a symmetrical kernel, such as a moving average or gaussian, it is unclear how appropriate this would be for exponential smoothing. They also both have roughly the same distribution of forecast error when <math>\alpha = 2/(k+1)</math>. They differ in that exponential smoothing takes into account all past data, whereas moving average only takes into account <math>k</math> past data points. Computationally speaking, they also differ in that moving average requires that the past <math>k</math> data points, or the data point at lag <math>k+1</math> plus the most recent forecast value, to be kept, whereas exponential smoothing only needs the most recent forecast value to be kept.<ref>{{cite book|last=Nahmias|first=Steven|title=Production and Operations Analysis|edition=6th|isbn=0-07-337785-6|date=3 March 2008}}{{Page needed|date=September 2011}}</ref>


This also implies that the [[wikipedia:autocorrelation|autocorrelation]] depends only on <math>\tau = t_1 - t_2</math>, that is
In the [[wikipedia:signal processing|signal processing]] literature, the use of non-causal (symmetric) filters is commonplace, and the exponential [[wikipedia:window function|window function]] is broadly used in this fashion, but a different terminology is used: exponential smoothing is equivalent to a first-order [[wikipedia:infinite-impulse response|infinite-impulse response]] (IIR) filter and moving average is equivalent to a [[wikipedia:finite impulse response filter|finite impulse response filter]] with equal weighting factors.


<math display="block">\,\! R_X(t_1,t_2) = R_X(t_1-t_2,0) \triangleq R_X(\tau).</math>
==Double exponential smoothing==
Simple exponential smoothing does not do well when there is a [[wikipedia:Trend estimation|trend]] in the data. <ref name=NIST>{{cite web|url=http://www.itl.nist.gov/div898/handbook/|title=NIST/SEMATECH e-Handbook of Statistical Methods|access-date=2010-05-23|publisher=NIST}}</ref> In such situations, several methods were devised under the name "double exponential smoothing" or "second-order exponential smoothing," which is the recursive application of an exponential filter twice, thus being termed "double exponential smoothing". This nomenclature is similar to quadruple exponential smoothing, which also references its recursion depth.<ref>{{cite web |url=http://help.sap.com/saphelp_45b/helpdata/en/7d/c27a14454011d182b40000e829fbfe/content.htm |title=Model: Second-Order Exponential Smoothing |author=<!--Staff writer(s); no by-line.--> |publisher=[[wikipedia:SAP AG|SAP AG]] |access-date=23 January 2013}}</ref>  
The basic idea behind double exponential smoothing is to introduce a term to take into account the possibility of a series exhibiting some form of trend. This slope component is itself updated via exponential smoothing.


The third property says that the second moments must be finite for any time <math>t</math>.
One method, works as follows:<ref>{{cite web |url= http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc433.htm |title=6.4.3.3. Double Exponential Smoothing |work=itl.nist.gov |access-date=25 September 2011}}</ref>


== Differencing ==
Again, the raw data sequence of observations is represented by <math>x_t</math>, beginning at time <math>t=0</math>. We use <math>s_t</math> to represent the smoothed value for time <math>t</math>, and <math>b_t</math> is our best estimate of the trend at time <math>t</math>. The output of the algorithm is now written as <math>F_{t+m}</math>, an estimate of the value of <math>x_{t+m}</math> at time <math>m > 0</math> based on the raw data up to time <math>t</math>. Double exponential smoothing is given by the formulas
One way to make some time series stationary is to compute the differences between consecutive observations. This is known as [[wikipedia:unit root|differencing]]. Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and so eliminating trends. This can also remove seasonality, if differences are taken appropriately (e.g. differencing observations 1 year apart to remove year-lo).


Transformations such as logarithms can help to stabilize the variance of a time series.
<math display = "block">
\begin{align*}
s_0 & = x_0\\
b_0 & = x_1 - x_0\\
\end{align*}
</math>


One of the ways for identifying non-stationary times series is the [[wikipedia:Autocorrelation|ACF]] plot. Sometimes, seasonal patterns will be more visible in the ACF plot than in the original time series; however, this is not always the case.<ref>{{Cite web|url=https://www.otexts.org/fpp/8/1|title=8.1 Stationarity and differencing {{!}} OTexts|website=www.otexts.org|access-date=2016-05-18}}</ref> Nonstationary time series can look stationary
And for <math>t > 0</math> by
<math display = "block">
\begin{align*}
s_t & = \alpha x_t + (1-\alpha)(s_{t-1} + b_{t-1})\\
b_t & = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\
\end{align*}


Another approach to identifying non-stationarity is to look at the [[wikipedia:Laplace transform|Laplace transform]] of a series, which will identify both exponential trends and sinusoidal seasonality (complex exponential trends). Related techniques from [[wikipedia:signal analysis|signal analysis]] such as the [[wikipedia:wavelet transform|wavelet transform]] and [[wikipedia:Fourier transform|Fourier transform]] may also be helpful.
</math>
where <math>\alpha</math> (<math>0 \le \alpha \le 1</math>) is the ''data smoothing factor'', and <math>\beta</math> (<math>0 \le \beta \le 1</math>) is the ''trend smoothing factor''.


'''Autocorrelation''', sometimes known as '''serial correlation''' in the [[wikipedia:discrete time|discrete time]] case, is the [[wikipedia:correlation|correlation]] of a [[wikipedia:Signal (information theory)|signal]] with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a [[wikipedia:periodic signal|periodic signal]] obscured by [[wikipedia:noise (signal processing)|noise]], or identifying the [[wikipedia:missing fundamental frequency|missing fundamental frequency]]  in a signal implied by its [[wikipedia:harmonic|harmonic]] frequencies. It is often used in [[wikipedia:signal processing|signal processing]] for analyzing functions or series of values, such as [[wikipedia:time domain|time domain]] signals.
To forecast beyond <math>x_t</math> is given by the approximation:


Different fields of study define autocorrelation differently, and not all of these definitions are equivalent. In some fields, the term is used interchangeably with [[wikipedia:autocovariance|autocovariance]].
<math display = "block">
F_{t+m} = s_t + m \cdot b_t
</math>
<!-- The use of m in conjunction with b above is confusing with the y = mx + b formula -->


[[wikipedia:Unit root|Unit root]] processes, [[wikipedia:trend-stationary process|trend-stationary process]]es, [[wikipedia:autoregressive process|autoregressive process]]es, and [[wikipedia:moving average process|moving average process]]es are specific forms of processes with autocorrelation.
Setting the initial value <math>b</math> is a matter of preference. An option other than the one listed above is <math display="inline">\frac{x_n-x_0} n</math> for some <math>n</math>.


== Autocorrelation of stochastic processes ==
Note that <math>F_0</math> is undefined (there is no estimation for time 0), and according to the definition <math>F_1= s_0 + b_0</math>, which is well defined, thus further values can be evaluated.


In [[wikipedia:statistics|statistics]], the autocorrelation of a real or complex [[wikipedia:random process|random process]] is the [[wikipedia:Pearson correlation coefficient|Pearson correlation]] between values of the process at different times, as a function of the two times or of the time lag. Let <math>\left\{ X_t \right\}</math> be a random process, and <math>t</math> be any point in time (<math>t</math> may be an [[wikipedia:integer|integer]] for a [[wikipedia:discrete-time|discrete-time]] process or a [[wikipedia:real number|real number]] for a [[wikipedia:continuous-time|continuous-time]] process). Then <math>X_t</math> is the value (or [[wikipedia:Realization (probability)|realization]]) produced by a given [[wikipedia:Execution (computing)|run]] of the process at time <math>t</math>. Suppose that the process has [[wikipedia:mean|mean]] <math>\mu_t</math> and [[wikipedia:variance|variance]] <math>\sigma_t^2</math> at time <math>t</math>, for each <math>t</math>. Then the definition of the '''auto-correlation function''' between times <math>t_1</math> and <math>t_2</math> is<ref name=Gubner>{{cite book |first=John A. |last=Gubner |year=2006 |title=Probability and Random Processes for Electrical and Computer Engineers |publisher=Cambridge University Press |isbn=978-0-521-86470-1}}</ref>{{rp|p.388}}<ref name=KunIlPark></ref>{{rp|p.165}}
A second method, referred to as either Brown's linear exponential smoothing (LES) or Brown's double exponential smoothing works as follows.<ref>{{cite web |url= http://www.duke.edu/~rnau/411avg.htm |title=Averaging and Exponential Smoothing Models  |work=duke.edu  |access-date=25 September 2011}}</ref>


<math>\operatorname{R}_{XX}(t_1,t_2) = \operatorname{E} \left[ X_{t_1} \overline{X}_{t_2}\right]</math>
<math display = "block">
\begin{align*}
s'_0 & = x_0\\
s''_0 & = x_0\\
s'_t & = \alpha x_t + (1-\alpha)s'_{t-1}\\
s''_t & = \alpha s'_t + (1-\alpha)s''_{t-1}\\
F_{t+m} & = a_t + mb_t,
\end{align*}
</math>


where <math>\operatorname{E}</math> is the [[wikipedia:expected value|expected value]] operator and the bar represents [[wikipedia:complex conjugation|complex conjugation]]. Note that the expectation may not be [[wikipedia:well defined|well defined]].
where ''a''<sub>''t''</sub>, the estimated level at time ''t'' and ''b''<sub>''t''</sub>, the estimated trend at time ''t'' are:


Subtracting the mean before multiplication yields the '''auto-covariance function''' between times <math>t_1</math> and <math>t_2</math>:<ref name=Gubner/>{{rp|p.392}}<ref name=KunIlPark/>{{rp|p.168}}
<math display = "block">
\begin{align*}
a_t & = 2s'_t - s''_t\\[5pt]
b_t & = \frac \alpha {1-\alpha} (s'_t - s''_t).
\end{align*}
</math>


<math display = "block">\operatorname{K}_{XX}(t_1,t_2) = \operatorname{E} \left[ (X_{t_1} - \mu_{t_1})\overline{(X_{t_2} - \mu_{t_2})} \right] = \operatorname{E}\left[X_{t_1} \overline{X}_{t_2} \right] - \mu_{t_1}\overline{\mu}_{t_2}</math>
==Triple exponential smoothing (Holt Winters)==
Triple exponential smoothing applies exponential smoothing three times, which is commonly used when there are three high frequency signals to be removed from a time series under study. There are different types of seasonality: 'multiplicative' and 'additive' in nature, much like addition and multiplication are basic operations in mathematics.


Note that this expression is not well defined for all time series or processes, because the mean may not exist, or the variance may be zero (for a constant process) or infinite (for processes with distribution lacking well-behaved moments, such as certain types of [[wikipedia:power law|power law]]).
If every month of December we sell 10,000 more apartments than we do in November the seasonality is ''additive'' in nature. However, if we sell 10% more apartments in the summer months than we do in the winter months the seasonality is ''multiplicative'' in nature. Multiplicative seasonality can be represented as a constant factor, not an absolute amount.
<ref>{{cite web|last1=Kalehar|first1=Prajakta S.|title=Time series Forecasting using Holt–Winters Exponential Smoothing|url=http://www.it.iitb.ac.in/~praj/acads/seminar/04329008_ExponentialSmoothing.pdf|access-date=23 June 2014}}</ref>


=== Definition for wide-sense stationary stochastic process ===
Triple exponential smoothing was first suggested by Holt's student, Peter Winters, in 1960 after reading a signal processing book from the 1940s on exponential smoothing.<ref name="Winters 324–342">{{cite journal|first=P. R.|last=Winters|title=Forecasting Sales by Exponentially Weighted Moving Averages|journal=[[wikipedia:Management Science: A Journal of the Institute for Operations Research and the Management Sciences|Management Science]]|volume=6|issue=3|date=April 1960|pages=324–342|doi=10.1287/mnsc.6.3.324}}</ref> Holt's novel idea was to repeat filtering an odd number of times greater than 1 and less than 5, which was popular with scholars of previous eras.<ref name="Winters 324–342"/> While recursive filtering had been used previously, it was applied twice and four times to coincide with the [[wikipedia:Hadamard conjecture|Hadamard conjecture]], while triple application required more than double the operations of singular convolution. The use of a triple application is considered a [[wikipedia:rule of thumb|rule of thumb]] technique, rather than one based on theoretical foundations and has often been over-emphasized by practitioners.
If <math>\left\{ X_t \right\}</math> is a [[wikipedia:wide-sense stationary process|wide-sense stationary process]] then the mean <math>\mu</math> and the variance <math>\sigma^2</math> are time-independent, and further the autocovariance function depends only on the lag between <math>t_1</math> and <math>t_2</math>: the autocovariance depends only on the time-distance between the pair of values but not on their position in time. This further implies that the autocovariance and auto-correlation can be expressed as a function of the time-lag, and that this would be an [[wikipedia:even function|even function]] of the lag  <math>\tau=t_2-t_1</math>. This gives the more familiar forms for the '''auto-correlation function'''<ref name=Gubner/>{{rp|p.395}}
-
Suppose we have a sequence of observations <math>x_t</math>, beginning at time <math>t=0</math> with a cycle of seasonal change of length <math>L</math>.


<math display = "block">\operatorname{R}_{XX}(\tau) = \operatorname{E}\left[X_{t+\tau} \overline{X}_{t} \right]</math>
The method calculates a trend line for the data as well as seasonal indices that weight the values in the trend line based on where that time point falls in the cycle of length <math>L</math>.


and the '''auto-covariance function''':
Let <math>s_t</math> represent the smoothed value of the constant part for time <math>t</math>, <math>b_t</math> is the sequence of best estimates of the linear trend that are superimposed on the seasonal changes, and <math>c_t</math> is the sequence of seasonal correction factors. We wish to estimate <math>c_t</math> at every time <math>t</math>mod <math>L</math> in the cycle that the observations take on.  As a rule of thumb, a minimum of two full seasons (or <math>2L</math> periods) of historical data is needed to initialize a set of seasonal factors.


<math display = "block">\operatorname{K}_{XX}(\tau) = \operatorname{E}\left[ (X_{t+\tau} - \mu)\overline{(X_{t} - \mu)} \right] = \operatorname{E} \left[ X_{t+\tau} \overline{X}_{t} \right] - \mu\overline{\mu}</math>
The output of the algorithm is again written as <math>F_{t+m}</math>, an estimate of the value of <math>x_{t+m}</math> at time <math>t+m>0</math> based on the raw data up to time <math>t</math>. Triple exponential smoothing with multiplicative seasonality is given by the formulas<ref name=NIST />


=== Normalization ===
<math display = "block">
It is common practice in some disciplines (e.g. statistics and [[wikipedia:time series analysis|time series analysis]]) to normalize the autocovariance function to get a time-dependent [[wikipedia:Pearson correlation coefficient|Pearson correlation coefficient]]. However, in other disciplines (e.g. engineering) the normalization is usually dropped and the terms "autocorrelation" and "autocovariance" are used interchangeably.
\begin{align*}
s_0 & = x_0\\[5pt]
s_t & = \alpha \frac{x_t}{c_{t-L}} + (1-\alpha)(s_{t-1} + b_{t-1})\\[5pt]
b_t & = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\[5pt]
c_t & = \gamma \frac{x_t}{s_t}+(1-\gamma)c_{t-L}\\[5pt]
F_{t+m} & = (s_t + mb_t)c_{t-L+1+(m-1)\bmod L},
\end{align*}
</math>


The definition of the auto-correlation coefficient of a stochastic process is<ref name=KunIlPark/>{{rp|p.169}}
where <math>\alpha</math> (<math>0 \le \alpha \le 1</math>) is the ''data smoothing factor'', <math>\beta</math> (<math>0 \le \beta \le 1</math>) is the ''trend smoothing factor'', and <math>\gamma</math> (<math>0 \le \gamma \le 1</math>) is the ''seasonal change smoothing factor''.


<math display=block>\rho_{XX}(t_1,t_2) = \frac{\operatorname{K}_{XX}(t_1,t_2)}{\sigma_{t_1}\sigma_{t_2}} = \frac{\operatorname{E}\left[(X_{t_1} - \mu_{t_1}) \overline{(X_{t_2} - \mu_{t_2})} \right]}{\sigma_{t_1}\sigma_{t_2}} .</math>
The general formula for the initial trend estimate <math>b</math> is:


If the function <math>\rho_{XX}</math> is well defined, its value must lie in the range <math>[-1,1]</math>, with 1 indicating perfect correlation and −1 indicating perfect [[wikipedia:anti-correlation|anti-correlation]].
<math display = "block">
\begin{align*}
b_0 & = \frac{1}{L} \left(\frac{x_{L+1}-x_1}{L} + \frac{x_{L+2}-x_2}{L} + \cdots + \frac{x_{L+L}-x_L}{L}\right)
\end{align*}
</math>


For a [[wikipedia:Stationary process#Weak or wide-sense stationarity|weak-sense stationarity, wide-sense stationarity]] (WSS) process, the definition is
Setting the initial estimates for the seasonal indices <math>c_i</math> for <math>i = 1,2,\ldots,L</math> is a bit more involved. If <math>N</math> is the number of complete cycles present in your data, then:
 
<math display=block>\rho_{XX}(\tau) = \frac{\operatorname{K}_{XX}(\tau)}{\sigma^2} = \frac{\operatorname{E} \left[(X_{t+\tau} - \mu)\overline{(X_{t} - \mu)}\right]}{\sigma^2}</math>


<math display = "block">
c_i = \frac{1}{N} \sum_{j=1}^N \frac{x_{L(j-1)+i}}{A_j} \quad \text{for } i = 1,2,\ldots,L
</math>
where
where
<math display = "block">
A_j = \frac{\sum_{i=1}^{L} x_{L(j-1)+i}}{L} \quad \text{for } j = 1,2,\ldots,N
</math>
Note that <math>A_j</math> is the average value of <math>x</math> in the <math>j^\text{th}</math> cycle of your data. Triple exponential smoothing with additive seasonality is given by:


<math display=block>\operatorname{K}_{XX}(0) = \sigma^2 .</math>
<math display = "block">
 
\begin{align*}
The normalization is important both because the interpretation of the autocorrelation as a correlation provides a scale-free measure of the strength of [[wikipedia:statistical dependence|statistical dependence]], and because the normalization has an effect on the statistical properties of the estimated autocorrelations.
s_0 & = x_0\\
 
s_t & = \alpha (x_t-c_{t-L}) + (1-\alpha)(s_{t-1} + b_{t-1})\\
===Properties===
b_t & = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\
====Symmetry property====
c_t & = \gamma (x_t-s_{t-1}-b_{t-1})+(1-\gamma)c_{t-L}\\
The fact that the auto-correlation function <math>\operatorname{R}_{XX}</math> is an [[wikipedia:even function|even function]] can be stated as<ref name=KunIlPark/>{{rp|p.171}}
F_{t+m} & = s_t + mb_t+c_{t-L+1+(m-1) \bmod L},
<math display=block>\operatorname{R}_{XX}(t_1,t_2) = \overline{\operatorname{R}_{XX}(t_2,t_1)}</math>
\end{align*}
respectively for a WSS process:<ref name=KunIlPark/>{{rp|p.173}}
</math>
<math display=block>\operatorname{R}_{XX}(\tau) = \overline{\operatorname{R}_{XX}(-\tau)} .</math>
 
====Maximum at zero====
For a WSS process:<ref name=KunIlPark/>{{rp|p.174}}
<math display=block>\left|\operatorname{R}_{XX}(\tau)\right| \leq \operatorname{R}_{XX}(0)</math>
Notice that <math>\operatorname{R}_{XX}(0)</math> is always real.
 
====Cauchy–Schwarz inequality====
The [[wikipedia:Cauchy–Schwarz inequality|Cauchy–Schwarz inequality]], inequality for stochastic processes:<ref name=Gubner/>{{rp|p.392}}
<math display=block>\left|\operatorname{R}_{XX}(t_1,t_2)\right|^2 \leq \operatorname{E}\left[ |X_{t_1}|^2\right] \operatorname{E}\left[|X_{t_2}|^2\right]</math>
 
====Wiener–Khinchin theorem====
The [[wikipedia:Wiener–Khinchin theorem|Wiener–Khinchin theorem]] relates the autocorrelation function <math>\operatorname{R}_{XX}</math> to the [[wikipedia:spectral density|power spectral density]] <math>S_{XX}</math> via the [[wikipedia:Fourier transform|Fourier transform]]:
 
<math display=block>\operatorname{R}_{XX}(\tau) = \int_{-\infty}^\infty S_{XX}(f) e^{i 2 \pi f \tau} \, {\rm d}f</math>
 
<math display=block>S_{XX}(f) = \int_{-\infty}^\infty \operatorname{R}_{XX}(\tau) e^{- i 2 \pi f \tau} \, {\rm d}\tau .</math>
 
For real-valued functions, the symmetric autocorrelation function has a real symmetric transform, so the [[wikipedia:Wiener–Khinchin theorem|Wiener–Khinchin theorem]] can be re-expressed in terms of real cosines only:
 
<math display=block>\operatorname{R}_{XX}(\tau) = \int_{-\infty}^\infty S_{XX}(f) \cos(2 \pi f \tau) \, {\rm d}f</math>
 
<math display=block>S_{XX}(f) = \int_{-\infty}^\infty \operatorname{R}_{XX}(\tau) \cos(2 \pi f \tau) \, {\rm d}\tau .</math>


==References==
==References==
{{reflist}}
{{reflist}}
==Wikipedia References==
==Wikipedia References==
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Stationary_process&oldid=1103403908|title= Stationary process | author = Wikipedia contributors |website= Wikipedia |publisher= Wikipedia |access-date = 17 August 2022 }}
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Exponential_smoothing&oldid=1104812496|title= Exponential smoothing | author = Wikipedia contributors |website= Wikipedia |publisher= Wikipedia |access-date = 17 August 2022 }}
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Autocorrelation&oldid=1104745910|title= Autocorrelation | author = Wikipedia contributors |website= Wikipedia |publisher= Wikipedia |access-date = 17 August 2022 }}

Revision as of 18:57, 30 October 2023

Exponential smoothing is a rule of thumb technique for smoothing time series data using the exponential window function. Whereas in the simple moving average the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time. It is an easily learned and easily applied procedure for making some determination based on prior assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of time-series data.

Exponential smoothing is one of many window functions commonly applied to smooth data in signal processing, acting as low-pass filters to remove high-frequency noise. This method is preceded by Poisson's use of recursive exponential window functions in convolutions from the 19th century, as well as Kolmogorov and Zurbenko's use of recursive moving averages from their studies of turbulence in the 1940s.

The raw data sequence is often represented by [math]\{x_t\}[/math] beginning at time [math]t = 0[/math], and the output of the exponential smoothing algorithm is commonly written as [math]\{s_t\}[/math], which may be regarded as a best estimate of what the next value of [math]x[/math] will be. When the sequence of observations begins at time [math]t = 0[/math], the simplest form of exponential smoothing is given by the formulas:[1]

[[math]] \begin{align*} s_0& = x_0\\ s_t & = \alpha x_{t} + (1-\alpha)s_{t-1},\quad t\gt0 \end{align*} [[/math]]

where [math]\alpha[/math] is the smoothing factor, and [math]0 \lt \alpha \lt 1[/math].

Basic (simple) exponential smoothing (Holt linear)

The use of the exponential window function is first attributed to Poisson[2] as an extension of a numerical analysis technique from the 17th century, and later adopted by the signal processing community in the 1940s. Here, exponential smoothing is the application of the exponential, or Poisson, window function. Exponential smoothing was first suggested in the statistical literature without citation to previous work by Robert Goodell Brown in 1956,[3] and then expanded by Charles C. Holt in 1957.[4] The formulation below, which is the one commonly used, is attributed to Brown and is known as "Brown’s simple exponential smoothing".[5] All the methods of Holt, Winters and Brown may be seen as a simple application of recursive filtering, first found in the 1940s[2] to convert finite impulse response (FIR) filters to infinite impulse response filters.

The simplest form of exponential smoothing is given by the formula:

[[math]]s_t = \alpha x_t + (1-\alpha) s_{t-1} = s_{t-1} + \alpha (x_t - s_{t-1}).[[/math]]

where [math]\alpha[/math] is the smoothing factor, and [math]0 \le \alpha \le 1[/math]. In other words, the smoothed statistic [math]s_t[/math] is a simple weighted average of the current observation [math]x_t[/math] and the previous smoothed statistic [math]s_{t-1}[/math]. Simple exponential smoothing is easily applied, and it produces a smoothed statistic as soon as two observations are available. The term smoothing factor applied to [math]\alpha[/math] here is something of a misnomer, as larger values of [math]\alpha[/math] actually reduce the level of smoothing, and in the limiting case with [math]\alpha[/math] = 1 the output series is just the current observation. Values of [math]\alpha[/math] close to one have less of a smoothing effect and give greater weight to recent changes in the data, while values of [math]\alpha[/math] closer to zero have a greater smoothing effect and are less responsive to recent changes.

There is no formally correct procedure for choosing [math]\alpha[/math]. Sometimes the statistician's judgment is used to choose an appropriate factor. Alternatively, a statistical technique may be used to optimize the value of [math]\alpha[/math]. For example, the method of least squares might be used to determine the value of [math]\alpha[/math] for which the sum of the quantities [math](s_t - x_{t+1})^2[/math] is minimized.[6]

Unlike some other smoothing methods, such as the simple moving average, this technique does not require any minimum number of observations to be made before it begins to produce results. In practice, however, a "good average" will not be achieved until several samples have been averaged together; for example, a constant signal will take approximately [math]3 / \alpha[/math] stages to reach 95% of the actual value. To accurately reconstruct the original signal without information loss all stages of the exponential moving average must also be available, because older samples decay in weight exponentially. This is in contrast to a simple moving average, in which some samples can be skipped without as much loss of information due to the constant weighting of samples within the average. If a known number of samples will be missed, one can adjust a weighted average for this as well, by giving equal weight to the new sample and all those to be skipped.

This simple form of exponential smoothing is also known as an exponentially weighted moving average (EWMA). Technically it can also be classified as an autoregressive integrated moving average (ARIMA) (0,1,1) model with no constant term.[7]

Time constant

The time constant of an exponential moving average is the amount of time for the smoothed response of a unit step function to reach [math]1-1/e \approx 63.2\,\%[/math] of the original signal. The relationship between this time constant, [math] \tau [/math], and the smoothing factor, [math] \alpha [/math], is given by the formula:

[[math]]\alpha = 1 - e^{-\Delta T/\tau}[[/math]]

, thus [math]\tau = - \frac{\Delta T}{\ln(1 - \alpha)}[/math] where [math]\Delta T[/math] is the sampling time interval of the discrete time implementation. If the sampling time is fast compared to the time constant ([math]\Delta T \ll \tau[/math]) then

[[math]]\alpha \approx \frac{\Delta T} \tau [[/math]]

Choosing the initial smoothed value

Note that in the definition above, [math]s_0[/math] is being initialized to [math]x_0[/math]. Because exponential smoothing requires that at each stage we have the previous forecast, it is not obvious how to get the method started. We could assume that the initial forecast is equal to the initial value of demand; however, this approach has a serious drawback. Exponential smoothing puts substantial weight on past observations, so the initial value of demand will have an unreasonably large effect on early forecasts. This problem can be overcome by allowing the process to evolve for a reasonable number of periods (10 or more) and using the average of the demand during those periods as the initial forecast. There are many other ways of setting this initial value, but it is important to note that the smaller the value of [math]\alpha[/math], the more sensitive your forecast will be on the selection of this initial smoother value [math]s_0[/math].[8][9]

Optimization

For every exponential smoothing method we also need to choose the value for the smoothing parameters. For simple exponential smoothing, there is only one smoothing parameter (α), but for the methods that follow there is usually more than one smoothing parameter.

There are cases where the smoothing parameters may be chosen in a subjective manner – the forecaster specifies the value of the smoothing parameters based on previous experience. However, a more robust and objective way to obtain values for the unknown parameters included in any exponential smoothing method is to estimate them from the observed data.

The unknown parameters and the initial values for any exponential smoothing method can be estimated by minimizing the sum of squared errors (SSE). The errors are specified as [math]e_t=y_t-\hat{y}_{t\mid t-1}[/math] for [math] t=1, \ldots,T[/math] (the one-step-ahead within-sample forecast errors). Hence we find the values of the unknown parameters and the initial values that minimize

[[math]] \text{SSE} = \sum_{t=1}^T (y_t-\hat{y}_{t\mid t-1})^2=\sum_{t=1}^T e_t^2[[/math]]

[10]

Unlike the regression case (where we have formulae to directly compute the regression coefficients which minimize the SSE) this involves a non-linear minimization problem and we need to use an optimization tool to perform this.

"Exponential" naming

The name 'exponential smoothing' is attributed to the use of the exponential window function during convolution. It is no longer attributed to Holt, Winters & Brown.

By direct substitution of the defining equation for simple exponential smoothing back into itself we find that

[[math]] \begin{align*} s_t& = \alpha x_t + (1-\alpha)s_{t-1}\\[3pt] & = \alpha x_t + \alpha (1-\alpha)x_{t-1} + (1 - \alpha)^2 s_{t-2}\\[3pt] & = \alpha \left[x_t + (1-\alpha)x_{t-1} + (1-\alpha)^2 x_{t-2} + (1-\alpha)^3 x_{t-3} + \cdots + (1-\alpha)^{t-1} x_1 \right] + (1-\alpha)^t x_0. \end{align*} [[/math]]

In other words, as time passes the smoothed statistic [math]s_t[/math] becomes the weighted average of a greater and greater number of the past observations [math]s_{t-1},\ldots, s_{t-}[/math], and the weights assigned to previous observations are proportional to the terms of the geometric progression

[[math]]1, (1-\alpha), (1-\alpha)^2,\ldots, (1-\alpha)^n,\ldots[[/math]]

A geometric progression is the discrete version of an exponential function, so this is where the name for this smoothing method originated according to Statistics lore.

Comparison with moving average

Exponential smoothing and moving average have similar defects of introducing a lag relative to the input data. While this can be corrected by shifting the result by half the window length for a symmetrical kernel, such as a moving average or gaussian, it is unclear how appropriate this would be for exponential smoothing. They also both have roughly the same distribution of forecast error when [math]\alpha = 2/(k+1)[/math]. They differ in that exponential smoothing takes into account all past data, whereas moving average only takes into account [math]k[/math] past data points. Computationally speaking, they also differ in that moving average requires that the past [math]k[/math] data points, or the data point at lag [math]k+1[/math] plus the most recent forecast value, to be kept, whereas exponential smoothing only needs the most recent forecast value to be kept.[11]

In the signal processing literature, the use of non-causal (symmetric) filters is commonplace, and the exponential window function is broadly used in this fashion, but a different terminology is used: exponential smoothing is equivalent to a first-order infinite-impulse response (IIR) filter and moving average is equivalent to a finite impulse response filter with equal weighting factors.

Double exponential smoothing

Simple exponential smoothing does not do well when there is a trend in the data. [1] In such situations, several methods were devised under the name "double exponential smoothing" or "second-order exponential smoothing," which is the recursive application of an exponential filter twice, thus being termed "double exponential smoothing". This nomenclature is similar to quadruple exponential smoothing, which also references its recursion depth.[12] The basic idea behind double exponential smoothing is to introduce a term to take into account the possibility of a series exhibiting some form of trend. This slope component is itself updated via exponential smoothing.

One method, works as follows:[13]

Again, the raw data sequence of observations is represented by [math]x_t[/math], beginning at time [math]t=0[/math]. We use [math]s_t[/math] to represent the smoothed value for time [math]t[/math], and [math]b_t[/math] is our best estimate of the trend at time [math]t[/math]. The output of the algorithm is now written as [math]F_{t+m}[/math], an estimate of the value of [math]x_{t+m}[/math] at time [math]m \gt 0[/math] based on the raw data up to time [math]t[/math]. Double exponential smoothing is given by the formulas

[[math]] \begin{align*} s_0 & = x_0\\ b_0 & = x_1 - x_0\\ \end{align*} [[/math]]

And for [math]t \gt 0[/math] by

[[math]] \begin{align*} s_t & = \alpha x_t + (1-\alpha)(s_{t-1} + b_{t-1})\\ b_t & = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\ \end{align*} [[/math]]

where [math]\alpha[/math] ([math]0 \le \alpha \le 1[/math]) is the data smoothing factor, and [math]\beta[/math] ([math]0 \le \beta \le 1[/math]) is the trend smoothing factor.

To forecast beyond [math]x_t[/math] is given by the approximation:

[[math]] F_{t+m} = s_t + m \cdot b_t [[/math]]

Setting the initial value [math]b[/math] is a matter of preference. An option other than the one listed above is [math]\frac{x_n-x_0} n[/math] for some [math]n[/math].

Note that [math]F_0[/math] is undefined (there is no estimation for time 0), and according to the definition [math]F_1= s_0 + b_0[/math], which is well defined, thus further values can be evaluated.

A second method, referred to as either Brown's linear exponential smoothing (LES) or Brown's double exponential smoothing works as follows.[14]

[[math]] \begin{align*} s'_0 & = x_0\\ s''_0 & = x_0\\ s'_t & = \alpha x_t + (1-\alpha)s'_{t-1}\\ s''_t & = \alpha s'_t + (1-\alpha)s''_{t-1}\\ F_{t+m} & = a_t + mb_t, \end{align*} [[/math]]

where at, the estimated level at time t and bt, the estimated trend at time t are:

[[math]] \begin{align*} a_t & = 2s'_t - s''_t\\[5pt] b_t & = \frac \alpha {1-\alpha} (s'_t - s''_t). \end{align*} [[/math]]

Triple exponential smoothing (Holt Winters)

Triple exponential smoothing applies exponential smoothing three times, which is commonly used when there are three high frequency signals to be removed from a time series under study. There are different types of seasonality: 'multiplicative' and 'additive' in nature, much like addition and multiplication are basic operations in mathematics.

If every month of December we sell 10,000 more apartments than we do in November the seasonality is additive in nature. However, if we sell 10% more apartments in the summer months than we do in the winter months the seasonality is multiplicative in nature. Multiplicative seasonality can be represented as a constant factor, not an absolute amount. [15]

Triple exponential smoothing was first suggested by Holt's student, Peter Winters, in 1960 after reading a signal processing book from the 1940s on exponential smoothing.[16] Holt's novel idea was to repeat filtering an odd number of times greater than 1 and less than 5, which was popular with scholars of previous eras.[16] While recursive filtering had been used previously, it was applied twice and four times to coincide with the Hadamard conjecture, while triple application required more than double the operations of singular convolution. The use of a triple application is considered a rule of thumb technique, rather than one based on theoretical foundations and has often been over-emphasized by practitioners. - Suppose we have a sequence of observations [math]x_t[/math], beginning at time [math]t=0[/math] with a cycle of seasonal change of length [math]L[/math].

The method calculates a trend line for the data as well as seasonal indices that weight the values in the trend line based on where that time point falls in the cycle of length [math]L[/math].

Let [math]s_t[/math] represent the smoothed value of the constant part for time [math]t[/math], [math]b_t[/math] is the sequence of best estimates of the linear trend that are superimposed on the seasonal changes, and [math]c_t[/math] is the sequence of seasonal correction factors. We wish to estimate [math]c_t[/math] at every time [math]t[/math]mod [math]L[/math] in the cycle that the observations take on. As a rule of thumb, a minimum of two full seasons (or [math]2L[/math] periods) of historical data is needed to initialize a set of seasonal factors.

The output of the algorithm is again written as [math]F_{t+m}[/math], an estimate of the value of [math]x_{t+m}[/math] at time [math]t+m\gt0[/math] based on the raw data up to time [math]t[/math]. Triple exponential smoothing with multiplicative seasonality is given by the formulas[1]

[[math]] \begin{align*} s_0 & = x_0\\[5pt] s_t & = \alpha \frac{x_t}{c_{t-L}} + (1-\alpha)(s_{t-1} + b_{t-1})\\[5pt] b_t & = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\[5pt] c_t & = \gamma \frac{x_t}{s_t}+(1-\gamma)c_{t-L}\\[5pt] F_{t+m} & = (s_t + mb_t)c_{t-L+1+(m-1)\bmod L}, \end{align*} [[/math]]

where [math]\alpha[/math] ([math]0 \le \alpha \le 1[/math]) is the data smoothing factor, [math]\beta[/math] ([math]0 \le \beta \le 1[/math]) is the trend smoothing factor, and [math]\gamma[/math] ([math]0 \le \gamma \le 1[/math]) is the seasonal change smoothing factor.

The general formula for the initial trend estimate [math]b[/math] is:

[[math]] \begin{align*} b_0 & = \frac{1}{L} \left(\frac{x_{L+1}-x_1}{L} + \frac{x_{L+2}-x_2}{L} + \cdots + \frac{x_{L+L}-x_L}{L}\right) \end{align*} [[/math]]

Setting the initial estimates for the seasonal indices [math]c_i[/math] for [math]i = 1,2,\ldots,L[/math] is a bit more involved. If [math]N[/math] is the number of complete cycles present in your data, then:

[[math]] c_i = \frac{1}{N} \sum_{j=1}^N \frac{x_{L(j-1)+i}}{A_j} \quad \text{for } i = 1,2,\ldots,L [[/math]]

where

[[math]] A_j = \frac{\sum_{i=1}^{L} x_{L(j-1)+i}}{L} \quad \text{for } j = 1,2,\ldots,N [[/math]]

Note that [math]A_j[/math] is the average value of [math]x[/math] in the [math]j^\text{th}[/math] cycle of your data. Triple exponential smoothing with additive seasonality is given by:

[[math]] \begin{align*} s_0 & = x_0\\ s_t & = \alpha (x_t-c_{t-L}) + (1-\alpha)(s_{t-1} + b_{t-1})\\ b_t & = \beta (s_t - s_{t-1}) + (1-\beta)b_{t-1}\\ c_t & = \gamma (x_t-s_{t-1}-b_{t-1})+(1-\gamma)c_{t-L}\\ F_{t+m} & = s_t + mb_t+c_{t-L+1+(m-1) \bmod L}, \end{align*} [[/math]]

References

  1. 1.0 1.1 1.2 "NIST/SEMATECH e-Handbook of Statistical Methods". NIST. Retrieved 2010-05-23.
  2. 2.0 2.1 Oppenheim, Alan V.; Schafer, Ronald W. (1975). Digital Signal Processing. Prentice Hall. p. 5. ISBN 0-13-214635-5.
  3. Brown, Robert G. (1956). Exponential Smoothing for Predicting Demand. Cambridge, Massachusetts: Arthur D. Little Inc. p. 15.
  4. Holt, Charles C. (1957). "Forecasting Trends and Seasonal by Exponentially Weighted Averages". Office of Naval Research Memorandum 52.  reprinted in Holt, Charles C. (January–March 2004). "Forecasting Trends and Seasonal by Exponentially Weighted Averages". International Journal of Forecasting 20 (1): 5–10. doi:10.1016/j.ijforecast.2003.09.015. 
  5. Brown, Robert Goodell (1963). Smoothing Forecasting and Prediction of Discrete Time Series. Englewood Cliffs, NJ: Prentice-Hall.
  6. "NIST/SEMATECH e-Handbook of Statistical Methods, 6.4.3.1. Single Exponential Smoothing". NIST. Retrieved 2017-07-05.
  7. Nau, Robert. "Averaging and Exponential Smoothing Models". Retrieved 26 July 2010.
  8. "Production and Operations Analysis" Nahmias. 2009.
  9. Čisar, P., & Čisar, S. M. (2011). "Optimization methods of EWMA statistics." Acta Polytechnica Hungarica, 8(5), 73–87. Page 78.
  10. 7.1 Simple exponential smoothing | Forecasting: Principles and Practice.
  11. Nahmias, Steven (3 March 2008). Production and Operations Analysis (6th ed.). ISBN 0-07-337785-6.[page needed]
  12. "Model: Second-Order Exponential Smoothing". SAP AG. Retrieved 23 January 2013.
  13. "6.4.3.3. Double Exponential Smoothing". itl.nist.gov. Retrieved 25 September 2011.
  14. "Averaging and Exponential Smoothing Models". duke.edu. Retrieved 25 September 2011.
  15. Kalehar, Prajakta S. "Time series Forecasting using Holt–Winters Exponential Smoothing" (PDF). Retrieved 23 June 2014.
  16. 16.0 16.1 Winters, P. R. (April 1960). "Forecasting Sales by Exponentially Weighted Moving Averages". Management Science 6 (3): 324–342. doi:10.1287/mnsc.6.3.324. 

Wikipedia References