Statistical Inference ( Unit 2: Cramer-Rao Inequality, Method Of Moment, Maximum Likelihood Estimator)
Statistical Inference I: (Cramer-Rao Inequality, Method Of Moment, Maximum Likelihood Estimator)
I. Introduction
We see in unbiased estimator from two distinct unbiased estimators give infinitely many unbiased estimators of θ, among these estimators we find the best estimator for parameter θ by comparing their variance or mean square errors. But in some examples, we see that the number of estimators is possible as.
For Normal distribution:
If X1, X2, ........Xn. random sample from a normal distribution with mean 𝛍 and variance 𝛔², then T1 = x̄, T2 = Sample median, both are unbiased estimators for parameter 𝛍. Now we find a sufficient estimator; therefore, T1 is a sufficient estimator for 𝛍, hence it is the best estimator for parameter 𝛍.
Thus, for finding the best estimator, we check if the estimator is sufficient or not.
Now we are interested in finding the variance of the best estimator. This can be discussed in this article.
II. Properties of Probability Mass Function (p.m.f.) or Probability Density Function (p.d.f)
If X1, X2, ........Xn. are from any p.d.f or p.m.f f(x,θ), θ ∈ Θ, following are the properties of P.D.F. OR P.M.F:
Properties of Fisher Information Function
1. \( \int_{-\infty}^{\infty} f(x, \theta) \, dx = 1 \)
Proof: We have
\( \int_{-\infty}^{\infty} f(x, \theta) \, dx = 1 \) from 1
2. \( \frac{\partial}{\partial\theta} \int_{-\infty}^{\infty} f(x, \theta) \, dx = 0 \)
Proof: We have
\( \frac{\partial}{\partial\theta} \int_{-\infty}^{\infty} f(x, \theta) \, dx = 0 \) from 2
3. \( E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right] = 0 \)
Proof: We have
\( E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right] = 0 \) from 3
4. \( E\left[\frac{\partial^2}{\partial\theta^2} \log f(x, \theta)\right] = -E\left[\frac{\partial\log f(x, \theta)}{\partial\theta}\right]^2 \)
Proof: We have
\( E\left[\frac{\partial^2}{\partial\theta^2} \log f(x, \theta)\right] = -E\left[\frac{\partial\log f(x, \theta)}{\partial\theta}\right]^2 \) from 4
5. \( \text{Var}\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right] = -E\left[\frac{\partial^2}{\partial\theta^2} \log f(x, \theta)\right] = E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right]^2 \)
Proof: We have
Var(\( \frac{\partial}{\partial\theta} \log f(x, \theta) \)) = \( E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right]^2 - \left(E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right]\right)^2 \)
Var(\( \frac{\partial}{\partial\theta} \log f(x, \theta) \)) = \( E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right]^2 - 0 \) from 3
Var(\( \frac{\partial}{\partial\theta} \log f(x, \theta) \)) = \( E\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right]^2 = -E\left[\frac{\partial^2}{\partial\theta^2} \log f(x, \theta)\right] \) from 4
Fisher Information Function:
Definition:
- Fisher Information Measure (or the amount of information) about parameter \(\theta\) obtained in the random variable \(x\) is denoted as \(I(\theta)\) and is defined as
- Fisher Information Measure (or the amount of information) about parameter \(\theta\) obtained in the random variable \(X_1, X_2, \ldots, X_n\) of size \(n\) is denoted as \(I_n(\theta)\) and is defined as
- Let \(X_1, X_2, \ldots, X_n\) be a random sample from the distribution of random variable \(x\), and \(T = T(X_1, X_2, \ldots, X_n)\) be any statistic for the parameter \(\theta\), and \(g(t, \theta)\) is the probability function, then the Fisher information function (or the amount of information) about parameter \(\theta\) contained in statistic \(T\) is given by \(I_T(\theta)\) and is given as
\(I(\theta) = \text{Var}\left[\frac{\partial}{\partial\theta} \log f(x, \theta)\right] = \text{E}\left[\left(\frac{\partial}{\partial\theta} \log f(x, \theta)\right)^2\right] = -\text{E}\left[\frac{\partial^2}{\partial\theta^2} \log f(x, \theta)\right]\)
\(I_n(\theta) = \text{Var}\left[\frac{\partial}{\partial\theta} \log L(\theta)\right] = \text{E}\left[\left(\frac{\partial}{\partial\theta} \log L(\theta)\right)^2\right] = -\text{E}\left[\frac{\partial^2}{\partial\theta^2} \log L(\theta)\right]\)
\(I_T(\theta) = \text{Var}\left[\frac{\partial}{\partial\theta} \log g(t, \theta)\right] = \text{E}\left[\left(\frac{\partial}{\partial\theta} \log g(t, \theta)\right)^2\right] = -\text{E}\left[\frac{\partial^2}{\partial\theta^2} \log g(t, \theta)\right]\)
Properties of Fisher Information Function
Result 1:
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the distribution \(f(x,\theta)\), then \(I_n(\theta) = n I(\theta)\).
Proof:
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the distribution \(f(x,\theta)\), \(\theta \in \Theta\).
The Fisher Information Measure (or the amount of information) about parameter \(\theta\) obtained in random variable \(x\) is denoted as \(I(\theta)\) and is defined as:
\[ I(\theta) = \text{Var}\left[\frac{\partial}{\partial \theta} \log f(x,\theta)\right] = \text{E}\left[\left(\frac{\partial}{\partial \theta} \log f(x,\theta)\right)^2\right] = -\text{E}\left[\frac{\partial^2}{\partial \theta^2} \log f(x,\theta)\right] \]
Now consider the joint probability function or the likelihood function of random variables \(X_1, X_2, \ldots, X_n\), then \(L(\theta) = f(X_1, X_2, \ldots, X_n,\theta)\).
\[ L(\theta) = \prod f(x,\theta) \]
Taking the logarithm on both sides:
\[ \log L(\theta) = \log\left(\prod f(x,\theta)\right) \]
\[ \log L(\theta) = \sum \log f(x,\theta) \]
Differentiating with respect to \(\theta\):
\[ \frac{\partial}{\partial \theta} \log L(\theta) = \sum \frac{\partial}{\partial \theta} \log f(x,\theta) \quad \text{(i)} \]
We have Fisher Information Measure (or the amount of information) about parameter \(\theta\) obtained in random variable \(X_1, X_2, \ldots, X_n\) is denoted as \(I_n(\theta)\) and it is defined as:
\[ I_n(\theta) = \text{Var}\left[\frac{\partial}{\partial \theta} \log L(\theta)\right] = \text{E}\left[\left(\frac{\partial}{\partial \theta} \log L(\theta)\right)^2\right] = -\text{E}\left[\frac{\partial^2}{\partial \theta^2} \log L(\theta)\right] \]
So, \(I_n(\theta) = n I(\theta)\), hence proved.
Result 2:
Show that for any statistic \(T\), \(I_n(\theta) \geq I_T(\theta)\).
Proof:
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the distribution of random variable \(x\), and \(T = T(X_1, X_2, \ldots, X_n)\) be any statistic for the parameter \(\theta\), and \(g(t,\theta)\) is the probability function.
The Fisher information function (or the amount of information) about parameter \(\theta\) contained in statistic \(T\) is given by \(I_T(\theta)\) and is given as:
\[ I_T(\theta) = \text{Var}\left[\frac{\partial}{\partial \theta} \log g(t,\theta)\right] = \text{E}\left[\left(\frac{\partial}{\partial \theta} \log g(t,\theta)\right)^2\right] = -\text{E}\left[\frac{\partial^2}{\partial \theta^2} \log g(t,\theta)\right] \]
Now, consider the joint probability function or the likelihood function of random variables \(X_1, X_2, \ldots, X_n\).
\[ L(\theta) = f(X_1, X_2, \ldots, X_n,\theta) \]
\[ L(\theta) = g(t,\theta) \cdot h(x) \]
Taking the logarithm of \(L(\theta)\):
\[ \log L(\theta) = \log (g(t,\theta) \cdot h(x)) \]
\[ \log L(\theta) = \log (g(t,\theta)) + \log (h(x)) \]
Differentiating with respect to \(\theta\):
\[ \frac{\partial}{\partial \theta} \log L(\theta) = \frac{\partial}{\partial \theta} \log (g(t,\theta)) + \frac{\partial}{\partial \theta} \log (h(x)) \]
\[ \frac{\partial}{\partial \theta} \log L(\theta) \geq \frac{\partial}{\partial \theta} \log (g(t,\theta)) + 0 \quad \text{(ii)} \]
\[ \text{Var}\left[\frac{\partial}{\partial \theta} \log L(\theta)\right] \geq \text{Var}\left[\frac{\partial}{\partial \theta} \log (g(t,\theta))\right] \]
\[ I_n(\theta) \geq I_T(\theta) \]
Remark:
If \(T\) is a sufficient statistic for \(\theta\), then \(I_n(\theta) = I_T(\theta)\).
Example 1: Fisher Information Function for Exponential Distribution
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the Exponential distribution with parameter \(\frac{1}{\theta}\). The probability density function is:
\[ f(x,\theta) = \begin{cases} \frac{1}{\theta} e^{-\frac{x}{\theta}}, & x \geq 0, \theta > 0 \\ 0, & \text{otherwise} \end{cases} \]
Likelihood Function
The likelihood function of the sample \(X_1, X_2, \ldots, X_n\) is given by:
\[ L(\theta) = \prod f(x,\theta) = \left(\frac{1}{\theta}\right)^n e^{-\sum \frac{x}{\theta}} \]
Taking the logarithm on both sides:
\[ \log L(\theta) = -n \log(\theta) - \sum \frac{x}{\theta} \]
Derivative with Respect to \(\theta\)
Differentiating with respect to \(\theta\):
\[ \frac{\partial}{\partial \theta} \log L(\theta) = -\frac{n}{\theta} + \frac{\sum x}{\theta^2} \]
Second derivative with respect to \(\theta\):
\[ \frac{\partial^2}{\partial \theta^2} \log L(\theta) = \frac{n}{\theta^2} - \frac{2 \sum x}{\theta^3} \]
Fisher Information Function
By definition of the Fisher information function:
\[ I_n(\theta) = -E\left[\frac{\partial^2}{\partial \theta^2} \log L(\theta)\right] = -E\left[\frac{n}{\theta^2} - \frac{2 \sum x}{\theta^3}\right] = -\frac{E(n)}{\theta^2} + \frac{2E(\sum x)}{\theta^3} \]
Since \(E(n) = n\) and \(E(\sum x) = n\theta\), we have:
\[ I_n(\theta) = -\frac{n}{\theta^2} + \frac{2n\theta}{\theta^3} = -\frac{n}{\theta^2} + \frac{2n}{\theta^2} = \frac{n}{\theta^2}\]
Answer:
So, \(I_n(\theta) = \frac{n}{\theta^2} \).
Example 2: Fisher Information Function for Poisson Distribution
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the Poisson distribution with parameter \(\lambda\). The probability mass function is:
\[ f(x,\theta) = \begin{cases} \frac{e^{-\lambda} \lambda^x}{x!}, & x = 0, 1, 2, \ldots, \lambda > 0 \\ 0, & \text{otherwise} \end{cases} \]
Likelihood Function
The likelihood function is given by:
\[ L(\theta) = \prod f(x,\theta) = \left(\frac{e^{-n\lambda} \lambda^{\sum x}}{\prod x!}\right) \]
Taking the logarithm on both sides:
\[ \log L(\theta) = -n\lambda + \sum x \log \lambda - \sum \log x \]
Derivative with Respect to \(\lambda\)
Differentiating with respect to \(\lambda\):
\[ \frac{\partial}{\partial \lambda} \log L(\theta) = -n + \frac{\sum x}{\lambda} \]
Second derivative with respect to \(\lambda\):
\[ \frac{\partial^2}{\partial \lambda^2} \log L(\theta) = -\frac{\sum x}{\lambda^2} \]
Fisher Information Function
By definition of the Fisher information function:
\[ I_n(\theta) = -E\left[\frac{\partial^2}{\partial \lambda^2} \log L(\theta)\right] = -E\left[-\frac{\sum x}{\lambda^2}\right] = \frac{E(\sum x)}{\lambda^2} \]
Since \(E(\sum x) = n\lambda\), we have:
\[ I_n(\theta) = \frac{n\lambda}{\lambda^2} = \frac{n}{\lambda} \]
Answer:
So, \(I_n(\theta) = \frac{n}{\lambda}\).
Example3:
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the Normal distribution with \(\mu\) and \(\sigma^2\), then the probability density function is:
\[ f(x, \theta) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}, \quad -\infty \leq x, \mu \leq \infty, \sigma^2 > 0 \]
\[ f(x, \theta) = \begin{cases} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}, & \text{-∞≤x,μ≥∞,σ^2>0} \\ 0, & \text{otherwise} \end{cases} \]
1. \(I(\mu)\)
We have:
\[ \begin{align*} \log f(x, \theta) & = \log \left(\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}\right) \\ & = -\log(\sqrt{2\pi}) - \log(\sigma) - \frac{1}{2\sigma^2}(x-\mu)^2 \quad \text{(i)} \end{align*} \]
Differentiating with respect to \(\mu\):
\[ \begin{align*} \frac{\partial}{\partial \mu} \log f(x, \theta) & = \frac{\partial}{\partial \mu} \left(-\log(\sqrt{2\pi}) - \log(\sigma) - \frac{1}{2\sigma^2}(x-\mu)^2\right) \\ & = 0 - 0 - \frac{2}{2\sigma^2}(x-\mu)(-1) \\ & = \frac{x-\mu}{\sigma^2} \end{align*} \]
Second derivative with respect to \(\mu\):
\[ \frac{\partial^2}{\partial \mu^2} \log f(x, \theta) = -\frac{1}{\sigma^2} \]
So, \(I(\mu) = -E\left[\frac{\partial^2}{\partial \mu^2} \log f(x, \theta)\right] = -E\left[-\frac{1}{\sigma^2}\right] = \frac{1}{\sigma^2}\).
2. \(I(\sigma)\)
Differentiating equation (i) with respect to \(\sigma\):
\[ \begin{align*} \frac{\partial}{\partial \sigma} \log f(x, \theta) & = \frac{\partial}{\partial \sigma} \left(-\log(\sqrt{2\pi}) - \log(\sigma) - \frac{1}{2\sigma^2}(x-\mu)^2\right) \\ & = 0 - \frac{1}{\sigma} - \frac{-2}{2\sigma^3}(x-\mu)^2 \\ & = \frac{-1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3} \end{align*} \]
Second derivative with respect to \(\sigma\):
\[ \frac{\partial^2}{\partial \sigma^2} \log f(x, \theta) = \frac{1}{\sigma^2} - \frac{3(x-\mu)^2}{\sigma^4} \]
So, \(I(\sigma) = -E\left[\frac{\partial^2}{\partial \sigma^2} \log f(x, \theta)\right] = -E\left[\frac{1}{\sigma^2} - \frac{3(x-\mu)^2}{\sigma^4}\right] = \frac{2}{\sigma^2}\).
3. \(I(\sigma^2)\)
Let \(\theta = \sigma^2\), and rewrite equation (i):
\[ \log f(x, \theta) = -\log(\sqrt{2\pi}) - \frac{1}{2}\log(\theta) - \frac{1}{2\theta}(x-\mu)^2 \]
Differentiating with respect to \(\theta\):
\[ \begin{align*} \frac{\partial}{\partial \theta} \log f(x, \theta) & = \frac{\partial}{\partial \theta} \left(-\log(\sqrt{2\pi}) - \frac{1}{2}\log(\theta) - \frac{1}{2\theta}(x-\mu)^2\right) \\ & = 0 - \frac{1}{2\theta} - \frac{-1}{2\theta^2}(x-\mu)^2 \\ & = \frac{1}{2\theta} - \frac{(x-\mu)^2}{2\theta^2} \end{align*} \]
Second derivative with respect to \(\theta\):
\[ \frac{\partial^2}{\partial \theta^2} \log f(x, \theta) = \frac{1}{2\theta^2} - \frac{(x-\mu)^2}{\theta^3} \]
So, \(I(\sigma^2) = -E\left[\frac{\partial^2}{\partial \theta^2} \log f(x, \theta)\right] = -E\left[\frac{1}{2\theta^2} - \frac{(x-\mu)^2}{\theta^3}\right] = \frac{1}{2\theta^2}\).
\[ I(\mu) = \frac{1}{\sigma^2} \]
\[ I(\sigma) = \frac{2}{\sigma^2} \]
\[ I(\sigma^2) = \frac{1}{2\sigma^4} \]
Cramer-Rao Inequality
Regularity Conditions:
- The parameter space \(\Theta\) is an open interval.
- The support or range of the distribution is independent of \(\theta\).
- For every \(x\) and \(\theta\), \(\frac{\partial}{\partial\theta} f(x,\theta)\) and \(\frac{\partial^2}{\partial\theta^2} f(x,\theta)\) exist and are finite.
- The statistic \(T\) has finite mean and variance.
- Differentiation and integration are permissible, i.e., \(\frac{\partial}{\partial\theta} \int T L(x,\theta) \, dx = \int \frac{\partial}{\partial\theta} T L(x,\theta) \, dx\).
Cramer-Rao Inequality
Regularity Conditions:
- The parameter space \(\Theta\) is an open interval.
- The support or range of the distribution is independent of \(\theta\).
- For every \(x\) and \(\theta\), \(\frac{\partial}{\partial\theta} f(x,\theta)\) and \(\frac{\partial^2}{\partial\theta^2} f(x,\theta)\) exist and are finite.
- The statistic \(T\) has finite mean and variance.
- Differentiation and integration are permissible, i.e., \(\frac{\partial}{\partial\theta} \int T L(x,\theta) \, dx = \int \frac{\partial}{\partial\theta} T L(x,\theta) \, dx\).
Cramer-Rao Inequality Statement:
Let \(X_1, X_2, \ldots, X_n\) be a random sample from any probability density function (p.d.f.) or probability mass function (p.m.f.) \(f(x, \theta)\), where \(\theta \in \Theta\). If \(T = T(X_1, X_2, \ldots, X_n)\) is an unbiased estimator of \(\phi(\theta)\) under regularity conditions, then:
\[ \text{Var}(T) \geq \frac{(\frac{\partial}{\partial\theta}\phi(\theta))^2}{I_n(\theta)} \quad \text{or} \quad \text{Var}(T) \geq \frac{(\phi'(\theta))^2}{nI(\theta)} \]
Proof:
Let \(x\) be a random variable following the p.d.f. or p.m.f. \(f(x,\theta)\), \(\theta \in \Theta\), and \(L(\theta)\) is the likelihood function of a random sample \(X_1, X_2, \ldots, X_n\) from the distribution. Then:
\[L(\theta) = \prod f(x, \theta) = f(X_1, X_2, \ldots, X_n, \theta)\]
\(\int L(\theta) \, dx = 1\)
\(\frac{\partial}{\partial\theta} \int L(\theta) \, dx = 0\)
\(\int \frac{\partial}{\partial\theta} L(\theta) \, dx = 0\)
\(\int \frac{1}{L} \cdot \frac{\partial}{\partial\theta} L(\theta) \, L \, dx = 0\)
\(\int \frac{\partial}{\partial\theta} \log L(\theta) \, L \, dx = 0\)
\[E\left[\frac{\partial}{\partial\theta}\log L(\theta)\right] = 0\ldots \text{ii}\]
And we know that \(T\) is an unbiased estimator of \(\phi(\theta)\), such that:
\[E(T(X)) = \phi(\theta)\]
\(\int T \cdot L(\theta) \, dx = \phi(\theta)\)
Now, differentiating with respect to \(\theta\):
\(\frac{\partial}{\partial\theta} \int T \cdot L(\theta) \, dx = \frac{\partial}{\partial\theta} \phi(\theta)\)
\(\int \frac{\partial}{\partial\theta} T \cdot L(\theta) \, dx = \phi'(\theta)\)
\(\int \frac{1}{L} \cdot \frac{\partial}{\partial\theta} L(\theta) \cdot T \, L \, dx = \phi'(\theta)\)
\(\int \frac{\partial}{\partial\theta} \log L(\theta) \cdot T \cdot L \, dx = \phi'(\theta)\)
E\(\left[\frac{\partial}{\partial\theta} \log L(\theta) \cdot T\right] = \phi'(\theta) \quad \ldots \text{iii}\)
And we have:
COV\(\left(\frac{\partial}{\partial\theta} \log L(\theta) \cdot T\right) = E\left(\frac{\partial}{\partial\theta} \log L(\theta) \cdot T\right) - 0 \cdot E(T)\)
COV\(\left(\frac{\partial}{\partial\theta} \log L(\theta) \cdot T\right) = \phi'(\theta) \quad \ldots \text{iv}\)
By Cauchy-Schwarz inequality for covariance, we have:
\[\left(\text{COV}\left(\frac{\partial}{\partial\theta} \log L(\theta) \cdot T\right)\right)^2 \leq \text{Var}\left(\frac{\partial}{\partial\theta} \log L(\theta)\right) \cdot \text{Var}(T)\]
\[\left(\phi'(\theta)\right)^2 \leq I_n(\theta) \cdot \text{Var}(T)\]
\[\text{Var}(T) \geq \frac{\left(\phi'(\theta)\right)^2}{I_n(\theta)}\]
This is the lower bound given by Cramer-Rao inequality, known as Cramer-Rao Lower Bound.
Remark: If the estimator \(T\) is unbiased, then \(\phi(\theta) = \theta\) and \(\phi'(\theta) = 1\). In this case, Cramer-Rao Lower Bound is:
\[\text{Var}(T) \geq \frac{1}{I_n(\theta)}\]
Comments
Post a Comment