## It’s A Math, Math World (CTL and CI’s)

Today’s blog post is really targeted as a “Statistics 101” for the early statistics student or non-statistician. Last week, we summarized basic probability and the normal distribution. This week we look at inferential statistics. Let me know your opinions!

We start by examining one of the main topics of this section which is the Central Limit Theorem (CLT):

Suppose we have a sample of 10,000 measurements and we take a random sample of 500 of these measurements and calculate a mean, ybar_1. We then repeat this process 299 times and get a total of 300 separate estimates of the mean and we call these ybar_1, ybar_2, …, ybar_300. We can consider these 300 measurements of the sample mean to be a sample of its own kind which has its own mean and variance. We can calculate these 2 quantities and look at the *distribution* of the sample mean. This is a sampling distribution.

*The variability of random samples from the same population is called sampling variability. A probability distribution that characterizes some aspects of sampling variability is a sampling distribution. We make the following conclusions:*

- The mean of the sampling distribution (u_x) is equal to the population mean (
**u**) - The standard deviation of the sampling distribution, or the
*standard error*of the mean, is equal to the population standard deviation (sigma) divided by the square root of the size of each subsample (in this case n=500).

i.e., Standard Error (SE) = sigma/sqrt(500)

If we *standardize* the sample mean (for instance ybar_1), then

(ybar_1 –u_x)/SE →* N(0,1),which is standard normal,*

* as the sample size gets very large, for any distribution. Thus we can use the standard normal table for probabilities for any distribution we are given.*

*We will use this concept in more depth when we look at the next 2 ideas.*

The process of drawing conclusions about the population, based on observations in the sample is known as *statistical inference*. To make these conclusions, we have to consider how likely it is that the sample is representative – that it closely resembles – the population.

These decisions can be divided into 2 categories:

- Confidence Intervals (to be examined next)
- Hypothesis Tests (we will look at next time)

A confidence interval is based on a probability, which in simplest terms, is the likelihood of something occurring given that it is repeated a great number of times. It is also the relative frequency represented as a percentage. Probabilities have some very intuitive properties and statistical inference is based on probability. Without getting into a big dissertation regarding probability theory (see last week’s blog post), we can say that every confidence interval had a degree of uncertainty, usually set at 0.05.

The principle underlying a confidence interval is that we want to build a continuous range of values (where the parameter may lie) such that the parameter falls in that range 1-0.05 = 0.95% of the time with repeated sampling.

Let’s take an example:

Consider the population mean which we will call **u**

We have a collection of sample data represented by y1, y2, y3,…, y1000

We calculate the sample mean which is ybar

Now, ybar is probably not exactly equal to u but we hope it is close. The *standard error of the mean, SE, *is how far ybar tends to be from **u**.

The confidence interval: (ybar – 2*SE, ybar + 2*SE) is an approximate 95% confidence interval for **u**. Don’t worry about where the 2 comes from. It is a constant determined from a probability table (for the Standard Normal distribution) that is rounded up for our purposes. The interpretation of the CI is as follows:

*IF under repeated sampling and repeated creation of CI’s from these samples, the true population mean will be contained in 95% of these CI’s.*

*The error rate of 5% means that the true population mean will not be contained in the remaining 5% of CI when the sampling is repeated a large number of times.*

Example: Confidence Interval for the difference **u1 – u2** of 2 population means.

CI: (ybar_1 – ybar_2) ± 2*SE(ybar_1 – ybar_2)

*This interpretation is that, if the confidence interval included the point 0, then there is no difference between the means u1 and u2, or that u1 = u2*

Examples:

(-1.023 < u1-u2 < 3.25)

Since the interval includes the point, 0, we conclude that there is **no difference** between the means u1 and u2.

(2.14 < u1-u2 <4.25)

Since this interval does not include the point, 0, we conclude that the points **are not equal** to reach other.

Next week, we continue with an examination of hypothesis tests.

*Like what you read? Get blogs delivered right to your inbox as I post them so you can start standing out in your job and career. There is not a better way to learn or review college level stats topics than by reading, It’s A Math, Math World*

Err…I’m sure you know this, but the CLT does not hold for any distribution. Most simply, it demands independence, identical distributions for each sample, and finite variances. Each can be relaxed to some extent, but exactly how far is a messy research question. Even something as innocent as the ratio of two Gaussian distributed random variables does not obey the CLT (it’s Cauchy distributed).

Also, the rate of convergence to Gaussian tends to be slow in the tails, so depending on the CLT to take care of everything tends to go most astray when you’re dealing with crazy outliers…exactly when you most want the mathematical support.

That is pretty helpful. It provided me some ideas and I’ll be posting them on my website soon. I’m bookmarking your site and I’ll be back. Thank you again!finance finance

Following searching Google I found your site. I think both are great and I is going to be coming back again to you and them in the long term. Thanksonline education degree

Thank you for the support. Do you have any thoughts on what you might want to see covered here?

I loved this post it’s exactly what I was looking for.

Can I quote some words from this post?

Hi Michael…I got stuck solving a problem on estimation:

In a fund raising event, Sara Gordon hopes to get donations from 36% of 250 alumni of a college. According to past data, alumni donate 4% of their annual salary. Their average salary is $32,000 with a standard deviation of $9,600.

If her expectations are met(36% of alumni donate 4% of their annual salary), what is the probability that donation would lie between $110,000 to $120,000?

Here, I found the Standard Error as SE = 9,600/Sq root of 90 = $1,012

Now, move on to standardizing the normal variable

Z1 = (110000-115200)/1012

Z2 = (120000-115200)/1012

but couldnt get the right answer….Where I am at fault?? kindly help..

Regards

PriyadarshiS