Understanding the significance of t-test:
In today’s era, one would have heard the term machine learning or let’s say machine learning models. The models that are statistically tested are often called statistical or econometric models. These models when fed into the computer through a programming software train the computer to learn and give desired output based on a statistical study conducted. Let’s take a very simple example of a model to decide on economic disparity. The decision simply means ‘yes’ or ‘no’ answer. So, this model is going to give only ‘0’ or ‘1’ as output.
But how can we decide economic disparity of a given country? One very common thinking will be measuring the average income. But before concluding average income as a predictor variable for the economic disparity, let us first build a hypothesis to test if average income can be a statistically significant parameter or a predictor variable.
Hypothesis and t-test
Defining Hypothesis and designing data collection forms the basis of any statistical analysis or machine learning models. In simple words, a hypothesis can be any empirical question. For example, to understand an empirical question of unemployment, the hypothesis can be lower education rate causes unemployment or lesser economic growth leads to unemployment. Any hypothesis needs to be statistically tested for validity. The hypothesis is tested for a given sample drawn from a given population. This hypothesis testing is usually performed through ‘t-test’.
Basics of student’s t-test:
The student’s t-test method is commonly used and has proven quite accurate and simple. We will use here this method simply because it is simple. Let us take our very first hypothesis example of deciding economic disparity for conducting t-test to find out if there is an underlying disparity among the states of India. One way to measure economic disparity is either average income or median income of the state. Just take average income as our ‘going to-be predictor variable’ for two states of India: Tamil Nadu and Bihar to test the hypothesis that there is a significant difference between the average income of the two states.
Why Sample populations as Tamil Nadu and Bihar.
Why Tamil Nadu and Bihar? Simply because it’s simple to perform a t-test. Both the sample populations have a significant difference in terms of economic growth and economic development. While Tamil Nadu is one of the economically developed states of India, Bihar is an economically underdeveloped state of India. Thus, the average income of the two sample states must form a significant difference. This is the hypothesis that we are going to test using the student’s t-test.
The sample data that we need contain the average income, standard deviation, and the number of observations(n). Below is a sample table with all the details that we will be needing to conduct t-test.
Let us now define our hypothesis and perform t-test…
Null hypothesis: H0: µtn — µb = 0 (no economic disparity)
Alternate hypothesis: H1: µtn — µb ≠ 0 (some significant economic disparity)
µtn represents avg. the income of Tamil Nadu, µb represents avg. the income of Bihar.
The formula for t-test:
d0 represents the value of the hypothesis that we want to test. In our case, it is 0 as we are trying if ‘µtn — µb’ is 0 or not. In other words, if there is an underlying pay gap for the two states.
SE represents standard error for the two given sample means.
Therefore, the formula will reduce to,
Now, let’s calculate std. error. The formula for calculating standard error of given two sample means is:
Using this formula, substitute the value of std. deviation and ’n’ for both populations means we get,
SE (µtn — µb) = (15000)²/100 + (22000)²/65
Therefore, the t-test can be summed as:
t-actual = (38938–26831–0)/3110 = 3.8929
Assessing t-actual value:
The t-actual(tact) value is compared with something called t-distribution table that contains different values at different probability level called t-critical. The table can be viewed at the t-distribution Table.
How to conclude results based on t-actual and t-critical value?
In the t-distribution table, on the top side representing column, you can view some values such as 0.25, 0.05, 0.005 and so on. This means the probability of rejecting the null hypothesis or taking a chance to reject the null hypothesis based on the given samples. In other words, setting the value of probability to make a wrong decision to reject the null hypothesis.
Based on our sample data, let us take the risk of rejecting the null hypothesis even if it is true at 5% probability. You can very well try for any other probability level and reduce your risk of making the wrong decision.
The critical value of t for 5% probability should lie between ±1.96.
Since our calculated t-actual value i.e. 3.8929 does not lie in between [-1.96, +1.96]. Therefore, we can reject the null hypothesis and conclude that there is a significant gap between the pay of the two states that adheres to underlying economic disparity. Therefore, t-test suggests that µtn — µb ≠ 0.
Significance of t-test:
When the sample size is small and you need to infer for the population or use your hypothesis to build a model, t-test provides a good way to start your analysis. However, it has certain limitations in terms of good or bad samples. In our sample data, the number of people or say observations are very less compared to actual people who are employed. Thus, the result may or may not be drawn for the entire population.
However, a t-test is useful in conducting empirical researches, business problems, government policies, international development projects. This forms the first step in defining a problem and testing if the problem is statistically significant or not. So, build your problem and try testing it. If you wondering to download a few data you can check Kaggle.com , US Gov Open Data, Aus. Gov Open Data. Try downloading data that has a small sample size.
Play along with the first step towards building a machine learning model or say Artificial Intelligence model. In the next article series, we will perform the t-test using statistical software such as STATA, R, Python also, through programming languages such as Python and JAVA. We will see the different approaches of computing the same t-test using different computational way to play around the different hypothesis of our interests.