What is Statistics?
We hear this field of mathematics referenced on a daily basis and is most commonly referred to as "Stats". Statistics can simply be understood as the science and mathematical practice of collecting, analyzing, interpreting, and presenting data.
Statistics can be broken down into two main areas: Descriptive & Inferential Statistics.
In this area of Statistics, we are trying to summarize our data for measures of central tendency and variability (More on this later) to describe the overall population. This helps us understand our data and organize it in a way that we can have a summary of the data and it's unique features. This can be either numerically or visually. Descriptive Statistics typically includes: Mean, Mode, Median, Standard Deviation, and Quartiles. One of the key differences between Descriptive and Inferential is that Descriptive Statistics applies to situations where all the values are known.
Commonly referenced as "averages", although they are not always the same thing, it is a measure of central tendency for interval and ratio data, and is defined as the arithmetic average of a set of values. Ex: 14 + 9 + 21 + 52 + 7 = 103 / 5 = 20.6
Is simply the number that occurs the most frequently in a range of values.
Ex: 7, 14, 4, 6, 19, 3, 7, 23, 14, 7, 11, 7 Mode = 7
The median is simply the middle value in a range of values when the values are listed in ascending or descending order. When there is an odd number of values it's the number directly in the middle. If there is an even number of values in the range, you simply take the average of the two middle values in the range.
Ex: Odd # of values: 3, 4, 8, 13, 17 Median = 8
Ex: Even # of values: 3, 4, 7, 8, 13, 17 Median = 7.5 (7 + 8 = 15 / 2 = 7.5)
The standard deviation sounds overly technical, but it is simply the measurement of how individual values differ from the mean. Said another way, the Standard Deviation is measuring how much values are spread out in a dataset and attaching a numerical value to the individual data points.
Most people can understand quartiles very easily. To figure our quartiles you simply divide your data into quarters. Think of a pizza with 8 slices, each quarter of the pizza would consist of 2 slices.
With Inferential Statistics we do not have all the data we need to draw an informed conclusion, so we have to take a random sample from the aggregate population in question. The goal of this area of statistics is to gain a better understanding of the overall population from a sample of the population that we're trying to gain an understanding of. T-Distribution, and Z-Score are common examples of Inferential Statistics.
Basic Statistical Terminology
A term for any value that describes the characteristics and attributes of an item. This item could be a transaction, a person, an event, a result, a change in the weather.
The parent group from which the experiments data is collected, or said another way, is the entire group of items. Example: All the registered users of an online shopping platform, or all of the players in the NBA, aggregate population of the city of San Diego.
Population is represented by a capital: N
A subset of the population, for the purpose of an experiment, or defined another way, when a full census is not possible or feasible, a selected subset, called a “Sample” is taken from the population.
Example: 15% of all NBA players, or 25% of all registered users on an online shopping platform.
Sample is represented by a lower case: n
A feature of an item from the population that differs in quantity or quality from another item.
Example: The payment method of registered users on an online shopping platform, or the age of NBA players.
Is a variable that is represented by (X), and impacts the dependent variable. E.g. The supply of oil (Independent variable) impacts the cost of fuel (Dependent Variable).
This variable is represented by (Y), and is dependent and impacted by the Independent variable (X). As the value of the independent variable changes, the effect on the dependent variable also changes, and is observed and recorded.
This is another important area of statistics that is often overlooked when trying to gain a better understanding of our data. As Data Analysts/Scientists we often want to focus on more fancy and sophisticated metrics that sound intelligent and high-brow. All the while Central Tendency measures often have the biggest impact on our understanding of our data and what it means . Central Tendency uses the Mean and Median metrics and tells us where the center of our data is. Without having a decent and well defined understanding of our center we can’t understand anomalies and outliers very well and how this impacts our overall dataset. Central Tendency measures tells us where most of our data points fall within our dataset, and from here we can then into move into other important metrics such as: ANOVA (Analysis of Variance).
Measures of Skewness
Measures of Skewness tell us whether the distribution of data is symmetric or asymmetric. Said another way: are the values in our dataset spread uniformly or not uniformly.
Is a measure to express how closely the sample results match the true value of the full population. This takes the form of a percentage value between 0% - 100%, which is called the Confidence Interval. The closer the Confidence Interval is to 100%, the more confident we are of their experimental results.
A Confidence Interval of 95% means that if you were to repeat the experiment numerous times (with the same conditions), the results would match that of the full population in 95% of all possible cases. A Confidence Interval of 0% expresses that you have no confidence in repeating the results in further experiments.
“P” denotes the probability of a certain event occurring. For experiments with a large population “P-hat” is used to define the probability of an event occurring in relation to its sample size.
It’s found by dividing the number of occurrences of the event (X) by the sample size (n), or x/n.
Type 1 Error
This is a rejection of a null hypothesis (H0) when it should not have been rejected in the first place.
This means that although the data appears to support that a relationship is responsible, the Covariance (which is the measurement of how related the variance is between two variables) of the variables is occurring totally by chance. This does not prove that a relationship doesn’t exist, but that it’s the most likely the case. Also known as a “False Positive”.
Type 2 Error: Is failing to reject a null hypothesis that should have been rejected because the covariance of the variables was probably not due to chance, also known as a “False Negative”.
Standard Error (SE)
Used to measure the difference between two populations.
This statistical test is a way to compare results from a test (experiment) to a normal population.
The standard deviation doesn’t explain the actual variance of an individual data point from the mean. The solution to this problem is to apply the Z-Score.
Z = Xi – Xbar
Z = Z-Score
Xi = an individual observation from the dataset
Xbar = the mean
S = Stand. Dev.