Learn Data Analytics terminology and Methods

image142

What is Data?

"Data" is simply the collection of information, facts and statistics. Think of data as individual bits of information that help to form a broader understanding of a particular subject matter. Data is broken down into several substructures: Structured and Unstructured.

Structured (Organized) and Unstructured (Unorganized) Data*

Structured Data

This type of data that has structure or organization to it. Think of a traditional tabular format with rows and columns like you would see in a typical Excel spreadsheet. Most Machine Learning and statistical models use structured data.


Unstructured Data

 This type of data is defined as data that has no standard organization or structure to it. Think of texts, or social media posts. 

Qualitative and Quantitative Data*

Qualitative Data

This type of data cannot typically be described numerically (numbers), and generally is described using categories or names.

Ex: Describe your mood today: Happy, Ambivalent, Sad, etc,.


Quantitative Data

This type of data can be described numerically and mathematically.  

Ex: Checking account balance $3,000, body weight 200 lbs, my home is 2,700 Sq ft, etc,.


Quantitative data can be further subdivided into Discrete Data and Continuous Data.

Discrete and Continuous Data*

Discrete Data

This is data that can be counted. This means that the data can only take on a limited range of values. Think of the number of ways a dice can be rolled, or, how many people can fit into a small cafe.


Continuous Data

This type of data is measured, and exists within an infinite range of values. Think of time, or the height of a building which can take on an infinite scale of decimals. 


Data can be broken down even further into: Nominal, Ordinal, Interval, Ratio.

Nominal, Ordinal, Interval, and Ratio Data*

Nominal Data 

This is data that is typically described by name or category only, and has no natural order to it. Think of gender, or nationality. This category of data does not provide any mathematical maneuverability.

Ex: Male, Female; Citizenship: US, Canadian, Mexican, etc,.


Ordinal Data

Data at this level allows us to order the data, but does not provide any real mathematical differentiation between the data points. Data at this level allows for the use of numbers to describe the data, and has a natural order to it. The most common example of Ordinal data is the frequently used Likert Scale used for determining a patients pain level. 

Ex: Likert Scale to describe a patients pain level: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10


Interval Data

This category of data is quantifiable, and can be expressed through simple mathematical functions such as addition and subtraction. Data at this level allows for meaningful quantifiable differentiation between the data points. Think of a calendar year as an example. 

Ex: Calendar Years: 2019 - 2014 = 5 Year difference


Ratio Data

This data tends to be the most resilient of the four. At this level we can define the natural order, has a natural zero, measure differences between the data points, and we can also multiply and divide along with addition and subtraction. Think of your bank account as a good example. You can have $0 dollars in it (natural zero starting point, more on this later), or you can have $40,000 in it which is twice as much as $20,000. 


Data Analytics: Methods and Practices


Regression Analysis  

Is a statistical method for understanding the relationships between variables of interest. Regression Analysis essentially aims to take a set of data and attempt to make a best guess prediction from that data. With predictive analytics we're essentially trying to make our best guess about the future based on past examples.


What is Simple Linear Regression and What's the Point? 

Simple Linear Regression is a statistical method that allows us to summarize and study the relationships between two continuous variables: The Predictor variable (which is on the X axis, and is also called the Input or Independent Variable) and the Output or Dependent Variable (which is on the Y axis).

The objective of Simple Linear Regression is to simply visualize fitting a straight line through  the data points in a scatterplot. The line has to be built in such a way that the sum of the squared distance from the data points to the line is minimal. Generically expressed as:


                          Y = a* x + b


Non-linear relationship

Is a type of relationship between two variables in which a change in one variable does not correspond with a constant change in the other entity.


ANOVA and Linear Regression

ANOVA (Analysis of Variance) and Linear Regression models are used to analyze the effect of one independent explanatory variable on a dependent response variable. 

If there is more than one, Multiple Linear Regression is used.

The similarity between ANOVA and Linear Regression is that the response variable is assumed to be normally distributed.



Data Mining

Is simply described as finding pertinent patterns within your data. Data mining always begins with the data, so it's important to have a generalized understanding of your data before you start. Data mining uses computational processes to uncover meaningful structures within the data.


Within in Data Mining methods there are multiple methods:

  

Classification, Regression, Association Analysis, Clustering, and Outlier/Anomaly Detection.


Each of these categories has a few dozen different algorithms, and each takes a slightly different approach to solve the problem at hand.

Classification and Regression tasks are predictive techniques because they predict an outcome variable based on one or more input variables.

Predictive algorithms need a known prior data set to learn the model.


Association analysis and Clustering are descriptive data mining techniques where there is no target variable to predict; hence there is no test data set.


Exploratory Data Analysis (EDA)

Is a method of preparing data so that we can standardize the results and gain valuable quick insights.


Machine Learning

Machine Learning is a hot topic, and a frequently used buzzword in today’s lexicon. 

Simply put, Machine Learning is the ability of computers to learn from our input data without being explicitly being programmed to do so. Said another way, computers learn and find patterns within the data. 


Training Data Set:

Using a prepared data set where all the attributes, including the target class attribute is known, cleaned, and normalized. We use this dataset to create our models from.


Test Datasets:

To check the validity of the training model, we will use a test dataset/validation data set. To facilitate this process, the overall known data set can be split into training data and a test dataset. A standard rule of thumb is for three-fourths of the data to go to training and one-fourth to go to the test data set.

Training Dataset = 3/4 of the known dataset, or depending on the size of the dataset 2/3.

Test Dataset = 1/4 of the remaining data, or 1/3 depending on the size of the dataset. basics San Diego


 Analytical Problem Solving Progression

In Data Analytics it’s important to understand what is the specific problem or issue we are addressing. It is critical that the problem be defined in a manner that is clear so that an effective solution can be adopted. 

Without a clear understanding of the problem we are investigating it is hard to develop an effective solution. So the first step in finding in optimal solution is to define the problem. Next, the second step is to understand the scope of the problem (how wide reaching might this issue be, size, timeline, etc,.). Third, we develop possible solutions for solving the problem. Our fourth step is to implement the solution(s) we have drafted up. Fifth and final step is to revise our solutions, if necessary, and re-implement our solution(s). These are the 5 steps for Analytical problem solving. 


1. Define the challenge/goal/or problem.

2. Understand the scope (size) of the challenge/goal/problem.

3. Develop possible solutions for the problem.

4. Implement solutions.

5. Revise, if necessary, and then re-implement solutions.

 

Data Cleansing Methods and Practices:


Data Cleansing Steps:

-Elimination of duplicate records.

-Removal or sectioning off of outlier values (these extreme values can severely distort your data if not properly understood. Can be removed entirely, or can remain with proper notation commenting on their value).

-Standardizing attribute values.

-Imputation of missing values (averaging of missing values based off of surrounding/similar attribute values).


Missing Values:

Missing values is one of the most common dataset issues that data analysts deal with.

It’s important to understand the reason behind why the values are missing in the first place before proceeding further.

As mentioned above, you can impute values with artificial data using either the Mean, Mode, Maximum, Minimum values depending on the characteristics of the attribute.

We can also ignore the missing values altogether. This reduces the size of the dataset.

Some algorithms are good at handling missing values while others are not.

K-nearest neighbors is good at handling missing values, while Neural Networks are not.


Feature Selection:

Reducing the number of attributes without experiencing significant loss of performance of the model.

Not all attributes are equally important or useful in predicting the target value.

*Caveat

The aforementioned data types require discretion on how and when they are used. Understanding the differences between these categories will go along way in the application of Data Analytics methodologies and practices.