Understanding data simplification using Factor Analysis and Principal Component Analysis

Data analytics have become synonymous with efficient decision-making in commerce. Massive data sets that are humanly impossible to handle are being made into sense regularly. Gargantuan amounts of data, that are interrelated and riddled with dependent variables and correlations are a daily encounter for contemporary data scientists. To reduce the number of variables and identify different factors that are a culmination of observed variations in the correlated variables factor analysis is used. In simple words, factor analysis aims to identify the covariance of observed variables by inferring a factor, or many factors from the same. Factor analysis can be classified into two major types based on the purpose of deployment. 

  • Exploratory factor analysis 

A process aims to explore the latent variables or factors that consist of different covariances. And predict the observed variables along with their relationships with the factor. 

  • Confirmatory factor analysis 

The process is more of a confirmatory test. It is performed to determine the number of factors that should exist for a set of interrelated variables. And identify the factor loadings of each variable for clearer comprehension. 

What are latent variables? 

Latent variables or factors are a collection of covariances between the observed variables. The goal is to present all those tiny changes in terms of a larger aspect for better comprehension. The best way to understand the same is by example. For example, a consumer satisfaction questionnaire measures a few product aspects that are variables. Such as the value proposition, utility, desirability, design, ergonomics, build quality, and endurance. With the help of these variables, especially by analyzing their covariances, more variables can be inferred, and these variables can be called latent variables. 

In the case of the aforementioned example, design, ergonomics, build quality, and endurance can determine the longevity of the product. And the aspects like utility, desirability, value proposition, and desirability can define the commercial relevance of a product. Commercial relevance and longevity are the latent variables or factors in this case. 

The essential traits of data needed for factor analysis

  1.  The data should be free of unnecessary points like outliers and exceptions.
  2. The number of factors should be lesser than the sample size. An ideal ratio in this case should be at least 5:1 for sample size and the number of factors.
  3. The variables that are to be inferred by a factor should be interrelated.
  4. The variables should be suitable for matric or numeric measurements. 
  5. The data should be normalized well. However, multivariate normalization is unnecessary. 

What is principal component analysis?

The principal component analysis aims to simplify a data set by reducing the dimensionality of variables and covariances. And naturally best understood by a simple example. Imagine a collection of cars with different attributes. Let, a section of them be sedan cars, a section of SUVs, and a sector of crossovers. 

Now if we try to understand the relationships between these three categories, we can plot them based on a single aspect. For example, wheel sizes. In this manner, all the aspects of each section can be compared but that requires the data scientist to plot a huge number of plots. San impossible feat in the fast-moving post-pandemic world. Therefore, the samples are plotted depending on a few factors that can define all the underlying variables that are replaced with the factor. These factors are called principal components, as they are the culmination of a multitude of minor components. 

PCA vs. Factor Analysis

This discussion is incomplete without explaining Factor analysis and principal factor analysis in light of one another. And understand clearly regarding PCA vs. Factor Analysis.

The principal component analysis is conducted while considering the maximum number of variances in the data.In factor analysis, only common variances are taken into account. 
Works on the complete correlation matrix. Achieves complete comprehension of the adjusted correlation matrix
Utilized to reduce substantial variance-related measurement errors.Used for avoiding multi-collinearity in regression. And factor loadings
Factor classification is redundantFactors are classified depending on the amounts of the explained variability
The number of components is not predetermined, rather they are computed.The number of factors is predetermined.