Translate

Qualitative Data Analysis (QDA)

 From the econometric point of view: 

On the econometric level, the analysis of the qualitative variable (categorical or nominal) takes place in two ways: either it consists of considering it and apprehending it as being an endogenous variable, or of course retaining it as an exogenous variable and to study it in a model of qualitative econometrics. Added to this is the possibility of analyzing the link between two qualitative variables through Chi-square test of independence (especially not to be confused with the Chi-square test of adequacy). In addition, models of qualitative econometrics are diverse, and some may be very complex than they are. Let us quote a few of the simplest to the most complex: the Probit binominal model, Logit binominal, Gombit, simple Tobit, Tobit generalized type I, II, III, IV, V, Tobit double censored, Tobit multiple censorship (truncated or limited), Heckit ... etc. 

The data structure, although dynamic on the temporal component, may vary according to each observation and in this case takes the form of panel data. We will explain in the previous section some principles of these models. To find out about the nature of the model that one has to do, one can easily define according to the domain of definition of the function but also of the modalities provided by the endogenous qualitative variable. These are variables, which are often derived from inquiries concerning a character of appreciation, opinion, satisfaction. ... etc. Others can be calculated and not observed as such. Their studies are as indispensable as the quantitative ones to make significant efforts in a specific direction. 

The simultaneity of a qualitative endogenous variable explained by another qualitative exogenous variable is certainly possible. To study in such a case, it will be necessary to consider one of the modalities of the variable while retaining the others to serve as references in the meaning of the results obtained. As for the qualitative endogenous variable, it is presented in terms of probability. To achieve this, a latent continuous variable is enough to facilitate the calculation in terms of probability of the modalities of the endogenous variable. A positive probability is indeed synonymous with a growing number of chances. It is eminent to note, on the other hand, the convergence of the solution after iterations in both the concave and convex cases. Beyond the significance by predictor variable, the overall significance or the adequacy of the model is a notion of appreciation of the model design. Thus, the coefficient of determination indicator of Mc Fadden and Hekman also called pseudo Rsquare makes a judgment on the quality of the fit of the model. In other words, the explanatory power or the part of the fluctuation explained by the variables retained in the model. 

On the other hand, the remaining percentage is usually less than 50% and corresponds to the relevant variables not considered. The Hosmer Lemeshow test also goes in the same direction on the quality of the fit. Indeed, it is by the marginal effects that we know more about the impact of each variable introduced into the model. The obtained estimators give an idea of the nature of the influence of the exogenous variable on the endogenous. In principle, it is recognized that there is a difficulty in interpreting the terms of the explanatory variable. It is then better to set a modality as a reference and to interpret by comparing with those used as references in the analysis. In most cases, the choice of modeling between Logit, Gombit (or extreme value) and Probit is done with the predictive power of the model. The best of them will be retained for the final modeling. 


  From a statistical point of view: 

In statistics, there are a variety of univariate and multivariate analysis procedures, including the family of factorial methods: factorial correspondence analysis, multiple correspondence analysis, multiple factor analysis, factorial analysis of mixed data. In some of these methods, it could concern both the qualitative and quantitative variables, so we evoke a mixed analysis of the variables. In addition, other techniques study the link between nominal and mixed variables. In this case, for example, the Chi-square test, the Cramer Rao coefficient, the correlation ratio, the analysis of the variance (ANOVA), etc. Statistically analyzing the qualitative variables is to perform the same operation seen in econometrics. In other words, make each modality a new variable. Therefore, in some studies we talk about the existence of a complete disjunctive table or Burt's chart. In addition, the match name refers to the link between the nominal variables. The search for axes expressing more meaning to the data is the common denominator of all factorial methods. It is rather in Multiple Correspondence Analysis that there is a massive loss of information and therefore the need to take certain results with care. The principles of the factorial analysis of correspondences, as the name indicates it allows to highlight the correspondences between two qualitative variables. In other words, the link whose modalities intervene and specially to identify the nature of the link that can be attractive, repulsive or independent. In this sense, it is an exploratory method, descriptive of data, established by Benzecri in the 70's. 

The idea is to translate the proximity of modalities as a link between the variables and specially to grasp as an identical profile for individuals which they describe. In contrast, the multiple analysis of the correspondences is a generalization of the factorial analysis of the correspondences, which itself is a double Analysis in Principal Components on on the one hand the line profile and on the other hand the profile column in a table contingency. Another aspect of distinction is that in simple factor analysis, the raw table is not studied directly, that might be interpreted as differences between rows and columns. It is also important, when interpreting, to avoid marginal low-margin modalities lest it influence the contributions of others. In discriminant analysis, it requires the presence of a qualitative variable with several quantitative variables. 

The principle is to put in place a linear combination of the quantitative variables separating at best the studied population. The discriminant function can be obtained using multiple linear regression. According to a threshold and the modalities of the qualitative variable, one determines the points individuals misplaced. The ideal is that it requires more than a minimum number of misplaced individuals. For this, one should think about including other variables in the regression and repeating at several iterative. The particularity of the discriminant analysis is that it is apart from its exploratory function, a decision-making method. 


 From the point of view of hierarchical classification: 

From the point of view of the ascending or descending hierarchical classification, the implementation is made possible by multitudes of metric distance calculation algorithms, among which the Manhattan method, weighted distance, Ward ... etc. The idea is to reduce the number of classes by iteration by grouping the one that is similar or the one whose dissimilarity is minimal (according to the index of aggregation). In other words, we try to minimize intra class variance. This partitioning derives from the distance matrix in a space of R power variable number. Identical profiles show individuals with the same preference for a given choice or the same profile for a characteristic of the individuals sought.



Abdi-Basid ADAN

The Abdi-Basid Courses Institute