From the econometric point of view:
On the econometric level, the analysis of the qualitative variable (categorical or
nominal) takes place in two ways: either it consists of considering it and apprehending
it as being an endogenous variable, or of course retaining it as an exogenous variable
and to study it in a model of qualitative econometrics. Added to this is the possibility of
analyzing the link between two qualitative variables through Chi-square test of
independence (especially not to be confused with the Chi-square test of adequacy).
In addition, models of qualitative econometrics are diverse, and some may be very
complex than they are. Let us quote a few of the simplest to the most complex: the
Probit binominal model, Logit binominal, Gombit, simple Tobit, Tobit generalized type
I, II, III, IV, V, Tobit double censored, Tobit multiple censorship (truncated or limited),
Heckit ... etc.
The data structure, although dynamic on the temporal component, may
vary according to each observation and in this case takes the form of panel data. We
will explain in the previous section some principles of these models.
To find out about the nature of the model that one has to do, one can easily define
according to the domain of definition of the function but also of the modalities provided
by the endogenous qualitative variable. These are variables, which are often derived
from inquiries concerning a character of appreciation, opinion, satisfaction. ... etc.
Others can be calculated and not observed as such. Their studies are as indispensable
as the quantitative ones to make significant efforts in a specific direction.
The simultaneity of a qualitative endogenous variable explained by another
qualitative exogenous variable is certainly possible. To study in such a case, it will be
necessary to consider one of the modalities of the variable while retaining the others
to serve as references in the meaning of the results obtained. As for the qualitative
endogenous variable, it is presented in terms of probability. To achieve this, a latent
continuous variable is enough to facilitate the calculation in terms of probability of the
modalities of the endogenous variable. A positive probability is indeed synonymous
with a growing number of chances. It is eminent to note, on the other hand, the
convergence of the solution after iterations in both the concave and convex cases.
Beyond the significance by predictor variable, the overall significance or the
adequacy of the model is a notion of appreciation of the model design. Thus, the
coefficient of determination indicator of Mc Fadden and Hekman also called pseudo Rsquare makes a judgment on the quality of the fit of the model. In other words, the
explanatory power or the part of the fluctuation explained by the variables retained in
the model.
On the other hand, the remaining percentage is usually less than 50% and
corresponds to the relevant variables not considered. The Hosmer Lemeshow test also
goes in the same direction on the quality of the fit.
Indeed, it is by the marginal effects that we know more about the impact of each
variable introduced into the model. The obtained estimators give an idea of the nature
of the influence of the exogenous variable on the endogenous. In principle, it is
recognized that there is a difficulty in interpreting the terms of the explanatory variable.
It is then better to set a modality as a reference and to interpret by comparing with
those used as references in the analysis.
In most cases, the choice of modeling between Logit, Gombit (or extreme value)
and Probit is done with the predictive power of the model. The best of them will be
retained for the final modeling.
From a statistical point of view:
In statistics, there are a variety of univariate and multivariate analysis procedures,
including the family of factorial methods: factorial correspondence analysis, multiple
correspondence analysis, multiple factor analysis, factorial analysis of mixed data. In
some of these methods, it could concern both the qualitative and quantitative variables,
so we evoke a mixed analysis of the variables. In addition, other techniques study the
link between nominal and mixed variables. In this case, for example, the Chi-square
test, the Cramer Rao coefficient, the correlation ratio, the analysis of the variance
(ANOVA), etc.
Statistically analyzing the qualitative variables is to perform the same operation
seen in econometrics. In other words, make each modality a new variable. Therefore,
in some studies we talk about the existence of a complete disjunctive table or Burt's
chart. In addition, the match name refers to the link between the nominal variables.
The search for axes expressing more meaning to the data is the common denominator
of all factorial methods. It is rather in Multiple Correspondence Analysis that there is a
massive loss of information and therefore the need to take certain results with care.
The principles of the factorial analysis of correspondences, as the name indicates
it allows to highlight the correspondences between two qualitative variables. In other
words, the link whose modalities intervene and specially to identify the nature of the
link that can be attractive, repulsive or independent. In this sense, it is an exploratory
method, descriptive of data, established by Benzecri in the 70's.
The idea is to translate
the proximity of modalities as a link between the variables and specially to grasp as an
identical profile for individuals which they describe.
In contrast, the multiple analysis of the correspondences is a generalization of the
factorial analysis of the correspondences, which itself is a double Analysis in Principal
Components on on the one hand the line profile and on the other hand the profile
column in a table contingency. Another aspect of distinction is that in simple factor
analysis, the raw table is not studied directly, that might be interpreted as differences
between rows and columns. It is also important, when interpreting, to avoid marginal
low-margin modalities lest it influence the contributions of others.
In discriminant analysis, it requires the presence of a qualitative variable with
several quantitative variables.
The principle is to put in place a linear combination of
the quantitative variables separating at best the studied population. The discriminant
function can be obtained using multiple linear regression. According to a threshold and
the modalities of the qualitative variable, one determines the points individuals
misplaced. The ideal is that it requires more than a minimum number of misplaced
individuals. For this, one should think about including other variables in the regression
and repeating at several iterative. The particularity of the discriminant analysis is that
it is apart from its exploratory function, a decision-making method.
From the point of view of hierarchical classification:
From the point of view of the ascending or descending hierarchical classification, the
implementation is made possible by multitudes of metric distance calculation
algorithms, among which the Manhattan method, weighted distance, Ward ... etc. The
idea is to reduce the number of classes by iteration by grouping the one that is similar
or the one whose dissimilarity is minimal (according to the index of aggregation). In
other words, we try to minimize intra class variance. This partitioning derives from the
distance matrix in a space of R power variable number. Identical profiles show
individuals with the same preference for a given choice or the same profile for a
characteristic of the individuals sought.
Abdi-Basid ADAN