A data mining approach to study disease presentation patterns in Primary Progressive Aphasia.
Nowadays the world is faced with an ageing population and the related challenges, as
healthcare issues given the current incidence of diseases more prevalent in elders, such as
neurodegenerative diseases. Primary Progressive Aphasia (PPA) is a neurodegenerative disease
characterized by a gradual dissolution of language abilities, being these patients regarded with
special attention since they possess higher risk to evolve to dementia. Consequently,
discovering the different subtypes of PPA patients is fundamental to the timely administration
of pharmaceutics and therapeutic interventions, improving patient's quality of life.
This thesis aims to propose a data mining approach to extract relevant knowledge from
clinical data, namely to learn the variants of PPA. Initially, standard clustering algorithms were
applied with the purpose of studying the number of groups existent in the dataset and
eventually, study the potential existence of new groups, different from the PPA subtypes
already defined in the literature. Then, during a second phase, supervised learning techniques
were used to analyze patients according to their clinical classification in one of the three PPA
variants and develop a new and accurate classification model.
The unsupervised learning analysis pointed to the existence of two main groups in the
dataset analyzed in this work. This study included the evaluation of diverse sets of attributes in
order to access which type/set of attributes produced better results. Finally, two new
methodologies for classifying patients with PPA were developed, reaching good accuracies in
the dataset under study. One of those methodologies enables the identification of instances
which are (potentially) not from any of the already defined three PPA subtypes.