Executive Summary : | The increasing prevalence of sensing capabilities, IoT technologies, smartphones, and social media services has led to the generation of vast amounts of unlabeled data with high dimensions, making it difficult to understand the underlying characteristics of the dataset. Effective analysis of such data can provide significant insights into various fields, including transportation, healthcare, energy, education, and finance. Clustering is an important unsupervised learning approach for (unlabeled) data mining that partitions the data into groups with similar objects to discover interesting patterns from the dataset. Deep learning-based clustering techniques have recently gained popularity due to their ability to handle complex high-dimensional data. However, these algorithms have limitations, such as the need to specify the number of clusters explicitly beforehand, the lack of information about the inherent cluster structure of datapoints, and difficulties in interpreting the neural network's predictions. To overcome these limitations, researchers have used visual assessment of clustering tendency (VAT) methods to estimate the number of clusters present in data. However, these methods are not practical and are often inconclusive with high-dimensional datasets which is the case in most real-life data. Therefore, this project aims to develop an explainable, self-supervised learning-based visual-analytical framework for cluster structure assessment to discover the deep structures present in complex, high-dimensional data when no ground truths are available. The project will advance the VAT family of algorithms, exploiting modern deep-learning approaches to automatically estimate the number of clusters and discover inherent cluster structures in complex, high-dimensional data. This framework will involve three tasks: (i) The first task will be the representation learning for high-dimensional data using modern deep learning architectures such that similar points come closer while dissimilar ones are pushed away from each other; (ii) The next task involves using deep learning to automatically determine the number of clusters and inherent cluster structure in the data from the obtained visualization of the learnt representation; and (iii) The final task aims to explain each cluster to end-users by exploring various explainable AI-based approaches, providing properties about each cluster, and determining the importance of each feature for each cluster obtained from the previous task. Various experiments will be performed on a variety of real-world datasets available at various data repositories. The proposed framework can help researchers to understand the inherent cluster structure present in their experiments and assist non-expert users in interpreting clustering outputs. Overall, the proposed approach aims to provide a comprehensive and interpretable solution to discover cluster structures in complex, high-dimensional data. |