Permutation-based Text Clustering

Faculty Sponsor: Pavel Oleinikov

Live Poster Session: Zoom Link

Yifei Zhao

Yifei Zhao is a rising junior (Class of ’25) originally from China. During the summer of 2023, he had the opportunity to collaborate with Professor Pavel on a project involving permutation-based text clustering. Beyond his academic pursuits, Yifei finds joy in sports, capturing candid moments through photography, and traveling.

At Wesleyan, he is a triple major in Mathematics, Computer Science, and Philosophy. His areas of interest encompass deep learning, artificial intelligence, and the practical applications of computational techniques to social science.

Looking ahead, Yifei aspires to continue combining computational expertise with qualitative thinking, aiming to apply these innovative approaches to further enrich his research endeavors.

Abstract: A binarized document-term matrix (DTM) can be treated as a black-and-white image. Because the ordering of the terms (columns) and documents (rows) in DTMs is arbitrary, clustering can be viewed as swapping (permutations) of rows and columns, aimed to place the black pixels together in larger blocks. Traditional clustering algorithms rely on sequential convergence and do not parallelize well. Many algorithms require a fixed number of clusters as input. However, a Monte Carlo Tree Search is agnostic to the number of clusters, and as the search tree grows automatically, the Monte Carlo Tree Search can guarantee convergence to the optimal solution (minimax). Based on its design, a Monte Carlo Tree Search will focus on exploring “more promising moves”. If it finds a good move, the Monte Carlo Tree Search will explore it deeply. This can be seen as a combination of breadth-first search and depth-first search, similar to heuristic search. This project will experiment with the Monte Carlo Tree Search for the generation of permutation matrices (random draws vs incremental improvements), heuristics for handling the high dimensionality of DTMs, and methods of parallelizing the search for the best solution. We are interested in parallel clustering algorithms like an image that can utilize GPUs.

YifeiPavel_final