NLP assignment on finding MCDA classmates with similar interests
Word and sentence embeddings are a clever way of turning words and sentences into numerical representations that computers can understand. Word and sentence embeddings make it possible for computers to work with and analyze text, making tasks like language translation, sentiment analysis, and information retrieval much easier and more accurate. A common way to numerically represent words and sentences is to transform them into vectors. Vectors are essentially a bunch of numbers arranged in a specific order, and they represent different aspects of the word or sentence.
For word embeddings, imagine each word as a unique vector. These vectors capture the word's meaning in a multi-dimensional space. Each dimension might represent something like word frequency, context, or similarity to other words. So, when we take a word and transform it into a vector, we're assigning it a specific location in this multi-dimensional space, sort of like plotting points on a map. We expect related words or words with similar meaning to be used in similar contexts and appear with similar number of frequencies. Therefore, once tranformed into vectors we should see the similarities reflected in vector space. As shown in the plot above [1], related words are clustered close with each other.
Sentence embeddings are like puzzles made up of word vectors. When we form a sentence, we combine the individual word vectors in a specific way to create a new vector that represents the entire sentence. This combined vector captures the overall meaning and context of the sentence by taking into account the relationships between the words.
In the image above[2], SIF (smooth inverse frequency)[3] refers to a sentence embedding method that combines word embeddings to calculate the sentence embedding.
The closer the vectors are, the more similar the words or sentences are in meaning. So, word and sentence embeddings are like a bridge that helps computers make sense of our words and language.
In this experiment, we are testing how sentence embeddings are compared in Sentence Transformers. We've modified sentences from 3 people in our dataset by changing the word orders in a sentence, replacing words with synonyms or antonyms, shortening the sentence, writing the oppsoite sentence meaning, or replacing the whole sentence with a randomly generated sentence without context. The changes are as follows:
We ran the matchmaking visualization code again and got the following:
As shown, most of Nikita's sentence modifications are still relatively clustered close together despite the changes (with the exception of Nikita_opposite2). While by replacing with a completely random sentence, Greg's and Tao's modified Tao's modified sentences are placed further away from their originals. To understand further why this is, we've calculated cosine similarities between the original sentence with modiefied ones and the results from the test are as follows for Nikita's sentence modifications:
And for Greg and Tao's:
The results from the cosine similarity test largely corresponds with their placements on the plot above. Expectedly, replacing whole sentence have the highest impact on similarity with both Greg and Tao's modifications scoring near 0 on cosine similarity scale; replacing with synonyms had the least impact with all of Nikita's synonym modifications scoring near the 0.9 mark. Interestingly, we see that using antonyms does not have a large impact on the sentence similarities, yet, the sentences have the opposite meaning. Therefore, using this model, people with opposite interests would be placed close together. Note that while Nikita_opposite2 has a higher similarity score than Tao_mod and Greg_mod, it is placed further away from its original on the visualization plot. This is likely more to do with UMAP's dimensionality reduction mechanism. In general, on the plot, sentences with similar meaning is placed closer together than sentences that are not.
In this expriment, we are testing how different transformer models affect our outcome of finding people with similar interests as me (Haodong Tao). We calculate the embedding similarities of 4 other transformer models (all-mpnet-base-v2, all-distilroberta-v1, all-MiniLM-L12-v2, paraphrase-albert-small-v2) and calculate the spearman's correlation with our base model (all-MiniLM-L6-v2). We also rank order the people with closest descriptions to mine for all models and calculate the rank differences. The results are as follows.
As shown, all models have high correlation with our base model, with all-mpnet-base-v2 performs most differently from our base model with correlation of 0.728. This suggests that model embedding could be a significant factor in our task performance. We can demonstrate this further with the following analysis.
Here we rank order people with the most similar interests as me (Haodong Tao) based on their description's cosine similarity score and calculate the difference in ranking. Then we count the number of people ranked 10 places or greater apart when comparing two embedding models (this difference is enough to displace the number 1 position out of top 10), therefore the higher this count is, the greater the difference in model performance. We also count the number of people ranked within 5 places apart between two models (this means a top 1 position would still be within the top 5), therefore, the higher this count is, the more similar in model performance. As we can see, the counts corresponds to the models' spearman correlations.
We can also visualize ranking differences with the following boxplot:
Lastly we can take a closer look at the top 5 people who's descriptions are the most similar to mine in all models:
We discussed briefly in the Data Analysis section about how UMAP's dimensionality reduction mechanism can affect our visualization. Here we will try to optimize UMAP hyperparamters inorder to more accurately reflect our interest rank order. First lets take a look at the pre-tuned correlation between cosine distance rank order and the euclidean distance drawn on the UMAP plot.
As shown, the correlation is about 3.8. Let's see if we can improve it using Optuna hyperparameter tuning optimizing for better correlation between the cosine distance and euclidian distance. Here we will optimize three parameters that affect visualization: n_neighbors, min_dist, and metric. Information about these parameters can be found in the UMAP documentation [4]. Briefly, N_neighbors controls how UMAP balances local versus global structure in the data; low values of n_neighbors will force UMAP to concentrate on very local structure. Min_dist controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. Metric controls how distance is computed in the ambient space of the input data; the default metric is 'euclidean', but also available are 'cosine' and many others. Here we will from three metrics euclidean, cosine, and manhattan.
After running 100 trials, we get the following results.
We have improved the spearmanr from 0.38 to 0.52, a significant improvement. Here is the tuned correlation plot:
We can also compare the UMAP plots to see that after tuning, the plot looks different. Here is the pretuned plot: VS the tuned plot:
It's interesting to note that while the tuned plot has higher correlation between interest rank and the euclidean distance, it also appears less clustered. This actually makes for a worse visualization since stronger clustering gives better visual information on which people have similar interests.
- Nilesh Barla, MLOps Blog: The Ultimate Guide to Word Embeddings, Neptune AI, 2023 https://neptune.ai/blog/word-embeddings-guide
- Diogo Ferreira, What Are Sentence Embeddings and why Are They Useful?, Medium, 2020 https://engineering.talkdesk.com/what-are-sentence-embeddings-and-why-are-they-useful-53ed370b3f35
- Sanjeev Arora et al., A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS, ICLR 2017 https://openreview.net/pdf?id=SyK00v5xx
- Basic UMAP parameters, UMAP Learn https://umap-learn.readthedocs.io/en/latest/parameters.html