Next, we convert these chunks into embeddings using a Sentence Transformer model. This turns each chunk into a vector with 384 numbers generated by the all-MiniLM-L12-v2 model. Each embedding represents a chunk’s meaning in a high-dimensional space. We can then use these embeddings to compare chunks for semantic similarity. Chunks that have similar meaning are closer together in vector space.
Thinking about embeddings in 2D simplifies this concept. Imagine we have individual words that are represented by coordinates (a vector with two numbers) as shown in Figure 7. Words that are closer in meaning such as computer and laptop are close together on the grid. Words that are not as similar such as apple and train are further apart. Now, for our topic model, we are transforming sentences into coordinates instead of individual words. And instead of two dimensions, we are representing them in 384 dimensions. Still, the same idea applies in this higher-dimensional space: If the sentences are close together in this space, their meanings are more similar.
After embedding all the review chunks for our hotel’s location, we can then begin clustering them. Groups of embeddings that are close together form clusters. We use BERTopic for this, which uses UMAP for dimension reduction (simplifying the data from 384 dimensions to a lower dimension) and HDBSCAN for clustering by default. Dimension reduction is needed for faster clustering.
Figure 8 shows one example of a possible cluster visualization with BERTopic. These visualizations aim to preserve the semantic distance—how far apart the meanings are—between clusters. Some topics are closer together in semantic space. This gives us the ability to create more general or more specific clusters depending on the parameters used during the UMAP and HDBSCAN processes.

From these clusters we can find topics. These topics can be represented by groups of keywords as shown on the right of Figure 8. Keywords are pulled from the cluster of reviews by looking at the most important words using cTF-IDF. Not all chunks can be clustered. These unclustered chunks are assigned a topic ID of -1. The parameters chosen for BERTopic affect which chunks get clustered and which topics they are assigned to.
Other topic representations are possible such as a single phrase determined by a GenAI model. We can also see the chunks that are most representative of each cluster if we want to explore a single topic in more detail. In Figure 9, topic 9_louis_andrea_concierge_manager and 7_noise_street_noisy seem interesting.

We can now see what shared topics our guests are discussing. We enhance these topics one step further by analyzing the sentiment of the chunks associated with each topic. With this we can see the topics our guests are talking about and how they feel about those topics. We use a HuggingFace pipeline to gauge the sentiment of each chunk. We then join the sentiment results with our topic results so that we can see the overall guest sentiment on each topic.

Looking at the topics we explored earlier, we can see the overall sentiment for Topic 9 is very positive, while Topic 7 is more negative.

And looking at all the topics, we can see some potential areas of improvement! Guests really enjoy our complimentary beverages and location, but we could improve the checkout process and work on soundproofing the rooms.

