What Are Guests Really Saying? Topic Modeling Hotel Reviews

Author:

Lance Lu

Date Published:
January 26, 2026
A hotel guest standing at a reception counter speaking with hotel staff member

In the hotel industry, guest expectations are always evolving, and customer feedback is one of the most valuable tools for staying ahead. Businesses examine customer feedback left on review aggregators like Yelp, Google, or Tripadvisor to improve service quality. But the right action requires a clear understanding of what guests are actually saying, which isn’t always easy with the volume and variety of reviews a hotel may receive. By understanding feedback, hotels can improve relationships with guests and make their hotels more appealing.

Here, we will share how hotels can use topic modeling and language models to efficiently monitor the reviews left by customers. Our modeling methods allow hotels to understand reviews at a deeper level than the overall ratings and understand the factors that attract or repel new guests.

Exploring Hotel Reviews

The original data comes from two text files with JSON objects—one file for reviews and one for hotel offerings. Because the dataset is larger, analysis is optimized by storing the individual reviews in an SQLite database.  This also helps with more efficient querying of individual reviews and joining of the offerings and reviews. After importing, we can perform some exploratory analysis on the dataset. The data is from hotels across the U.S., with New York City and Houston having the most locations (Figure 1).

Figure 1: Hotel locations per city

Most reviews seem to be positive as shown in Figure 2. Giving reviews with a rating of two or lower is rare. Why? A few plausible explanations include:

1. Most hotels are doing a good job.

2. Customers review hotels if their experience is either exceptional or terrible but not if it’s mediocre.

3. Customers are biased toward giving higher-star reviews even if experiences are only average.

4. Hotels with consistently low ratings may go out of business, leaving only highly rated hotels.

Ratings provide a quick and simple way of measuring a hotel’s quality, but they don’t paint the whole picture. Even in 4- and 5-star reviews, there may be some elements of guest dissatisfaction. Customers have become accustomed to this type of ratings inflation and will not take a high rating at face value. Potential hotel guests may read through a few 3- and 4- star reviews to understand a hotel’s quality more completely. So hotel leaders should carefully look at each review’s content to find specific areas needing improvement.

Figure 2: Ratings distribution

Instead of reading each review, which is time-consuming and could miss important trends in the set, we recommend using a machine learning tool to analyze the content of each review. When developing such a tool, we need to recognize that the number of reviews per offering (Figure 3) and the length of reviews (Figure 4) have a lot of range. We need a solution that will work for both long and short reviews and accommodate hotels with a large or small number of reviews. In the next section, we will show how topic modeling can analyze a hotel’s reviews, regardless of review length or the number of reviews.

Figure 3: Reviews per offering distribution

Figure 4: Review length distribution

Example Solution: Topic Modeling

Let’s walk through an example informed by patterns in our dataset. Imagine we operate the fictional Hotel Alpha in New York City. To enhance our guest experience, we are looking for potential areas of improvement. We have 253 reviews available, with an average rating of 4.38 (Figure 5), taken between the years 2006 to 2012. We could read through each review individually; however, doing so is tedious, and tracking shared elements is tricky.

This is where topic modeling with embeddings comes in. With topic modeling we can find common points across the set of reviews when multiple reviews mention a similar topic. Since embeddings capture the underlying meaning of a sentence, even if the wording is different, similar ideas will still tend to be grouped together in a cohesive topic.

Figure 5: Hotel Alpha rating distribution

To perform topic modeling, we first chunk the reviews. Chunking breaks a review into smaller pieces. There are many ways to do this, but for this example we will chunk the reviews along sentences, which allows us to look for finer-grained topics. On the other hand, larger chunk sizes, like paragraphs or whole reviews, can give us a more holistic view of what guests are commenting on. (Note: In other use cases, chunks can be overlapped so you are less likely to lose context between each one.)

Figure 6 shows how the chunking process works on a fictional review.

Figure 6: Sample review chunking

Using Embeddings

Next, we convert these chunks into embeddings using a Sentence Transformer model. This turns each chunk into a vector with 384 numbers generated by the all-MiniLM-L12-v2 model. Each embedding represents a chunk’s meaning in a high-dimensional space. We can then use these embeddings to compare chunks for semantic similarity. Chunks that have similar meaning are closer together in vector space.

Thinking about embeddings in 2D simplifies this concept. Imagine we have individual words that are represented by coordinates (a vector with two numbers) as shown in Figure 7. Words that are closer in meaning such as computer and laptop are close together on the grid. Words that are not as similar such as apple and train are further apart. Now, for our topic model, we are transforming sentences into coordinates instead of individual words. And instead of two dimensions, we are representing them in 384 dimensions. Still, the same idea applies in this higher-dimensional space: If the sentences are close together in this space, their meanings are more similar.

After embedding all the review chunks for our hotel’s location, we can then begin clustering them. Groups of embeddings that are close together form clusters. We use BERTopic for this, which uses UMAP for dimension reduction (simplifying the data from 384 dimensions to a lower dimension) and HDBSCAN for clustering by default. Dimension reduction is needed for faster clustering.

Figure 8 shows one example of a possible cluster visualization with BERTopic. These visualizations aim to preserve the semantic distance—how far apart the meanings are—between clusters. Some topics are closer together in semantic space. This gives us the ability to create more general or more specific clusters depending on the parameters used during the UMAP and HDBSCAN processes.

Figure 8: Example topic visualization

From these clusters we can find topics. These topics can be represented by groups of keywords as shown on the right of Figure 8. Keywords are pulled from the cluster of reviews by looking at the most important words using cTF-IDF. Not all chunks can be clustered. These unclustered chunks are assigned a topic ID of -1. The parameters chosen for BERTopic affect which chunks get clustered and which topics they are assigned to.

Other topic representations are possible such as a single phrase determined by a GenAI model. We can also see the chunks that are most representative of each cluster if we want to explore a single topic in more detail. In Figure 9, topic 9_louis_andrea_concierge_manager and 7_noise_street_noisy seem interesting.

Figure 9: Example topics

We can now see what shared topics our guests are discussing. We enhance these topics one step further by analyzing the sentiment of the chunks associated with each topic. With this we can see the topics our guests are talking about and how they feel about those topics. We use a HuggingFace pipeline to gauge the sentiment of each chunk. We then join the sentiment results with our topic results so that we can see the overall guest sentiment on each topic.

Table 1: Topics with sentiment

Looking at the topics we explored earlier, we can see the overall sentiment for Topic 9 is very positive, while Topic 7 is more negative.

Figure 10: Overall topic sentiment

And looking at all the topics, we can see some potential areas of improvement! Guests really enjoy our complimentary beverages and location, but we could improve the checkout process and work on soundproofing the rooms.

Table 2: Most positive topics

Table 3: Most negative topics

Conclusion

For a business, topic modeling with embeddings and sentiment analysis provides some useful benefits:

  • Quickly analyze a large set of reviews.
  • Find shared topics that highlight areas to target for improvement.
  • Track recurring and emerging issues to make sure problems are mitigated before future customers are driven away.
  • Find topics guests are responding positively to and use those topics to strengthen current efforts or inspire new ones.

In general, for topic modeling:

  • Choosing chunk size affects how granular your topics will be (sentence chunks vs. the whole review).
  • Embeddings capture the meaning behind sentences, even if the wording is different.
  • UMAP and HDBSCAN parameters affect how general your topics are.1
  • Topics let you see consistent elements in your dataset.
  • Adding sentiment analysis to chunks and topics gives you a more factor-specific understanding of your topics.

Future work might include:

  • Fine-tuning the sentiment and embedding models to boost performance.2
  • Analyzing topic presence or sentiment as they change over time.
  • Having an LLM find the best topic name based on extracted topic keywords to improve readability.

Topic modeling is a powerful way to uncover what matters most to guests. If you’d like to learn more about this approach, explore The Observatory for a demo on analyzing customer feedback. Our team at Elder Research is also glad to chat. Learn more about how we serve the hospitality industry.


1 For UMAP, increasing the number of neighbors will make more generalized topics by emphasizing the global structure of the data. For HDBSCAN, increasing min_cluster_size will focus the topic modeling on larger clusters of reviews.

2 Clustering performance can be measured with metrics like Silhouette score or Rand Index. Calculating these metrics would require chunks with classes as a ground truth, which would likely require human annotation. These metrics check if members of the same “class” are clustered together.