Working with Large Language Models to Improve Customer Feedback Analysis

Closeup image of a cell phone in a man's hands. Five yellow stars float above the phone.

Elder Research partnered with a national restaurant chain to refine their customer feedback analysis. Our goal was to enhance their text classification capabilities, enabling quicker, more accurate insights into customer sentiment and operational improvements.

The Challenge

Our client collects a large, ongoing stream of text responses to customer surveys, which they regularly monitor as an input to their operations. One aspect of this analysis involves tracking how often various “topics” are contained in the customer feedback. We worked with the client’s research and development group to explore methods for improving this text-classification and topic-identification process.

The Solution

Our work focused on the application of large language models (LLMs) to classify datasets according to their sentiment, themes, and topics, and to identify previously unrecognized topics. These models are especially proficient at understanding and translating language, which makes them well-suited to this kind of analysis. LLMs and natural-language technology have advanced so rapidly in recent years that we were able to evaluate a range of language models and prompting techniques for their performance in our client’s use case.

Making the most of these LLMs required assessments of both overall model performance and the prompting techniques we used to apply these models to data. Most LLMs are trained to operate from long, unstructured text input such as sentences or paragraphs. The science (or art) of prompting focuses on how best to provide these inputs, along with direction in the form that results should take, to attain our desired outcomes—in this case, identifying whether a set of topics was present in a text.

We evaluated several publicly available LLMs and a variety of prompting techniques, including both zero-shot prompts, which solicit a classification without providing any examples, and few-shot prompts, which provide several examples to guide the model’s output. We evaluated a selection of open-source and proprietary models and found that Google’s Gemini 1.5 Flash model, paired with few-shot prompting, yielded the best results, beating out a variety of medium-size open-source models from Meta, Mistral, and Google.

The Results

We found that our best-performing solution would provide at least a 50% cost savings to the client over their current process without a drop in performance. Further, our solution could be developed, managed, and deployed in-house instead of depending on a third-party vendor, as this client has been doing. Bringing the solution in-house would improve data protection, provide additional flexibility, and allow customization of the model itself.

During our experiments, we also identified interesting cases in which the human-provided annotations themselves might be incorrect. Flagging these cases performed a kind of anomaly detection, allowing client staff to double-check these instances and providing value for both this text analysis and for training future models.

Finally, from an R&D standpoint, we were able to benchmark a variety of state-of-the-art language models using real client data collected from applied use cases. We identified and refined the prompts capable of achieving high performance on our tasks and passed this knowledge on to the client to support future efforts in this area.