Custom Web Crawler Uncovers Web Page Themes

The Challenge

A large international charity with staff spread around the globe hired Elder Research to provide a data-driven view into the text content being shared to the organization’s sprawling corporate intranet. The charity sought to use these text insights in future user experience (UX) work.

The Solution

Elder Research developed a web scraping engine and deployed it to Amazon Web Services (AWS), collecting more than 230,000 documents from the client’s intranet. Using open-source data science tools, Elder Research also developed an HTML-to-text pipeline that filtered unusable content and unwanted duplicates, extracting text from more than 70,000 unique pages and preparing it for analysis.

After assembling this text corpus, Elder Research applied text mining and network analytics techniques to identify candidate themes within intranet sub-communities.

Results

Using desktop and AWS-based computing resources, Elder Research identified collections of key themes describing each sub-community and organized these sub-communities into larger groups. These text-based inferences could better inform the client’s future UX decisions.