Blog

The Power of Open Data and Crowdsourcing Analytics

Paul Derstine

April 20, 2018

 BLOG_The Power of Open Data and Crowdsourcing Analytics

Crowdsourcing, a combination of “crowd” and “outsourcing” first coined by Wired magazine in 2005 and fueled by the Internet, is a powerful sourcing model that leverages the depth of experience and ideas of a public group rather than an organizations own employees. In The Importance of CrowdSourcing Matt H. Evans points out that “Crowdsourcing taps into the global world of ideas, helping companies work through a rapid design process. You outsource to large crowds in an effort to make sure your products or services are right.” The advantages of using crowdsourcing are claimed to include improved costs, speed, quality, flexibility, scalability, or diversity. It has been used by start-ups, large corporations, non-profit organizations, and to create common goods. Wikipedia maintains a list of crowdsourced projects.

Last week the city of Charlottesville hosted the Tom Tom Festival Applied Machine Learning Conference. The conference’s emphasis was on "advancing the understanding of practical issues of applying data science and machine learning techniques to real world problems" and included a crowdsourcing challenge using open data sources. On its journey to become a “Smart City,” Charlottesville is looking to accelerate its evolution through the first City Open Data Challenge, organized by Daniel Bailey, co-founder and CTO of Astraea. The goal of the challenge was to use open data and crowdsourcing to engage the growing data science community within Charlottesville and the surrounding area to help the city better understand pedestrian use of the Downtown Mall. The Mall is one of the most successful pedestrian malls in the nation and is a vibrant collection of more than 120 shops and 30 restaurants and the city wants to ensure that capital infrastructure plans are prioritized effectively.

The challenge required registered teams of data scientists to analyze a year of anonymized time series data  on free WiFi usage, create a predictive model forecasting pedestrian usage, and to identify influential factors on pedestrian usage through visualization. Teams were encouraged to use other open data, including any relevant data found on the City of Charlottesville Open Data Portal.

The challenge was divided into two parts:

Best Predictive Model: Provide three models that generate one-week forecasts for the following time series:

  • Clients Per Day – number of people using city-provided WiFi services
  • Number of Sessions – sessions per day conducted by clients
  • Usage Over Time – kilobytes of usage per 4-hour window

Best Data Storytelling: Craft a narrative and visualizations that explain what is happening in the data (trends, anomalies, outliers, etc.). Enlighten members of the target audience to insights that would not be clear without charts or graphs.

More than 30 teams with more than 60 members participated in the challenge. Team HACK’D, a five member team comprised mostly of data scientists from Elder Research, walked away with the award for Best Storytelling. The evaluation criteria employed by the panel of judges came from five categories:

  • Soundness – How robust and rigorous is the analysis behind the storytelling?
  • Explainable – How well does the story explain what is going on in the data?
  • Appeal – How stunning are the visuals?
  • Accessibility – How accessible are the findings to a diverse audience?
  • Engagement – How engaging is the combined narrative and visuals?

I interviewed the HACK’D team1 to learn more about what motivated them to join the challenge, how they went about developing their winning solution, and challenges they encountered along the way.

Team HACKD

What is the value of crowd sourcing data science?

Kazlin Mason: Applying crowdsourcing to data science initiatives, via open data portals, allows for high value data acquisition at a low cost. In typical research settings, many strategic decisions are made and acted upon based on relatively small sample sizes or surveys. In our connected world, incorporating multiple data sources and allowing opportunities for capable researchers and analysts to come together in creative ways, results in improved understanding of the data and broadly applicable insights to be made.

What motivated you to participate in the challenge?

Cory Everington: While the competition took up a lot of our time after work, we thought it would be a good opportunity to learn some new skills. I think we were all also motivated by the chance to participate in a local competition and support the Charlottesville tech community. I think most of us had not participated in a competition like this before and we thought it would be a fun challenge to do in our spare time.

What was the biggest challenge with this project?

Halee Mason: The primary challenge was time. We decided to work on creating submissions for both parts of the challenge, the Best Predictive Model & Best Data Storytelling.  Since our team consisted of industry professionals and we were working on this challenge after work we were limited in the amount of time we could spend on the project. The Best Predictive Modeling contest had three models to train and tune1 and we opted to create a website for Best Data Storytelling. Each task had a set of unique challenges where time was the limiting factor for the team.

What are the challenges of working with open source data?

Cory Everington: I think one of the greatest challenges of working with open source data was deciding on what data sources we wanted to use and where to find them. There were so many options and interesting data sets available that it was tempting to spend all of our time expanding datasets and not focusing on building a model or story. We ended up starting small by adding weather and holiday data and then iterating on feature creation to make sure we had enough time to spend understanding the data.

What, if any, other data sources were used?

Kazlin Mason: Data sources primarily focused on holiday data, Charlottesville event data, and weather data.

Data Story Telling 1-Charlottesville Open Data Challenge-1

From there, data were analyzed to determine how events and weather impacted WiFi and pedestrian usage. All data was consolidated from publicly available sources.

How did you attack the problem?

Anna Godwin: We started the Open Data Challenge by summarizing, plotting, and analyzing the WiFi data sources provided to us. We began identifying the interesting trends and outliers inherent in the original data sources. 

Data Story Telling 2-Charlottesville Open Data Challenge

Figure 1. A monthly average of 3,478 users was observed for the whole data set. The peak time for WiFi visitors on the downtown mall is during the spring when the weather gets warmer (5,176 users in the month of April). Visitors and usage decrease as the year progresses, with an all-time low in November (1,393 users) and December* (1,262 users).

Then we explored what would happen if we layered in additional open data sources such as the weather and events taking place on the Downtown Mall. From there, a clear story began to form around what draws pedestrians to the mall. With the story complete, we were able to summarize our findings into a website and make recommendations to the City of Charlottesville.

What tools and techniques did you use?

Danny Brady: We used the containerization technology Docker in order to develop, and then deliver, our predictive models in a way that would be easy to deploy. We developed our time series models in Python using sklearn and statsmodel and data was wrangled using the pandas package. Data was stored in a MySQL database and we interacted with the database using the Python SQLAlchemy package. Docker was used for containerized model deployment to enable the contest administrators to verify our model forecasts in a reproducible way.

Kazlin Mason: For Best Data Storytelling we developed a website using Javascript, Vue, HTML, CSS, and Bootstrap4. Static graphics and visualizations were made using Adobe Illustrator.

What were the most interesting insights revealed?

Kazlin Mason: The most interesting insight came from the premise of the contest: the idea to use WiFi data as a proxy for downtown mall pedestrian use. From this viewpoint, we broadly identified when visitors were more likely to interact on the mall. We were surprised to see that Sundays were a less popular day for downtown mall visitors. It was also surprising to see a drop-off in WiFi users during the holidays. These insights may identify key time frames for future community involvement and events, as well as to draw more visitors to the mall throughout the year.

The HACK’D team’s story led to recommendations the City of Charlottesville planners can use to drive traffic to the Downtown mall, such as:

  • Hosting events to draw more Apple or tech users to the downtown mall may bring in greater volume for tech business and increase the number of WiFi users.
  • Device repair stores can use this data to better serve their customers and provide tailored services.
  • Consider opening a technology related store, such as an Apple store, on the mall.
  • Alternatively, bring in these specific vendors during key technology related events such as the TomTom Festival.

Data Story Telling 4-Charlottesville Open Data Challenge

According to Mr. Bailey “This challenge not only raises awareness of the open data initiative undertaken by the Charlottesville city, but also creates a pathway through which the city and its various counties can engage with the growing tech community for the purpose of social good.”

Need help getting started with analytics? Our on-site half-day Analytics Executive Strategy session delivers strategies for using analytics to improve organizational decision-making, recommendations on how to grow your analytics capabilities, and plans for short and long-term analytics opportunities, prioritized based on feasibility and return on investment. Learn more.


[1] Editor’s note:  It took me awhile before it dawned on me where their team name came from. (hint: look at the members)

[2] Editor’s note:  HACK’D did have time to submit good predictions for one time series, and it was the most accurate of the contest.  But the accuracy prize averaged accuracy across all three time series.


Related

Read the blog Choosing the Right Analytics Problem

Download the e-book Mining Your Own Business (Chapter 3)

Read the blog Avoiding Common Data Science Business Mistakes


About the Author

Paul Derstine As Director of Marketing, Paul Derstine works with clients to understand their data analytics goals and how Elder Research's vast experience with data engineering, data analytics, and data visualization can deliver solutions that enable return on their analytics investment. Prior to joining Elder Research, Paul worked for 18 years for GE Intelligent Platforms in Charlottesville where he worked with a broad spectrum of global customers to understand their business needs in order to deliver solution value and optimize profit and growth objectives. Paul has a B.S. degree in Electrical Engineering from The Pennsylvania State University.