A Data Science Approach to Tallying Teslas on the Road

Why Is It Called The ‘Research Triangle’?
The Research Triangle gets its name from Research Triangle Park and three Tier 1 research universities—Duke University, North Carolina State University and University of North Carolina Chapel Hill—located only minutes apart.

The Research Triangle in North Carolina is often called the “Silicon Valley of the East”… just like every interesting city on the east coast. There are several metrics used to measure the importance of a city to the tech community — number of companies, new jobs, size of research and academia – but there is one metric that I believe captures the most: number of Teslas on the road. As polarizing as Tesla and Elon Musk can be, the cars do serve as a symbol of innovation and technology, the keys for a tech hub.

As a Cary resident on the edge of Research Triangle Park it seems like an unusually high number of Teslas are zipping around. My wife and I have made a game out of spotting them first while driving around town (in our very non-Tesla minivan). Is it just our imagination, or is Cary really a hotbed for Tesla activity? How does Cary’s ‘Silicon Valley-ness’ measure up to the real Silicon Valley? To settle this question, I decided to do some detective work. And lucky for me, I was headed to a conference in San Francisco, the tech capital of the world.

The Data

To compare the data, I first needed to collect it. I could of course try finding metrics online, but what’s the fun of that? I’ll personally collect the data. The hotel I was staying at on Lombard Street was roughly one mile from the conference location in Japantown. On that walk I counted all of the passenger cars that I passed: 86 vehicles, 9 of which were Teslas.

The following week when I returned to North Carolina, I counted passenger vehicles on my morning commute to drop the kids off at school: 95 vehicles, 6 of which were Teslas.

Great; now we can compare.

In the most naïve approach, we could simply calculate the rate and compare directly: 10% of the SF cars were Teslas, vs. 6% of the NC cars. SF is clearly more Tesla-dense, right? Not necessarily; I only collected one sample from each city, so there remains quite a bit of uncertainty about their overall proportions. The 10% and 6% are ‘sample proportions’, while we are seeking the population proportions. The difference in sample proportions could be due either to differences in the populations, or just random chance. How sure can we be that San Francisco has more Teslas?

This is the fun part, and we’ll look at a frequentist approach and a Bayesian approach for the answer.

Frequentist Approach

The frequentist approach to comparing sample proportions is one of the most used statistical methods. If you took undergraduate statistics, you came across this (and almost certainly forgot about it afterwards). The formula calculates the significance of the difference in the two proportions (that is, the likelihood that chance is the reason for their difference), accounting for the size of the samples. Larger samples make the measured proportions more accurate, and thus reduce the likelihood that chance is the reason.

To compare the two sample proportions calculate a test statistic and then compare it to a predetermined level of significance, which is typically 0.05 (not for any good reason at all).

The math here is simple enough:

Or use statistical software (here, R):

When testing whether the two cities have a different proportion of Teslas, we see that our p-value is 0.31. This is higher than our predetermined level of significance (0.05), so we cannot conclude that the Tesla proportion is different in San Francisco vs. Cary. We must be careful, though, because we also can’t conclude that the rate is the same in the two cities!

Bayesian Approach

Another approach to comparing sample proportions is based on Bayes’ theorem and the principles of Bayesian inference. In contrast to the frequentist approach, the Bayesian approach allows for the incorporation of prior knowledge about the parameter of interest. We start with a prior probability distribution for the parameter (the proportion of Teslas on the road) and update it based on the data we collect.

To use Bayesian analysis, we estimate a distribution for the parameter which reflects our beliefs about the true proportion of Teslas in each area. Notice here that we have a distribution of possible proportions rather than a single estimate. This can help us quantify the uncertainty of our knowledge.

First, we need to figure out our “prior” information about the proportion of Teslas on the road. According to Car and Driver, roughly 1% of the cars on the road are electric. According to Electrek, Tesla owns roughly two-thirds of the EV market in the US. This would mean that roughly 0.66% of cars on the road in the US are Teslas. However, there is uncertainty in those estimates AND that is for the entire United States. We would expect the rate to be higher in major cities. We can choose a relatively wide distribution as our prior since we have a lot of uncertainty.

We’ll assume the mean of the prior distribution is .05 (so the most likely proportion of Teslas in either city is 5%) and the standard deviation is .1 (which reflects a high uncertainty). We can model this distribution as a Beta Distribution with that mean and standard deviation.

Next, we can create ‘posterior distributions’ of the Tesla proportions in both San Francisco and Cary. We will start with the same prior distribution, and then update using the observed data from my survey.

Posterior Distribution of Tesla Rate

Tesla Rate

We can see that generally the rate is expected to be higher in San Francisco than in Cary, but there is still a lot of overlap in the distributions.

Since what we really care about is the difference between the rate of Teslas between San Francisco and Cary, we can sample from each of the two and take the difference between them. This gives us a distribution of the differences between the San Francisco Tesla rate and the Cary Tesla rate. Zero indicates no difference, a positive value indicates that San Francisco has a higher rate of Teslas.

Difference in Rate of Teslas (San Francisco, CA and Cary, NC)

Percentage Point Difference

With the Bayesian approach we have some more flexibility in the way we think about the output. We don’t simply reject or fail-to-reject the null hypothesis – we have a distribution of values. This distribution is centered at 0.036, so our most likely estimate is that the rate of Teslas is 3 percentage points higher in SF than in Cary. We can also gauge how confident we are that the rate is higher in SF at all. As shown in the graph, zero difference is at the 16th percentile of this distribution, so we can say we are 84% confident the rate of Teslas is higher in San Francisco.

Which should we use?

There’s some debate on this question. The frequentist approach is simpler, faster to calculate, and easier to find tools for implementation. The Bayesian approach is more flexible, provides interpretable confidences, and can incorporate prior knowledge. In a case like this where we’ve collected very little data, a Bayesian approach lets us see a distribution of plausible differences.

With increased computing power, it’s becoming easier to conduct the sampling that drives Bayesian inference, though it is still more complicated to implement. It can be worth it though, if you have strong priors, or want to see the range of possibilities. In most cases, the frequentist and Bayesian approaches have consistent conclusions.

Caution!

This was meant to be a fun exercise rather than a robust measurement of Cary’s standing as a tech hub. Here are the biggest ‘cautions’ to this analysis if we wanted to treat it seriously:

Are we asking the right question?

What does it mean to be a tech hub? Do we need to measure the importance of an entire city, or region, or neighborhood? What will we do with that information and how will it impact our decision making? Which stakeholders care about this?

Are we using the right metric(s)?

Is “Tesla Rate” really a useful metric? Is it measuring what we want it to measure, or does it serve as a useful proxy? If it’s just a proxy, how can we use it to estimate the metrics that we really care about?

Is our data ‘good’?

I only collected data on one street, at one time, on one day. This is certainly not representative of all the passenger vehicles in the city. I’m also a human, and a rather sloppy one, so I’ve almost certainly made errors in counting the vehicles.

Conclusion

I would answer each of the above questions confidently with a ‘no’, so this analysis shouldn’t be used to drive any real decisions. But this framework for how to compare data can be useful if you have the right place to apply it – like comparing the success rates of medical treatments, the performance of marketing campaigns, or the impact of different social policies.

To gain insights into complex phenomena and make informed decisions, we must be mindful of the assumptions and limitations of our data, and combine different statistical methods with qualitative research, expert opinions, and stakeholder feedback. Statistical analysis is just one piece of the puzzle, and we must approach it with caution and a critical eye. By using a well-rounded and holistic approach to data analysis, we can improve the world for all!

A Data Science Approach to Tallying Teslas on the Road

Author:

Date Published: