How Do You Begin a Data Science Project?


Amy Snyder

Date Published:
January 9, 2023

Where do you begin?

If you want to use data science to improve your chances of making a well-thought-out decision in a business or mission opportunity, what should you do? Where should you start?

Are the objectives clearly defined?

If life was like a college test, a client might say, “Here is my knowledge gap, carefully formulated as a question, and here is my structured data; please provide a machine learning model that has a true positive rate of at least 85%.”

In real life, often the hardest part of helping a client make a data-based decision is identifying their key knowledge gap and reshaping it into a focused question that addresses their specific business or mission needs.

Let’s look at an example through a story about my friend Jocelyn. Jocelyn has two hobbies: road trips in her RV and line-dancing at dance halls. She recently asked me how a data scientist might help her participate in both activities with the goal of having a great time. However, Jocelyn knew that we didn’t share those hobbies and wondered if I would be able to help her without having the same background knowledge.

What else is hidden in the knowledge gap?

Unlike school or a Kaggle competition[1], in which an expert third party has carefully crafted both the question and data, life is much messier.

One must identify a knowledge gap – an unknown where additional insight would allow the client to make a decision informed by current data – and come up with a question to answer.

But initial questions often contain implicit assumptions and biases. Similarly, the data might have gaps or collection biases. Moreover, we won’t know if our data has problems – or if it’s even the right data to address the issue – until we’ve worked with the client to understand their gap and why it’s important — using Problem Framing.

How Do We Make a Qualitative Problem Quantitative?

What is Problem Framing?

After a client has identified a gap, Problem Framing is the process of breaking down what needs to be learned into component pieces and rebuilding key components into a quantifiable question.

It’s crucial to making sure that you and the client are on the same page and surfaces mutual assumptions and potential biases.

In our example, the traveling part of the knowledge gap is a good one to problem frame. Efficiently traveling between different locations — the traveling salesman problem — is a hard but old question with a hundred years of history.

But before I can determine Jocelyn’s best route, I must identify the points she wants to visit. For that, we need to have an in-depth conversation.

How Do We Make a Qualitative Problem Quantitative?

How can we quantify ‘having a great time’? I have my own understanding of the phrase, but I need to understand Jocelyn’s definition. (Many more data-based projects would be implemented if they focused on what the client needs and not what the analyst is most eager to do!) I need to hear about her experience.

We started with dance halls. I asked Jocelyn to describe memorable visits and what she liked or disliked about each part. After building my basic understanding of a visit, I brainstormed about additional features (some of which required drawing in external data), including:

Has she found that halls with higher customer ratings are more enjoyable?

Does she like to start by a specific time?

Does the location need to serve food?

What cover charge is acceptable?

Does the location need to play a specific type of music? Or even a specific song (that she loves to dance to?

Then we went into the part of her question that deals with road trips in her RV. Again, I’ve been on road trips before, and that can help my brainstorming, but I needed to understand what a great road trip looks like for Jocelyn. For each, we discussed details, including:

What duration range works best?

Does she care more about efficient driving or about having a picturesque drive?

How far is comfortable to drive each day?

How long does she like to spend in each location?

Does she want time in the schedule to do other things besides visit dance halls?

At which venues can she park her RV?

Does she want options or a full plan?

What is the Result?

For each of these questions, we talked about how important each answer is to Jocelyn. Then, Jocelyn and I narrowed down her gap:

Initial Knowledge Gap Question:

“Where to go line-dancing at dance halls on an RV road trip?”
step image

Initial Criteria:

“Have a great time.”
step image

Revised Question:

“What are two possible itineraries for line-dancing at different dance halls on an RV road trip?”
step image

Revised Dance-hall Criteria:

“Dance halls should be open by 8 pm, cost less than $15 to enter, be rated 3.8 stars or higher, and must play The Electric Slide. It would be nice if they serve food but that is not required.”
step image

Revised Trip Criteria:

“Travel between October 7th and October 28th, starting and ending the trip at home, and driving for no longer than 4 hours a day, and must have acceptable parking within 30 minutes of the dance hall. It would be nice if the route is pretty, and she is willing to spend an extra 45 minutes a day driving for scenic overlooks or great fall foliage.”
step image

Note that our criteria/question may have qualities we want to optimize (maximize or minimize), as well as hard and soft constraints.

Initial constraints may have to be loosened for a viable solution to be found, so the definition process may need to be revisited after the solution space is explored.


After doing the work together to refine the question, Jocelyn and I have a shared understanding of the problem and which criteria matter most to her. We’ve also learned what data we need to answer her question. Here, we’ll need opening hours, customer reviews, and locations for RV parking, for example.

Breaking down the problem into components ensures that the client and the data scientist are on the same page. This is harder in practice than one would at first think!  But if you’re truly going to help someone, the first step is to start with a quantifiable question that clearly defines their objectives.