Tools for Data Science

Tools for Data Science

Zohar Strinka
Wednesday February 28, 2018

In the world of data science, "open" is a big trend. Open data, open source tools, and openly sharing analysis to help others in their own data science journeys. In this post we will show three different tools for getting “Insight” (the job of data science) from data and how we used them to study a single data set.

Mashey ran an internal event where teams within the company had 24 hours to deliver something outside of their normal jobs based on the model started by Atlassian. For that project, one group of developers set out to analyze public data using free and open source tools. In addition, we are sharing that analysis here to help demonstrate how data scientists ask and answer different questions in order to ultimately get insight from data.

For the code related to this project, visit the Github repository for this project.

The Data: The city of Boulder has an open data initiative and invites anyone to download and use their data. One of the data sets available is Public Free WiFi usage. This data set seemed like a good one to use for machine learning because the data set was large enough (2.6 million connections recorded) and had several interesting features for each record. Looking at the information available, we wondered if we might answer questions like the following:

  • Do Apple users consume more data than others?
  • Are there people using so much WiFi that they are negatively affecting the network?
  • How does network quality impact usage?

After settling on the data and rough outline of business questions, our team began to explore the dataset using Qlik Sense (a data visualization tool), R (a popular tool for statistics), and Python (a popular tool for machine learning). At this stage we discovered issues with the data including duplicates and rows with zero usage information. Upon handling the data issues, we began our more in-depth analysis.

Visualizations: Using Qlik Sense, we delved into the data and attempted to see if there were patterns we could identify by filtering a subset of data or producing the right graph.

Screen Shot 2018-03-02 at 8.58.21 AM

In the included graph for example, we can see that there is not a clear trend of Apple users consuming more data (on average) than non-Apple users. This helped guide our later analysis by allowing us to focus on questions without obvious answers.

Statistics: Using RStudio and a Jupyter notebook, we searched for correlations which would help us to be more effective in building a machine learning model in the next phase of the project. We developed a definition of “High-Power user” who may use enough data to negatively affect other users. Using the designation as High-Power or not, we were able to see which of the other features in the data were correlated with usage.

Machine learning: Using Python via the Anaconda distribution, we used Machine Learning and standard validation techniques to see what might be accurately predictable from our data. We discovered that the features of the user rather than the usage (type of device, connection time, location, signal quality) were not able to do better than guessing at predicting who the High-Power users would be.

Parting thoughts: At the end of the analysis it is easy to be disappointed that we were not able to predict what we wanted to know. However, ultimately this is a snapshot of how Data Scientists iterate toward an answer. With the knowledge gained, the next cycle might focus on determining if low signal quality predicts usage, or if a different user mix goes to the library than other locations.

Ultimately, we found it interesting to see how different tools also help focus your attention to different aspects of a problem. If we had not been planning to do a prediction with Machine Learning, we might not have thought carefully about which variables are really properties of the user rather than the usage. Similarly, by visualizing the data we were able to more quickly identify the issues in the original data set and ask ourselves how including or excluding certain data may affect our analysis. Finally, while Machine Learning is popular for taking any input, using statistics to focus your efforts can be an asset.

In the end, all of these tools are only useful if there is a human to close the loop and turn the analysis into insight. Here at Mashey we can help clarify the insight you could be getting from your data, and work with you to drive change within your organization.

Mashey logo
Mashey is a next generation data and analytics consultancy that designs and implements modern strategies that help transform companies into data-driven and data-informed teams. Through business intelligence, data warehousing, and data science, Mashey provides companies with the advantages of veteran experience and startup agility.