Dan Murray explores three core areas of data and shares insight into how we should think about each in order to work with data more effectively.
During the holidays I spent a some time reading about the terms data science, big data and data visualization. This is part of my annual thinking and planning cycle. Because there is so much buzz surrounding these words I wanted to sort out how to think about the relationship between these topics.
How Buzzy Are These Terms?
Searching the terms data science, big data and data visualization on Google Trends reveals the number of searches for the past 10 years.
Big data has the most buzz, followed by data science and then data visualization. Searching the same terms using Google Books, Ngram Viewer yields a different picture:
More books have been written about data visualization than the other two terms which isn’t surprising since data visualization (the term) has been around longer. Data visualization is a less nebulous term as well which may explain why more people are searching data science and big data.
These Are Big Topics
Because there’s a lot of territory to cover I’m going to break this into a series of posts each focusing on one of the terms. I’ll start with data science, then look at big data and finish with data visualization in the context of its relationship to the other terms.
What Is Data Science Exactly?
The term data science first appeared in a 1974 book by Peter Naur called "Concise Survey of Computer Methods." According to Wikipedia, Naur had been using the term since the 1960’s as a substitute for computer science.
In 1997, C.F. Jeff Wu gave a lecture entitled "Statistics = Data Science?"
Data science in its current iteration was introduced by Bill Cleveland in a 2001 article in the International Statistical Review.
Who Owns Data Science?
There seems to be a tussle going on within academia regarding data science. Is is part of mathematics, statistics or computer science? My answer is “yes.” The field involves many different disciplines. Data science is to these fields what nanotechnology is to biology, chemistry and physics.
A data scientist who knows math and statistics but doesn’t understand database technologies is crippled. In my opinion, the math and statistics are less important than knowing how to hack and glue disparate data sources together to glean insight in what has already happened.
There is a lot of focus on machine learning and predictive analytics in the tech sector and other industries like insurance, banking and finance because accurate projections can result in competitive advantage.
A great paper by David Donoho entitled "50 Years of Data Science" does a pretty good job of bringing the term into focus. He breaks-down the term into six divisions:
- Data Exploration and Preparation
- Data Representation and Transformation
- Computing with Data
- Data Modeling
- Data Visualization and Presentation
- Science About Data Science
Data Exploration and Preparation.
This is the starting point for every inquiry. The data must be cleaned and structured to facilitate analysis. In businesses, data quality is the result of process control. Companies that understand how to hone business processes to achieve reliable data quality have a competitive advantage. More academic or exploratory data quality is achieved through the careful interpretation and exploration of novel data sources with appropriate tools and methods.
Data Representation and Transformation
This section includes the broad topics of modern databases and the user of the appropriate mathematical structures. My “nuts & bolts” interpretation of this area is this: You need to understand the database and the data model. You have to be able to interpret what is available and what is missing in order to achieve a result.
Computing with Data
This topic includes which tools you need to use to understand the data. Those that are technically-inclined would use tools like R or Python to analyze and interpret a data set. For less technical people, there are tools like Tableau that can be used to bring large data sets into focus.
Data modeling describes more technical issues related to efficiently applying the tools in cluster and cloud computing are part of modern computing infrastructures. In order to make the data useful, the platform must be fast and efficient.
Provides the means for people to understand the data. Using the appropriate visualization can improve understanding or mislead. Tableau Software’s [company website](http://www.tableau.com/) market success clearly shows the appetite for making data accessible and understandable.
Science About Data Science
David Donoho states that the "science of data analysis" is among the “most complicated of all sciences.” As a practitioner in the business information field, I don’t care who owns what. I’m focused on what works and how it can be implemented quickly and cost-effectively.
How and Where to Study Data Science
The market is bidding-up these skills. Data science has been described as the sexiest job of the 21st century, and the education marketplace has started to catch up with the demand. With that in mind, I decided to build a dashboard that visualizes the the degree programs and provides links to their content:
Many thanks to Ryan Swanstrom for his github site. I used his file as data source form my dashboard making a few updates to some of he websites. I didn’t review every website. If you find any mistakes please let me know and I’ll update my files.
As you can see, there are 495 degree programs globally, and 390 of them are available in the United States. The majority offer master's degrees as the content is diverse and deep. It is interesting to explore who owns these programs (mathematics, statistics or computer science). The best ones seem to pluck out content from all three.
The University of Michigan recently announced a $100 million data science initiative. I like the approach the UC Berkeley takes with their program, but there are many schools offering what seems to be diverse content.
Come back to see next week’s post on big data. I’ll take a look at the universe of data products available and which database tools make sense to use in different situations.