What is data science? A closer look at science’s latest priority dispute

Two popular data science algorithms – naïve Bayes and eigen centrality – are used to examine the difference between data scientists, statisticians, and other occupations.

Algorithms
Data science education
Skills
Training
Author

Jonathan Auerbach, David Kepplinger, and Nicholas Rios

Published

February 19, 2024

What is data science, and where did it come from? Is data science a new and exciting set of skills, necessary for analyzing 21st century data? Or is it (as some have claimed) a rebranding of statistics, which has carefully developed time-honored methods for data analysis over the past century?

Priority disputes – disagreements over who deserves credit for a new scientific theory or method – date back to the beginning of science. Famous examples include the invention of calculus and ordinary least squares. But this latest dispute calls into question the novelty of an entire discipline.

In this article, we use two popular data science algorithms to examine the difference between data science, statistics, and other occupations. We find that in terms of the preparation required to become a data scientist, data science reflects both the work of natural sciences managers – individuals who oversee research operations in the natural sciences – and statisticians and mathematicians. This suggests that data science is a shared enterprise among science and math, and thus those trained in the natural sciences have as much claim to data science as those trained in mathematics and statistics.

In terms of the role a data scientist serves relative to other occupations, however, we find that data science is closest to statistics by far. Both occupations are fast growing and central among the occupations that work with data, suggesting a data scientist serves the same function as a statistician. But this function may be changing. While the centrality of statistics has declined over the past decade relative to other occupations, the centrality of data science has grown. In fact, data science has now surpassed statistics as the most central fast-growing occupation.

We examine the role of data science using data science

Everyone seems to agree that data science requires skills traditionally associated with a variety of different occupations. Drew Conway, for example, describes data science as a combination of math and statistics, substantive (domain) expertise, and “hacking” skills (see Figure 1). In dispute is the relative importance of those skills. Some have argued that data science is basically statistics – and that 20th century statisticians like John Tukey have long possessed the data science skills traditionally associated with computer science and the natural sciences. Others have argued that data science is truly interdisciplinary, and statistical thinking only plays a small role. But while opinions on data science abound, few appear to be based on data or science.1

Figure 1: Drew Conway describes data science as a combination of math and statistics, substantive (domain) expertise, and “hacking” skills. Conway’s data science venn diagram, reproduced here, is Creative Commons licensed as Attribution-NonCommercial.

To that end, we use two popular data science algorithms, naïve Bayes and eigen centrality (eigen decomposition), to investigate the question: What is data science? Both algorithms use data listing the training a worker must generally complete to work in an occupation, such as data science. Specifically, we use the CIP SOC Crosswalk provided by the US Bureau of Labor Statistics and US National Center for Education Statistics, which links the Classification of Instructional Programs – the standard classification of educational fields of study into roughly 2,000 instructional programs – with the Standard Occupational Classification – the standard classification of professions into roughly 700 occupations.

Our main assumption is that the skills required to work in an occupation can be represented by the instructional programs that prepare students to work in that occupation. For example, the occupation “data scientists” is associated with 35 instructional programs, such as data science, statistics, artificial intelligence, computational science, mathematical biology, and econometrics. The occupation “statisticians” is associated with 26 instructional programs, including data science, statistics, and econometrics, but not artificial intelligence, computational science, or mathematical biology.

The algorithms we employ consider occupations to be similar if they have many instructional programs in common. Data scientists and statisticians share 14 degrees, suggesting they are similar: Half the programs that prepare students to be a statistician also prepare students to be a data scientist. In contrast, data scientists and computer programmers share six degrees in common, suggesting they are less similar; computer programmers have 17 degrees overall so only a third of the programs that prepare students to be a computer programmer also prepare students to be a data scientist.2

Data and code to reproduce the analysis and figures are available through GitHub.

Data science is a shared enterprise among science and math

We use naïve Bayes to measure the similarity between each occupation and data science in terms of the preparation required to work in that occupation. Specifically, we first pretend that the occupation “data scientist” did not exist and then use Bayes’ rule to calculate the probability that a hypothetical group of workers with the 35 degrees associated with data science could have come from one of the roughly 700 other occupations. The higher the measure, the more consistent that occupation is with data science.

The use of Bayes’ rule is appealing because the similarity between a given occupation and data science takes into account the similarities between every other occupation and data science. Our use of Bayes’ rule is naïve in the sense that – before collecting the data – we assume these workers are equally likely to have come from any occupation.

The occupations with the largest probabilities, and thus most related to data science, are summarized in Figure 2. We find that the hypothetical workers have a 50% chance of being natural sciences managers and a 50% chance of being statisticians or mathematicians.3 We conclude that data science is a shared enterprise among science and math, and thus those trained in natural sciences have as much claim to data science as those trained in mathematics and statistics.

Figure 2: We use naïve Bayes to measure the similarity between each occupation and data science in terms of the preparation required to work in that occupation. We find that in terms of the preparation required to become a data scientist, data science is a shared enterprise among science and math.

Data science is closest to statistics in its role among other occupations

We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. Specifically, we calculate the principal right singular vector of the adjacency matrix denoting whether an instructional program (row) is associated with an occupation (column).4 An occupation has high eigen centrality when the instructional programs that prepare a worker for that occupation also prepare that worker for many other occupations as well. This suggests that the higher the measure, the more central the role of the occupation relative to other occupations.

The eigen centrality of each occupation is displayed in Figure 3. Each point represents an occupation, the x-axis denotes the centrality of the occupation, and the y-axis denotes the percent growth of the occupation as predicted by the US Bureau of Labor Statistics over the next decade. The figure demonstrates that data scientists and statisticians occupy nearly identical positions: Both are fast growing and central to the other occupations that work with data. In contrast, natural sciences managers are central but growing much more slowly, suggesting a role closer to managers. We conclude that – though data scientists are prepared similarly to natural sciences managers – a data scientist serves the same function as a statistician.

Figure 3: We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. We find that in terms of the role a data scientist serves relative to other occupations, a data scientist functions like a statistician.

But this function may be changing. Figure 4 shows the centrality (x-axis) of each occupation (y-axis) in 2010 and 2020. Green bars denote increases from 2010 to 2020 while yellow bars denote decreases. We find that the centrality of statisticians has declined over the past decade relative to other occupations, while the centrality of data scientists has grown. In fact, data science has now surpassed statistics as the most central fast-growing occupation. We conclude that though a data scientist and a statistician serve similar roles today, those roles may change as the workforce changes. Note that the occupation classifications changed in 2018, and we used the crosswalk provided by the US Bureau of Labor Statistics to make these comparisons.

Figure 4: We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. We find that the centrality of statisticians has declined over the past decade relative to other occupations, while the centrality of data scientists has grown. Data science has now surpassed statistics as the most central fast-growing occupation. (Occupations predicted to grow more than 20% over the next decade shown.)

The findings in this section are based on the adjacency matrix that encodes whether an instructional program (row) is associated with an occupation (column). A more detailed summary of the matrix is provided in Figure 5, which depicts the matrix as a network graph. Larger nodes represent occupations that are growing faster, while nodes closer to the center of the network represent more central occupations. The figure is interactive. You can zoom in to see the similar positions between data scientists and statisticians, which are both large (fast growing) and central.

Figure 5: A visualization of occupations as a network: Occupations are placed according to the instructional programs that train students for that occupation, with occupations closer together sharing more instructional programs in common. We find data scientists and statisticians occupy nearly identical positions at the center of the network. Occupations are colored according to the primary classification of instructional programs that train students for that occupation. Larger nodes represent occupations that are growing faster.

Is data science statistics?

We conclude that individuals trained in managing natural sciences research – a slow growing occupation – are turning to data science – a much faster growing occupation, and one which currently serves a role like that of a statistician. But if present trends continue, data science is poised to eclipse the historic role of the statistician as central to the occupations that work with data.

This suggests that while data science may be new and exciting, the role served by the data scientist is not particularly new. This does not mean that data scientists necessarily use the same time-honored methods for data analysis as statisticians. It is the authors’ experience, however, that many data science tools are in fact statistical. Indeed, the two data science algorithms we used in this article are both taught to students as new and exciting, but in reality are centuries-old methods steeped in statistical history.

Regardless of whether data science is or is not statistics, the occupation “data scientist” has proven immensely popular, capturing a zeitgeist that has eluded statistics. This is best evidenced by the fact that data science – and not statistics – has been crowned the sexiest job of the 21st century. But if statistics has not enjoyed the popularity of data science, perhaps the real question in need of answering is: What is statistics?

Explore more data science ideas

About the author
Jonathan Auerbach is an assistant professor in the Department of Statistics at George Mason University. His research covers a wide range of topics at the intersection of statistics and public policy. His interests include the analysis of longitudinal data, particularly for data science and causal inference, as well as urban analytics, open data, and the collection, evaluation, and communication of official statistics.
David Kepplinger is an assistant professor in the Department of Statistics at George Mason University. His research revolves around methods for robust and reliable estimation and inference in the presence of aberrant contamination in high-dimensional, complex data. He has active collaborations with researchers from the medical, biological, and life sciences.
Nicholas Rios is an assistant professor of statistics at George Mason University. He earned his PhD in statistics 2022 from Penn State University, where his dissertation focused on designing optimal mixture experiments. His primary research interests are experimental design and methods for intelligent data collection in the presence of real-world constraints. He is also interested in functional data analysis, computational statistics, compositional data analysis, and the analysis of high-dimensional data.
Copyright and licence
© 2023 Jonathan Auerbach, David Kepplinger, and Nicholas Rios

Text, code, and figures are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted. Thumbnail photo by Marc Sendra Martorell on Unsplash.

How to cite
Auerbach, Jonathan, David Kepplinger, and Nicholas Rios. 2023. “What is data science? A closer look at science’s latest priority dispute.” Real World Data Science, February 19, 2024. DOI

References

Donoho, David. 2017. “50 Years of Data Science.” Journal of Computational and Graphical Statistics 26 (4): 745–66.
Stigler, Stephen M. 1981. “Gauss and the Invention of Least Squares.” The Annals of Statistics, 465–74.

Footnotes

  1. Descriptions of occupations by government agencies are not particularly helpful in differentiating between data science, statistics, and related occupations. For example, according to the Bureau of Labor Statistics, data scientists use “analytical tools and techniques to extract meaningful insights from data.” This description is similar to mathematicians/statisticians, who “analyze data and apply computational techniques to solve problems,” and operations research analysts who use “mathematics and logic to help solve complex issues.”↩︎

  2. Our analysis treats all instructional programs as equal and independent. We do not consider, for example, the number of workers who hold a degree from an instructional program or whether two instructional programs are similar or offered by similar academic departments. Our analysis could be adjusted to account for this or related information, although it is unclear to the authors whether such an adjustment would make the results more accurate.↩︎

  3. Note that natural sciences managers share 18 instructional programs with data scientists, while statisticians share 14.↩︎

  4. Or alternatively, the principal eigenvector of the adjacency matrix denoting the number of instructional programs each occupation (row) has in common with each other occupation (column).↩︎