What is Data Science?

What is Data?

David Dalpiaz

January 22, 2024

What is Data Science?

What Data Science is Not

  • “Big Data”
  • “Machine Learning”
  • “Just Statistics”
  • “New”

Dave’s DS Definition

The application of methods from statistics, computer science, and related fields to produce information and knowledge from data in order to solve domain specific problems and make decisions.

Because of the intersection of the methods fields (statistics, CS, etc) and the domain application fields (biology, finance, insurance, sports, etc), data science is necessarily an interdisciplinary field.

Wikipedia DS Definition

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

Snark DS Definition

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

American Statistical Association DS Definition

While there is not yet a consensus on what precisely constitutes data science, three professional communities, all within computer science and/or statistics, are emerging as foundational to data science: (i) Database Management enables transformation, conglomeration, and organization of data resources, (ii) Statistics and Machine Learning convert data into knowledge, and (iii) Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis.

Is Data Science a Science?

Good question!

What is Data?

Dave’s Data Definition

Data is anything that can be observed, stored (most often digitally as numbers or characters), and recalled.

Here, “anything” could be, but is not limited to: measurements, artifacts, and proxies for the state of nature.

Merriam-Webster Data Definition

  • factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
  • information in digital form that can be transmitted or processed
  • information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

Cambridge Dictionary Data Definition

information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer

Wikipedia Data Definition

In common usage data is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally.

What is Big Data?

Data, Information, Knowledge, and Wisdom

DIKW Pyramid

DIKW Pyramid Definitions

  • Data refers to raw, unprocessed facts and figures without context. It is the foundation for all subsequent layers but holds limited value in isolation.

  • Information is organized, structured, and contextualized data. Information is useful for answering basic questions like “who,” “what,” “where,” and “when.”

  • Knowledge is the result of analyzing and interpreting information to uncover patterns, trends, and relationships. It provides an understanding of “how” and “why” certain phenomena occur.

  • Wisdom is the ability to make well-informed decisions and take effective action based on understanding of the underlying knowledge.

DataCamp: The Data-Information-Knowledge-Wisdom Pyramid

What Do Data Scientists Do?

Make Data Products

What Are Data Products?

Really, anything downstream of data itself that is useful in some way.

  • Summary Statistics
  • Visualizations
  • Statistical Models
  • Machine Learning Models

Good data products will help an end-user solve a problem or make a decision.

Data Product Examples

Other Roles

  • Data Engineer
  • Data Analyst
  • Analytics Engineer
  • Statistician
  • Machine Learning Engineer
  • Software Engineers

Domains Using Data Scientists

What fields are not using data?

What Are The Tools of Data Science?

How do data scientists apply statistical and computational methods?

  • Databases
  • Text Editor
  • Scripting Languages (Jupyter)
  • Unix Shell
  • Version Control
  • “The Cloud”

CS 498 Topics

  • The Jupyter Languages
  • Array Programming
  • Data Frames and Tidy Data
  • EDA: Summary Statistics and Tables
  • EDA: Visualization
  • Grammar of Graphics
  • Storage Formats and IO
  • Code Packaging

More CS 498 Topics

  • Statistical Methods for Data Analysis
  • Experimentation
  • Classical Machine Learning
  • Modern Machine Learning
  • Reporting
  • Organization and Reproducibility
  • Orchestration
  • APIs

Even More CS 498 Topics

  • Data Generation and Collection
  • Data Validation
  • Data Dashboards
  • Reactive Programming
  • Communication
  • Ethics