Statistics Fundamentals for Data Science — Part1

Sachin Dev
3 min readMay 1, 2023

Statistics is a crucial aspect of data science and provides the tools and techniques to analyze and interpret the data. It is the science of collecting, organizing, and analyzing the data. It is essential to have a solid understanding of statistics to make informed decisions based on data. Without statistics, data science would be incomplete, and the analysis of data would be limited to basic descriptive measures like mean, median, and mode.

Data: Facts and Pieces of information.

Descriptive and Inferential Statistics

  • Descriptive Statistics: It is the branch of statistics that deals with organizing and summarizing the data. It includes measures of central tendency (mean, median, and mode), and measures of variability like variance and standard deviation. It also provides a comprehensive summary of the data and helps in understanding its characteristics. We can use Histograms, Pie Charts, Bar Plots, etc. for the same.
  • Inferential Statistics: It consists of collecting sample data and making conclusions about the data using some experiments. It involves hypothesis testing, confidence intervals, and regression analysis. It allows us to make generalizations about a population based on a sample and provides a framework for making informed decisions.

Population (N) vs Sample (n)

The main difference between a population and a sample has to do with how the observations are assigned to the dataset.

1: A Population includes all the elements from a set of data.
For e.g. Number of students in a school.

2: A Sample consists of one or more observations drawn from a population.
For e.g. Number of students in a particular class.

Sampling Techniques

Sampling techniques are methods used to select a subset of individuals from a population for a study. There are several types of sampling techniques, including:

  • Simple Random Sampling: Every member of the population(N) has an equal chance of being selected for the sample(n). This technique is useful when the population is homogeneous and there are no specific criteria for selection.
  • Stratified Sampling: It is a type of sampling method in which the total population is divided into smaller groups or strata to complete the sampling process. This technique is useful when the population is heterogeneous and contains distinct subgroups.
    Example: Gender: Male or Female, Education Degree: High School, Masters, Ph.D., etc.
  • Systematic Sampling: It is a probability sampling method where researchers select the members of the population at regular intervals. This technique is useful when the population is large and ordered, and a random selection is not feasible.
    Example: In a population of 10000 people a statistician selects every 100th person for sampling.
  • Convenience Sampling: Only those who are interested in the survey will only participate. This technique is useful when the population is small, and the research question is not too complex.

Variable and Types of Variables

A variable is a property that can take any value. In statistics, there are two types of variables:

  • Quantitative Variables: Variables that can be measured numerically (mathematical operations) like Age, Height, Distance, etc. Quantitative variables are further divided into:
    1: Discrete Variables:- These are usually integers. In discrete variables, we will only have whole numbers and these numbers can repeat with respect to different data points. Example: Number of bank accounts, Number of children in a family, etc.
    2: Continuous Variables:- Continuous variables can take on any value within a range or interval. These variables can be decimals, fractions, etc. Example: Height of students in a class, weight, speed, etc.
  • Qualitative Variables: Variables that are grouped together based on some characteristics like Gender, Types of flowers, Movie Types, etc.

Conclusion

In this article, we discussed the basics of statistics in data science including the difference between population and sample, different sampling techniques, and the types o variables. In my future articles, I will try to cover more topics related to statistics in data science.

Thanks for reading this article! Leave a comment below if you have any questions. You can follow me on Linkedin and GitHub.

BECOME a WRITER at MLearning.ai

--

--