Probability and Statistics

Course Overview

This course introduces the core ideas of probability theory and statistical reasoning. We explore random variables, distributions, expectation, variance, independence, sampling, estimation, and hypothesis testing. These concepts form the backbone of data science, machine learning, and stochastic modeling.

As with all courses in this ecosystem, topics here will link to related material in other courses — such as linear algebra, differential equations, and functional analysis — so you can quickly revisit prerequisite ideas whenever needed.

Introduction to Probability

To understand probability, a good place to start is to understand some of the basic terminology that goes along with it. Generally we start with a collection of facts called data. When we are addressing questions that we would like to focus on, our data will generally be drawn from some population. It is important to identify what the population is for your particular problem, for instance if I wish to examine how U.S. college students perform with various levels of studying, then my population is all U.S. college students, notice that this population is not all individuals who live in the U.S. because many of them are not pertinent to the question I am examining. When we examine all members of a population it is called a census (note that members does not mean that we are referring to people, for example my population could be engines if I am trying to estimate failure rates). Often times it is not feasible to gather data on all members of a population, for a variety of reasons. For instance, if I want to examine the risk of heart attacks among all U.S. adults, it would be impossible to go out and find the data on every adult, both because of the sheer size of the population, and also because the number is constantly changing, thus we rely instead on a representative sample. Samples are subsets of a particular population, and much care needs to be taken when drawing samples to ensure that we do not skew the results of our analysis when drawing conclusions about the population.

The characteristics of interest can generally take on different types of values, they may be categorical, such as heart attacks (either yes or no, each of which is a category) or they may take on discrete values in which the magnitude of the value has meaning (such as the number of earthquakes a region has of magnitude 5 or greater per year), or they may take on continuous real values, as in the concentration of salt in a cell.

More to come!