Data Science Resource List

Learning new things has become more accesible now due to the plethora of material available online. This is particularly the case for Data Science and Machine Learning. Since I got interested in the field, I have come across a huge amount of learning material which I found immensely useful. This is an attempt to put them togther and make it accesible to others.
There are many wonderful resources which Professors have put up online and this is an attempt to catalogue these awesome resources. It also has been done by Prakhar onGithub, which is suited to Software Engineering, so the below list is an attempt to list down resources pertaining to Data Science and focussed more on R software language. I plan to add more Python Material going forward. Hope you find this list useful.

Made with by Vikesh. Say Hi!

Content

| Data Science/Statistics Books | Cheatsheets | Courses | |--- |--- |--- |

Data Science/Statistics Books

Statistics Books

Stats without Tears Stan Brown
Introduction to Probability and Statistics Using R G. Jay Kerns- Youngstown State University
Theory Meets Data Ani Adhikari- Univ. of California Berkeley
Introduction to Statistical Thinking (With R, Without Calculus) Benjamin Yakir, The Hebrew University of Jerusalem
Applied Statistics with R David Dalpiaz - University of Illinois- UC
R for Statistical Learning David Dalpiaz - University of Illinois- UC
R Companion to Statistics: Unlocking the Power of Data Book Lock, Lock, Lock, Lock, and Lock
R Companion to Introduction to Statistical Investigations (Preliminary Edition) *Nathan Tintle et al *
Introduction to the Practice of Statistics (6th edition) in R Nicholas Horton and Ben Baumer
Introduction to Data Science This is an open source textbook aimed at introducing undergraduate students to Data Science
Stats: Data and Models (4th edition) in R De Veaux, Velleman, and Bock
ModernDive- An Introduction to Statistical and Data Sciences via R Chester Ismay and Albert Y. Kim - DataCamp and Amherst College
An R-companion for Statistics for Business: Decision Making and Analysis Robert A Stine- UPenn
Principles of Econometrics with R Constantin Colonescu
Introduction to Econometrics with R- using Stock and Watson Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer
Forecasting: Principles and Practice Rob J Hyndman and George Athanasopoulos - Monash University
Statistical Rethinking with brms, ggplot2, and the tidyverse A Solomon Kurz
Causal Inference Book- Draft Miguel Hernan and Jamie Robins - Harvard University
Computational and Inferential Thinking- (Python based) Ani Adhikari and John DeNero - UC- Berkeley

Machine Learning Books

An Introduction to Machine Learning with R Laurent Gatto
Introduction to Data Science Rafael A. Irizarry - Harvard University
Data Science Live Book Pablo Casas
R for Data Science Garrett Grolemund and Hadley Wickham - RStudio
Feature Engineering and Selection: A Practical Approach for Predictive Models Max Kuhn and Kjell Johnson- RStudio
Interpretable Machine Learning- A Guide for Making Black Box Models Explainable Christoph Molnar
From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science Prof. Norm Matloff- University of California, Davis
Technical Foundations of Informatics Michael Freeman and Joel Ross - University of Washington
Text Mining with R Julia Silge and David Robinson - StackOverflow
The Little Book of LDA Chris Tufts - StackOverflow
Deep Learning Book Series Hadrien J.- PhD Student

DataViz Books

Fundamentals of Data Visualization Claus O. Wilke
Data Visualization: A practical introduction Kieran Healy- Duke University
R for Social Sciences Data Carpentry
Visual Statistics Alexey Shipunov

R in Other Fields

Applied R for the quantitative social scientist Rense Nieuwenhuis
R and Social Science Michael Clark - Centre for Social Research
List of Books on CRAN Various - CRAN
Geocomputation with R Robin Lovelace, Jakub Nowosad, Jannes Muenchow
Sociospatial Data Science Christopher Prener, Ph.D.
Statistical Thinking for the 21st Century Russell A. Poldrack
Data Science for Startups Ben G Weber

R Tool Books

An Introduction to R W. N. Venables, D. M. Smith and the R Core Team
The R Inferno Patrick Burn
Advanced R Hadley Wickham - RStudio
Statistical Programming Methods with R James Balamuta- UIUC
Mastering Software Development in R Roger D. Peng, Sean Kross, and Brooke Anderson- John Hopkins University
Advanced R Course Florian Privé- Grenoble, France
Efficient R programming Colin Gillespie and Robin Lovelace- Newcastle University and Leeds Institute for Transport Studies
Pack YouR Code Gaston Sanchez- University of California Berkley
Introduction to Open Data Science Ocean Health Index Team

Other R resources

R user group Oxford: Dedicated to bringing together area practitioners of R to exchange knowledge, inspire new users, and spur the adoption of R for innovative research and commercial applications.
Awesome Blogdown : Awesome curated list of blogs built using blogdown.
DALEX: Descriptive mAchine Learning EXplanations: In many applications we need to know, understand or prove how input variables are used in the model and what impact do they have on final model prediction. DALEX is a set of tools that help to understand how complex models are working
aRrgh:a newcomer’s (angry) guide to R

Cheatsheets

Click to expand!

Probability cheatsheet Shervine Amidi
Statistics cheatsheet Shervine Amidi
Distribution Tables cheatsheet Shervine Amidi
Key Concepts Explained- Stats Shervine Amidi
Machine Learning tips and tricks cheatsheet Shervine Amidi
Deep Learning cheatsheet Shervine Amidi
ML cheatsheet Rémi Canard
Stats cheatsheet CSE 103
Data Science cheatsheet Maverick Lin
Super VIP ML cheatsheet Afshine Amidi and Shervine Amidi

Courses

Click to expand!

R Studio Online Tutorials

Programming with R Software Carpentry Foundation

Courses taught by Hadley Wickham H. Wickham

Statistics courses offered in Harvard Harvard University

PROB 140 Probability for Data Science UC- Berkeley

Prob 140 (formally Statistics 140 or STAT 140) is a probability course for undergraduates who have taken Data 8, have a math background, and wish to go deeper into the theory of data science. The emphasis on simulation and the bootstrap in Data 8 gives students a concrete sense of randomness and sampling variability. Prob 140 will capitalize on this. Because of the students’ backgrounds, Prob 140 will move swiftly over basics, avoid approximations that are unnecessary when SciPy is at hand, and replace some of the routine calculus by symbolic math done in SymPy. This will create time to focus on the more demanding concepts that are part of the theoretical foundations of data science.
Syllabus
Textbook
Lectures/Slides
Assignments

CS 109 Probability for Computer Scientists Stanford University

The class starts by providing a fundamental grounding in combinatorics, and then quickly moves into the basics of probability theory. We will then cover many essential concepts in probability theory, including particular probability distributions, properties of probabilities, and mathematical tools for analyzing probabilities. Finally, the last third of the class will focus on data analysis and Machine Learning as a means for seeing direct applications of probability in this exciting and quickly growing subfield of computer science.
Syllabus
Textbook
Lectures/Slides
Assignments

DS 101 Data Science 101 Stanford University

The course provides a solid introduction to data science, both exposing students to computational tools they can proficently use to analyze data and exploring the conceptual challenges of inferential reasoning. Each module/week represents a new “data adventure,” analyzing real datasets, exploring different questions and trying out tools.
Syllabus
Lectures/Slides
Assignments

CME/STATS 195 Introduction to R Stanford University

The goal of this short course is to familiarize students with R’s tools for scientific computing. Class lectures will have interactive elements, and assignments will be application-driven.Topics covered include basic data structures, file I/O, control structures, functions, visualizations, packages for statistical analysis.
Syllabus
Lectures/Slides
Assignments
Final Project

Stat 48N Riding the data wave Stanford University

How can we make sense of all the information we are acquiring about ourselves? During each week, we will consider a different data set to be summarized with a different goal. We will review analyses of similar problems carried out in the past and explore if and how the same tools can be useful today. We will pay attention to contemporary media (newspapers, blogs, etc.) to identify settings similar to the ones we are examining and critique the displays and summaries there documented
Syllabus
Lectures/Slides
Assignments

MS&E 226 Small Data Stanford University

This course is about understanding “small data”: these are datasets that allow interaction, visualization, exploration, and analysis on a local machine. The material provides an introduction to applied data analysis, with an emphasis on providing a conceptual framework for thinking about data from both statistical and machine learning perspectives. Topics will be drawn from the following list, depending on time constraints and class interest: approaches to data analysis: statistics (frequentist, Bayesian) and machine learning; binary classification; regression; bootstrapping; causal inference and experimental design; multiple hypothesis testing.
Syllabus
Lectures/Slides
Datasets

DS100 Principles and Techniques of Data Science UC- Berkley

Combining data, computation, and inferential thinking, data science is redefining how people and organizations solve challenging problems and understand their world. This intermediate level class bridges between Data8 and upper division computer science and statistics courses as well as methods courses in other fields
Syllabus
Material
Assignments

Stats 200 Introduction to Statistical Inference Stanford University

The class will introduce the students to formal statistical reasoning. Building on knowledge of probability and calculus, we will explore how limited noisy observations can be used to learn general characteristics of a population. We will study the basics of decision theory, including frequentist and Bayesian solutions to the "paradox of induction."
Syllabus
Lectures/Slides
Assignments

INFO 201A Technical Foundations of Informatics University of Washington

This course introduces fundamental tools and technologies necessary to transform data into knowledge. We'll cover skill associated with each component of the information lifecycle, including the collection, storage, analysis, and visualization of data. Core competencies underlying this process, including functional programming, use of databases, data wrangling, version control, and command line proficiency, are acquired through real-world data-driven assignments
Lectures/Slides
Assignments

STAT 405 Introduction to Data Analysis (using R, 2012) Rice University

This course will teach you to be a data analyst. You will learn how to take a large dataset break up into manageable pieces and use a range of qualitative and quantitative tools to summarise it and learn what it has to tell. You will learn the importance of scepticism and curiosity, and how to communicate your findings. Each section of the course is motivated by a particular dataset, and you will gain experience working with a wide variety of data sources varying in size and quality.
Syllabus
Lectures/Slides
Assignments

STAT 385 Statistics Programming Methods UIUC

MY472 Data for Data Scientists LSE

This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non-relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest
Syllabus
Lectures/Slides
Assignments

STAT 149 Generalized Linear Models Harvard University

An introduction to methods for analyzing categorical data. Emphasis will be on understanding models and applying them to datasets. Topics include visualizing categorical data, analysis of contingency tables, odds ratios, log-linear models, generalized linear models, logistic regression, Poisson regression and model diagnostics. Examples drawn from many fields, including biology, medicine and the social sciences.
Syllabus
Lectures/Slides
Assignments

DSO 530 Applied Modern Statistical Learning Techniques Univ. of Southern California

This course aims to go far beyond the classical statistical methods, such as linear regression, that are introduced in GSBA 524
Syllabus
Lectures/Slides
- The course follows ISLR and provides succinct summary of the book in the slides
Assignments
Videos

STAT 320 Design and Analysis of Causal Studies Duke University

Presents an overview of methods for estimating causal effects: how to answer the question of “What is the effect of A on B?” Includes discussion of randomized designs, but with more emphasis on alternative designs and methods for when randomization is infeasible: matching methods, propensity scores, longitudinal treatments, regression discontinuity, instrumental variables, and principal stratification. Methods are motivated by examples from social sciences, policy and health sciences.
Syllabus
Lectures/Slides
Assignments
Webpage of Dr. Kari Lock Morgan for other course links

Statistics 585X Data Technologies for Statistical Analysis Iowa State University

Not all data lives in nice, clean spreadsheets, not all data fits in a computer’s main memory. As statisticians we cannot always rely on other people and sciences to get the data into formats that we can deal with: we will discuss aspects of statistical computing as they are relevant for data analysis. Read and work with data in different formats: flat files, databases, web technologies. Elements of literate programming help us with making our workflow transparent and analyses reproducible. We will discuss communication of results in form of R packages and interactive web application.
Syllabus
Lectures/Slides
Assignments
Final Project

STATS 202 Data Mining and Analysis (using R) Stanford University

Stats 202 is an introduction to Data Mining. Students will:
Understand the distinction between supervised and unsupervised learning and be able to identify appropriate tools to answer different research questions.Become familiar with basic unsupervised procedures including clustering and principal components analysis. Become familiar with the following regression and classification algorithms: linear regression, ridge regression, the lasso, logistic regression, linear discriminant analysis, K-nearest neighbors, splines, generalized additive models, tree-based methods, and support vector machines.Gain a practical appreciation of the bias-variance tradeoff and apply model selection methods based on cross-validation and bootstrapping to a prediction challenge.Analyze a real dataset of moderate size using R.Develop the computational skills for data wrangling, collaboration, and reproducible research.Be exposed to other topics in machine learning, such as missing data, prediction using time series and relational data, non-linear dimensionality reduction techniques, web-based data visualizations, anomaly detection, and representation learning.
Syllabus
Lectures/Slides
Assignments
Final Project- Kaggle

STATS 203 Introduction to Regression Models and Analysis of Variance Stanford University

The course is intended to be a (non-exhaustive) survey of regression techniques from both a theoretical and applied perspective.
Syllabus
Lectures/Slides
Assignments

6.S085 Statistics for Research Projects MIT

This class is a practical introduction to statistical modeling and experimental design, intended to provide essential skills for doing research. We'll cover basic techniques (e.g., hypothesis testing and regression models) for both traditional experiments and newer paradigms such as evaluating simulations. Students with research projects will be encouraged to share their experiences and project-specific questions.
Syllabus
Lectures/Slides
Assignments
Case Study

Statistics 36-350 Statistical Computing: Spring 2018 Carnegie Mellon University

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify, and write code, so that they can assemble the computational tools needed to solve their data analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to statistically-oriented programming, targeted at statistics majors, without assuming extensive programming background
Syllabus
Lectures/Slides
Assignments

Statistics 231 Statistical Learning Theory Stanford University

Uncover common statistical principles underlying diverse array of machine learning techniques.
- Linear algebra
- Probability
- Optimization
Syllabus
Lectures/Slides
Assignments

Sta 323 Statistical Programming(2018) Duke University

STATS 401 Applied Statistical Methods II University of Michigan

An intermediate course in applied statistics, covering a range of topics in modeling and analysis of data including: review of simple linear regression, two-sample problems, one-way analysis of variance; multiple linear regression, diagnostics and model selection; two-way analysis of variance, multiple comparisons, and other selected topics
Lectures/Slides
Assignments
Lab Material

Stats 531 Analysis of Time Series University of Michigan

This course gives an introduction to time series analysis using time domain methods and frequency domain methods. The goal is to acquire the theoretical and computational skills required to investigate data collected as a time series. The first half of the course will develop classical time series methodology, including auto-regressive moving average (ARMA) models, regression with ARMA errors, and estimation of the spectral density.
Lectures/Slides
Assignments
Projects

AGRON 590RD Data Stewardship for Earth Systems Scientists Iowa State University

Learn how to clearly organize, track, and communicate data-based work, collect and house data through analysis and publication, collaborate in a reproducible way, model data structures and wrangle data, and complete the entire research cycle in a responsible way.
Syllabus
Lectures/Slides
Assignments

MPA 635 Data Visualization Brigham Young University.

Become literate in data and graphic design principles, (2) an ethical data communicator, and (3) a collaborative sharer by producing beautiful, powerful, and clear visualizations of your own data
Syllabus
Lectures/Slides
Assignments

CME 252 Introduction to Optimization Stanford University

This course introduces mathematical optimization and modeling, with a focus on convex optimization. Topics include: varieties of mathematical optimization, convexity of functions and sets, convex optimization modeling with CVXPY, gradient descent and basic distributed optimization, in-depth examples from machine learning, statistics and other fields and applications of bi-convexity and non-convex gradient descent.
Lectures/Slides
Assignments

CSC 321 Intro to Neural Networks and Machine Learning University of Toronto

This course gives an overview of both the foundational ideas and the recent advances in neural net algorithms. Roughly the first 2/3 of the course focuses on supervised learning -- training the network to produce a specified behavior when one has lots of labeled examples of that behavior. The last 1/3 focuses on unsupervised learning and reinforcement learning..
Lectures/Slides
Assignments

EECS 349 Machine Learning- Spring 2018 Northwestern University

Lectures/Slides
- The lecture notes are of very good quality.
Assignments

STAT 365/665 Data Mining and Machine Learning (uses R) Yale University

Note: The lecture notes and assignments of the course are of very good quality
Other course by Taylor Arnold
Syllabus
Lectures/Slides
- The lecture notes are of very good quality.
Assignments

TJ-ML TJHSST Machine Learning Thomas Jefferson High School

TJHSST Machine Learning Club aims to bring the complex and vast topic of machine learning to high school students. We teach a variety of topics, including SVMs, Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, and more.

Note: Great Initiative, that too from High School students @Mihir Patel

SIGIL Statistical Analysis of Corpus Data with R Postdam University

CIS 419/519 Applied Machine Learning- Spring 2018 UPenn Engineering

This course will introduce some of the key machine learning methods that have proved valuable and successful in practical applications. We will discuss some of the foundational questions in machine learning in order to get a good understanding of the basic issues in this area, and present the main paradigms and techniques needed to obtain successful performance in application areas such as natural language and text understanding, speech recognition, computer vision, data mining, adaptive computer systems and others. The main body of the course will review several supervised and (semi/un)supervised learning approaches. These include methods for learning linear representations, decision-tree methods, Bayesian methods, kernel based methods and neural networks methods, as well as clustering, dimensionality reduction and reinforcement learning methods.

Data Science Resources

List of Books, Courses for Data Science

Data Science Resource List

Content

Data Science/Statistics Books

Cheatsheets

Courses