Data Science Resource List
Learning new things has become more accesible now due to the plethora of material available online. This is particularly the case for Data Science and Machine Learning. Since I got interested in the field, I have come across a huge amount of learning material which I found immensely useful. This is an attempt to put them togther and make it accesible to others.
There are many wonderful resources which Professors have put up online and this is an attempt to catalogue these awesome resources. It also has been done by Prakhar onGithub
, which is suited to Software Engineering
, so the below list is an attempt to list down resources pertaining to Data Science
and focussed more on R
software language. I plan to add more Python Material going forward. Hope you find this list useful.
Content
| Data Science/Statistics Books | Cheatsheets | Courses | |--- |--- |--- |
Data Science/Statistics Books
Statistics Books
-
Stats without Tears Stan Brown
-
Introduction to Probability and Statistics Using R G. Jay Kerns- Youngstown State University
-
Theory Meets Data Ani Adhikari- Univ. of California Berkeley
-
Introduction to Statistical Thinking (With R, Without Calculus) Benjamin Yakir, The Hebrew University of Jerusalem
-
Applied Statistics with R David Dalpiaz - University of Illinois- UC
-
R for Statistical Learning David Dalpiaz - University of Illinois- UC
-
R Companion to Statistics: Unlocking the Power of Data Book Lock, Lock, Lock, Lock, and Lock
-
R Companion to Introduction to Statistical Investigations (Preliminary Edition) *Nathan Tintle et al *
-
Introduction to the Practice of Statistics (6th edition) in R Nicholas Horton and Ben Baumer
-
Introduction to Data Science This is an open source textbook aimed at introducing undergraduate students to Data Science
-
Stats: Data and Models (4th edition) in R De Veaux, Velleman, and Bock
-
ModernDive- An Introduction to Statistical and Data Sciences via R Chester Ismay and Albert Y. Kim - DataCamp and Amherst College
-
An R-companion for Statistics for Business: Decision Making and Analysis Robert A Stine- UPenn
-
Principles of Econometrics with R Constantin Colonescu
-
Introduction to Econometrics with R- using Stock and Watson Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer
-
Forecasting: Principles and Practice Rob J Hyndman and George Athanasopoulos - Monash University
-
Statistical Rethinking with brms, ggplot2, and the tidyverse A Solomon Kurz
-
Causal Inference Book- Draft Miguel Hernan and Jamie Robins - Harvard University
-
Computational and Inferential Thinking- (Python based) Ani Adhikari and John DeNero - UC- Berkeley
Machine Learning Books
-
An Introduction to Machine Learning with R Laurent Gatto
-
Introduction to Data Science Rafael A. Irizarry - Harvard University
-
Data Science Live Book Pablo Casas
-
R for Data Science Garrett Grolemund and Hadley Wickham - RStudio
-
Feature Engineering and Selection: A Practical Approach for Predictive Models Max Kuhn and Kjell Johnson- RStudio
-
Interpretable Machine Learning- A Guide for Making Black Box Models Explainable Christoph Molnar
-
From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science Prof. Norm Matloff- University of California, Davis
-
Technical Foundations of Informatics Michael Freeman and Joel Ross - University of Washington
-
Text Mining with R Julia Silge and David Robinson - StackOverflow
-
The Little Book of LDA Chris Tufts - StackOverflow
-
Deep Learning Book Series Hadrien J.- PhD Student
DataViz Books
-
Fundamentals of Data Visualization Claus O. Wilke
-
Data Visualization: A practical introduction Kieran Healy- Duke University
-
R for Social Sciences Data Carpentry
-
Visual Statistics Alexey Shipunov
R in Other Fields
-
Applied R for the quantitative social scientist Rense Nieuwenhuis
-
R and Social Science Michael Clark - Centre for Social Research
-
List of Books on CRAN Various - CRAN
-
Geocomputation with R Robin Lovelace, Jakub Nowosad, Jannes Muenchow
-
Sociospatial Data Science Christopher Prener, Ph.D.
-
Statistical Thinking for the 21st Century Russell A. Poldrack
-
Data Science for Startups Ben G Weber
R Tool Books
-
An Introduction to R W. N. Venables, D. M. Smith and the R Core Team
-
The R Inferno Patrick Burn
-
Advanced R Hadley Wickham - RStudio
-
Statistical Programming Methods with R James Balamuta- UIUC
-
Mastering Software Development in R Roger D. Peng, Sean Kross, and Brooke Anderson- John Hopkins University
-
Advanced R Course Florian Privé- Grenoble, France
-
Efficient R programming Colin Gillespie and Robin Lovelace- Newcastle University and Leeds Institute for Transport Studies
-
Pack YouR Code Gaston Sanchez- University of California Berkley
-
Introduction to Open Data Science Ocean Health Index Team
Other R resources
-
R user group Oxford: Dedicated to bringing together area practitioners of R to exchange knowledge, inspire new users, and spur the adoption of R for innovative research and commercial applications.
-
Awesome Blogdown : Awesome curated list of blogs built using blogdown.
-
DALEX: Descriptive mAchine Learning EXplanations: In many applications we need to know, understand or prove how input variables are used in the model and what impact do they have on final model prediction. DALEX is a set of tools that help to understand how complex models are working
Cheatsheets
Click to expand!
-
Probability cheatsheet Shervine Amidi
-
Statistics cheatsheet Shervine Amidi
-
Distribution Tables cheatsheet Shervine Amidi
-
Key Concepts Explained- Stats Shervine Amidi
-
Machine Learning tips and tricks cheatsheet Shervine Amidi
-
Deep Learning cheatsheet Shervine Amidi
-
ML cheatsheet Rémi Canard
-
Stats cheatsheet CSE 103
-
Data Science cheatsheet Maverick Lin
-
Super VIP ML cheatsheet Afshine Amidi and Shervine Amidi
Courses
Click to expand!
Programming with R Software Carpentry Foundation
Courses taught by Hadley Wickham H. Wickham
Statistics courses offered in Harvard Harvard University
PROB 140 Probability for Data Science UC- Berkeley
-
Prob 140 (formally Statistics 140 or STAT 140) is a probability course for undergraduates who have taken Data 8, have a math background, and wish to go deeper into the theory of data science. The emphasis on simulation and the bootstrap in Data 8 gives students a concrete sense of randomness and sampling variability. Prob 140 will capitalize on this. Because of the students’ backgrounds, Prob 140 will move swiftly over basics, avoid approximations that are unnecessary when SciPy is at hand, and replace some of the routine calculus by symbolic math done in SymPy. This will create time to focus on the more demanding concepts that are part of the theoretical foundations of data science.
CS 109 Probability for Computer Scientists Stanford University
-
The class starts by providing a fundamental grounding in combinatorics, and then quickly moves into the basics of probability theory. We will then cover many essential concepts in probability theory, including particular probability distributions, properties of probabilities, and mathematical tools for analyzing probabilities. Finally, the last third of the class will focus on data analysis and Machine Learning as a means for seeing direct applications of probability in this exciting and quickly growing subfield of computer science.
DS 101 Data Science 101 Stanford University
-
The course provides a solid introduction to data science, both exposing students to computational tools they can proficently use to analyze data and exploring the conceptual challenges of inferential reasoning. Each module/week represents a new “data adventure,” analyzing real datasets, exploring different questions and trying out tools.
CME/STATS 195 Introduction to R Stanford University
-
The goal of this short course is to familiarize students with R’s tools for scientific computing. Class lectures will have interactive elements, and assignments will be application-driven.Topics covered include basic data structures, file I/O, control structures, functions, visualizations, packages for statistical analysis.
Stat 48N Riding the data wave Stanford University
-
How can we make sense of all the information we are acquiring about ourselves? During each week, we will consider a different data set to be summarized with a different goal. We will review analyses of similar problems carried out in the past and explore if and how the same tools can be useful today. We will pay attention to contemporary media (newspapers, blogs, etc.) to identify settings similar to the ones we are examining and critique the displays and summaries there documented
MS&E 226 Small Data Stanford University
-
This course is about understanding “small data”: these are datasets that allow interaction, visualization, exploration, and analysis on a local machine. The material provides an introduction to applied data analysis, with an emphasis on providing a conceptual framework for thinking about data from both statistical and machine learning perspectives. Topics will be drawn from the following list, depending on time constraints and class interest: approaches to data analysis: statistics (frequentist, Bayesian) and machine learning; binary classification; regression; bootstrapping; causal inference and experimental design; multiple hypothesis testing.
DS100 Principles and Techniques of Data Science UC- Berkley
-
Combining data, computation, and inferential thinking, data science is redefining how people and organizations solve challenging problems and understand their world. This intermediate level class bridges between Data8 and upper division computer science and statistics courses as well as methods courses in other fields
Stats 200 Introduction to Statistical Inference Stanford University
-
The class will introduce the students to formal statistical reasoning. Building on knowledge of probability and calculus, we will explore how limited noisy observations can be used to learn general characteristics of a population. We will study the basics of decision theory, including frequentist and Bayesian solutions to the "paradox of induction."
INFO 201A Technical Foundations of Informatics University of Washington
-
This course introduces fundamental tools and technologies necessary to transform data into knowledge. We'll cover skill associated with each component of the information lifecycle, including the collection, storage, analysis, and visualization of data. Core competencies underlying this process, including functional programming, use of databases, data wrangling, version control, and command line proficiency, are acquired through real-world data-driven assignments
STAT 405 Introduction to Data Analysis (using R, 2012) Rice University
-
This course will teach you to be a data analyst. You will learn how to take a large dataset break up into manageable pieces and use a range of qualitative and quantitative tools to summarise it and learn what it has to tell. You will learn the importance of scepticism and curiosity, and how to communicate your findings. Each section of the course is motivated by a particular dataset, and you will gain experience working with a wide variety of data sources varying in size and quality.
STAT 385 Statistics Programming Methods UIUC
MY472 Data for Data Scientists LSE
-
This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non-relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest
STAT 149 Generalized Linear Models Harvard University
-
An introduction to methods for analyzing categorical data. Emphasis will be on understanding models and applying them to datasets. Topics include visualizing categorical data, analysis of contingency tables, odds ratios, log-linear models, generalized linear models, logistic regression, Poisson regression and model diagnostics. Examples drawn from many fields, including biology, medicine and the social sciences.
DSO 530 Applied Modern Statistical Learning Techniques Univ. of Southern California
-
This course aims to go far beyond the classical statistical methods, such as linear regression, that are introduced in GSBA 524
-
- The course follows
ISLR
and provides succinct summary of the book in the slides
- The course follows
STAT 320 Design and Analysis of Causal Studies Duke University
-
Presents an overview of methods for estimating causal effects: how to answer the question of “What is the effect of A on B?” Includes discussion of randomized designs, but with more emphasis on alternative designs and methods for when randomization is infeasible: matching methods, propensity scores, longitudinal treatments, regression discontinuity, instrumental variables, and principal stratification. Methods are motivated by examples from social sciences, policy and health sciences.
Statistics 585X Data Technologies for Statistical Analysis Iowa State University
-
Not all data lives in nice, clean spreadsheets, not all data fits in a computer’s main memory. As statisticians we cannot always rely on other people and sciences to get the data into formats that we can deal with: we will discuss aspects of statistical computing as they are relevant for data analysis. Read and work with data in different formats: flat files, databases, web technologies. Elements of literate programming help us with making our workflow transparent and analyses reproducible. We will discuss communication of results in form of R packages and interactive web application.
STATS 202 Data Mining and Analysis (using R) Stanford University
-
Stats 202 is an introduction to Data Mining. Students will:
-
Understand the distinction between supervised and unsupervised learning and be able to identify appropriate tools to answer different research questions.Become familiar with basic unsupervised procedures including clustering and principal components analysis. Become familiar with the following regression and classification algorithms: linear regression, ridge regression, the lasso, logistic regression, linear discriminant analysis, K-nearest neighbors, splines, generalized additive models, tree-based methods, and support vector machines.Gain a practical appreciation of the bias-variance tradeoff and apply model selection methods based on cross-validation and bootstrapping to a prediction challenge.Analyze a real dataset of moderate size using R.Develop the computational skills for data wrangling, collaboration, and reproducible research.Be exposed to other topics in machine learning, such as missing data, prediction using time series and relational data, non-linear dimensionality reduction techniques, web-based data visualizations, anomaly detection, and representation learning.
STATS 203 Introduction to Regression Models and Analysis of Variance Stanford University
-
The course is intended to be a (non-exhaustive) survey of regression techniques from both a theoretical and applied perspective.
6.S085 Statistics for Research Projects MIT
-
This class is a practical introduction to statistical modeling and experimental design, intended to provide essential skills for doing research. We'll cover basic techniques (e.g., hypothesis testing and regression models) for both traditional experiments and newer paradigms such as evaluating simulations. Students with research projects will be encouraged to share their experiences and project-specific questions.
Statistics 36-350 Statistical Computing: Spring 2018 Carnegie Mellon University
-
Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify, and write code, so that they can assemble the computational tools needed to solve their data analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to statistically-oriented programming, targeted at statistics majors, without assuming extensive programming background
Statistics 231 Statistical Learning Theory Stanford University
-
Uncover common statistical principles underlying diverse array of machine learning techniques.
- Linear algebra
- Probability
- Optimization
Sta 323 Statistical Programming(2018) Duke University
STATS 401 Applied Statistical Methods II University of Michigan
-
An intermediate course in applied statistics, covering a range of topics in modeling and analysis of data including: review of simple linear regression, two-sample problems, one-way analysis of variance; multiple linear regression, diagnostics and model selection; two-way analysis of variance, multiple comparisons, and other selected topics
Stats 531 Analysis of Time Series University of Michigan
-
This course gives an introduction to time series analysis using time domain methods and frequency domain methods. The goal is to acquire the theoretical and computational skills required to investigate data collected as a time series. The first half of the course will develop classical time series methodology, including auto-regressive moving average (ARMA) models, regression with ARMA errors, and estimation of the spectral density.
AGRON 590RD Data Stewardship for Earth Systems Scientists Iowa State University
-
Learn how to clearly organize, track, and communicate data-based work, collect and house data through analysis and publication, collaborate in a reproducible way, model data structures and wrangle data, and complete the entire research cycle in a responsible way.
MPA 635 Data Visualization Brigham Young University.
-
Become literate in data and graphic design principles, (2) an ethical data communicator, and (3) a collaborative sharer by producing beautiful, powerful, and clear visualizations of your own data
CME 252 Introduction to Optimization Stanford University
-
This course introduces mathematical optimization and modeling, with a focus on convex optimization. Topics include: varieties of mathematical optimization, convexity of functions and sets, convex optimization modeling with CVXPY, gradient descent and basic distributed optimization, in-depth examples from machine learning, statistics and other fields and applications of bi-convexity and non-convex gradient descent.
CSC 321 Intro to Neural Networks and Machine Learning University of Toronto
-
This course gives an overview of both the foundational ideas and the recent advances in neural net algorithms. Roughly the first 2/3 of the course focuses on supervised learning -- training the network to produce a specified behavior when one has lots of labeled examples of that behavior. The last 1/3 focuses on unsupervised learning and reinforcement learning..
EECS 349 Machine Learning- Spring 2018 Northwestern University
-
Lectures/Slides
- The lecture notes are of very good quality.
- Assignments
STAT 365/665 Data Mining and Machine Learning (uses R) Yale University
-
Note: The lecture notes and assignments of the course are of very good quality
-
- The lecture notes are of very good quality.
TJ-ML TJHSST Machine Learning Thomas Jefferson High School
- TJHSST Machine Learning Club aims to bring the complex and vast topic of machine learning to high school students. We teach a variety of topics, including SVMs, Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, and more.
Note: Great Initiative, that too from High School students @Mihir Patel
SIGIL Statistical Analysis of Corpus Data with R Postdam University
CIS 419/519 Applied Machine Learning- Spring 2018 UPenn Engineering
This course will introduce some of the key machine learning methods that have proved valuable and successful in practical applications. We will discuss some of the foundational questions in machine learning in order to get a good understanding of the basic issues in this area, and present the main paradigms and techniques needed to obtain successful performance in application areas such as natural language and text understanding, speech recognition, computer vision, data mining, adaptive computer systems and others. The main body of the course will review several supervised and (semi/un)supervised learning approaches. These include methods for learning linear representations, decision-tree methods, Bayesian methods, kernel based methods and neural networks methods, as well as clustering, dimensionality reduction and reinforcement learning methods.