Readings are not mandatory, but highly recommended. Post your comments and questions using a “Note” in Piazza with the appropriate title (e.g., Readings Week 1: …). Always add the hashtag #readings. To avoid a proliferation of different notes on the same topic please use the “Followup discussions” feature in Piazza.

Books

You can read O’Reilly books for free with a Harvard login at this web site.

Python for Data AnalysisPython for Data Analysis, O’Reilly Media - “Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.”

image alt textMachine Learning for Hackers, O’Reilly Media - “If you’re an experienced programmer interested in crunching data, this book will get you started with machine learning—a toolkit of algorithms that enables computers to train themselves to automate useful tasks. Authors Drew Conway and John Myles White help you understand machine learning and statistics tools through a series of hands-on case studies, instead of a traditional math-heavy presentation.”

A translation of the R examples in Machine Learning for Hackers to Python can be found here: http://slendrmeans.wordpress.com/will-it-python/

image alt textProbabilistic Programming and Bayesian Methods for Hackers - “The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.”

Basic Data Science Motivation and Introduction

[1] BBC Documentary: The Age of Big Data (58 mins)

[2] Data Science Workflow: Overview and Challenges by Philip Guo

[3] Enterprise Data Analysis and Visualization: An Interview Study, Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer, IEEE Visual Analytics Science & Technology (VAST), 2012

[4] That’s Funny…, Howard Wainer and Shaun Lysen, American Scientist, 2009

Visualization

[1] matplotlib - 2D and 3D plotting in Python, J.R. Johannson, .ipynb

[2] A Gallery of Statistical Graphs in Matplotlib (Matplotlib Defaults), C. Beaumont, .ipynb

[3] A Gallery of Statistical Graphs in Matplotlib, C. Beaumont, .ipynb

[4] Wrangler: Interactive Visual Specification of Data Transformation Scripts, Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer, ACM Human Factors in Computing Systems (CHI), 2011 [Data Wrangler tool web site]

Storytelling and presentations

[1]Narrative Visualization: Telling Stories with Data, Edward Segel, Jeffrey Heer, IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2010

[2] When do stories work? Evidence and illustration in the social sciences, A. Gelman and T. Basboll, 2013

[3] Storytelling, M. Krzywinski & A. Cairo, Nature Methods, 2013 (Rebuttal by Y. Katz, Editorial, Response)

[4] Presentation Zen Tips, Garr Reynolds

[5] Tips for Giving Clear Talks, Kayvon Fatahalian

Data acquisition and cleanup

[1] Web scraping demo C. Beaumont, .ipynb

[2] Data Wrangling Demo C. Beaumont, .ipynb

PCA

[1] PCA Tutorial, J. Shlens, Princeton University

[2] Principal Components: Mathematics, Example, Interpretation, Cosma Shalizi, CMU

Machine Learning

[1] Chapter 1 of Machine learning, a Probabilistic Perspective

[2] Cross Validation: The Right and Wrong Way