The coming century is surely the century of data. A combination of blind faith and serious purpose makes our society invest massively in the collection and processing of data of all kinds, on scales unimaginable until recently.

Data analysis today is not an unsophisticated activity carried out by hand; it is much more ambitious, and … an intellectual force to be reckoned with.

David Donoho, High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality, 2000

James later reduced his complaint to a sentence: fielding statistics made sense only as numbers, not as language. Language, not numbers, is what interested him. Words, and the meaning they were designed to convey. “When the numbers acquire the significance of language,” he later wrote, “they acquire the power to do all of the things which language can do: to become fiction and drama and poetry.”

Michael Lewis, Moneyball writing about Bill James, inventor of sabermetrics

Preface#

Welcome to this book!

These are lecture notes for Data Science 701, Tools for Data Science, as taught by me at Boston University.

This course has evolved from CS 506, which has major contributions from Evimaria Terzi, George Kollios, and Lance Galletti. Errors are mine. (Please alert me to errors – or better yet, submit a pull request!).

Format#

The notes are in the form of Jupyter notebooks. Demos and most figures are included as executable Python code. All course materials are in the github repository here.

Each Chapter is based on a single Jupyter notebook, and each notebook forms the basis for one lecture (more or less).