Course description

Statistics is the science of extracting knowledge from data. Before the advent of computers, datasets were restricted to what could be measured and painstakingly recorded by hand (think of Mendel's genetics experiments). Now, the pipeline of turning observable phenomena into data has been dramatically broadened by the ubiquitousness of sensors and input devices, and the internet to transfer the extracted data across the world. The Data Sciences represent all of the facets of working with and extracting knowledge from data, such as data extraction, processing, curation, maintainance, representation, visualization, exploration, hypothesis testing, parameter estimation, prediction.

In this course, we will focus on data extraction, processing, curation, maintainance, and visualization. Because these tasks rarely stand alone, and you are presumably aquiring data to make sense of it, we will revisit some basic statistical tools that you've learned in 141A and other statistics classes. Because statistical tools, data types, and computational frameworks are constantly shifting, we will emphasize the ability to parse and understand the foundations of data science, through the enumeration of principles, standards, and fundamentals.

The Course: 2018

Recommended Reading

Final Project

The final project is 50% of the grade for the course, and it should demonstrate a large proportion of what you have learned from the course. If you do not display proficiency in the key technologies that we have gone over, then you will not receive a good grade. This includes

Some of the general rules and things to think about are...
  1. 1. You can work individually or in groups of 1-4 people. Larger groups will be graded more harshly than smaller groups.
  2. 2. As a group you should begin with a certain curiosity, for example, in my lecture 'What happened in Ohio?' I looked at the presidential election in OH. Then we processed the data, visualized it, and asked specific questions.
  3. 3. There is an in class group presentation in the last two weeks of the course, and a final written portion due TBD (near or during finals week).
  4. 4. Your presentation and final project will be separately factored into your grade. Because many of you will be presenting weeks before the deadline for the written portion you are encouraged to continue working on your project after you give the presentation.
  5. 5. As a group you will submit code and jupyter notebooks in addition to a website for the project. You will receive a link on slack to form a group repository toward the beginning of the course.

Syllabus

  Date  Material (required reading, slides, notebooks)Technology & Principles
M 1-8SlidesIntro to Data Science
W 1-10Required reading, slidesData science workflow, ipython, Jupyter
F 1-12Required reading, scrabble notebook, slidesgit, Versioning systems, python basics
W 1-17Required reading, notebookPython basics, variable scope, modular code
F 1-19Required reading, labimmutable/mutable, data structures, pythonic programming
M 1-22Required reading, notebookMatrices, numpy, Vectorization
W 1-24Required reading, notebookreading data, pandas, processing data in batches
F 1-26Required reading, notebookmatplotlib, Plotting sense
M 1-29Required reading, notebookData Wrangling, pandas, Missing data
W 1-31Required readingIndexing, merging and transforming data
F 2-2NotebookA data munging and analysis example: OH 2016
M 2-5Classification notesMachine Learning, evaluating performance, sklearn, classification
W 2-7Reading from UnpingcoMachine Learning, overfitting and model selection
F 2-9Reading from Unpingco, notebookMachine Learning, logistic regression and SVMs
M 2-12Reading, notebooknltk, From text to numbers, n-grams
W 2-14Reading, notebookrequests, APIs, JSON
F 2-16Reading, notebookDatabase systems, sql, sqlalchemy
W 2-21TA day IGroup work and TA presentations
F 2-23TA day IIGroup work and TA presentations
M 2-26Reading, notebookHTML, DOM, Beautiful soup
W 2-28Notebook, mpld3 demoInteractive visualization, mpld3
F 3-2Reading, notebookD3, fixing the choropleth map
M 3-5Group presentations
W 3-7Group presentations
F 3-9Group presentations
M 3-12Group presentations
W 3-14Group presentations
M 3-16Group presentations