Statistics is the science of extracting knowledge from data. Before the advent of computers, datasets were restricted to what could be measured and painstakingly recorded by hand (think of Mendel's genetics experiments). Now, the pipeline of turning observable phenomena into data has been dramatically broadened by the ubiquitousness of sensors and input devices, and the internet to transfer the extracted data across the world. The Data Sciences represent all of the facets of working with and extracting knowledge from data, such as data extraction, processing, curation, maintainance, representation, visualization, exploration, hypothesis testing, parameter estimation, prediction.
In this course, we will focus on data extraction, processing, curation, maintainance, and visualization. Because these tasks rarely stand alone, and you are presumably aquiring data to make sense of it, we will revisit some basic statistical tools that you've learned in 141A and other statistics classes. Because statistical tools, data types, and computational frameworks are constantly shifting, we will emphasize the ability to parse and understand the foundations of data science, through the enumeration of principles, standards, and fundamentals.
The Course: 2018
- 1. All communications should be done on slack: you should have received an email inviting you to this, email me if not
- 2. There will be 6 assignments (you will get a link on slack when they are assigned) and one final project (homeworks combined are 50% of the grade, and the final is also 50%).
- 3. You should do the assignments on your own, but you can get advice from others in the class. You should never copy code from someone else in the course on the homeworks.
- 4. Scheduling questions (e.g. waitlist) should be directed to the undergrad program coordinator, Kim McMullen at email@example.com
- 5. Office hours are: Wed. and Thurs. 11am-12pm (Prof. James) and Thurs. 12-2pm (TA Gary) in Academic Surge 2142, starting on Thurs. 11 Jan.
- 6. Be respectful: to your colleagues, your TAs, your Prof, and yourself!
- Python for Probability, Statistics, and Machine Learning by Jose Unpingco
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd Edition) by Wes McKinney
The final project is 50% of the grade for the course, and it should demonstrate a large proportion of what you have learned from the course. If you do not display proficiency in the key technologies that we have gone over, then you will not receive a good grade. This includes
- - extracting data through web scraping, api calls, and parsing structured text or other unorthodox data types
- - a thorough exploratory data analysis involving visualization, unsupervised learning, and descriptive statistics
- - thoughtful data visualization with matplotlib, ggplot, seaborn, basemap, etc. including for high-dimensional data, network data, geodata
- - appropriate use of statistical and machine learning tools, involving feature construction, transformations, and proper evaluation
- - good communication of your findings and a good grasp of statistical thinking
- 1. You can work individually or in groups of 1-4 people. Larger groups will be graded more harshly than smaller groups.
- 2. As a group you should begin with a certain curiosity, for example, in my lecture 'What happened in Ohio?' I looked at the presidential election in OH. Then we processed the data, visualized it, and asked specific questions.
- 3. There is an in class group presentation in the last two weeks of the course, and a final written portion due TBD (near or during finals week).
- 4. Your presentation and final project will be separately factored into your grade. Because many of you will be presenting weeks before the deadline for the written portion you are encouraged to continue working on your project after you give the presentation.
- 5. As a group you will submit code and jupyter notebooks in addition to a website for the project. You will receive a link on slack to form a group repository toward the beginning of the course.
|Date||Material (required reading, slides, notebooks)||Technology & Principles|
|M 1-8||Slides||Intro to Data Science|
|W 1-10||Required reading, slides||Data science workflow, ipython, Jupyter|
|F 1-12||Required reading, scrabble notebook, slides||git, Versioning systems, python basics|
|W 1-17||Required reading, notebook||Python basics, variable scope, modular code|
|F 1-19||Required reading, lab||immutable/mutable, data structures, pythonic programming|
|M 1-22||Required reading, notebook||Matrices, numpy, Vectorization|
|W 1-24||Required reading, notebook||reading data, pandas, processing data in batches|
|F 1-26||Required reading, notebook||matplotlib, Plotting sense|
|M 1-29||Required reading, notebook||Data Wrangling, pandas, Missing data|
|W 1-31||Required reading||Indexing, merging and transforming data|
|F 2-2||Notebook||A data munging and analysis example: OH 2016|
|M 2-5||Classification notes||Machine Learning, evaluating performance, sklearn, classification|
|W 2-7||Reading from Unpingco||Machine Learning, overfitting and model selection|
|F 2-9||Reading from Unpingco, notebook||Machine Learning, logistic regression and SVMs|
|M 2-12||Reading, notebook||nltk, From text to numbers, n-grams|
|W 2-14||Reading, notebook||requests, APIs, JSON|
|F 2-16||Reading, notebook||Database systems, sql, sqlalchemy|
|W 2-21||TA day I||Group work and TA presentations|
|F 2-23||TA day II||Group work and TA presentations|
|M 2-26||Reading, notebook||HTML, DOM, Beautiful soup|
|W 2-28||Notebook, mpld3 demo||Interactive visualization, mpld3|
|F 3-2||Reading, notebook||D3, fixing the choropleth map|
|M 3-5||Group presentations|
|W 3-7||Group presentations|
|F 3-9||Group presentations|
|M 3-12||Group presentations|
|W 3-14||Group presentations|
|M 3-16||Group presentations|