Statistics is the science of extracting knowledge from data. Before the advent of computers, datasets were restricted to what could be measured and painstakingly recorded by hand (think of Mendel's genetics experiments). Now, the pipeline of turning observable phenomena into data has been dramatically broadened by the ubiquitousness of sensors and input devices, and the internet to transfer the extracted data across the world. The Data Sciences represent all of the facets of working with and extracting knowledge from data, such as data extraction, processing, curation, maintainance, representation, visualization, exploration, hypothesis testing, parameter estimation, prediction.
In this course, we will focus on data extraction, processing, curation, maintainance, and visualization. Because these tasks rarely stand alone, and you are presumably aquiring data to make sense of it, we will revisit some basic statistical tools that you've learned in 141A and other statistics classes. Because statistical tools, data types, and computational frameworks are constantly shifting, we will emphasize the ability to parse and understand the foundations of data science, through the enumeration of principles, standards, and fundamentals.
The Course: Fall 2018
- 1. All communications should be done on canvas (with your fellow students) and you should email me at firstname.lastname@example.org
- 2. There will be 6-8 assignments on canvas.
- 3. You should do the assignments on your own, but you can get advice from others in the class. You should never copy code from someone else in the course on the homeworks.
- 5. Office hours are TBD
- 6. Be respectful: to your colleagues, your TAs, your Prof, and yourself!
Data Science: Principles and Python (in progress) by J. Sharpnack
Other Optional Reading
- Python for Probability, Statistics, and Machine Learning by Jose Unpingco
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd Edition) by Wes McKinney
The final project is 40% - 50% of the grade for the course, and it should demonstrate a large proportion of what you have learned from the course. If you do not display proficiency in the key technologies that we have gone over, then you will not receive a good grade. This includes
- - extracting data through web scraping, api calls, and parsing structured text or other unorthodox data types
- - a thorough exploratory data analysis involving visualization, unsupervised learning, and descriptive statistics
- - thoughtful data visualization with matplotlib, ggplot, seaborn, basemap, etc. including for high-dimensional data, network data, geodata
- - appropriate use of statistical and machine learning tools, involving feature construction, transformations, and proper evaluation
- - good communication of your findings and a good grasp of statistical thinking
- 1. You can work individually or in groups of 1-4 people. Larger groups will be graded more harshly than smaller groups.
- 2. As a group you should begin with a certain curiosity, for example, in my lecture 'What happened in Ohio?' I looked at the presidential election in OH. Then we processed the data, visualized it, and asked specific questions.
- 3. There is an in class group presentation in the last two weeks of the course, and a final written portion due TBD (near or during finals week).
- 4. Your presentation and final project will be separately factored into your grade. Because many of you will be presenting weeks before the deadline for the written portion you are encouraged to continue working on your project after you give the presentation.
- 5. As a group you will submit code and jupyter notebooks in addition to a website for the project. You will receive a link on slack to form a group repository toward the beginning of the course.