Course description

Statistics is the science of extracting knowledge from data. Before the advent of computers, datasets were restricted to what could be measured and painstakingly recorded by hand (think of Mendel's genetics experiments). Now, the pipeline of turning observable phenomena into data has been dramatically broadened by the ubiquitousness of sensors and input devices, and the internet to transfer the extracted data across the world. The Data Sciences represent all of the facets of working with and extracting knowledge from data, such as data extraction, processing, curation, maintainance, representation, visualization, exploration, hypothesis testing, parameter estimation, prediction.

In this course, we will focus on data extraction, processing, curation, maintainance, and visualization. Because these tasks rarely stand alone, and you are presumably aquiring data to make sense of it, we will revisit some basic statistical tools that you've learned in 141A and other statistics classes. Because statistical tools, data types, and computational frameworks are constantly shifting, we will emphasize the ability to parse and understand the foundations of data science, through the enumeration of principles, standards, and fundamentals.


1: Data science workflowText editors, ipython, JupyterProgramming basics, data science presentation
2: Collaborating with gitgit, command lineVersioning systems
3: Python basicsstrings, numbers, lists, dictionaries, looping, defSequential programming, variable scope, modular code
4: More Pythonbuilt-in functions, dictionaries, iterables, pythonic programmingobject-oriented programming, data structures
5: Matrices in pythonnumpy, arrayVectorization
6: Reading dataopen, csv, pandasData latency, processing data in batches
7: Plotting basicsmatplotlibPlotting sense
8: Data wranglingpandas, DataFrame, pandas plottingSlicing, thinking like Pandas
9: Data wrangling 2indexing, groupby, joinIndexing, Merging and transforming data
10: Web dataapi, request, jsonInternet basics, data aquisition
11: Learning from Textnltk, scikit-learn, tokenizationFrom text to numbers, n-grams
12: Processing web datahtml, xml, beautifulsoupHypertext languages, structured data
13: Databasessqlalchemy, queryDatabase systems
14: Geographical databasemap, patchesGIS data, map projections
15: Data visualizationplotly, ggplotGraphics grammar
16: Interactive data visualizationjavascript basics, D3Server side/client side processing, embedded script