The first thing to know is your computing architecture. Most of you will be using your laptops, which will likely be running Windows or Mac OS. This is fine, although it is not quite as seemless as using a modern distribution of linux. Linux allows you to have a much finer control of your machine, and I will occasionally show you something that is exclusively a linux thing. If you are using a laptop then your architecture will be a 32 or 64 bit CPU with X cores and X Gb RAM and such and such graphics processing unit. Maybe you're remotely accessing a virtual machine on a cloud computing system like Amazon's AWS, where you select the architecture that you want and they charge accordingly. Whenever possible I will try to demonstrate things on both a Windows and linux machine (I typically use Ubuntu). If you are using Mac OS, then you can either try the linux command in terminal (OS X was built from unix so many things carry over) or google it. I am currently running Windows on a Microsoft Surface Pro, and if I open "system information" then I can see my computer's specs.
- OS Name: Microsoft Windows 10 Pro
- Version: 10.0.14393
- System Type: x64-based PC
- RAM: 16 GB
- Processor: Intel Core i5-6300U CPU @ 2.40 GHz, 2 Cores
- (Components->Display) Intel HD Graphics 520
Checkpoint 1: Look at the specs for your machine, if you don't know how to find that then as always Google it.
What does this tell me? First, I'm using Windows 10, the sports blazer of operating systems (as opposed to the black turtleneck and the free, hole-ridden t-shirt). Second, my machine has a 64-bit CPU, which indicates, roughly, the size of integers that the CPU works with (a bit is a 0 or 1, and so 64 bits allows us to store 2^64 distinct integers). This is in contrast with a 32-bit architecture, which only allows us to store 2^32, which is roughly 4 billion. The CPU has a register, which is a small chunk of memory that helps it keep track of where it stores things in RAM, and it does this by assigning a number to the places in memory, so if there are only 4 billion values it can take, then you can only work with 4 Gigabytes at a time. This may seem like alot, but if you want to store a matrix that is 10^5 by 10^5 then you need 10^10 bytes which is more than 4 GB (this happens more than you think!). Think about a dataset that records the number of times individual A likes an facebook post of individual B. Each day facebook generates 4.5 Billion likes, and there are 1.7 Billion users on facebook, and we assign each user on facebook a 8 bit (1 byte) user ID, similar to a social security number. Then for each like, we can record the two user IDs, the first being the liker and the second being the likee. This comes out to a 9 GB dataset per day, and this only stores who likes whom, not the content, time, length, media type, or comments of the post. On a 32-bit CPU, I cannot have a process that stores and remembers where all of the values are in RAM, and will have to process the data in chunks.
Third, I have 16GB of RAM, which means that I can theoretically store 16 billion bytes (a byte is 8 bits) in RAM. We may talk about the real difference between the types of memory later in the course, but for now just know that it is much much faster to access RAM than the hard drive, so be grateful that you have it. Fourth, I have two cores of Intel processors at 2.4 GHz, which indicates the number of processes that I can run in parallel (2) and the speed of each is clocked at 2.4 GHz. And finally, I am using a Intel HD Graphics 520 graphics card. A graphics card is like a mini-computer with an entirely different architecture than the rest of the machine. It is designed to be able to do operations that are common to graphics processing, and so contains typically hundreds or thousands of small cores that share memory with each other. You can program these GPUs to run scripts that take advantage of this parallel architecture, which is common practice in computer graphics and machine learning.
Tools of the trade...In this class, we will be using Python to program scripts that will process data. Python is well suited for data science because of the interactive shell via ipython, the interactive computing environment via jupyter, the flexibility and neatness of python as a programming language, and most importantly, the huge number of open source packages. First of all, you will probably want to store code somewhere, which means having the ability to write files without
Text Editor: Files on your computer are just a bunch of bytes (strings of 0's and 1's) on your harddrive. A plain text editor reads these as characters via ASCII, which is a dictionary that converts bytes to characters (like how Ribosomes convert RNA base pairs to amino acids). So you can start a new file using an editor like emacs, vi, or notepad, and write something there like "Hello world." and then save it as "hello.txt" or "hello" or whatever. Then it writes the bytes that those characters correspond to on your hard drive. If you do the same thing in Word and save it as a Word file, then Word converts it to a different set of bytes and this process is proprietary (it is a closely guarded secret like the recipe for CocaCola). Never send me a Word document, when I open it in emacs it will look like \320\317^Q\340\241\261^Z\341^@^@^@^@^@^@^@^@ and you may lose points.
The best way to write code is to just select a text editor that you like and stick with it for all of your coding needs. Common choices are emacs, vim, sublime, notepad++, atom, etc. All of these have syntax highlighting, but you may need to do some work to enable it depending on your install. The most universal editors are vim and emacs, and they have their own hotkeys and interfaces. I use emacs, but that's because my mentor used emacs, I am using it to write this. Vim seems to have a cult following, but it feels willfully obtuse to me. I cannot help you with anything other than emacs, but in general I'll leave it up to you to figure out how to use your text editor.
Checkpoint 2: Install a text editor of your choice and learn to open a new file, copy and paste (kill and yank in emacs), search and replace strings, move the cursor around, save and close the file. (You may want to figure out how to do more advanced things.) Write a file called arch.txt and write your computer specs in it.
Python: You will need to have python installed, but first you should check if python is already on your machine. If you are using linux or Mac OS then open a terminal and type:
python -V. It will throw an error if python is not installed. On a server I get:
$ python -V
So my python version is 3.5.2, which impacts the packages that are available to me. Python 2.7 is a more universal version, and I will be assuming that you will be using 2.7.
On this server, I have this installed as
If you don't have python installed, you should install python 2.7, and should also get a package manager, like pip.
On my Windows machine, I installed python 2.7.12 from this link, and there are also installers for Mac OS.
Look back at your specs file to see if you have a 32-bit or 64-bit machine, and select the appropriate installer.
I have a 64-bit windows machine, so I selected Windows x86-64, and I also selected the option to add python to my PATH.
Checkpoint 3: Run python by typing python in the command prompt/terminal. If it is not recognized then it's probably not in your PATH, which is a environment variable that allows you to run an executable without finding its location in your filesystem. Type
2+2 and hit Enter in the python shell that should've openned up.
ipython: ipython is an interactive shell for python, that has fancy things like tab completion, debugging, and magic commands.
You just used the python shell, and ipython is just a more fancy version of this.
You should have pip installed somewhere on your machine and it may already be in your PATH.
On my Windows laptop, it's at C:\Python27\Scripts\pip, so in the Scripts directory I can run
pip install ipython
which installs ipython.
ipython as a calculator (pardon the poor quality, it is my first video)
Checkpoint 4: Run ipython from the command prompt/terminal. Type "pr" and hit TAB, then hit "i" and TAB again. Then finish the line and run
print("Hello World"). Then run the following:
You've just seen what tab autocompletion does, and had an example of the magic command %timeit. I found out that it takes 14.6 nanoseconds to run 2+2.
Example 1, opencorporates: Let's start a running example though out this module so that we can see some of the things that we hope to do. I am winging this, so don't be surprised if it doesn't really go anywhere. Here is a video of me playing around with opencorporates.com.
opencorporates.com API: Let's look at this site.
Remote Computing (Advanced): There is no cloud, it's just someone else's computer. Most of the time, I do my computing on a server so that I don't bog down my laptop with my scripts. The advantage of this is that if I start using both of my cores on a script to process whatever, then I can still run my web browser and text editor while the script is running without loss of performance. I use ssh (putty) to access the server, then use screen to either run ipython and emacs in different attached screens or use emacs with python-mode plugin (here is a nice config tutorial for emacs on linux). I copy from emacs with screen and paste in ipython. This way I just test and run snippets of code, and then modularize the code and make it into a nice python script for reuse. This way I never have to touch my mouse. It is not required for the course, but I encourage you to try this by either getting access to a server from UCD, buying a cheap desktop that acts as a server with linux on it, or using Amazon's AWS service which has a free tier. In this course, remote computing is not a requirement, but in 141C it will be, so you may want to get ahead of the curve.
Example 1, opencorporates: I'm going to walk you through a use of the opencorporates.com API. This will give you a look into something that we can do with python and the packages that we can do. The pie chart that I decided to create is just some random way to slice and view the data. I would encourage you to play with the code and ask your own questions. It may be easier after you've learned some python though. You can download the source code.
Reading data from opencorporates.com: Here we use the python packages urllib2 and json to read the data and get a sense for its structure.
Jupyter notebook: ipython is great for all of the fancy features that it has, but it's doesn't help you walk someone through your code. That's why they invented the ipython notebook, which was later turned into jupyter.
To install run
pip install jupyter and then you can go to a directory with code that you would edit and run
Then your browser should open and you should see the file directory there.