Data science development environment: pip, jupyter, virtualenv, pycharm, and AWS
I have pinned down my data science/machine learning workflow and wanted to share it. It is based on using pip
, virtualenv
, and pycharm
and is good for simultaneously writing python packages and doing data analyses in jupyter
. My preferred infrastructure solution is AWS. Disclaimer: I work for Amazon AWS.
EC2 setup
I consolidate my EC2 setup using .ssh/Config
, this greatly reduces my setup time. First, I fire up an EC2 instance, using an AMI such as one of the Deep Learning community images. I generate or use an existing key-pair pem
file, and add it to my ~/.ssh
folder on my local machine. I copy my Public IP, it looks like ec2-[IP].compute.1.amazonaws.com
, but may differ. Then I add this to my ~/.ssh/Config
:
Host EC2dev
HostName ec2-[IP].compute-1.amazonaws.com
User ubuntu
IdentityFile ~/.ssh/[MY-PEM-FILE].pem
Now my EC2 server has the alias EC2dev
which I use for the rest of my setup. This way every time I fire up a new instance, or stop and start this one, I just have to change this IP.
Python packages and virtualenv on server
Now I am ready to set up my python package, which I usually will host on github. For example, if I am working on AutoGluon then I will clone the repo on both my server and my local machine. I use SSH keys for authentication so I need to add my rsa public keys for both machines to my github settings.
$ cd ~
$ git clone git@github.com:awslabs/autogluon.git
Alternatively, I can just scp over my repo on my local machine, which will be what we are effectively doing with pycharm deployment anyway. On my EC2 server, I will create and activate a virtualenv,
$ python3 -m pip install --upgrade pip
$ python3 -m pip install virtualenv
$ python3 -m virtualenv env
$ source env/bin/activate
This way I can install my development packages in a contained environment. I go into the package that I am writing (e.g. cd autogluon/shift
) and install with pip install -e .
. The -e
flag means that my install updates as I make changes.
Pycharm
You should start by installing pycharm.
To set up pycharm to work with your remote installation you have to set up the
- Remote interpreter
- Deployment
To set up the remote interpreter, you need to go to File > Preferences > Python Interpreter
. In the settings, add a new interpreter and select SSH Interpreter. In Host put EC2dev
and in Username put ubuntu
. Then for the interpreter find the location of your virtualenv, i.e. /home/ubuntu/env/bin/python
. I will also deploy my project to /home/ubuntu/autogluon
so I add that mapping to the Sync folders.
To set up my deployment, I go to Tools > Deployment > Configuration
and either find the SSH connection or add it. I make sure that the mapping from local path to deployment path matches my project folder in both: ~/autogluon
-> /home/ubuntu/autogluon
. You can test the connection.
With this set up you should be able to debug with breakpoints, unittest, and deploy your edits. If you turn on automatic upload then this should be done for you, but you can manually upload individual files as well.
Jupyter
First you need to have remote forwarding set up on your local machine. In your ~/.ssh/Config
file add
RemoteForward 8888 EC2dev:22
On the remote instance you can launch jupyter and then access it via port forwarding. The way to do this is to launch a new screen or tmux on the remote server. Then activate the environment, install and run jupyter.
$ ssh EC2dev
$ tmux
$ source env/bin/activate
$ pip install jupyter
Then I have to add the virtual environment to jupyter:
$ pip install ipykernel
$ python -m ipykernel install --name=env
Then fire up jupyter:
$ jupyter notebook
The jupyter notebook server should start and you should see an link with a token such as
http://localhost:8888/?token=...
Then I turn on ssh forwarding on my local machine with
$ ssh -N -f -L localhost:8888:localhost:8888 EC2dev
Then copy and paste the link into your browser, and you should see your jupyter instance.
Other AWS services
There are a host of AWS services, but the ones that I have used the most are the following.
- Sagemaker: this is expecially useful for its python SDK with pyspark support, pretrained models, etc.
- Athena: this is for making serverless SQL queries. I often find that Athena is faster than using pyspark on a Sagemaker instance (depending on the query). One common use is to generate a subsample of entries from a database and then load it in jupyter on my EC2 instance. Then I can do quick data analyses using Pandas, Sklearn, etc.
- S3: put your data here.