Metis Data Science - Week 1 (9/2 - 9/5)


Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.

First week! Unfortunately we weren’t off to the greatest of starts as the very first day we spent close to 1 hour getting everyone introduced and checked in with keycards. Getting keycards was the longest part, since the WeWork staff seemed disorganized and slow.

Probably the highlight of the first day was the “Hipster Game” we played. We took a set of images of guys and labeled them as a class whether they looked like hipster or not. Then we split up into groups of three and developed a list of features we could use to determine whether a person was hipster or not based on our labeled training set. After a short break, we tested our set of features on a different test set of images. My group did okay, labeling a little over half of the test set correctly.

I actually enjoyed the exercise quite a bit. It felt silly at first, but was a pretty good introduction to the concept of classifcation and feature selection.

The rest of the week was a whirlwind tour of basic OSX terminal and UNIX shell commands, Python syntax, Python file i/o, IPython Notebook, statistics (resampling, boostrapping, etc.), git & github.

The latter part of the week was devoted to project “benson”, where the the whole class was essentially walked through a full data science project analyzing NYC MTA turnstile data. We worked together as a class through a series of ipython notebook files that ran through the whole process of collecting, cleaning, analyzing, and finally visualizing the data. At the end of the week, we split into groups and held brainstorming sessions for coming up with questions about NYC subway ridership we could investigate using our MTA data set.

Random Observations in no particular order:

  • I’m totally new to OSX and unix, and I’m already missing keyboard hotkeys I’m used to on Windows machines (like dedicated home/end buttons).
  • Basic organization of the bootcamp seems to be short lectures in the morning followed by additional topics or challenges for us to work on in the afternoon. This week we worked on selected statistics exercises from Think Stats.
  • I knew people loved Ipython and IPython Notebook, but after using it for a few days I really get it now. They’re amazing tools for prototyping python code.
  • They really emphasized the brainstorming process before beginning any data science project. Before even looking at any data, we should always have a question we are trying to answer. We should also be entertaining the wildest of ideas during initial brainstorming. It’s tempting to start editing your ideas based on what data is available or possible to obtain, but wild unfeasible ideas often can be built upon to produce really great project ideas. You really want to go for quantity here.

note: Thinking back on this, project benson was a good introduction to a full data science project from the beginning data cleaning steps to final analysis. I finished the first day having a bit of a negative view since the whole day was spent just passively listening to lecture over an ipython notebook, but this let us focus on the project brainstorming without getting hung up in the weeds of cleaning MTA turnstile data.

Written on November 3, 2014