Metis Data Science - Week 2 (9/8 - 9/12)
Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.
Week 2 started with a github crash course. Version control, branching, etc. This was followed up with a crash course in best practices for Phythonic coding. We were encouraged to look up and import python packages instead of rewriting code. The rest of the week focused on learning our first set of tools: webscraping with BeautifulSoup4, data wrangling with Pandas, and building predictive models using linear regression in scikit-learn and statsmodels. All of this lead up to our group brainstorming session and developing Project Luther.
At this point I have settled on scraping movie data from the-numbers because it has a pretty simplistic html page layout with a lot of data conveniently in one page. I pulled domestic and worldwide box office revenue, DVD/Blu-ray sales, budget, rating, and rotten tomatoes scores from their list of movies released each year. My current idea is to use a simple linear regression trying to predict box office sales as a function of budget, scores, etc.
General thoughts:
- Pandas is amazing. Seriously. Hands down my favorite library for handling data.
- By now the basic daily structure for the bootcamp seems to have been established. We typically get a short lecture in the morning introducing a topic. Afternoons are reserved for working on challenges or working on our own personal project.
- A lot of people in the Metis data science program have been experiencing problems scraping other sites like imdb or boxofficemojo due to inconsistent html page layouts.