Roobyz Ramblings


A responsive site for sharing thoughts and lessons learned on data science, programming, and technology.


Titanic: Learning Data Science with RStudio

So we are aspiring data scientists and want to dip our toes into RStudio. How do we get started? We dive into the into the waters of the Kaggle Titanic "Competition", of course!

Our Objective

Use this Kaggle exercise to:

  • learn to reason and problem solve like a data scientist

  • get somewhat comfortable with RStudio

  • predict whether a passenger would survive the sinking of the Titanic

  • enter a Kaggle submission file for evaluation

  • have fun!

Kaggle Basics

Kaggle is a community of data scientists and a platform for facilitating data science journeys. One way to participate, is by entering data science competitions. Similar to other competitions, Kaggle provides two Titanic datasets containing passenger attributes:

  • a training set, complete with the outcome (target) variable for training our predictive model(s)

  • a test set, for predicting the unknown outcome variable based on the passenger attributes provided in both datasets.

After training and validating our predictive model(s), we can then enter the submission file to Kaggle for evaluation. As we iterate, we can submit more files and assess our progress on the leaderboard. Subtle model improvements can lead to significant leaps on the leaderboard.

lightbulb.png

About That: Predictive models are trained using attributes (variables), right? How does that work?

Some attributes are correlated: as they vary, to some degree other attributes may also vary. Machine learning leverages that interdependence to model the predicted outcomes. For accurate model performance, we need to:

  • maximize the number of explanatory variables: those that are correlated with the outcome variable, and

  • compensate for the correlation of explanatory variables to each other (multicollinearity).

In other words, we need to find the fewest quantity of variables that can explain almost everything that is going on with the outcome that we want to predict.

Titanic History Lesson

The Titanic was a British passenger liner that sank after colliding with an iceberg in the Atlantic on its maiden voyage en route to New York City. It was the largest ship of its time with 10 decks, 8 of which were for passengers.

There were 2,224 passengers and crew aboard. Of the 1,317 passengers, there were: 324 in First Class (including some of the wealthiest people of the time), 284 in Second Class, and 709 in Third Class. Of these, 869 (66%) were male and 447 (34%) female. There were 107 children aboard, the largest number of which were in Third Class.

The ship had enough lifeboats for about 1,100 people, and more than 1,500 died. Due to the "women and children first" protocol, men were disproportionately left aboard. Also, not all lifeboats were completely filled during the evacuation. The 705 surviving passengers were rescued by the RMS Carpathia around 2 hours after the catastrophe.

lightbulb.png

About That: So the different variables like gender and class could influence whether someone survived, correct?

Yes, for example, someone may not have been able to get to the lifeboats because they were in a lower deck, which is correlated with third class. Also, children were disporptionately in third class, but they were also impacted by the "children first" protocol.

Next Steps…​

We’ll approach this in multiple parts. This is still a work in progress, but roughly speaking it should look like:

  • Part 1: Basic Setup

  • Part 2: Feature Engineering

  • Part 3: Prediction

  • Part 4: Conclusion

comments powered by Disqus