Investigate Titanic Dataset


Dataset chosen: Titanic

The latter question will be analyzed via other (more detailed questions), such as:

  1. Does gender have any impact on the survival rate?
  2. Does passenger survive because of higher passenger class?
  3. What’s the range of age have highest rate to survive?
import pandas as pd
import numpy as np
import matplotlib
import csv
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline
def display_piechart(survive, death, xlabel):
    # Data to plot
    labels = 'Survive', 'Death'
    sizes = [survive, death]
    colors = ['yellowgreen', 'gold']
    explode = (0.1, 0)
    # Plot
    plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True)


  • Survived: (0:No, 1:Yes)
  • Pclass: Passenger class (1:First Class, 2:Second Class, 3: 3rd Class)
  • Name: Name of the passenger
  • Sex: Gender of the passenger
  • Age: Age of the passenger
  • Fare: Passenger Fare
  • Cabin: cabin of the passenger
  • Embarked: Embarkation Port (C = Cherbourg, Q = Queenstown, S = Southampton)
df = pd.read_csv('titanic-data.csv')
print("Data information")
print("\nFigure out missing data")

Data Decided to be drop

  • PassengerId = The PassengerId just an Id of passenger.
  • Cabin – Due to high ammount of NaN data, so decided to drop it. ##### Data don’t relate to the question
  • SibSp – Passenger of siblings or spouses aboard.
  • Parch – Passenger of parents or children aboard.
  • Fare – Passenger fare.
  • Embarked – Embarkation Port.
df = pd.read_csv('titanic-data.csv')
# Drop unnecessary column
df.drop(["PassengerId", "SibSp", "Parch", "Ticket", "Cabin", "Embarked","Fare"], axis=1,inplace=True)
rows = df.shape[0]
columns = df.shape[1]
print("The Dataset consist of ",rows," rows of record and ",columns,"columns of the variable.")

Does gender has any impact to the survival rate?

num_passenger = df.shape[0]
num_male = df.loc[df['Sex'] == "male"].shape[0]
num_female = df.loc[df['Sex'] == "female"].shape[0]
print ("We had total number of {} record with {} male and {} female.\n".format(num_passenger, num_male, num_female))
print(df.groupby('Sex').size(), '\n')
print(pd.crosstab(df['Sex'], df['Survived']))
# Visualize Survivability
table = pd.crosstab(df['Survived'],df['Sex'])
axes = table.plot.pie(subplots=True, labels=['Death','Survived'], autopct='%1.1f%%');
plt.suptitle('Survivability across sex',y=0.8)
for ax in axes:
We had total number of 891 record with 577 male and 314 female.

female    314
male      577
dtype: int64 

Survived    0    1
female     81  233
male      468  109
Survivability Across Sex

Answer: Wow! There’s a 74.2% survival rate for female and only 18.9% for male.

Does passenger survive because of higher passenger class?

num_passenger = df.shape[0]
num_Pclass_1 = df.loc[df['Pclass'] == 1].shape[0]
num_Pclass_2 = df.loc[df['Pclass'] == 2].shape[0]
num_Pclass_3 = df.loc[df['Pclass'] == 3].shape[0]
print ("We had total number of {} record with {} for first class and {} for second class and {} for third class.".format(num_passenger, num_Pclass_1, num_Pclass_2,num_Pclass_3))
survived_Pclass_1 = (df['Pclass'] == 1) & (df['Survived'] == 1)
survived_Pclass_2 = (df['Pclass'] == 2) & (df['Survived'] == 1)
survived_Pclass_3 = (df['Pclass'] == 3) & (df['Survived'] == 1)
# Calculate number of survival and death
num_survived_Pclass_1 = df.loc[survived_Pclass_1].shape[0]
num_survived_Pclass_2 = df.loc[survived_Pclass_2].shape[0]
num_survived_Pclass_3 = df.loc[survived_Pclass_3].shape[0]
num_not_survived_Pclass_1 = num_Pclass_1 - num_survived_Pclass_1
num_not_survived_Pclass_2 = num_Pclass_2 - num_survived_Pclass_2
num_not_survived_Pclass_3 = num_Pclass_3 - num_survived_Pclass_3
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_1, num_survived_Pclass_1, num_not_survived_Pclass_1))
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_2, num_survived_Pclass_2, num_not_survived_Pclass_2))
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_3, num_survived_Pclass_3, num_not_survived_Pclass_3))
display_piechart(num_survived_Pclass_1, num_not_survived_Pclass_1, 'Survival Rate for First Class')
display_piechart(num_survived_Pclass_2, num_not_survived_Pclass_2, 'Survival Rate for Second Class')
display_piechart(num_survived_Pclass_3, num_not_survived_Pclass_3, 'Survival Rate for Third Class')
We had total number of 891 record with 216 for first class and 184 for second class and 491 for third class.
Total of 216 in first class, 136 survive and 80 not survive.
Total of 184 in first class, 87 survive and 97 not survive.
Total of 491 in first class, 119 survive and 372 not survive.
Survival Rate For First Class
Survival Rate For Second Class
Survival Rate For Third Class

Answer: Wow! According to the pie chart, first class has the highest survival rate of 63%. The second class has only 47.3% and the third class of passenger has only 24.2% survival rate.

What’s the range of age have highest rate to survive?

# Drop lost passenger age record
df = df.dropna()
plt.xlabel("Age distribution of people who survived and death")
Age Distribution Of People Who Survived And Death

Answer: The 0-10 range of age has the highest rate to survive. However, the 20-30 age range has the highest risk.


According to the above analysis, the report shows there are many factors that would affect the survival rates. As we analyzed, age, class and gender are having lots of impact in this analysis.


This analysis has some limitations of:

  • Missing data: Tinanic dataset has a few missing values about the passenger age and most of the cabin’s data have gone.
  • Data Ingore: I drop the whole cabin data and the unnessary column due to a large amount of missing data.

Other variables

  • Passenger Career
  • Lifeboat number
  • Passenger Health Status
  • Wrong recorded
  • Passenger Reputation (Maybe some of them are superstars?)
  • Ship maintenance report
  • Passenger Background

It would be interesting if other variables exist!