 # Investigate Titanic Dataset

GITHUB: URL

#### Dataset chosen: Titanic

The latter question will be analyzed via other (more detailed questions), such as:

1. Does gender have any impact on the survival rate?
2. Does passenger survive because of higher passenger class?
3. What’s the range of age have highest rate to survive?
``````import pandas as pd
import numpy as np
import matplotlib
import csv
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline
def display_piechart(survive, death, xlabel):
# Data to plot
labels = 'Survive', 'Death'
sizes = [survive, death]
colors = ['yellowgreen', 'gold']
explode = (0.1, 0)
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True)
plt.axis('equal')
plt.xlabel(xlabel)
plt.show()``````

• Survived: (0:No, 1:Yes)
• Pclass: Passenger class (1:First Class, 2:Second Class, 3: 3rd Class)
• Name: Name of the passenger
• Sex: Gender of the passenger
• Age: Age of the passenger
• Fare: Passenger Fare
• Cabin: cabin of the passenger
• Embarked: Embarkation Port (C = Cherbourg, Q = Queenstown, S = Southampton)
``````df = pd.read_csv('titanic-data.csv')
print("Data information")
df.info()
print("\nFigure out missing data")
display(df.isnull().sum())``````

#### Data Decided to be drop

• PassengerId = The PassengerId just an Id of passenger.
• Cabin – Due to high ammount of NaN data, so decided to drop it. ##### Data don’t relate to the question
• SibSp – Passenger of siblings or spouses aboard.
• Parch – Passenger of parents or children aboard.
• Fare – Passenger fare.
• Embarked – Embarkation Port.
``````df = pd.read_csv('titanic-data.csv')
# Drop unnecessary column
df.drop(["PassengerId", "SibSp", "Parch", "Ticket", "Cabin", "Embarked","Fare"], axis=1,inplace=True)
rows = df.shape
columns = df.shape
print("The Dataset consist of ",rows," rows of record and ",columns,"columns of the variable.")
display(df.describe())``````

Does gender has any impact to the survival rate?

``````num_passenger = df.shape
num_male = df.loc[df['Sex'] == "male"].shape
num_female = df.loc[df['Sex'] == "female"].shape
print ("We had total number of {} record with {} male and {} female.\n".format(num_passenger, num_male, num_female))
print(df.groupby('Sex').size(), '\n')
print(pd.crosstab(df['Sex'], df['Survived']))
# Visualize Survivability
table = pd.crosstab(df['Survived'],df['Sex'])
axes = table.plot.pie(subplots=True, labels=['Death','Survived'], autopct='%1.1f%%');
plt.suptitle('Survivability across sex',y=0.8)
for ax in axes:
ax.legend_.remove()
ax.set_aspect('equal')``````
```We had total number of 891 record with 577 male and 314 female.

Sex
female    314
male      577
dtype: int64

Survived    0    1
Sex
female     81  233
male      468  109```

Answer: Wow! There’s a 74.2% survival rate for female and only 18.9% for male.

#### Does passenger survive because of higher passenger class?

``````num_passenger = df.shape
num_Pclass_1 = df.loc[df['Pclass'] == 1].shape
num_Pclass_2 = df.loc[df['Pclass'] == 2].shape
num_Pclass_3 = df.loc[df['Pclass'] == 3].shape
print ("We had total number of {} record with {} for first class and {} for second class and {} for third class.".format(num_passenger, num_Pclass_1, num_Pclass_2,num_Pclass_3))
survived_Pclass_1 = (df['Pclass'] == 1) & (df['Survived'] == 1)
survived_Pclass_2 = (df['Pclass'] == 2) & (df['Survived'] == 1)
survived_Pclass_3 = (df['Pclass'] == 3) & (df['Survived'] == 1)
# Calculate number of survival and death
num_survived_Pclass_1 = df.loc[survived_Pclass_1].shape
num_survived_Pclass_2 = df.loc[survived_Pclass_2].shape
num_survived_Pclass_3 = df.loc[survived_Pclass_3].shape
num_not_survived_Pclass_1 = num_Pclass_1 - num_survived_Pclass_1
num_not_survived_Pclass_2 = num_Pclass_2 - num_survived_Pclass_2
num_not_survived_Pclass_3 = num_Pclass_3 - num_survived_Pclass_3
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_1, num_survived_Pclass_1, num_not_survived_Pclass_1))
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_2, num_survived_Pclass_2, num_not_survived_Pclass_2))
print("Total of {} in first class, {} survive and {} not survive.".format(num_Pclass_3, num_survived_Pclass_3, num_not_survived_Pclass_3))
display_piechart(num_survived_Pclass_1, num_not_survived_Pclass_1, 'Survival Rate for First Class')
display_piechart(num_survived_Pclass_2, num_not_survived_Pclass_2, 'Survival Rate for Second Class')
display_piechart(num_survived_Pclass_3, num_not_survived_Pclass_3, 'Survival Rate for Third Class')``````
```We had total number of 891 record with 216 for first class and 184 for second class and 491 for third class.
Total of 216 in first class, 136 survive and 80 not survive.
Total of 184 in first class, 87 survive and 97 not survive.
Total of 491 in first class, 119 survive and 372 not survive.```

Answer: Wow! According to the pie chart, first class has the highest survival rate of 63%. The second class has only 47.3% and the third class of passenger has only 24.2% survival rate.

#### What’s the range of age have highest rate to survive?

``````# Drop lost passenger age record
df = df.dropna()
df[df.Survived==1].Age.plot.hist(bins=range(0,81,10),alpha=0.5,color="blue",figsize=(6,4),label='Survived')
df[df.Survived==0].Age.plot.hist(bins=range(0,81,10),alpha=0.5,color="red",figsize=(6,4),label='Death')
plt.legend()
plt.xlabel("Age distribution of people who survived and death")``````

Answer: The 0-10 range of age has the highest rate to survive. However, the 20-30 age range has the highest risk.

### Conclusion

According to the above analysis, the report shows there are many factors that would affect the survival rates. As we analyzed, age, class and gender are having lots of impact in this analysis.

#### Limitations

This analysis has some limitations of:

• Missing data: Tinanic dataset has a few missing values about the passenger age and most of the cabin’s data have gone.
• Data Ingore: I drop the whole cabin data and the unnessary column due to a large amount of missing data.

#### Other variables

• Passenger Career
• Lifeboat number
• Passenger Health Status
• Wrong recorded
• Passenger Reputation (Maybe some of them are superstars?)
• Ship maintenance report
• Passenger Background

It would be interesting if other variables exist!

### Reference

https://www.kaggle.com/c/titanic/data