Titanic Survival Prediction
Would you have survived the Titanic if you were onboard ?
The RMS Titanic was known as the unsinkable ship and was the largest, most luxurious passenger ship of its time. Sadly, the British ocean liner sank on April 15, 1912, killing over 1500 people while just 705 survived.
In this article, I will take you through a very famous case study for machine learning practitioners which is to predict titanic survival with Machine Learning.
Predict Titanic Survival with Machine Learning
First, we will import the required libraries. Then, we will import the data from titanic.csv file using read_csv() function and provide attribute names using names parameter of the read_csv() function. The imported data is transferred to dataframe df. A dataframe is a two-dimensional data structure which stores the data in tabular form i.e. rows and columns. And now we will have a look at the data. head() is used to display the first five rows of the dataframe by default.
Shape is used to tell the no. of rows and columns present in the dataset. Describe is used to view some statistical details like percentile, mean, std etc. of a dataframe.
The info() function is used to get a concise summary of the dataframe. Scikit-learn’s algorithms generally cannot be powered by missing data, so we’ll look at the columns to see if there are any attributes that contain missing data.
We will drop all those attributes which are not useful for training our model.
axis = columns means I’m specifically targeting my column attributes. inplace = True refers that the data is modified in place, which means it will return nothing and the dataframe is now updated.
Now, we will use LabelEncoder from sklearn.Preprocessing library. It involves converting each value in a column to a number.
Scikit-learn’s algorithms generally cannot be powered by objects or string values. It has to be either in float or in integer which the LabelEncoder provides.
Here, ‘X’ is my input variable which contains the attributes that are required for training the model. Whereas ‘y’ is my target variable or the desired variable.
Age is being checked whether it has any null value or not. If there is any null value present in the dataset we will fill it with the mean() function.
For training my model I have imported train_test_split from model_selection from sklearn library and have used Random Forest Classifier.
The test_size which I have given is 0.2 i.e. 20%. Generally the test_size varies from 20% to 30% and the rest 70% to 80% of the data is used for training the model.
We can have a look at the X_test and y_test data respectively.
The predict_proba function gives us the probabilities for the target in array form.
Thanks for visiting…
You can have a more precise look at the jupyter notebook in my Github repository.