Practicing Exploratory Data Analysis (EDA)

Shahzaib Ali
Oct 16, 2021
3 min read

Figure 1: Pairsplot for Heardisease Dataset

One of the initial steps in conducting data analysis of your data is to explore it – to get a sense of the type of variables present.

1. One example for exploratory analysis is that to know if the variables are of a different scale - you will need to normalize them to bring them to an equal level which then allows for accurate analysis/ML models.

2. Another example is getting a sense of a standard deviation or variance – because if you are performing a regression analysis more specifically linear and the variables are not linear themselves, that is an issue, and you need to rethink the type of regression you are going to do.

3. One final example – the correlation matrix addresses the multicollinearity problem; it helps understand if there is one or not. Thus, finally emphasizing the importance once again of exploratory analysis.

For this example, I will be using Python however, R can be used or anything else you like – but the steps are the same. Don’t worry - exploratory analysis is pretty simple. The dataset used is called Heart.csv from Kaggle.

Note: I will put the link at the bottom so you can obtain the dataset and conduct your own EDA.

Summary of Data

The mean, standard deviation, variance, and quartiles can give a good indication of how much spread there is in data for a given variable. This can be achieved by the describe function in python. Why? If you were to conduct a linear regression - you would want to see if there is not too much variability. One thing to do here is to see this kind of summary before and after you have dropped missing values. The missing values can come in the form of NA, NaN, or even zero. In this data set, we have a zero, for example, many rows have an entry zero in the cholesterol column - and this is impossible.

Table 1: Summary for Each Variable

Outlier Analysis

It would be best to visualize the outliers – and you can do this by producing a boxplot. The boxplot is also a good plot for comparison between two classes - such as males and females, which I have done for healthy and unhealthy individuals. In this case, it would be a good thing to check if there is a significant difference in RestingBP and Cholesterol between the sexes males and females - and healthy and unhealthy individuals. There is higher cholesterol noticed in females which is interesting.

Figure 2: Boxplots for Cholesterol and RestingBP by Heart-disease Class and Sex.

Correlation Analysis

You can construct a correlation matrix between the variables, and this can indicate which two variables have a relationship. It can be important when constructing a linear regression model. Reason being that you would want to know what variables are driving the prediction. If two variables are related, and one is influencing the other - then you would never know what is important especially in the case of disease analysis. In this case, let us assume that blood pressure and cholesterol were related, how would you know that blood pressure and cholesterol are both causing heart disease – on a larger scale if, you were investigating medicine impact and if two variables were related, how would you know if the medicine is affecting both or one factor and the other is naturally the result. Correlation is important, especially in prediction models. A pair's plot can also be used to investigate collinearity in a different set of visuals (Figure 1).

Figure 3: Correlation Matrix between Features of Heart-Disease Dataset.

Distribution Visuals

Histograms are good to use to visualize the distribution and the Presence of outliers.

Figure 4: Distributions of Age, Cholesterol, and RestingBP Variables.

Observations Noted

Females, in general, have more cholesterol than males.
Females specifically with no observed symptoms in chest pain and with atypical angina have higher cholesterol than males and females with (non-anginal pain and typical angina pain).
Males with atypical angina and asymptomatic chest pain have higher cholesterol than males with typical and non-anginal pain.
Females with cholesterol greater than 300 have a very high probability of having heart disease and suffering from atypical angina.
Males with heart disease have no difference in cholesterol levels to females with no heart disease (P-Value = 0.7888).
There is a negative correlation with MaxHR and Age variable and a positive correlation with old peak and heart disease.
The age variable is normally distributed showing there is no sample bias.

Figure 5: Barplots of Average Cholesterol, Age, Resting BP by Sex and Chest Pain.

Dataset Link: https://www.kaggle.com/fedesoriano/heart-failure-prediction

Practicing Exploratory Data Analysis (EDA)

Recent Posts

Comments