Kaggle Porto Seguro Part I - Exploratory Data Analysis

15-Aug-18 . 13 mins read

R eda

Data description
Missing values
Basic imputation
Binary Features
Categorical features
Remaining features
Do the features follow a normal distribution ?
Correlation
Binary-Binary Correlation (Phi Coefficient)
Binary-Binary Correlation (Chi-Square test / CramerV)
Continuous - Binary (Kruskal-Wallis Rank Sum Test)
Summary / Key Learnings

Photo by Viraj Rajankar on Unsplash

Porto Seguro is Brazilian insurance company. They hosted a Kaggle competition in Nov 2017 to predict the probability that a driver will initiate an auto insurance claim in the next year. This was the first Kaggle competition that I participated. In this post, I will do an exploratory analysis of the training data and also try some statistical inference tests.

Data description

The training set has around 600k observations and 59 features (including the target feature and an id feature) and test set has around 900K rows. Below was the given description :

#train file stored outside the blog files
root <- '/Mac Backup/R-Large-Items/porto/'
DT_train <- fread(file=paste(root,'train.csv',sep=''))
DT_test <- fread(file=paste(root,'test.csv',sep=''))
# dim(DT_train)
# dim(DT_test)

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

Missing values

The first thing that I look at in a dataset is whether there are any missing values.
The missing value and its imputation is a interesting topic that I am hoping to cover in the future. I have used the plot_missing from the excellent DataExplorer package to get an idea of the missing values.
The missing values were coded as -1 that I have converted to NA and I am plotting only the features that had missing values.
There are 13 features with missing values and four of them (ps_car_14,ps_reg_03,ps_car_05_cat and ps_car_03_cat) have significant volume of missingness
The distribution of the missing values is shown in the following plot

DT_train[DT_train == -1] <- NA
#lapply(DT_train,function(x) sum(is.na(x))/length(x))
cols_NA <- names(which(lapply(DT_train,function(x) sum(is.na(x))/length(x)) >0))
DataExplorer::plot_missing(DT_train[,..cols_NA]
                           ,theme_config = theme_darklightmix(color_theme = ggplot_color_theme)
                           ,title = "Missing values")

visdat::vis_dat(dplyr::sample_n(DT_train[,..cols_NA],60000),warn_large_data = F)  +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "right") +
  labs(title="Missing value distribution (sample size 60K)"
  ,caption = varPlotCaption) +
        theme(axis.text.x = element_text(angle =45,hjust=0.2))

Basic imputation

Missing value imputation is an extensive topic and there are many advanced algorithms and packages like mice, Amelia, expectation maximisation etc. that can be used
But for this analysis, I will rely on the simplest methods :
- Mode imputation for categorical features
- Median imputation for the continuous features.
I could have handcoded the functions but they are available in the na.tools package

DataExplorer::plot_histogram(DT_train[,..cols_NA]
                           ,theme_config = theme_darklightmix(color_theme = ggplot_color_theme)
                           ,title = "Distribution of the features with missing values")

cols_NA_cat <- c("ps_ind_02_cat","ps_ind_04_cat","ps_ind_05_cat","ps_car_01_cat","ps_car_02_cat","ps_car_03_cat","ps_car_05_cat","ps_car_07_cat","ps_car_09_cat","ps_car_11")
DT_train[,(cols_NA_cat) := lapply(.SD,na.tools::na.most_freq), .SDcols = cols_NA_cat ]

cols_NA_cont <- c("ps_car_14","ps_reg_03","ps_car_12")
DT_train[,(cols_NA_cont) := lapply(.SD,na.tools::na.median), .SDcols = cols_NA_cont ]

Binary Features

There are 17 features that are postfixed with ‘bin’. These are binary features and as such have values of 0 and 1.
99% of observations have zero values for the features ps_ind_10, ps_ind_11, ps_ind_12 and ps_ind_13. For the predictive modeling, these features would not add much values and I would probably look at dropping them.

cols_bin <- grep("bin",names(DT_train),value = T)

DT_train[,..cols_bin] %>%
  melt(measure.vars = 1:ncol(.)) %>%
  .[,.(freq=.N),.(variable,value)] %>%
  .[,.(value,freq,freqPct=freq/sum(freq)),variable] %>%
  ggplot(aes(x=factor(value),y=freq/1e3,fill=factor(value))) +
  geom_col() +
  facet_wrap(~variable,scales="free",ncol=3) +
  scale_y_continuous(labels = scales::unit_format(unit = "K")) +
#  scale_x_discrete(breaks = c(0,1)) +
  geom_text(aes(label = paste(round(100*freqPct,1),"%",sep="")), position = position_stack(vjust = 0.5), size = 2) +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "none") +
  scale_fill_brewer(palette = "Set3") +
  labs(x="value"
  ,y="count"
  ,title="Distribution of bin variables"
  ,caption = varPlotCaption)

Categorical features

There are 14 features that are postfixed with cat. These are categorical features.
While majority of them have less than 10 levels, there are some features like ps_ind_06_cat and ps_car_11_cat that have higher number of levels.
- As these are column plots, the variability is not clear, so I will include ps_car_11_cat in the following histogram plots

cols_cat <- grep("cat",names(DT_train),value = T)

DT_train[,..cols_cat] %>%
  melt(measure.vars = 1:ncol(.)) %>%
  .[,.(freq=.N),.(variable,value)] %>%
  .[,.(value,freq,freqPct=freq/sum(freq,na.rm = T)),variable] %>%
  ggplot(aes(x=factor(value),y=freq/1e3,fill=factor(..x..))) +
  geom_col() +
  facet_wrap(~variable,scales="free",ncol=3) +
  scale_y_continuous(labels = scales::unit_format(unit = "K")) +
#  scale_x_discrete(breaks = c(0,1)) +
  geom_text(aes(label = paste(round(100*freqPct,1),"%",sep="")), position = position_stack(vjust = 0.5), size = 2) +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "none") +
  scale_fill_brewer(palette = "Set3") +
  labs(x="value"
  ,y="count"
  ,title="Distribution of cat variables"
  ,caption = varPlotCaption)

Remaining features

Remaining 26 features are either continuous or ordinal are have been plotted below

#remove id and target and add one variable from cat
cols_others <- names(DT_train)[!names(DT_train) %in% c(cols_bin,cols_cat,"id","target")]
cols_others <- c(cols_others,"ps_car_11_cat")
fn_extraPolate <- colorRampPalette(brewer.pal(6, "Set3"))

DT_train[,..cols_others] %>%
  melt(measure.vars = 1:ncol(.)) %>%
  ggplot(aes(x=value,y=..count../1e3,fill=variable)) +
  geom_histogram(na.rm = T) +
  facet_wrap(~variable,scales="free",ncol=3) +
  scale_y_continuous(labels = scales::unit_format(unit = "K")) +
#  scale_x_discrete(breaks = c(0,1)) +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "none") +
#  scale_fill_continuous()
  scale_fill_manual(values = fn_extraPolate(length(cols_others))) +
#  scale_fill_brewer(palette = "Set3") +
  labs(x="value"
  ,y="count"
  ,title="Distribution of remaining variables"
  ,caption = varPlotCaption)

Do the features follow a normal distribution ?

Some of the prediction algorithms require the continuous features to follow a normal distribution. This can be verified by different ways :
- Histogram
- QQ Plots
- Shapiro-Wilk Test
- Kolmogorov-Smirnov Test
- Anderson-Darling test
We already know that the binary and categorical features do not follow normal distribution.
Let’s look at few of the features that looked like normally distributed in the histograms (plotted above)
- Only ps_calc_10, ps_calc_14 and to certain extent ps_calc_11 shows clear normal distribution
- ps_reg_03, ps_car_14 and ps_car_13 (not shown) is influenced by outliers and will need to be treated

cols_others_normal <- c("ps_ind_03", "ps_ind_15", "ps_reg_03","ps_car_14", "ps_car_15", "ps_calc_06", "ps_calc_10", "ps_calc_11", "ps_calc_14")

DT_train[1:100,..cols_others_normal] %>%
  melt(measure.vars = 1:ncol(.)) %>%
  ggplot(aes(sample = value,color = variable)) +
  stat_qq() + 
  stat_qq_line() +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "none") +
  scale_color_manual(values = fn_extraPolate(length(cols_others_normal))) +
  labs(x="value"
  ,y="count"
  ,title="QQ Plots - Normality Testing"
  ,caption = varPlotCaption)

How about using statistical tests for normality

There are couple of them :Shapiro-Wilk and Kolmogorov-Smirnov Test
The null hypothesis for these tests is that the distribution is normal
- Hence a p-value << 0.05 means would reject the null hypothesis
From the below table we can see that none of the features are normal
- This might not be 100% correct as these tests don’t work well for large datasets
- And they expect the distribution to be perfectly normal

shap <-  DT_train[,lapply(.SD,function(x) shapiro.test(sample(x,5000))$p.value)
         ,.SDcols = cols_others_normal] %>% melt(value.name ="Shapiro pValue")
      
ks <-  DT_train[,lapply(.SD,function(x) ks.test(x,rnorm,3)$p.value)
         ,.SDcols = cols_others_normal] %>% melt(value.name ="KS pValue") 

batchtools::ijoin(shap,ks,by="variable")  %>% fn_kable()

variable	Shapiro pValue	KS pValue
ps_ind_03	0	0
ps_ind_15	0	0
ps_reg_03	0	0
ps_car_14	0	0
ps_car_15	0	0
ps_calc_06	0	0
ps_calc_10	0	0
ps_calc_11	0	0
ps_calc_14	0	0

Correlation

I have used R cor function to see if there is correlation among the features and it does not look like there is not much.
- Except for ps_ind_12_bin and ps_ind_14 has a high correlation coefficient of 0.89
- This gives a nice overall picture of the feature space, even though it violates some of the assumptions of Pearson correlation i.e. normality, linearity and homoscedasticity.

correl <- cor(DT_train,method="pearson")

ggcorrplot::ggcorrplot(correl, hc.order = FALSE, type = "lower",lab=F,lab_size = 2,outline.color = "black") +
    labs(title="Pearson correlation of all features"
    ,caption = varPlotCaption) +
    theme_darklightmix(color_theme = ggplot_color_theme
                       ,legend_position = "bottom") +
      theme(axis.text.x = element_text(angle =90,hjust=1,vjust = 0.5)
            ,axis.title = element_blank()
            ,panel.grid.major =element_line(size=0.8)
            ,panel.grid.major.x  = element_blank())

Can the R cor function be used to find the correlation between different variable types i.e binary, continuous, etc ?
As per it this article it looks like there are other statistical tests available for checking the correlation or the independence of features depending on the type of variable.
I have used the datapasta package to create the below table by copy pasting the values from Excel

#datapasta::tribble_paste()

cor_types  <- tibble::tribble(
                             ~Test, ~Notation,                                            ~`Variable types`,                                                          ~Assumptions,
                       "Pearson’s",     "r",                                                       "Both continuous",                           "Normality, Linearity and Homoscedasticity",
                             "Phi",     "f",                                                           "Both binary",                                                                 "TBD",
                     "Contingency",     "C",                                                      "Both categorical",                                                                 "TBD",
                  "Point-Biserial", "rpbis",                                                   "Continuous - Binary", "No outliers, Approximately normal, equal variance for each category",
                        "Biserial",  "rbis",                          "Both continuous but converted to Categorical",                                                                 "TBD",
                     "Tetrachoric",    "rt",                               "Both continuous but converted to Binary",                                                                 "TBD",
                             "Eta",     "h", "Both are continuous and are used to detect curvilinear relationships.",                                                                 "TBD",
                      "Rank Order",     "r",                                               "Both are rank (ordinal)",                                                                 "TBD"
                  )

cor_types %>% fn_kable

Test	Notation	Variable types	Assumptions
Pearson’s	r	Both continuous	Normality, Linearity and Homoscedasticity
Phi	f	Both binary	TBD
Contingency	C	Both categorical	TBD
Point-Biserial	rpbis	Continuous - Binary	No outliers, Approximately normal, equal variance for each category
Biserial	rbis	Both continuous but converted to Categorical	TBD
Tetrachoric	rt	Both continuous but converted to Binary	TBD
Eta	h	Both are continuous and are used to detect curvilinear relationships.	TBD
Rank Order	r	Both are rank (ordinal)	TBD

Binary-Binary Correlation (Phi Coefficient)

As per the above article we can use the Phi Coefficient to determine the correlation of binary features
Is the Pearson Correlation different from the Phi Coefficient ?
I don’t think so as I am getting the same results. Does this mean the cor function internally takes care of correlation of binary variable. I will need to investigate
I have used the PairApply function from the DescTools package for pair-wise column computations

cor <- PairApply(DT_train[,..cols_bin],FUN = cor)
p1 <- ggcorrplot::ggcorrplot(cor, hc.order = FALSE, type = "lower",lab=T,lab_size = 2,outline.color = "black"
                       ,legend.title = "p-value"
                       ,digits = 2) +
    labs(title="Correlation"
         ,subtitle =  "High indicates stronger relationship"
    ,caption = varPlotCaption) +
    theme_darklightmix(color_theme = ggplot_color_theme
                       ,legend_position = "bottom") +
      theme(axis.text.x = element_text(angle =90,hjust=1,vjust = 0.5)
            ,axis.title = element_blank()
            ,panel.grid.major =element_line(size=0.8)
            ,panel.grid.major.x  = element_blank()) 

phi <- PairApply(DT_train[,..cols_bin],FUN = function(x,y) psych::phi(table(x,y)))
p2 <- ggcorrplot::ggcorrplot(phi, hc.order = FALSE, type = "lower",lab=T,lab_size = 2,outline.color = "black"
                       ,legend.title = "p-value"
                       ,digits = 2) +
    labs(title="Phi Coefficient"
         ,subtitle =  "Correlation coefficient for binary variables"
    ,caption = varPlotCaption) +
    theme_darklightmix(color_theme = ggplot_color_theme
                       ,legend_position = "bottom") +
      theme(axis.text.x = element_text(angle =90,hjust=1,vjust = 0.5)
            ,axis.title = element_blank()
            ,panel.grid.major =element_line(size=0.8)
            ,panel.grid.major.x  = element_blank()) 


gridExtra::grid.arrange(p1, p2, ncol=2)

Binary-Binary Correlation (Chi-Square test / CramerV)

Continuing with the assumption that Pearson’s correlation cannot be used for Binary features, I thought of checking the results against Chi-Square test
Chi-square test is useful in two scenarios :
- To check if the sample comes from a population (goodness of fit test)
- To check if two categorical variables are independent
  - Low chi-square implies relationship exists
  - High chi-square implies no relationship
The CramerV gives the strength of the relationship
But these results also seem to be inline with what I got from Pearson’s correlation

cols_bin <- grep("bin",names(DT_train),value = T)

chiSqPValue <- PairApply(DT_train[,..cols_bin],FUN = function(x,y) chisq.test(x,y)$p.value)
p1<- ggcorrplot::ggcorrplot(chiSqPValue, hc.order = FALSE, type = "lower",lab=T,lab_size = 2,outline.color = "black"
                       ,legend.title = "p-value"
                       ,digits = 2) +
    labs(title="ChiSq Test p-values"
         ,subtitle = "Low values indicates relationship exists"
    ,caption = varPlotCaption) +
    theme_darklightmix(color_theme = ggplot_color_theme
                       ,legend_position = "bottom") +
      theme(axis.text.x = element_text(angle =90,hjust=1,vjust = 0.5)
            ,axis.title = element_blank()
            ,panel.grid.major =element_line(size=0.8)
            ,panel.grid.major.x  = element_blank()) 

crammerV <- PairApply(DT_train[,..cols_bin],FUN = DescTools::CramerV)
p2 <- ggcorrplot::ggcorrplot(crammerV, hc.order = FALSE, type = "lower",lab=T,lab_size = 2,outline.color = "black"
                       ,legend.title = "p-value"
                       ,digits = 2) +
    labs(title="cramerV"
         ,subtitle =  "High indicates stronger relationship"
    ,caption = varPlotCaption) +
    theme_darklightmix(color_theme = ggplot_color_theme
                       ,legend_position = "bottom") +
      theme(axis.text.x = element_text(angle =90,hjust=1,vjust = 0.5)
            ,axis.title = element_blank()
            ,panel.grid.major =element_line(size=0.8)
            ,panel.grid.major.x  = element_blank()) 

gridExtra::grid.arrange(p1, p2, ncol=2)

Continuous - Binary (Kruskal-Wallis Rank Sum Test)

There are multiple ways in which we can test the dependence of Continuous-Categorical variables
- One-way ANOVA
- Logistic Regression
- Point biserial correlation
- Kruskal-Wallis Rank Sum Test
The Kruskal-Wallis Rank Sum tests is a non-parametric test and hence can be used for non-normal distributions (unlike ANOVA that expects the continuous variable to be normal)
- The Null hypothesis is that the both the variables are dependent
- Hence, a p-value of << 0.05 will reject the null hypothesis
There are few combinations that result in high p-value but they have low Pearson’s correlation coefficient. Does this mean they are dependent ?

cols_bin <- grep("bin",names(DT_train),value = T)
cols_others <- names(DT_train)[!names(DT_train) %in% c(cols_bin,cols_cat,"id","target")]
grid<- expand.grid(cols_bin,cols_others,stringsAsFactors = F,KEEP.OUT.ATTRS = F)
out_cor <- 
map2(.x=grid$Var1,.y=grid$Var2,.f = function(v1,v2){
  cols <- c(v1,v2)
  list(v1,v2,cor(DT_train[,..cols],use="pairwise.complete.obs")[1,2])
  
}) 
out_cor <- rbindlist(out_cor) %>% .[order(V3)]
setnames(out_cor,"V3","Correlation")
out_kruskal <-map2(.x=grid$Var1,.y=grid$Var2,.f = function(v1,v2){
  cols <- c(v1,v2)
  k <- kruskal.test(x=DT_train[[v2]],g=DT_train[[v1]])
  list(v1,v2,statistic=k$statistic,pvalue=round(k$p.value,10))
  
}) 
out_kruskal <- rbindlist(out_kruskal) %>% .[order(-pvalue)]
setnames(out_kruskal,"statistic","KW - Statistic")
out <- batchtools::ijoin(out_cor,out_kruskal,by=c("V1","V2"))

out[order(-pvalue)][1:10] %>% fn_kable(position = "float_left",caption = "Top 10 p-value")

Table 1: Top 10 p-value
V1	V2	Correlation	KW - Statistic	pvalue
ps_calc_20_bin	ps_reg_03	0.0002519	0.0003591	0.9848819
ps_calc_20_bin	ps_calc_11	-0.0001643	0.0005204	0.9817993
ps_calc_20_bin	ps_car_15	0.0002138	0.0005890	0.9806376
ps_calc_15_bin	ps_calc_02	0.0000345	0.0007271	0.9784874
ps_calc_18_bin	ps_reg_03	0.0002220	0.0007388	0.9783160
ps_calc_19_bin	ps_calc_05	0.0002651	0.0015041	0.9690632
ps_ind_12_bin	ps_calc_11	0.0001290	0.0015448	0.9686480
ps_ind_09_bin	ps_calc_09	0.0000841	0.0016159	0.9679351
ps_ind_13_bin	ps_calc_11	-0.0001858	0.0016465	0.9676327
ps_calc_19_bin	ps_car_12	-0.0003419	0.0021382	0.9631184

p1 <- ggplot(out,aes(x=Correlation,y=`KW - Statistic`)) +
  geom_point() +
  coord_cartesian(ylim = c(0,11000)) +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "none") +
  labs(subtitle="KW Statistics vs Correlation Coefficient"
  ,caption = varPlotCaption)

p2 <- ggplot(out,aes(x=Correlation,y=pvalue)) +
  geom_point(position = "jitter") +
  theme_darklightmix(color_theme = ggplot_color_theme,legend_position = "none") +
  labs(subtitle="pValue of KW vs Correlation Coefficient"
  ,caption = varPlotCaption)

gridExtra::grid.arrange(p1, p2, ncol=2)

This post has already crossed a 10 mins read, so I will stop here and write another post to continue with the EDA.

Summary / Key Learnings

The features in the dataset were encoded and their meanings were not provided. Hence, I have not attempted to write how the insights relate to the real-world business
The objective was to run through the R code and further my understanding of key statistical concepts
New packages and functions that I learnt while writing this post
- na.tools package for basic missing value imputation
- In the histogram plot, I used the special variable ..count.. to show the count of the respective features
  - As per the documentation, these are special variables computed from the aesthetics
  - Alternative way of using them is by calling them within the ggplot2::stat(count) function
  - ggplot(aes(x=value,y=..count../1e3,fill=variable))
- Ran the Shapiro-Wilk shapiro.test and Kolmogorov-Smirnov ks.test tests for normality
  - Shapiro-Wilk test works for sample size < 5000
  - Both tests expect the distribution to be perfectly normal
- The results of Pearson’s correlation cor were not different from the results of some other tests
  - Phi coefficient is used for binary-binary variables psych::phi(table(x,y))
  - Kruskal test is used for categorical-continuous variables, but I was not able to come to conclusion and will need further reading
References/Further Reading

### Low variance features
This was an quick overview of the feature set. Next step is to see how these features are correlated.

But before that, I will remove the features that show low variance. For example the `ps_ind_10_bin` has only 0 values and it will not have much predictive value. We can use the `var` function is R to find these features, but it is always a good practice to look at the plots.

DT_train[,lapply(.SD,var)] %>% melt() %>% .[order(value)] %>% .[1:10] %>%
  kable(format.args = list(decimal.mark = '.', big.mark = ",")) %>%
  kable_styling(bootstrap_options = "condensed"
            ,full_width = FALSE
            ,position = "left"
            ,font_size = 10)

lowVarianceFeatures <- c("ps_ind_10_bin","ps_ind_13_bin","ps_ind_11_bin","ps_car_10_cat","ps_ind_12_bin")
DT_train[,(lowVarianceFeatures) := NULL]


#Bin-Bin
cor(DT_train$ps_ind_18_bin,DT_train$ps_calc_15_bin)
CramerV(factor(DT_train$ps_ind_18_bin),factor(DT_train$ps_calc_15_bin))
Phi(DT_train$ps_ind_18_bin,DT_train$ps_calc_15_bin)

#Bin-Continuous
cor(DT_train$ps_ind_18_bin,DT_train$ps_calc_14)
summary(aov(ps_calc_14~ps_ind_18_bin,data=DT_train))
var(DT_train$ps_calc_14)
aggregate(formula = ps_calc_14~ps_ind_18_bin,data=DT_train, FUN = var)
kruskal.test(DT_train$ps_calc_14,DT_train$ps_ind_18_bin)

#Cat-Continuous
cor(DT_train$ps_car_02_cat,DT_train$ps_car_13,use="pairwise.complete.obs")
summary(aov(ps_car_13~ps_car_02_cat,data=DT_train))
var(DT_train$ps_car_13)
aggregate(formula = ps_car_13~ps_car_02_cat,data=DT_train, FUN = var)
K <- kruskal.test(DT_train$ps_car_13,DT_train$ps_car_02_cat)


cols_bin <- grep("bin",names(DT_train),value = T)
cols_cat <- grep("cat",names(DT_train),value = T)
cols_others <- names(DT_train)[!names(DT_train) %in% c(cols_bin,cols_cat,"id","target")]



cols_bin <- grep("bin",names(DT_train),value = T)
correl <- cor(DT_train[,..cols_bin])
ggcorrplot::ggcorrplot(correl, hc.order = FALSE, type = "lower",lab=TRUE,lab_size = 2) +
  ggtitle("bin cor")

cols_cat <- grep("cat",names(DT_train),value = T)
correl <- cor(DT_train[,..cols_cat] )
ggcorrplot::ggcorrplot(correl, hc.order = FALSE, type = "lower",lab=TRUE,lab_size = 2) +
  ggtitle("cat cor")

cols_others <- names(DT_train)[!names(DT_train) %in% c(cols_bin,cols_cat,"id","target")]
cols_others <- c(cols_others,"ps_car_11_cat")
correl <- cor(DT_train[,..cols_others],use = "complete.obs")
ggcorrplot::ggcorrplot(correl, hc.order = FALSE, type = "lower",lab=TRUE,lab_size = 2) +
  ggtitle("others cor")

Geographic Data and Visualisation in R

Linear Regression Part III - Plots

Linear Regression Part II - Multiple