Tweet Association and Correlation Hypothesis Tests

Lets conduct some association and correlation tests by forming relevant hypothesis from our village data.

Lets download it again. Alternatively, you can use read.csv to import from your computer.

download.file(url="https://dataverse.harvard.edu/api/access/datafile/4805891",destfile = "df.csv",cacheOK=TRUE)#Downloading the file in the name of "df.csv"

village<-read.csv("df.csv",header=T,sep="\t",stringsAsFactors = FALSE)#Importing the file into your RStudio environment

Association and correlation

Association

From the above data we can test following hypothesis

\[H_0\ Land\ ownership\ and\ gender\ are\ independent\] In other words, no relationship exists between ownership and gender.

chisq.test(village$Land_own, village$Gender)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  village$Land_own and village$Gender
## X-squared = 1.6876, df = 1, p-value = 0.1939
#storing and printing the result
asso<-chisq.test(village$Land_own, village$Gender)

Notice the object type we just used to store the chi-square result. It is a list.

#Lets explore the items in the list
asso$observed
##                  village$Gender
## village$Land_own  Female Male
##   Marginal farmer     15   44
##   Small farmer         3   25
asso$expected
##                  village$Gender
## village$Land_own     Female    Male
##   Marginal farmer 12.206897 46.7931
##   Small farmer     5.793103 22.2069
asso$p.value
## [1] 0.1939156

We want to extract the relevant observed and expected value tables from the object and store in a meaningful presentable way.

#extract the observed values
chi_obs<-asso$observed
chi_obs<-data.frame(chi_obs)
write.csv(chi_obs, "chi_result.csv")
#extract the expected values
chi_exp<-asso$expected
chi_exp<-data.frame(chi_exp)

Correlation

\[H_0\ Education\ and\ Income\ are\ not\ correlated\]

Education<-village$Edu_highest
Income<-village$Income

cor.test(Education, Income)
## 
##  Pearson's product-moment correlation
## 
## data:  Education and Income
## t = 2.3767, df = 85, p-value = 0.01972
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04113943 0.43727927
## sample estimates:
##       cor 
## 0.2496245

Is our result reliable? Lets check the assumptions of pearson correlation test and learn how distribution matters.

The test says it follows a t distribution with length(x)-2 degrees of freedom if the samples follow independent normal distributions.

check our distribution hist(village$Income)

Lets normalize the distribution of income by taking log

HH_income<-log(village$Income)
hist(HH_income)

cor.test(Education, HH_income)#Not expected result
## 
##  Pearson's product-moment correlation
## 
## data:  Education and HH_income
## t = NaN, df = 85, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  NaN NaN
## sample estimates:
## cor 
## NaN

We get NaN error messages due to presence of infinte value in our data. A quick look at HH_income tells us “-Inf” value in 24th row of income variable.

Since, we applied log to income and a respondent reported the income as zero, the log transformation yielded the result as infinite. Hence, we got the error.

Lets eliminate the Inf value either by dropping it from analysis or by assigning the original zero value as “1” so that log of 1 will ultimately result to zero in our final correlation analysis.

village$Income[village$Income==0]<-1
HH_income<-log(village$Income)
#hist(HH_income)
cor<-cor.test(Education, HH_income)
#convert the list object to data frame (table)
cor_table<-data.frame(cor$statistic)
cor_table$p_value<-cor$p.value
cor_table$estimate<-cor$estimate
#we do not want 't' in row name while exporting the table
write.csv(cor_table, "cor_result.csv", row.names=FALSE)

To know more about R, please refer to https://cran.r-project.org/doc/manuals/r-release/R-intro.html