Data Summary

Summarizing your Data

Lets ensure we have our village data file. Load the file from your computer.

For easier reproducibility, I am again downloading the file from the website.

download.file(url="https://dataverse.harvard.edu/api/access/datafile/4805891",destfile = "df.csv",cacheOK=TRUE)#Downloading the file in the name of "df.csv"

village<-read.csv("df.csv",header=T,sep="\t",stringsAsFactors = FALSE)#Importing the file into your RStudio environment

Lets use some basic statistical summary operations.

mean(village$Income)

## [1] 7901.724

median(village$Income)

## [1] 6300

Freq<-table(village$Gender)
#Not readable? Lets change it into a dataframe
freq_table<-data.frame(Freq)#notice data.frame function
names(freq_table)[names(freq_table) == 'Var1'] <- 'Gender'#Recall renaming
#Lets see two variable table
table(village$Land_own, village$Gender)

##                  
##                   Female Male
##   Marginal farmer     15   44
##   Small farmer         3   25

freq_table2<-table(village$Land_own, village$Gender)
freq_table2<-data.frame(freq_table2)
names(freq_table2)[names(freq_table2) == 'Var1'] <- 'Land ownership'
names(freq_table2)[names(freq_table2) == 'Var2']<- 'Gender'
#Using Summary
summary(village$Income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3500    6300    7902    9900   55000

summary(village$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   41.00   55.00   53.09   62.00   85.00

summary(village$HH_size)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   6.000   6.241   7.000  21.000

#Try summary with qualitative variable
summary(village$Gender)

##    Length     Class      Mode 
##        87 character character

#not expected result! Takeaway: summary works for quantitative variables.

Visualizing your results

HH_size<-village$HH_size
Income<-village$Income
plot(HH_size, Income)

#change data point type
plot(HH_size, Income, type="l")#does not make sense? try type="p"

#Add color
plot(HH_size, Income, type="p", col="blue")

#customize scale
plot(HH_size, Income, type="p", col="blue", log="y")

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

Explore more about “plot” in help search bar.

Lets try a bar plot with categorical variables

barplot(table(village$Land_own, village$Gender))#by default it is stacked

#change the appearance
barplot(table(village$Land_own, village$Gender),beside=T)

#add colour 
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"))

#Adding labels
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"),      ylab="Land ownership", xlab="Gender")

#Adding legend
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"),      ylab="Land ownership", xlab="Gender",legend.text = T,args.legend =list(x="bottomright"))

#Add title
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"),     ylab="Land ownership", xlab="Gender",legend.text = T,args.legend =list(x="bottomright"),main="Land Ownership")

You can do more visualization and customization with the package “ggplot2”. It is an improved version of “ggplot” package.

You can now go to next section of the turtorial on association and correlation.

Association and Correlation testing

You can go back to previous section about handling your data.

data handling

Happy learning!