Lets ensure we have our village data file. Load the file from your computer.
For easier reproducibility, I am again downloading the file from the website.
download.file(url="https://dataverse.harvard.edu/api/access/datafile/4805891",destfile = "df.csv",cacheOK=TRUE)#Downloading the file in the name of "df.csv"
village<-read.csv("df.csv",header=T,sep="\t",stringsAsFactors = FALSE)#Importing the file into your RStudio environment
Lets use some basic statistical summary operations.
mean(village$Income)
## [1] 7901.724
median(village$Income)
## [1] 6300
Freq<-table(village$Gender)
#Not readable? Lets change it into a dataframe
freq_table<-data.frame(Freq)#notice data.frame function
names(freq_table)[names(freq_table) == 'Var1'] <- 'Gender'#Recall renaming
#Lets see two variable table
table(village$Land_own, village$Gender)
##
## Female Male
## Marginal farmer 15 44
## Small farmer 3 25
freq_table2<-table(village$Land_own, village$Gender)
freq_table2<-data.frame(freq_table2)
names(freq_table2)[names(freq_table2) == 'Var1'] <- 'Land ownership'
names(freq_table2)[names(freq_table2) == 'Var2']<- 'Gender'
#Using Summary
summary(village$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3500 6300 7902 9900 55000
summary(village$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 41.00 55.00 53.09 62.00 85.00
summary(village$HH_size)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 6.000 6.241 7.000 21.000
#Try summary with qualitative variable
summary(village$Gender)
## Length Class Mode
## 87 character character
#not expected result! Takeaway: summary works for quantitative variables.
HH_size<-village$HH_size
Income<-village$Income
plot(HH_size, Income)
#change data point type
plot(HH_size, Income, type="l")#does not make sense? try type="p"
#Add color
plot(HH_size, Income, type="p", col="blue")
#customize scale
plot(HH_size, Income, type="p", col="blue", log="y")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot
Explore more about “plot” in help search bar.
Lets try a bar plot with categorical variables
barplot(table(village$Land_own, village$Gender))#by default it is stacked
#change the appearance
barplot(table(village$Land_own, village$Gender),beside=T)
#add colour
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"))
#Adding labels
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"), ylab="Land ownership", xlab="Gender")
#Adding legend
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"), ylab="Land ownership", xlab="Gender",legend.text = T,args.legend =list(x="bottomright"))
#Add title
barplot(table(village$Land_own, village$Gender),beside=T,col=c("green","blue"), ylab="Land ownership", xlab="Gender",legend.text = T,args.legend =list(x="bottomright"),main="Land Ownership")
You can do more visualization and customization with the package “ggplot2”. It is an improved version of “ggplot” package.
You can now go to next section of the turtorial on association and correlation.
You can go back to previous section about handling your data.
Happy learning!