The Power of R,
My second blog post will be about R, the power of Free statistical software. To be honest I am brand new to R-Studio, I have just recently finished the R-Studio training on codeschool.com/ at first I was a bit sceptical, but I did really enjoy it. At first glance, it looks easy; the syntax feels natural and can be easily mastered with more practice.
For this assignment, I decided to produce the graph on movie genres changes. I am sure with the right graph we can find real variation between various movie genres. I got data set from http://grouplens.org/datasets/movielens/ the data set need some tidying up (extracting year of publishing, clearing the genres from movies etc. The data scrubbing is one of the most important parts of the any project. The data must be in uniform state, it should be consistent across all data set. In my case, some of the release years were missing. The biggest problem I had the movie genres were listed in one line:
|6365||Matrix Reloaded, The (2003)||Action|Adventure|Sci-Fi|Thriller|IMAX|
When I have finished working with Excel the data has the following form.
I will save this spreadsheet as .CSV and I will import it to R-Studio. After some research and many failed attempts, I finally got result; a chart-displaying movie releases in 1990-2000.
All this was achieved with simple code:
barplot(sci, main =”Movie distribution”, xlab = “Xlabel”, col=c(“lightblue”,”red”), legend= rownames(sci), besides=TRUE )
barplot(table(movies$genres,movies$Year_),col=rainbow(19),main=”Movie releases by genre 2000 – 2010″)
legend = (unique(movies$genres)),
fill = c(rainbow(19)))
I did only use R for few days, but from my experience I have two observations:
- R is very powerful and has a lot of built in features and great community support.
- The software had almost no GUI everything is code based, this is by far not a bad thing, just for some users with no coding experience it add unnecessary complexity.