Hadoop – Parallel computing and distributed data storage

Hi,

For my next blog I will write about Hadoop, what is is and why it is used buy many organizations and it is becoming the standard in data storage and processing.

Hadoop has may advantages over traditional databases computing platforms, it is very scalable and offers rapid data analysis.

The way I do understand Hadoop is, it offers decentralized storage and computing power to the organizations.  With Hadoop the organizations can spread out computing tasks among many nodes in the cluster. Each task is divided into smaller more manageable pieces, that are completed by one node in the cluster where the data is stored. This offers offering increased performance, the and.  Hadoop also has HDFS (hadoop distributed file system) this allows the  data to be stored in multiple notes, for better file read and write speeds. it also has data redundancy built it, it creates 2 additional data copies on separate nodes in the cluster. When the data needs to be processed it can be processed from any from the three copies. And the changes are synced back across.

More to be updated…

 

Big Data – new data era

Hello dear Blog followers,

Since you are reading my forth my Blog post it means you are liking my posts and hopefully leaning something new, this is what Blogging is all about – writing about stuff that matters and readers find useful.

Todays blog post will be about Big Data. If you are interested in data analytics, I am sure you came across this term quite often. To fully understand what is “Big Data” we need to look at three main aspects of it, as the are called the 3Vs.

Volume – Shier size of the data that is available that is available and needs to be stored and analyzed.

Variety – The data can be harnessed from many sources; sensors, web, facebook, audio/video files, transaction application.

Velocity – The rate at which the data is generated.

Big Data enables organizations to better understand their customers, build better services and products. Understand customer needs, gain competitive advantage.

For many years people did rely on traditional relational, structured databases, but since new era has emerged – Dig Data.

It is hard to describe what big data is, if you will ask 100 professionals many will have different Big Data definitions, but in general all will agree to the main 3Vs.

The way I do understand what Big Data is, is internal (company) and external company data mixture. The internal data, is structured, normalized, optimized and external data is “chaotic” it is massive (Volume), vast (Variety) and newer ending(velocity). The combination of these data is the Big Data.

One of the main challenges of the big data was the storage and computing requirements. The traditional database mechanisms are not good for big data. One of the solutions is Hadoop. I will cover it it my next blog post

Thanks again for the reading my Blogs

Evaldas

 

Excel – analytic platform

Hello,

My dear readers, for my third post I will write my Blog about Excel. Why I think it is one of the best and most versatile analytic tools available. I know what some of you will be thinking Excel, it is just good for spreadsheets and quick calculations – but it is so much more.

I am Excel Power user for number of years, in my professional life I do use it everyday and I still learn something new. When I was given these two CA1 and CA2 assignments most of the work I did do with Excel, data extraction, validation. Just the end result was produced with different software package. For the CA1 it was Google Fusion tables and for CA2 it was R-Studio.

For the CA1 similar result could be achieved in Excel, in excel used can load additional add-ins, they are also refereed as Apps for Office. For the data presentation in Maps I did use the app “Bing Maps”, it is straight forward to use, just load the map reference data to the app and it will produce the report.

Excel map report

The result is not exactly the same, but it the aim of the assignment is to represent the population in each Town in Ireland, Excel is more than capable for the job.

CA2

When it comes into data visualization, summaries Excel has excellent features for the task. The same chart that I did create in R-Studio can be done in Excel Pivot Chart. There is many types to choose from

2000 - 2010 movies2000 - 2010 movies chart2

These features are standard, they are not new, but what is relatively new is Power Pivot. The concept is similar to ordinary pivot, but the the main difference is that the user can create data models from multiple data sources. Available connections are 1-to-1 or 1-to-many.

In movies scenario we can easily link the movies data tale to another table holding decades information. These two data sources can be liked via Year data column with one to many relationship.

One to Many connection

This database relationship model enables the users to get and transform the data from multiple sources with ease. The data can be presented in more meaningful ways, more insight can be gained from the data. In CA2 scenario we can easily represent movie releases by decades.

PowerPivot

With Excel 2016 we can get data from multiple sources, not just conventional CSV or spreadsheets. We can use Excel to query databases such a, MySQL,, Oracle, SQL server, other data sources, as SharePoint lists, Facebook, SalesForce and many more.

For me Excel is Lightweight DBMS, it can retrieve the data, transform the data and load it. To sum it up Excel is great for rapid report prototyping data validation, extraction, it can combine multiple data steams into one report. The only limitation is the local computer it is running on.

 

Thanks for reading my blog, I hope you did learn something new:)

 

Evaldas

 

 

R Studio, the power of R

 

The Power of R,

 

My second blog post will be about R, the power of Free statistical software. To be honest I am brand new to R-Studio, I have just recently finished the R-Studio training on codeschool.com/  at first I was a bit sceptical, but I did really enjoy it. At first glance, it looks easy; the syntax feels natural and can be easily mastered with more practice.

For this assignment, I decided to produce the graph on movie genres changes. I am sure with the right graph we can find real variation between various movie genres. I got data set from http://grouplens.org/datasets/movielens/ the data set need some tidying up (extracting year of publishing, clearing the genres from movies etc. The data scrubbing is one of the most important parts of the any project. The data must be in uniform state, it should be consistent across all data set. In my case, some of the release years were missing. The biggest problem I had the movie genres were listed in one line:

movieId title genres
6365 Matrix Reloaded, The (2003) Action|Adventure|Sci-Fi|Thriller|IMAX

 

When I have finished working with Excel the data has the following form.

Year Genres % Year
1900 Romance 100.00%
1901 Documentary 100.00%
1902 Action 25.00%
1902 Adventure 25.00%
1902 Fantasy 25.00%
1902 Sci-Fi 25.00%
1903 Crime 50.00%

 

I will save this spreadsheet as .CSV and I will import it to R-Studio. After some research and many failed attempts, I finally got result; a chart-displaying movie releases in 1990-2000.

Movie RChart
Movie RChart

All this was achieved with simple code:

movies=read.csv(“movies_csv.csv”)

barplot(sci, main =”Movie distribution”, xlab = “Xlabel”, col=c(“lightblue”,”red”), legend= rownames(sci), besides=TRUE  )

barplot(table(movies$genres,movies$Year_),col=rainbow(19),main=”Movie releases by genre 2000 – 2010″)

legend(“topright”,

       legend = (unique(movies$genres)),

       fill = c(rainbow(19)))

I did only use R for few days, but from my experience I have two observations:

  • R is very powerful and has a lot of built in features and great community support.
  • The software had almost no GUI everything is code based, this is by far not a bad thing, just for some users with no coding experience it add unnecessary complexity.

MAP Data Visualization

For my 1st bog post I will cover data visualization in Excel and Google Fusion Tables (Heat maps). As a Data Analyst, the person must be acquainted with many software tools. Each project should start with these steps:

  1. What question needs to be answered with the project
  2. What data is available, what data will be needed
  3. What software tools are best for the project
  4. What is the best way to present the project

For the Irish Census project the objective the project is to produce the report showing population by county and product the Irish population heat map.

As a data source I had table with the 2011 Census data:

http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/

At first glance this table is fine, but only at 1st glance. As a report it is fine, but as data source it needs to be transformed. Just to name few of the data issues;

  • First column contains multiple data types, City, County, Province,
  • County names are not consistent
  • Males and Females are in the same Row
  • Total persons is static field

I have chosen Excel to transom the data. The final data table looks like this:

County Province Gender QTY
Carlow Leinster Male 27431
Carlow Leinster Female 27181
Cavan Ulster Male 37013
Cavan Ulster Female 36170
Clare Munster Male 58298
Clare Munster Female 58898

The data is broken down, by County, Province, Gender and QTY of persons.

Second data source is Google maps coordinates for the Irish Counties:

http://www.independent.ie/editorial/test/map_lead.kml

To achieve the goal of the project these two data sources need to be merged. To achieve this Google Fusion tables is one of the best tools to use. To create Heat Map few simple steps need to be taken:

  • Upload County boarder data to Google fusion tables
  • Upload Excel file with the County population data
  • Merge these two (common fields are County names)
  • Tweak the report presentation. Add the “buckets” to colour code counties by the population quantity in each county.

 

Heat Map is one the best way for the customer to consume the map data report. The same report can be presented in the table summary. Excel is in the league on its own when it comes to reporting.

County Population % Share Province Population % of Population
Carlow 54,612 1.21% Connacht 542547 12.06%
Cavan 73,183 1.63% Leinster 2504814 55.69%
Clare 117,196 2.61% Munster 1155655 25.69%
Cork 519,032 11.54% Ulster 294803 6.55%
Donegal 161,137 3.58% Grand Total 4497819 100.00%
Dublin 1,273,069 28.30%
Galway 250,653 5.57%
Kerry 145,501 3.23%
Kildare 210,312 4.68%
Kilkenny 95,419 2.12%
Laois 80,559 1.79%
Leitrim 31,798 0.71%
Limerick 191,809 4.26%
Longford 39,000 0.87%
Louth 122,897 2.73%
Mayo 130,638 2.90%
Meath 184,135 4.09%
Monaghan 60,483 1.34%
Offaly 76,687 1.70%
Roscommon 64,065 1.42%
Sligo 65,393 1.45%
Tipperary 70,322 1.56%
Waterford 111,795 2.49%
Westmeath 86,164 1.92%
Wexford 145,320 3.23%
Wicklow 136,640 3.04%
Grand Total 4,497,819 100.00%
Irish population map according to 2012 census
Irish population map according to 2012 census

Other uses of this data

 

With the current data we could calculate the population density in each County. This data can also be used in conjunction with other data sources, for instance graduates in each county, to get comparison of each county comparing the % of third level graduates. Getting number of schools in county calculate student distribution per school and % of recent graduates attending 3rd level.

Get coordinates of international Airports, to find out how far on average people need to travel to the nearest airport.