Hadoop – Parallel computing and distributed data storage


For my next blog I will write about Hadoop, what is is and why it is used buy many organizations and it is becoming the standard in data storage and processing.

Hadoop has may advantages over traditional databases computing platforms, it is very scalable and offers rapid data analysis.

The way I do understand Hadoop is, it offers decentralized storage and computing power to the organizations.  With Hadoop the organizations can spread out computing tasks among many nodes in the cluster. Each task is divided into smaller more manageable pieces, that are completed by one node in the cluster where the data is stored. This offers offering increased performance, the and.  Hadoop also has HDFS (hadoop distributed file system) this allows the  data to be stored in multiple notes, for better file read and write speeds. it also has data redundancy built it, it creates 2 additional data copies on separate nodes in the cluster. When the data needs to be processed it can be processed from any from the three copies. And the changes are synced back across.

More to be updated…


Big Data – new data era

Hello dear Blog followers,

Since you are reading my forth my Blog post it means you are liking my posts and hopefully leaning something new, this is what Blogging is all about – writing about stuff that matters and readers find useful.

Todays blog post will be about Big Data. If you are interested in data analytics, I am sure you came across this term quite often. To fully understand what is “Big Data” we need to look at three main aspects of it, as the are called the 3Vs.

Volume – Shier size of the data that is available that is available and needs to be stored and analyzed.

Variety – The data can be harnessed from many sources; sensors, web, facebook, audio/video files, transaction application.

Velocity – The rate at which the data is generated.

Big Data enables organizations to better understand their customers, build better services and products. Understand customer needs, gain competitive advantage.

For many years people did rely on traditional relational, structured databases, but since new era has emerged – Dig Data.

It is hard to describe what big data is, if you will ask 100 professionals many will have different Big Data definitions, but in general all will agree to the main 3Vs.

The way I do understand what Big Data is, is internal (company) and external company data mixture. The internal data, is structured, normalized, optimized and external data is “chaotic” it is massive (Volume), vast (Variety) and newer ending(velocity). The combination of these data is the Big Data.

One of the main challenges of the big data was the storage and computing requirements. The traditional database mechanisms are not good for big data. One of the solutions is Hadoop. I will cover it it my next blog post

Thanks again for the reading my Blogs



Excel – analytic platform


My dear readers, for my third post I will write my Blog about Excel. Why I think it is one of the best and most versatile analytic tools available. I know what some of you will be thinking Excel, it is just good for spreadsheets and quick calculations – but it is so much more.

I am Excel Power user for number of years, in my professional life I do use it everyday and I still learn something new. When I was given these two CA1 and CA2 assignments most of the work I did do with Excel, data extraction, validation. Just the end result was produced with different software package. For the CA1 it was Google Fusion tables and for CA2 it was R-Studio.

For the CA1 similar result could be achieved in Excel, in excel used can load additional add-ins, they are also refereed as Apps for Office. For the data presentation in Maps I did use the app “Bing Maps”, it is straight forward to use, just load the map reference data to the app and it will produce the report.

Excel map report

The result is not exactly the same, but it the aim of the assignment is to represent the population in each Town in Ireland, Excel is more than capable for the job.


When it comes into data visualization, summaries Excel has excellent features for the task. The same chart that I did create in R-Studio can be done in Excel Pivot Chart. There is many types to choose from

2000 - 2010 movies2000 - 2010 movies chart2

These features are standard, they are not new, but what is relatively new is Power Pivot. The concept is similar to ordinary pivot, but the the main difference is that the user can create data models from multiple data sources. Available connections are 1-to-1 or 1-to-many.

In movies scenario we can easily link the movies data tale to another table holding decades information. These two data sources can be liked via Year data column with one to many relationship.

One to Many connection

This database relationship model enables the users to get and transform the data from multiple sources with ease. The data can be presented in more meaningful ways, more insight can be gained from the data. In CA2 scenario we can easily represent movie releases by decades.


With Excel 2016 we can get data from multiple sources, not just conventional CSV or spreadsheets. We can use Excel to query databases such a, MySQL,, Oracle, SQL server, other data sources, as SharePoint lists, Facebook, SalesForce and many more.

For me Excel is Lightweight DBMS, it can retrieve the data, transform the data and load it. To sum it up Excel is great for rapid report prototyping data validation, extraction, it can combine multiple data steams into one report. The only limitation is the local computer it is running on.


Thanks for reading my blog, I hope you did learn something new:)