Sunday, January 19, 2014

Hadoop: A Framework for Data-Intensive Distributed Computing – part I

Understanding what is “Big Data”
 Dealing  with  Big  Data‚¬  requires    an  in expensive,  reliable  storage  and  a  new  tool  for  analyzing structured  and  unstructured  data. Today, were surrounded by big data. People upload videos, take pictures on their  cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. We live in the data age. Its not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the digital universe‚¬ at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 10^21 bytes, or equivalently  one million petabytes, or one billion terabytes. Thats roughly the same order of magnitude as one disk drive for every person in the world. Thus Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.
What is Hadoop?
Consider the example of facebook, Facebook data has grown upto 15TB/day by 2011 and in future shall produce data of a much higher magnitude. They have many web servers and huge MySql (profile, friends etc.) servers to hold the user data.  
Now to run various reports on these huge data For example Ratio of men vs. women users for a period, No of users who commented  on a particular day. The solution for this requirement they had scripts written in python which uses ETL processes. But as the size of data increased to this extent these scripts did not work.
Read More   >>Hadoop: A Framework for Data-Intensive Distributed Computing

No comments:

Post a Comment