Krazytech: Hadoop: A Framework for Data-Intensive Distributed Computing

Understanding what is “Big Data”

Dealing with Big Data‚¬ requires an in expensive, reliable storage and a new tool for analyzing structured and unstructured data. Today, were surrounded by big data. People upload videos, take pictures on their cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. We live in the data age. Its not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the digital universe‚¬ at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 10^21 bytes, or equivalently one million petabytes, or one billion terabytes. Thats roughly the same order of magnitude as one disk drive for every person in the world. Thus Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.

What is Hadoop?

Consider the example of facebook, Facebook data has grown upto 15TB/day by 2011 and in future shall produce data of a much higher magnitude. They have many web servers and huge MySql (profile, friends etc.) servers to hold the user data.

Now to run various reports on these huge data For example Ratio of men vs. women users for a period, No of users who commented on a particular day. The solution for this requirement they had scripts written in python which uses ETL processes. But as the size of data increased to this extent these scripts did not work.

Krazytech

Sunday, January 19, 2014

Hadoop: A Framework for Data-Intensive Distributed Computing – part I

No comments:

Post a Comment