Saturday, April 27, 2013

Hadoop: A Framework for Data-Intensive Distributed Computing – part I


1.INTRODUCTION
Understanding what is “Big Data”
 Dealing  with  “Big  Data”  requires  –  an  in expensive,  reliable  storage  and  a  new  tool  for  analyzing structured  and  unstructured  data. Today, we’re surrounded by big data. People upload videos, take pictures on their  cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 10^21 bytes, or equivalently  one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world. Thus Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.
 The exponential growth of data  presented challenges to cutting-edge  businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and petabytes  of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Existing tools were becoming inadequate to process such large data sets. Google was the first to publicize MapReduce—a system they had used to scale their data Processing needs. This system aroused a lot of interest because many other businesses were facing similar scaling challenges, and it wasn’t feasible for everyone to reinvent their own proprietary  tool. Doug Cutting   saw an opportunity and led the charge to develop an open source  version of this MapReduce system called Hadoop . Soon after, Yahoo, rallied around to support this effort. Today, Hadoop is a core part of the computing infrastructure for many web companies, such as Yahoo, Facebook, LinkedIn, and Twitter. 


Hadoop: A Framework for Data-Intensive Distributed Computing – part I

No comments:

Post a Comment