Introduction to BIG DATA

BIG DATA is “the next frontier for innovation”.

Introduction: What is BigData

The data which are beyond storing and processing capacity of a conventional database management systems is called “Big Data”. A Huge amount of data is generated daily in Peta Bytes, and data generation rate is rapidly increasing.


Characterization of "BIG-DATA" by "4V's":

Volume: It is very common to have Terabytes and Petabytes of the storage system for enterprises. (Volume is nothing but Size of data: MB, TB, PB, EB, ZetaB, YottaB…)
Velocity:Traversing of data through the network for processing.
Variety:Structured, Semi-Structured, and UnStructured data.
Veracity:Uncertainty of data.

Source of BigData:

The data is coming from various sources: - transactions, social media, sensors, digital images, cc camera, online shopping, Airlines-black box, videos, audios, Search engine and click-streams for domains including healthcare, retail, energy and utilities. In last decade’s 90% of data is generated from all data available in the world. Ex. New York Stock exchange – 1TB/day, Facebook-1PB/day, Internet Archive – 20 TB/month, Large Hardon Collider near Geneva – 15 PB/year.

Where is the use of "BIG-DATA"

1. Understanding and Targeting Customers.
2. Understanding and Optimizing Business Processes.
3. Improving Healthcare and Public Health.
4. Improving Science and Research.
5. Optimizing Machine and Device Performance.
6. Financial Trading. and in so many fields.

Different types of Data:

1. Structured Data :
All data which can be stored in the database in a row and column format i.e. Relational database and it is very simple to manage. Structured data is only 5-10% of all informatics data.
2. Semi-structured Data :
Semi-structured data doesn’t reside in RDBMS but have some organizational properties that make it easier to analyses. Ex. Log files, CSV, XML.
3. Unstructured Data :
Remaining all data is considered as unstructured data, it contains video, images, email photo, audio, video, webpages and much more. It doesn’t fit neatly into database. Unstructured data contributes 80% of all informatics data. The growth of unstructured data in exponential than the other types of data. This data is either machine generated or human generated.
Machine-generated data: Satellite images, scientific data, Photographs, Videos, Radar or Sensor data and so many.
Human-generated unstructured data: Mobile data, Website data, Social Media data, Text data and so many.

Challenges with "BIG-DATA":

1) Capturing & Storing the data.
2) Understanding and analysis of the data.
3) Synchronization across the Data Sources.
4) Getting and displaying meaningful Information out of that data.

Limitations of RDBMS to support “BIG-DATA”:

1) RDBMS is not able to handle huge data volumes properly, it needs to scale up database management system vertically.
2) The majority of the data comes in a semi-structured or unstructured format. RDBMS can handle only structured data.
3) Big Data generated at very high velocity.RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth.Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive.

Tools for "BIG-DATA":

NoSQL: MongoDb, CounchDB, Cassandra, Redis, BigTable, Hbase, Zookeeper
MapReduce: Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR,Acunu, Flume, Kafka, Oozie
Storage:S3, Hadoop Distributed File System.
Servers:EC2,Google App Engine, Elastic, Beanstalk, Heroku
Processing:R, Yahoo! Pipes, Mechanical Turk, ElasticSearch, Datameer, BigSheets, Tinkerpop

Applications of "BIG-DATA":

Recommendation
Online advertising
Stock exchange analysis
Social networking analysis
Spam filtering
Telecommunication network monitoring and much more.

6 comments:

Post a Comment