Data is largely classified as Structured, Semi-Structured and
Un-Structured.
If we know the fields as well as their datatype, then we call it
structured. The data in relational databases such as MySQL, Oracle or Microsoft
SQL is an example of structured data.
The data in which we know the fields or columns but we do not
know the datatypes, we call it semi-structured data. For example, data in CSV
which is comma separated values is known as semi-structured data.
If our data doesn't contain columns or fields, we call it
unstructured data. The data in the form of plain text files or logs generated
on a server are examples of unstructured data.
The process of translating unstructured data into structured is
known as ETL - Extract, Transform and Load.
What is Distributed System?
When networked computers are utilized to achieve a common goal,
it is known as a distributed system. The work gets distributed amongst many
computers. The branch of computing that studies distributed systems is known as
distributed computing. The purpose of distributed computing is to get the work
done faster by utilizing many computers.
Most but not all the tasks can be performed using distributed
computing.
What is Big Data?
In very simple words, Big Data is data of very big size which can
not be processed with usual tools. And to process such data we need to be
distributed architecture. This data could be structured or unstructured.
Generally, we classify the problems related to the handling of
data into three buckets: Volume: When the problem we are solving is related to
how we would store such huge data, we call it Volume. Examples of Volume are
Facebook handling more than 500 TB data per day. Facebook is having 300 PB of
data storage.
Velocity: When we are trying to handle many requests per second,
we call this characteristic Velocity. The problems as the number of requests
received by Facebook or Google per second is an example of Big Data due to
Velocity.
Variety: If the problem at hand is complex or data that we are
processing is complex, we call such problems as related to variety.
Imagine you have to find the fastest route on a map since the
problem involves enumerating through many possibilities, it is a complex
problem even though the map's size would not be too huge.
Data could be termed as Big Data if either Volume, Velocity or
Variety becomes impossible to handle using traditional tools.
Why do we need big
data now?
Paper, Tapes etc are
Analog storage while CDs, DVDs, hard disk drives are considered digital
storage.
This graph shows that
the digital storage has started increasing exponentially after 2002 while
Analog storage remained practically same.
The year 2002 is
called beginning of the digital age.
Why so? The answer is
two fold: Devices, Connectivity On one hand, the devices became cheaper, faster
and smaller. Smart Phone is a great example. On another, the connectivity
improved. We have wifi, 4G, Bluetooth, NFC etc.
This lead to a lot of
very useful applications such as a very vibrant world wide web, social networks,
and Internet of things leading to huge data generation.
Roughly, the computer
is made of 4 components. 1. CPU - Which executes instructions. CPU is
characterized by its speed. More the number of instructions it can execute per
second, faster it is considered.
Then comes RAM. Random
access memory. While processing, we load data into RAM. If we can load more
data into ram, CPU can perform better. So, RAM has two main attributes which
matter: Size and its speed of reading and writing.
To permanently store
data, we need hard disk drive or solid state drive. The SSD is faster but
smaller and costlier. The faster and bigger the disk, faster we can process
data.
Another component that
we frequently forget while thinking about the speed of computation is a
network. Why? Often our data is stored on different machines and we need to
read it over a network to process.
While processing Big
Data at least one of these four components become the bottleneck. That's where
we need to move to multiple computers or distributed computing architecture.
So far we have tried to establish that while handling humongous
data we would need new set of tools which can operate in a distributed fashion.
But who would be generating such data or who would need to
process such humongous data? Quick answer is everyone.
Now, let us try to take few examples.
In e-commerce industry, the recommendation is a great example of
Big Data processing. Recommendations also known as collaborative filtering is
the process of suggesting someone a product based on their preferences or
behavior.
The e-commerce website would gather lot of data about the
customer's behavior. In a very simplistic algorithm, we would basically try to
find similar users and then cross-suggest them the products. So, more the
users, better would be results.
As per Amazon, major chunk of their sales happen via
recommendations on website and email. The other big example of Big Data
processing was Netflix 1 million dollar competition to generate the movie
recommendations.
As of today generating recommendations have become pretty
simple. The engines such as MLLib or Mahout have made it very simple to
generate recommendations on humongous data. All you have to do is format the
data in the three column format: user id, movie id, and ratings.
A/B Testing is a process used in testing drugs to identify if a
drug is effective or not.
A similar process is used to identify if a particular feature is
effective or not. As you can see in the diagram, randomly selected half of the
users are shown variation A and other half is shown variation B. We can clearly
see that variation A is very effective because it is giving double conversions.
This method is effective only if we have a significant amount of users. Also,
the ratio of the users need not be 50-50.
So, A/B testing utilizes the data in product development. At
Amazon, we have many such experiments running all the time. Every feature is
launched via A/B testing. It is first shown to say 1 percent of users and if it
is performing good, we increase the percentages.
To manage so many variations on such a high number of users, we
generally need Big Data platforms.
Big Data Customers:
Government ==
Since governments have huge data about the citizens, any
analysis would be Big Data analysis. The application are many.
First is Fraud Detection. Be it antimony laundering or user
identification, the amount of data processing required is really high.
In Cyber Security Welfare and Justice, the Big Data is being
generated and Big Data tools are getting adopted.
Telecom ==
The telecom companies can use the big data in order to
understand why their customers are leaving and how they can prevent the
customers from leaving. This is known customer churn prevention. The data that
could help in customer churn prevention is: 1. How many calls did customers
make to the call center? 2. For how long were they out of coverage area 3. What
was the usage pattern?
The other use-case is network performance optimization. Based on
the past history of traffic, the telecoms can forecast the network traffic and
accordingly optimize the performance.
Third most common use-case of Big Data in Telecommunication
industry is Calling Data Record Analysis. Since there are millions of users of
a telecom company and each user makes 100s of calls per day. Analysing the
calling Data records is a Big Data problem.
It is very much possible to predict the failure of hardware
based on all the data points when previous failures occurred. A seemingly
impossible task is possible because of the sheer volume of data.
Healthcare ==
Healthcare inherently has humongous data and complex problems to
solve. Such problems can be solved with the new Big Data Technologies as
supercomputers could not solve most of these problems.
Few example of such problems are Health information exchange,
Gene sequencing, Healthcare improvements and Drug Safety.
There are many Big Data Solution stacks.
The first and most powerful stack is Apache Hadoop and Spark
together. While Hadoop provides storage for structured and unstructured data,
the Spark provides the computational capability on top of Hadoop.
The second way could be to use Cassandra or MongoDB.
Third could be to use Google Compute Engine or Microsoft Azure.
In such cases, you would have to upload your data to Google or Microsoft which
may not be acceptable to your organization sometimes.
Hadoop was created by Doug Cutting in order to build his search
engine called Nutch. He was joined by Mike Cafarella. Hadoop was based on the
three papers published by Google: Google File System, Google MapReduce, and
Google Big Table.
It is named after the toy elephant of Doug Cutting's son.
Hadoop is under Apache license which means you can use it
anywhere without having to worry about licensing.
It is quite powerful, popular and well supported.
It is a framework to handle Big Data.
Started as a single project, Hadoop is now an umbrella of
projects. All of the projects under the Apache Hadoop umbrella should have
followed three characteristics: 1. Distributed - They should be able to utilize
multiple machines in order to solve a problem. 2. Scalable - If needed it
should be very easy to add more machines. 3. Reliable - If some of the machines
fail, it should still work fine. These are the three criteria for all the
projects or components to be under Apache Hadoop.
Hadoop is written in Java so that it can run on kinds of devices.
The Apache Hadoop is a suite of components. Let us take a look
at each of these components briefly. We will cover the details in depth during
the full course.
HDFS or Hadoop Distributed File System is the most important
component because the entire eco-system depends upon it. It is based on Google
File System.
It is basically a file system which runs on many computers to
provide a humongous storage. If you want to store your petabytes of data in the
form of files, you can use HDFS.
YARN or yet another resource negotiator keeps track of all the
resources (CPU, Memory) of machines in the network and run the applications.
Any application which wants to run in distributed fashion would interact with
YARN.
HBase provides humongous storage in the form of a database
table. So, to manage humongous records, you would like to use HBase.
HBase is a kind NoSQL Datastore.
MapReduce is a framework for distributed computing. It utilizes
YARN to execute programs and has a very good sorting engine.
You write your programs in two parts Map and reduce. The map
part transforms the raw data into key-value and reduce part groups and combines
data based on the key. We will learn MapReduce in details later.
Spark is another computational framework similar to MapReduce
but faster and more recent. It uses similar constructs as MapReduce to solve
big data problems.
Spark has its own huge stack. We will cover in details soon.
Writing code in MapReduce is very time-consuming. So, Apache
Hive makes it possible to write your logic in SQL which internally converts it
into MapReduce. So, you can process humongous structured or semi-structured data
with simple SQL using Hive.
Pig Latin is a simplified SQL like language to express your ETL
needs in stepwise fashion. Pig is the engine that translates Pig Latin into Map
Reduce and executes it on Hadoop.
Mahout is a library of machine learning algorithms that run in a
distributed fashion. Since machine learning algorithms are complex and
time-consuming, mahout breaks down work such that it gets executed on MapReduce
running on many machines.
Apache Zookeeper is an independent component which is used by
various distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for
the coordination between various components. It provides a distributed
configuration service, synchronization service, and naming registry for large
distributed systems.
Flume makes it possible to continuously pump the unstructured
data from many sources to a central source such as HDFS.
If you have many machines continuously generating data such as
Webserver Logs, you can use flume to aggregate data at a central place such as
HDFS.
Sqoop is used to transport data between Hadoop and SQL
Databases. Sqoop utilizes MapReduce to efficiently transport data using many
machines in a network.
Since a project might involve many components, there is a need
of a workflow engine to execute work in sequence.
For example, a typical project might involve importing data from
SQL Server, running some Hive Queries, doing predictions with Mahout, Saving
data back to an SQL Server.
This kind of workflow can be easily accomplished with Oozie.
A user can either talk to the various components of Hadoop using
Command Line Interface, Web interface, API or using Oozie.
We will cover each of these components in details later.
Apache Spark is a fast and general
engine for large-scale data processing.
It is around 100 times faster than MapReduce using only RAM and
10 times faster if using the disk.
It builds upon similar paradigms as MapReduce.
It is well integrated with Hadoop as it can run on top of YARN
and can access HDFS.
A cluster resource manager or resource manager is a software
component which manages the various resources such as memory, disk, CPU of the
machines connected in the cluster.
Apache Spark can run on top of many cluster resource managers
such YARN, Amazon EC2 or Mesos. If you don't have any resource managers yet,
you can use Apache Spark in Standalone mode.
Instead of building own file or data storages, Apache spark made
it possible to read from all kinds of data sources: Hadoop Distributed File
System, HBase, Hive, Tachyon, Cassandra.
Apache Spark comes with great set of libraries. Data
frames provide a generic way to represent the data in the tabular
structure. The data frames make it possible to query data using R or SQL
instead of writing tonnes of code.
Streaming Library makes it possible to
process fast incoming streaming of huge data using Spark.
MLLib is a very rich machine learning library.
It provides very sophisticated algorithms which run in distributed fashion.
GraphX makes it very simple to represent
huge data as a graph. It proves library of algorithms to process graphs using
multiple computers.
Spark and its libraries can be used with Scala,
Java, Python, R, and SQL. The only exception is GraphX
which can only be used with Scala and Java.
With these set of libraries, it is possible to do ETL, Machine
Learning, Real time data processing and graph processing on Big Data.
We will cover each component in details as we go forward.
0 Comments