Big Data Introduction

Data is largely classified as Structured, Semi-Structured and Un-Structured.

If we know the fields as well as their datatype, then we call it structured. The data in relational databases such as MySQL, Oracle or Microsoft SQL is an example of structured data.

The data in which we know the fields or columns but we do not know the datatypes, we call it semi-structured data. For example, data in CSV which is comma separated values is known as semi-structured data.

If our data doesn't contain columns or fields, we call it unstructured data. The data in the form of plain text files or logs generated on a server are examples of unstructured data.

The process of translating unstructured data into structured is known as ETL - Extract, Transform and Load.

What is Distributed System?

When networked computers are utilized to achieve a common goal, it is known as a distributed system. The work gets distributed amongst many computers. The branch of computing that studies distributed systems is known as distributed computing. The purpose of distributed computing is to get the work done faster by utilizing many computers.

Most but not all the tasks can be performed using distributed computing.

What is Big Data?

In very simple words, Big Data is data of very big size which can not be processed with usual tools. And to process such data we need to be distributed architecture. This data could be structured or unstructured.

Generally, we classify the problems related to the handling of data into three buckets: Volume: When the problem we are solving is related to how we would store such huge data, we call it Volume. Examples of Volume are Facebook handling more than 500 TB data per day. Facebook is having 300 PB of data storage.

Velocity: When we are trying to handle many requests per second, we call this characteristic Velocity. The problems as the number of requests received by Facebook or Google per second is an example of Big Data due to Velocity.

Variety: If the problem at hand is complex or data that we are processing is complex, we call such problems as related to variety.

Imagine you have to find the fastest route on a map since the problem involves enumerating through many possibilities, it is a complex problem even though the map's size would not be too huge.

Data could be termed as Big Data if either Volume, Velocity or Variety becomes impossible to handle using traditional tools.

Why do we need big data now?

Paper, Tapes etc are Analog storage while CDs, DVDs, hard disk drives are considered digital storage.

This graph shows that the digital storage has started increasing exponentially after 2002 while Analog storage remained practically same.

The year 2002 is called beginning of the digital age.

Why so? The answer is two fold: Devices, Connectivity On one hand, the devices became cheaper, faster and smaller. Smart Phone is a great example. On another, the connectivity improved. We have wifi, 4G, Bluetooth, NFC etc.

This lead to a lot of very useful applications such as a very vibrant world wide web, social networks, and Internet of things leading to huge data generation.

Roughly, the computer is made of 4 components. 1. CPU - Which executes instructions. CPU is characterized by its speed. More the number of instructions it can execute per second, faster it is considered.

Then comes RAM. Random access memory. While processing, we load data into RAM. If we can load more data into ram, CPU can perform better. So, RAM has two main attributes which matter: Size and its speed of reading and writing.

To permanently store data, we need hard disk drive or solid state drive. The SSD is faster but smaller and costlier. The faster and bigger the disk, faster we can process data.

Another component that we frequently forget while thinking about the speed of computation is a network. Why? Often our data is stored on different machines and we need to read it over a network to process.

While processing Big Data at least one of these four components become the bottleneck. That's where we need to move to multiple computers or distributed computing architecture.

 

Big Data applications - recommendations

So far we have tried to establish that while handling humongous data we would need new set of tools which can operate in a distributed fashion.

But who would be generating such data or who would need to process such humongous data? Quick answer is everyone.

Now, let us try to take few examples.

In e-commerce industry, the recommendation is a great example of Big Data processing. Recommendations also known as collaborative filtering is the process of suggesting someone a product based on their preferences or behavior.

The e-commerce website would gather lot of data about the customer's behavior. In a very simplistic algorithm, we would basically try to find similar users and then cross-suggest them the products. So, more the users, better would be results.

As per Amazon, major chunk of their sales happen via recommendations on website and email. The other big example of Big Data processing was Netflix 1 million dollar competition to generate the movie recommendations.

As of today generating recommendations have become pretty simple. The engines such as MLLib or Mahout have made it very simple to generate recommendations on humongous data. All you have to do is format the data in the three column format: user id, movie id, and ratings.

 

Big Data applications - A/B testing

A/B Testing is a process used in testing drugs to identify if a drug is effective or not.

A similar process is used to identify if a particular feature is effective or not. As you can see in the diagram, randomly selected half of the users are shown variation A and other half is shown variation B. We can clearly see that variation A is very effective because it is giving double conversions. This method is effective only if we have a significant amount of users. Also, the ratio of the users need not be 50-50.

So, A/B testing utilizes the data in product development. At Amazon, we have many such experiments running all the time. Every feature is launched via A/B testing. It is first shown to say 1 percent of users and if it is performing good, we increase the percentages.

To manage so many variations on such a high number of users, we generally need Big Data platforms.

Big Data Customers

Big Data Customers:

Government ==

Since governments have huge data about the citizens, any analysis would be Big Data analysis. The application are many.

First is Fraud Detection. Be it antimony laundering or user identification, the amount of data processing required is really high.

In Cyber Security Welfare and Justice, the Big Data is being generated and Big Data tools are getting adopted.

Telecom ==

The telecom companies can use the big data in order to understand why their customers are leaving and how they can prevent the customers from leaving. This is known customer churn prevention. The data that could help in customer churn prevention is: 1. How many calls did customers make to the call center? 2. For how long were they out of coverage area 3. What was the usage pattern?

The other use-case is network performance optimization. Based on the past history of traffic, the telecoms can forecast the network traffic and accordingly optimize the performance.

Third most common use-case of Big Data in Telecommunication industry is Calling Data Record Analysis. Since there are millions of users of a telecom company and each user makes 100s of calls per day. Analysing the calling Data records is a Big Data problem.

It is very much possible to predict the failure of hardware based on all the data points when previous failures occurred. A seemingly impossible task is possible because of the sheer volume of data.

Healthcare ==

Healthcare inherently has humongous data and complex problems to solve. Such problems can be solved with the new Big Data Technologies as supercomputers could not solve most of these problems.

Few example of such problems are Health information exchange, Gene sequencing, Healthcare improvements and Drug Safety.

Big Data solutions

There are many Big Data Solution stacks.

The first and most powerful stack is Apache Hadoop and Spark together. While Hadoop provides storage for structured and unstructured data, the Spark provides the computational capability on top of Hadoop.

The second way could be to use Cassandra or MongoDB.

Third could be to use Google Compute Engine or Microsoft Azure. In such cases, you would have to upload your data to Google or Microsoft which may not be acceptable to your organization sometimes.

What is Apache Hadoop?

Hadoop was created by Doug Cutting in order to build his search engine called Nutch. He was joined by Mike Cafarella. Hadoop was based on the three papers published by Google: Google File System, Google MapReduce, and Google Big Table.

It is named after the toy elephant of Doug Cutting's son.

Hadoop is under Apache license which means you can use it anywhere without having to worry about licensing.

It is quite powerful, popular and well supported.

It is a framework to handle Big Data.

Started as a single project, Hadoop is now an umbrella of projects. All of the projects under the Apache Hadoop umbrella should have followed three characteristics: 1. Distributed - They should be able to utilize multiple machines in order to solve a problem. 2. Scalable - If needed it should be very easy to add more machines. 3. Reliable - If some of the machines fail, it should still work fine. These are the three criteria for all the projects or components to be under Apache Hadoop.

Hadoop is written in Java so that it can run on kinds of devices.

``Overview of Apache Hadoop ecosystem

The Apache Hadoop is a suite of components. Let us take a look at each of these components briefly. We will cover the details in depth during the full course.


HDFS

HDFS or Hadoop Distributed File System is the most important component because the entire eco-system depends upon it. It is based on Google File System.

It is basically a file system which runs on many computers to provide a humongous storage. If you want to store your petabytes of data in the form of files, you can use HDFS.

YARN or yet another resource negotiator keeps track of all the resources (CPU, Memory) of machines in the network and run the applications. Any application which wants to run in distributed fashion would interact with YARN.

HBase

HBase provides humongous storage in the form of a database table. So, to manage humongous records, you would like to use HBase.

HBase is a kind NoSQL Datastore.

MapReduce

MapReduce is a framework for distributed computing. It utilizes YARN to execute programs and has a very good sorting engine.

You write your programs in two parts Map and reduce. The map part transforms the raw data into key-value and reduce part groups and combines data based on the key. We will learn MapReduce in details later.

Spark

Spark is another computational framework similar to MapReduce but faster and more recent. It uses similar constructs as MapReduce to solve big data problems.

Spark has its own huge stack. We will cover in details soon.

Hive

Writing code in MapReduce is very time-consuming. So, Apache Hive makes it possible to write your logic in SQL which internally converts it into MapReduce. So, you can process humongous structured or semi-structured data with simple SQL using Hive.

Pig (Latin)

Pig Latin is a simplified SQL like language to express your ETL needs in stepwise fashion. Pig is the engine that translates Pig Latin into Map Reduce and executes it on Hadoop.

Mahout`

Mahout is a library of machine learning algorithms that run in a distributed fashion. Since machine learning algorithms are complex and time-consuming, mahout breaks down work such that it gets executed on MapReduce running on many machines.

ZooKeeper

Apache Zookeeper is an independent component which is used by various distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for the coordination between various components. It provides a distributed configuration service, synchronization service, and naming registry for large distributed systems.

Flume

Flume makes it possible to continuously pump the unstructured data from many sources to a central source such as HDFS.

If you have many machines continuously generating data such as Webserver Logs, you can use flume to aggregate data at a central place such as HDFS.

SQOOP

Sqoop is used to transport data between Hadoop and SQL Databases. Sqoop utilizes MapReduce to efficiently transport data using many machines in a network.

Oozie

Since a project might involve many components, there is a need of a workflow engine to execute work in sequence.

For example, a typical project might involve importing data from SQL Server, running some Hive Queries, doing predictions with Mahout, Saving data back to an SQL Server.

This kind of workflow can be easily accomplished with Oozie.

User Interaction

A user can either talk to the various components of Hadoop using Command Line Interface, Web interface, API or using Oozie.

We will cover each of these components in details later.

Apache Spark ecosystem walkthrough

Apache Spark is a fast and general engine for large-scale data processing.

It is around 100 times faster than MapReduce using only RAM and 10 times faster if using the disk.

It builds upon similar paradigms as MapReduce.

It is well integrated with Hadoop as it can run on top of YARN and can access HDFS.

Resource Managers

A cluster resource manager or resource manager is a software component which manages the various resources such as memory, disk, CPU of the machines connected in the cluster.

Apache Spark can run on top of many cluster resource managers such YARN, Amazon EC2 or Mesos. If you don't have any resource managers yet, you can use Apache Spark in Standalone mode.

Sources

Instead of building own file or data storages, Apache spark made it possible to read from all kinds of data sources: Hadoop Distributed File System, HBase, Hive, Tachyon, Cassandra.

Libraries

Apache Spark comes with great set of libraries. Data frames provide a generic way to represent the data in the tabular structure. The data frames make it possible to query data using R or SQL instead of writing tonnes of code.

Streaming Library makes it possible to process fast incoming streaming of huge data using Spark.

MLLib is a very rich machine learning library. It provides very sophisticated algorithms which run in distributed fashion.

GraphX makes it very simple to represent huge data as a graph. It proves library of algorithms to process graphs using multiple computers.

Spark and its libraries can be used with Scala, Java, Python, R, and SQL. The only exception is GraphX which can only be used with Scala and Java.

With these set of libraries, it is possible to do ETL, Machine Learning, Real time data processing and graph processing on Big Data.

We will cover each component in details as we go forward.

 


Post a Comment

0 Comments

Close Menu