Big Data Analytics Options on AWS
January 2016
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
© 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notices
This document is provided for informational purposes only. It represents AWS’s
current product offerings and practices as of the date of issue of this document,
which are subject to change without notice. Customers are responsible for
making their own independent assessment of the information in this document
and any use of AWS’s products or services, each of which is provided “as is”
without warranty of any kind, whether express or implied. This document does
not create any warranties, representations, contractual commitments, conditions
or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities
and liabilities of AWS to its customers are controlled by AWS agreements, and
this document is not part of, nor does it modify, any agreement between AWS
and its customers.
Page 2 of 50
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
Contents
Abstract
Introduction
The AWS Advantage in Big Data Analytics
Amazon Kinesis Streams
AWS Lambda
Amazon EMR
Amazon Machine Learning
Amazon DynamoDB
Amazon Redshift
Amazon Elasticsearch Service
Amazon QuickSight
Amazon EC2
Solving Big Data Problems on AWS
Example 1: Enterprise Data Warehouse
Example 2: Capturing and Analyzing Sensor Data
Example 3: Sentiment Analysis of Social Media
Conclusion
Contributors
Further Reading
Document Revisions
Notes
Page 3 of 50
4
4
5
6
9
12
18
21
25
28
32
32
35
36
39
42
44
45
45
46
46
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
Abstract
This whitepaper helps architects, data scientists, and developers understand the
big data analytics options available in the AWS cloud by providing an overview of
services, with the following information:
Ideal usage patterns
Cost model
Performance
Durability and availability
Scalability and elasticity
Interfaces
Anti-patterns
This paper concludes with scenarios that showcase the analytics options in use, as
well as additional resources for getting started with big data analytics on AWS.
Introduction
As we become a more digital society, the amount of data being created and
collected is growing and accelerating significantly. Analysis of this ever-growing
data becomes a challenge with traditional analytical tools. We require innovation
to bridge the gap between data being generated and data that can be analyzed
effectively.
Big data tools and technologies offer opportunities and challenges in being able
to analyze data efficiently to better understand customer preferences, gain a
competitive advantage in the marketplace, and grow your business. Data
management architectures have evolved from the traditional data warehousing
model to more complex architectures that address more requirements, such as
real-time and batch processing; structured and unstructured data; high-velocity
transactions; and so on.
Amazon Web Services (AWS) provides a broad platform of managed services to
help you build, secure, and seamlessly scale end-to-end big data applications
quickly and with ease. Whether your applications require real-time streaming or
batch data processing, AWS provides the infrastructure and tools to tackle your
Page 4 of 50
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
next big data project. No hardware to procure, no infrastructure to maintain and
scale—only what you need to collect, store, process, and analyze big data. AWS
has an ecosystem of analytical solutions specifically designed to handle this
growing amount of data and provide insight into your business.
The AWS Advantage in Big Data Analytics
Analyzing large data sets requires significant compute capacity that can vary in
size based on the amount of input data and the type of analysis. This
characteristic of big data workloads is ideally suited to the pay-as-you-go cloud
computing model, where applications can easily scale up and down based on
demand. As requirements change, you can easily resize your environment
(horizontally or vertically) on AWS to meet your needs, without having to wait for
additional hardware or being required to over invest to provision enough
capacity.
For mission-critical applications on a more traditional infrastructure, system
designers have no choice but to over-provision, because a surge in additional data
due to an increase in business need must be something the system can handle. By
contrast, on AWS you can provision more capacity and compute in a matter of
minutes, meaning that your big data applications grow and shrink as demand
dictates, and your system runs as close to optimal efficiency as possible.
In addition, you get flexible computing on a global infrastructure with access to
the many different geographic regions1 that AWS offers, along with the ability to
use other scalable services that augment to build sophisticated big data
applications. These other services include Amazon Simple Storage Service
(Amazon S3)2 to store data and AWS Data Pipeline3 to orchestrate jobs to move
and transform that data easily. AWS IoT,4 which lets connected devices interact
with cloud applications and other connected devices.
In addition, AWS has many options to help get data into the cloud, including
secure devices like AWS Import/Export Snowball5 to accelerate petabyte-scale
data transfers, Amazon Kinesis Firehose6 to load streaming data, and scalable
private connections through AWS Direct Connect.7 As mobile continues to
rapidly grow in usage, you can use the suite of services within the AWS Mobile
Page 5 of 50
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
Hub8 to collect and measure app usage and data or export that data to another
service for further custom analysis.
These capabilities of the AWS platform make it an ideal fit for solving big data
problems, and many customers have implemented successful big data analytics
workloads on AWS. For more information about case studies, see Big Data &
HPC. Powered by the AWS Cloud.9
The following services are described in order from collecting, processing, storing,
and analyzing big data:
Amazon Kinesis Streams
AWS Lambda
Amazon Elastic MapReduce
Amazon Machine Learning
Amazon DynamoDB
Amazon Redshift
Amazon Elasticsearch Service
Amazon QuickSight
In addition, Amazon EC2 instances are available for self-managed big data
applications.
Amazon Kinesis Streams
Amazon Kinesis Streams10 enables you to build custom applications that process
or analyze streaming data in real time. Amazon Kinesis Streams can continuously
capture and store terabytes of data per hour from hundreds of thousands of
sources, such as website clickstreams, financial transactions, social media feeds,
IT logs, and location-tracking events.
With the Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis
applications and use streaming data to power real-time dashboards, generate
alerts, and implement dynamic pricing and advertising. You can also emit data
from Amazon Kinesis Streams to other AWS services such as Amazon Simple
Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic MapReduce
(Amazon EMR), and AWS Lambda.
Page 6 of 50
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
Provision the level of input and output required for your data stream, in blocks of
1 megabyte per second (MB/sec), using the AWS Management Console, API,11 or
SDKs.12 The size of your stream can be adjusted up or down at any time without
restarting the stream and without any impact on the data sources pushing data to
the stream. Within seconds, data put into a stream is available for analysis.
Stream data is stored across multiple Availability Zones in a region for 24 hours.
During that window, data is available to be read, re-read, backfilled, and
analyzed, or moved to long-term storage (such as Amazon S3 or Amazon
Redshift). The KCL enables developers to focus on creating their business
applications while removing the undifferentiated heavy lifting associated with
load-balancing streaming data, coordinating distributed services, and fault-
tolerant data processing.
Ideal Usage Patterns
Amazon Kinesis Streams is useful wherever there is a need to move data rapidly
off producers (data sources) and continuously process it. That processing can be
to transform the data before emitting into another data store, drive real-time
metrics and analytics, or derive and aggregate multiple streams into more
complex streams, or downstream processing. The following are typical scenarios
for using Amazon Kinesis Streams for analytics.
Real-time data analytics – Amazon Kinesis Streams enables real-time
data analytics on streaming data, such as analyzing website clickstream
data and customer engagement analytics.
Log and data feed intake and processing – With Amazon Kinesis
Streams, you can have producers push data directly into an Amazon
Kinesis stream. For example, you can submit system and application logs
to Amazon Kinesis Streams and access the stream for processing within
seconds. This prevents the log data from being lost if the front-end or
application server fails, and reduces local log storage on the source.
Amazon Kinesis Streams provides accelerated data intake because you are
not batching up the data on the servers before you submit it for intake.
Real-time metrics and reporting – You can use data ingested into
Amazon Kinesis Streams for extracting metrics and generating KPIs to
power reports and dashboards at real-time speeds. This enables data-
processing application logic to work on data as it is streaming in
continuously, rather than wait for data batches to arrive.
Page 7 of 50
Amazon Web Services – Big Data Analytics Options on AWS
January 2016
Cost Model
Amazon Kinesis Streams has simple pay-as–you-go pricing, with no up-front
costs or minimum fees, and you’ll only pay for the resources you consume. An
Amazon Kinesis stream is made up of one or more shards, each shard gives you a
capacity 5 read transactions per second, up to a maximum total of 2 MB of data
read per second. Each shard can support up to 1000 write transactions per
second and up to a maximum total of 1 MB data written per second.
The data capacity of your stream is a function of the number of shards that you
specify for the stream. The total capacity of the stream is the sum of the capacity
of each shard. There are just two pricing components, an hourly charge per shard
and a charge for each 1 million PUT transactions. For more information, see
Amazon Kinesis Streams Pricing.13 Applications that run on Amazon EC2 and
process Amazon Kinesis streams also incur standard Amazon EC2 costs.
Performance
Amazon Kinesis Streams allows you to choose throughput capacity you require in
terms of shards. With each shard in an Amazon Kinesis stream, you can capture
up to 1 megabyte per second of data at 1,000 write transactions per second. Your
Amazon Kinesis applications can read data from each shard at up to 2 megabytes
per second. You can provision as many shards as you need to get the throughput
capacity you want; for instance, a 1 gigabyte per second data stream would
require 1024 shards.
Durability and Availability
Amazon Kinesis Streams synchronously replicates data across three Availability
Zones in an AWS Region, providing high availability and data durability.
Additionally, you can store a cursor in DynamoDB to durably track what has been
read from an Amazon Kinesis stream. In the event that your application fails in
the middle of reading data from the stream, you can restart your application and
use the cursor to pick up from the exact spot where the failed application left off.
Scalability and Elasticity
You can increase or decrease the capacity of the stream at any time according to
your business or operational needs, without any interruption to ongoing stream
processing. By using API calls or development tools, you can automate scaling of
Page 8 of 50