logo资料库

Big_Data_Analytics_Options_on_AWS.pdf

第1页 / 共50页
第2页 / 共50页
第3页 / 共50页
第4页 / 共50页
第5页 / 共50页
第6页 / 共50页
第7页 / 共50页
第8页 / 共50页
资料共50页,剩余部分请下载后查看
Big Data Analytics Options on AWS January 2016
Amazon Web Services – Big Data Analytics Options on AWS January 2016 © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Page 2 of 50
Amazon Web Services – Big Data Analytics Options on AWS January 2016 Contents Abstract Introduction The AWS Advantage in Big Data Analytics Amazon Kinesis Streams AWS Lambda Amazon EMR Amazon Machine Learning Amazon DynamoDB Amazon Redshift Amazon Elasticsearch Service Amazon QuickSight Amazon EC2 Solving Big Data Problems on AWS Example 1: Enterprise Data Warehouse Example 2: Capturing and Analyzing Sensor Data Example 3: Sentiment Analysis of Social Media Conclusion Contributors Further Reading Document Revisions Notes Page 3 of 50 4 4 5 6 9 12 18 21 25 28 32 32 35 36 39 42 44 45 45 46 46
Amazon Web Services – Big Data Analytics Options on AWS January 2016 Abstract This whitepaper helps architects, data scientists, and developers understand the big data analytics options available in the AWS cloud by providing an overview of services, with the following information:  Ideal usage patterns  Cost model  Performance  Durability and availability  Scalability and elasticity  Interfaces  Anti-patterns This paper concludes with scenarios that showcase the analytics options in use, as well as additional resources for getting started with big data analytics on AWS. Introduction As we become a more digital society, the amount of data being created and collected is growing and accelerating significantly. Analysis of this ever-growing data becomes a challenge with traditional analytical tools. We require innovation to bridge the gap between data being generated and data that can be analyzed effectively. Big data tools and technologies offer opportunities and challenges in being able to analyze data efficiently to better understand customer preferences, gain a competitive advantage in the marketplace, and grow your business. Data management architectures have evolved from the traditional data warehousing model to more complex architectures that address more requirements, such as real-time and batch processing; structured and unstructured data; high-velocity transactions; and so on. Amazon Web Services (AWS) provides a broad platform of managed services to help you build, secure, and seamlessly scale end-to-end big data applications quickly and with ease. Whether your applications require real-time streaming or batch data processing, AWS provides the infrastructure and tools to tackle your Page 4 of 50
Amazon Web Services – Big Data Analytics Options on AWS January 2016 next big data project. No hardware to procure, no infrastructure to maintain and scale—only what you need to collect, store, process, and analyze big data. AWS has an ecosystem of analytical solutions specifically designed to handle this growing amount of data and provide insight into your business. The AWS Advantage in Big Data Analytics Analyzing large data sets requires significant compute capacity that can vary in size based on the amount of input data and the type of analysis. This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud computing model, where applications can easily scale up and down based on demand. As requirements change, you can easily resize your environment (horizontally or vertically) on AWS to meet your needs, without having to wait for additional hardware or being required to over invest to provision enough capacity. For mission-critical applications on a more traditional infrastructure, system designers have no choice but to over-provision, because a surge in additional data due to an increase in business need must be something the system can handle. By contrast, on AWS you can provision more capacity and compute in a matter of minutes, meaning that your big data applications grow and shrink as demand dictates, and your system runs as close to optimal efficiency as possible. In addition, you get flexible computing on a global infrastructure with access to the many different geographic regions1 that AWS offers, along with the ability to use other scalable services that augment to build sophisticated big data applications. These other services include Amazon Simple Storage Service (Amazon S3)2 to store data and AWS Data Pipeline3 to orchestrate jobs to move and transform that data easily. AWS IoT,4 which lets connected devices interact with cloud applications and other connected devices. In addition, AWS has many options to help get data into the cloud, including secure devices like AWS Import/Export Snowball5 to accelerate petabyte-scale data transfers, Amazon Kinesis Firehose6 to load streaming data, and scalable private connections through AWS Direct Connect.7 As mobile continues to rapidly grow in usage, you can use the suite of services within the AWS Mobile Page 5 of 50
Amazon Web Services – Big Data Analytics Options on AWS January 2016 Hub8 to collect and measure app usage and data or export that data to another service for further custom analysis. These capabilities of the AWS platform make it an ideal fit for solving big data problems, and many customers have implemented successful big data analytics workloads on AWS. For more information about case studies, see Big Data & HPC. Powered by the AWS Cloud.9 The following services are described in order from collecting, processing, storing, and analyzing big data:  Amazon Kinesis Streams  AWS Lambda  Amazon Elastic MapReduce  Amazon Machine Learning  Amazon DynamoDB  Amazon Redshift  Amazon Elasticsearch Service  Amazon QuickSight In addition, Amazon EC2 instances are available for self-managed big data applications. Amazon Kinesis Streams Amazon Kinesis Streams10 enables you to build custom applications that process or analyze streaming data in real time. Amazon Kinesis Streams can continuously capture and store terabytes of data per hour from hundreds of thousands of sources, such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events. With the Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis applications and use streaming data to power real-time dashboards, generate alerts, and implement dynamic pricing and advertising. You can also emit data from Amazon Kinesis Streams to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic MapReduce (Amazon EMR), and AWS Lambda. Page 6 of 50
Amazon Web Services – Big Data Analytics Options on AWS January 2016 Provision the level of input and output required for your data stream, in blocks of 1 megabyte per second (MB/sec), using the AWS Management Console, API,11 or SDKs.12 The size of your stream can be adjusted up or down at any time without restarting the stream and without any impact on the data sources pushing data to the stream. Within seconds, data put into a stream is available for analysis. Stream data is stored across multiple Availability Zones in a region for 24 hours. During that window, data is available to be read, re-read, backfilled, and analyzed, or moved to long-term storage (such as Amazon S3 or Amazon Redshift). The KCL enables developers to focus on creating their business applications while removing the undifferentiated heavy lifting associated with load-balancing streaming data, coordinating distributed services, and fault- tolerant data processing. Ideal Usage Patterns Amazon Kinesis Streams is useful wherever there is a need to move data rapidly off producers (data sources) and continuously process it. That processing can be to transform the data before emitting into another data store, drive real-time metrics and analytics, or derive and aggregate multiple streams into more complex streams, or downstream processing. The following are typical scenarios for using Amazon Kinesis Streams for analytics.  Real-time data analytics – Amazon Kinesis Streams enables real-time data analytics on streaming data, such as analyzing website clickstream data and customer engagement analytics.  Log and data feed intake and processing – With Amazon Kinesis Streams, you can have producers push data directly into an Amazon Kinesis stream. For example, you can submit system and application logs to Amazon Kinesis Streams and access the stream for processing within seconds. This prevents the log data from being lost if the front-end or application server fails, and reduces local log storage on the source. Amazon Kinesis Streams provides accelerated data intake because you are not batching up the data on the servers before you submit it for intake.  Real-time metrics and reporting – You can use data ingested into Amazon Kinesis Streams for extracting metrics and generating KPIs to power reports and dashboards at real-time speeds. This enables data- processing application logic to work on data as it is streaming in continuously, rather than wait for data batches to arrive. Page 7 of 50
Amazon Web Services – Big Data Analytics Options on AWS January 2016 Cost Model Amazon Kinesis Streams has simple pay-as–you-go pricing, with no up-front costs or minimum fees, and you’ll only pay for the resources you consume. An Amazon Kinesis stream is made up of one or more shards, each shard gives you a capacity 5 read transactions per second, up to a maximum total of 2 MB of data read per second. Each shard can support up to 1000 write transactions per second and up to a maximum total of 1 MB data written per second. The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacity of each shard. There are just two pricing components, an hourly charge per shard and a charge for each 1 million PUT transactions. For more information, see Amazon Kinesis Streams Pricing.13 Applications that run on Amazon EC2 and process Amazon Kinesis streams also incur standard Amazon EC2 costs. Performance Amazon Kinesis Streams allows you to choose throughput capacity you require in terms of shards. With each shard in an Amazon Kinesis stream, you can capture up to 1 megabyte per second of data at 1,000 write transactions per second. Your Amazon Kinesis applications can read data from each shard at up to 2 megabytes per second. You can provision as many shards as you need to get the throughput capacity you want; for instance, a 1 gigabyte per second data stream would require 1024 shards. Durability and Availability Amazon Kinesis Streams synchronously replicates data across three Availability Zones in an AWS Region, providing high availability and data durability. Additionally, you can store a cursor in DynamoDB to durably track what has been read from an Amazon Kinesis stream. In the event that your application fails in the middle of reading data from the stream, you can restart your application and use the cursor to pick up from the exact spot where the failed application left off. Scalability and Elasticity You can increase or decrease the capacity of the stream at any time according to your business or operational needs, without any interruption to ongoing stream processing. By using API calls or development tools, you can automate scaling of Page 8 of 50
分享到:
收藏