Simplified Big Data Analytics in AWS

Simplified Big Data Analytics in AWS
Understanding what’s there, and why.

Written by David Linthicum exclusively for Nelson Hilliard

Big data is…well…big. The massive size of these systems, typically over a petabyte, will bog down most traditional approaches to data management. The on-premises costs will price itself out of the budgets of most of the Global 2000, as well as government agencies.

So, cloud comes along, with cloud providers such as AWS and others that have much more cost-effective and powerful approaches to support big data, and the analytical-type services that typically surround big data systems. Typically priced on a per-use basis, these cloud services are set to revolutionize the ways that we can understand our own businesses.

This is not just data formatted and restructured to drive some useful reporting, but operational data that provides a real-time look at the state of the business. We also have the ability to link these analytics to dynamic business processes that can make the enterprise more “self healing” or “self optimizing.” That is where the true value exists.

In looking at what AWS has to offer, in terms of supporting big data analytics, the services are somewhat confusing. I spend a great deal of my travel time explaining my understanding of AWS’s big data analytics services line-up to my clients: What’s there, and why. Here is a brain dump that you may find useful.

Data integration is the first problem you need to consider when doing big data analytics in a public cloud provider, whether it be AWS or any other. Your data needs to flow from operational data stores within your enterprise, to your big data systems that will most likely exist in the cloud.

AWS supports data transfer services, such as AWS Direct Connect, which can move big data into and out of the cloud. (Keep in mind that all inbound data traffic into AWS is free.) This is a fine approach when higher latency is not an issue with your big data analytics system, meaning that the data does not need to arrive in real time.

Another middleware-type service to consider around big data analytics is Amazon Kinesis, which is more real-time. This is a cloud service for real-time processing of streaming big data to support data throughput from megabytes to gigabytes of data per second, and it can handle streams from hundreds of thousands of different sources of the data. Think of running several data streams from multiple data sources within your enterprise to your big data database of choice on AWS.

Moving from the middleware to the actual databases, we have a mix of SQL and NoSQL database technology with the AWS services catalog. Amazon DynamoDB is a managed NoSQL database service that many enterprises have found both valuable and easy to use for big data analytics processing where interactive response time is valued. Amazon DynamoDB has a guaranteed throughput and single-digit millisecond latency that makes it a good fit for big data projects where quick interaction with the data is a must-have, such as supporting mobile computing.

If you’re looking for simplicity, then Amazon RDS is a well-designed relational database that can scale within the AWS cloud. RDS is a good fit for big data systems that need to stay in the relational model, and won’t get to the petabyte-scale (most won’t). For that you need Amazon Redshift (take that Oracle), which is a petabyte-scale database pretty much designed and built specifically to support big data analytics and traditional data warehousing.

Redshift leverages a columnar storage technology, and distributed queries, which should be very familiar to those who manage on-premise data warehouses. However, Redshift costs less than $1,000 per terabyte, per year.

Finally, Amazon Elastic MapReduce (EMR) is a Hadoop file system framework on Amazon EC2 that provides map/reduce queries and takes advantage of the core Hadoop tools. This is the solution if you want Hadoop in the cloud to support your big data analytics project.

As you can see, AWS provides several nice public cloud-delivered options for big data analytics in the cloud. Most requirements can be met with AWS technology, but AWS is not the only cloud that provides big data technology on-demand. Google and Microsoft have competitive systems, and some of the smaller players have some interesting offerings as well. However, AWS offers one-stop-shopping for those who build big data systems, and the catalogue of database services and middleware is rather compelling.

Remember to Subscribe to our Youtube Channel for the Latest Cloud Computing Tech Jobs, News, and Cloud Shows.


David S. Linthicum is a managing director and chief cloud strategy officer. David is internationally recognized as the worlds No.1 cloud computing industry expert, pundit and thought-leader.

(Disclosure: David Linthicum’s views in the blogs, video shows and podcasts are his OWN and are NOT financially sponsored by Nelson Hilliard)

Connect with David on LinkedIn and Twitter

At Nelson Hilliard we specialise in cloud technologies, sourcing the top 20% of cloud professionals inspired to work for you through our specialised marketing and profiling. If you are interested in having a quick talk to me regarding your employment needs please feel free to reach out.

You can also check my availability and book your 15 minute discovery call here.

Brad Nelson