Big data analytics holds the key to unlock the age of ubiquitous information and pervasive intelligence. As the amount of sensors and web services are growing tremendously, the real challenge is to employ the data at hand and extract actionable insights from it. While something like a Python script may be enough to analyze a few spreadsheets, you need something much more powerful when you are dealing with big data. Big data analytics can be defined as a collection of technologies and processes used to crunch through complex data sets in order to discover market trends, correlations and hidden data patterns. It enables organizations to make informed business decisions and aids researchers in verifying their scientific models.
Big Data Analytics has become an essential part of any business, be it financial analysis, retail, advertising or healthcare. The amount of data world-wide has been growing exponentially and is estimated to jump from 33ZB this year to 175ZB in 2025. Not only this creates immense opportunities, but also raises a huge demand for the infrastructure to run big data analytics on and poses new challenges for data engineers.
Is there anything unique about an analytics workload?
Let’s agree what defines analytics workload in the first place. According to Curt Monash, “Analytics is the antonym of transactional”. While transactional processing (OLTP) is characterized by a short set of discrete operations with a high number of transactions per second and strict data integrity, analytics workloads are typically distinguished by fewer users making much more complex and resource-intensive queries to the data source. There is massive parallelism going on behind the curtains and data movement is lowered as much as possible by making computations as close to the data as possible. Data volume is large, the model is complex and computation is done by a distributed system – all of which place a real burden on the infrastructure that carries out these tasks.
Considering transition to a cloud?
It’s no joke to build up and maintain your big data analytics stack, so companies often choose to migrate their analytics workloads to a cloud in order to reduce complexity and increase operational efficiency. As a rule of thumb, there are two main points to consider when preparing for a transition to a cloud - data storage and data processing.
Mountains of data to store
It is distributed data storage that you need to consider first and foremost for your big data project. With a reference to Brewer’s theorem, it is impossible for a distributed data store to simultaneously provide more than two guarantees: whether that’s consistency, availability or partition tolerance. So pick two and you are good to go. The choice, as always, depends on your application.
To keep the big data wheel spinning, a highly scalable, efficient and cost-effective storage is required. Almost always it is going to be some type of NoSQL database, which are ample nowadays with more than 225 NoSQL databases available. Remember Brewer’s theorem? This is when you start making sacrifices. If the risk of some data becoming unavailable is tolerable for you (sacrifice availability), a highly flexible and easily scalable document database with straightforward querying, like MongoDB, may be your way to go. Whether it’s no big deal if your clients may read inconsistent data (sacrifice consistency), you may probably want to choose a fault-tolerant and linearly scalable database like Cassandra. There are even some niche use cases when you may consider using a traditional relational database management system like MySQL or PostgreSQL and sacrifice partition tolerance. Although this may validate your hipster identity, it would probably involve database sharding and make working with unstructured data nearly impossible. Let’s just leave SQL for querying data warehouses, shall we?
Despite of the database you mean to choose, most of them run really well on commodity hardware. Although all hyperscale cloud providers are offering managed database services these days - and some of them do not feel shy to give open source the middle finger - there’s no necessity to get locked in their ecosystem when there are superior open source products out there. For instance, you can run a MongoDB cluster on bare metal cloud with HDD, SSD or NVMe directly attached storage to skyrocket your I/O operations on each node. And if you are a true speed fan, setting-up an in-memory database like Ignite or Redis might be your thing to do.
Let’s process data! Wait, but how?
Data is the new oil not without a reason. We love data, since it helps us understand things better and reveal actionable insights. In order to do so, we have to process our data one way or another.
First, there was Hadoop with its batch processing compute framework made upon MapReduce computing paradigm. Life was good and songs were sung while engineers scaled their big data clusters horizontally and employed massive parallelism. Each node executed the given reduce functions on the mapped data it has been assigned with – this way enormous data chunks were processed like a breeze. This was also how Google started its search engine. In time Hadoop ecosystem expanded rapidly and introduced additional layers of abstraction to address new issues as big data industry got more mature. It is still the most prominent and used tool in the data industry today, which you can run smoothly on simple commodity servers. Just make sure your have fast directly attached storage on your nodes, since Hadoop MapReduce is disk-bound.
While batch processing is a really powerful concept, we first need to store the data for it to be processed. This creates difficulties when you want to start making real-time predictions with continuous data streaming-in. For something like algorithmic stock trading or wildfire monitoring to work your data has to be processed in a glimpse. Obviously, we need a different paradigm here and Apache Spark is in the forefront of innovation when it comes to stream processing. The project was first intended to address Hadoop weaknesses in stream analytics. Spark has no file management system, so it relies on HDFS or any other storage cluster. It reads data from the cluster, performs its operations in a single step and then writes the data back to the cluster. This can be 100x faster than Hadoop, since Spark operates in memory by default. When choosing the right infrastructure for your Spark cluster look for something powerful RAM-wise. On bare metal cloud we usually recommend Intel Gold 6230R servers as your Spark nodes.
Like most great technologies, Spark has evolved and changed a lot. It is now a unified analytics engine with powerful interactive queries, graph processing and iterative algorithms supported. For instance, you can easily build machine learning workflows and employ some of the most popular algorithms on Spark to iterate over your data set and build machine learning models. It can even process batch jobs these days. And the best thing about Spark? It’s completely free of charge.
To cloud or not to cloud
OK, you have all these powerful open source tools in your pocket that we have just discussed. Now you need to choose the right infrastructure for your project. If you bought bare metal servers and hosted them on-premises, you would be able to squeeze out most of the benefits of raw infrastructure, but with a huge upfront capital investment and further maintenance costs. Although this option is still considerable for large enterprises, small and medium businesses must be much more agile. Renting infrastructure, on the other hand, is a more convenient option, since you pay per usage and need not invest into hardware. Let’s say you have eventually decided to move to a cloud. But which one to choose?
The big boys
Every hyperscale provider, be it AWS, Azure or GCP, has a wide portfolio of managed services to offer for big data community, ranging from managed databases to integrated machine learning frameworks. It might seem like a one-size-fits-all solution, but most of their managed services have roughly the same functionality as their open source counterparts that are available free of charge. If you still want to abstract from the infrastructure entirely and are fine with being locked-in at a single provider, get ready to receiving ever-increasing invoices and using complex pricing calculators. Funny enough, there’s even a role of Cloud Economist to help you to solve your AWS invoice.
Alright, so you want to keep full control over your cloud stack and always have the freedom to choose where it resides. As mentioned earlier, open source technologies like Hadoop and Spark work really well on commodity hardware, so the main question is which infrastructure as a service provider to choose. Historically, a typical cloud offering included virtual machines with overbooked hardware resources that often resulted in fluctuating workloads and increased security risks. Although cloud service ecosystem expanded greatly, the underlying infrastructure services still rely heavily on the hypervisor.
Bare metal cloud on the block
Bare metal cloud is different. You still have fully automated infrastructure provisioning, just with no underlying virtualization layer. This is great for several reasons. First, all servers are strictly single-tenant and you’re the only owner of the entire machine. Being single is not much fun in life, but it’s great in the cloud: you have no noisy neighbors, no hardware overbooking, no hypervisor overhead and less security risks. Simply put, bare metal cloud is a much cleaner way to host your resource-intensive applications. As for data analytics workloads that require robust infrastructure and enhanced security, bare metal cloud is unbeatable.
Raw horsepower of bare metal
Running your big data cluster on bare metal machines gives you an extra edge. Servers can be easily scaled up and down in minutes via RESTful API, which is critical when running a distributed system. With no virtualization and hardware overbooking you can run your application at maximum capacity and still have smooth and steady workloads. If that’s not enough, you can customize server hardware as you like. Simply add a GPU accelerator when building your machine learning model, increase RAM to scale your in-memory database or put NVMe storage into your servers to skyrocket your Hadoop cluster. Where else can you get specialized hardware that easy? By eliminating hypervisor overhead and introducing custom hardware bare metal cloud gives you the most efficient infrastructure on demand. Raw and simple.
Enhanced privacy & security
Security is treated very seriously in big data world and it’s understandable that you need to choose your infrastructure accordingly. Dealing with sensitive data often means that you have to store and process personal identifiable information (PII) and stay in line with legal regulation. Making sure your vendor is GDPR compliant or have industry recognized certifications like ISO 27001 is always a good idea. In addition to that, you may also be legally obliged to have a private and isolated infrastructure. Bare metal cloud is inherently single-tenant and you don’t pay a cent for that.
Legal compliance is important, yet system security is no less crucial. On a big data distributed computing cluster data should be moved between your nodes privately. To make this happen you typically need a private network subnet. On bare metal cloud a private network interface is assigned automatically for every server. This way you can process your data internally on a fast and secure LAN with 10G bandwidth.
100 times cheaper data transfer
Sure, you try to keep your computations as close as possible to where the data resides. This is especially true when your data set is huge, since moving computation is cheaper than moving data. Nevertheless, you still need to move your data in and out of the cluster. While hyperscale providers boast about their low data transfer prices which may vary from $50 to more than $100 per terabyte, on bare metal cloud you can transfer data in and out to the Internet for as low as $1 per terabyte. Quite a different, right?
It’s up to you now
There are still only a handful of bare metal cloud providers in the market and even less those that can offer fully automated AND easily customizable infrastructure. We at Cherry severs have developed an exclusive cloud platform to address the limitations of traditional cloud vendors. So if you are about to conquer the world with the next big thing, consider choosing a more flexible, efficient and price-effective cloud platform to level up your big data application.