Real-time business intelligence is going mainstream, thanks in part to the Storm and Spark open source projects. Here’s how to choose between them.
The idea of real-time business intelligence has been around for a while (see the Wikipedia page on the topic begun in 2006). But while people have been talking about the idea for years, I haven’t seen many enterprises actually embrace the vision, much less realize the benefits it enables.
At least part of the reason has been the lack of tooling for implementing BI and analytics in real time. Traditional data-warehousing environments were heavily oriented toward batch operations with extremely high latencies, were incredibly expensive, or both.
A number of powerful, easy-to-use open source platforms have emerged to change this. Two of the most notable ones are Apache Storm and Apache Spark, which offer real-time processing capabilities to a much wider range of potential users. Both are projects within the Apache Software Foundation, and while the two tools provide overlapping capabilities, they each have distinctive features and roles to play.
Storm, a distributed computation framework for event stream processing, began life as a project of BackType, a marketing intelligence company bought by Twitter in 2011. Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately moved to the Apache Incubator and became an Apache top-level project in September 2014.
Storm has sometimes been referred to as the Hadoop of real-time processing. The Storm documentation appears to agree: “Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”
To meet this end, Storm is designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes, and offers a strong guarantee that every tuple will be processed. Storm defaults to an “at least once” guarantee for messages, but offers the ability to implement “exactly once” processing as well.
Storm is written primarily in Clojure and is designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration.
You can think of topologies as roughly analogous to a MapReduce job in Hadoop, except that given Storm’s focus on real-time, stream-based processing, topologies default to running forever or until manually terminated. Once a topology is started, the spouts bring data into the system and hand the data off to bolts (which may in turn hand data to subsequent bolts) where the main computational work is done. As processing progresses, one or more bolts may write data out to a database or file system, send a message to another external system, or otherwise make the results of the computation available to the users.
One of the strengths of the Storm ecosystem is a rich array of available spouts specialized for receiving data from all types of sources. While you may have to write custom spouts for highly specialized applications, there’s a good chance you can find an existing spout for an incredibly large variety of sources — from the Twitter streaming API to Apache Kafka to JMS brokers to everything in between.
Adapters exist to make it straightforward to integrate with HDFS file systems, meaning Storm can easily interoperate with Hadoop if needed. Another strength of Storm is its support for multilanguage programming. While Storm itself is based on Clojure and runs on the JVM, spouts and bolts can be written in almost any language, including non-JVM languages that take advantage of a protocol for communicating between components using JSON over stdin/stdout.
In short, Storm is a very scalable, fast, fault-tolerant open source system for distributed computation, with a special focus on stream processing. Storm excels at event processing and incremental computation, calculating rolling metrics in real time over streams of data. While Storm also provides primitives to enable generic distributed RPC and can theoretically be used to assemble almost any distributed computation job, its strength is clearly event stream processing.
Spark, another project suited to real-time distributed computation, started out as a project of AMPLab at the University of California at Berkeley before joining the Apache Incubator and ultimately graduating as a top-level project in February 2014. Like Storm, Spark supports stream-oriented processing, but it’s more of a general-purpose distributed computing platform.
As such, Spark can be seen as a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster, relying on YARN for resource scheduling. In addition to Hadoop YARN, Spark can layer on top of Mesos for scheduling or run as a stand-alone cluster using its built-in scheduler. Note that if Spark is not used with Hadoop, some type of network/distributed file system (NFS, AFS, and so on) is still required if running on a cluster, so each node will have access to the underlying data.
Spark is written in Scala and, like Storm, supports multilanguage programming, although Spark provides specific API support only for Scala, Java, and Python. Spark does not have the specific abstraction of a “spout,” but includes adapters for working with data stored in numerous disparate sources, including HDFS files, Cassandra, HBase, and S3.
Where Spark shines is in its support for multiple processing paradigms and the supporting libraries. Yes, Spark supports a streaming model, but this support is provided by only one of several Spark modules, including purpose-built modules for SQL access, graph operations, and machine learning, along with stream processing.
Spark also provides an extremely handy interactive shell that allows quick-and-dirty prototyping and exploratory data analysis in real time using the Scala or Python APIs. Working in the interactive shell, you quickly notice another major difference between Spark and Storm: Spark has more of a “functional” flavor, where working with the API is driven more by chaining successive method calls to invoke primitive operations — as opposed to the Storm model, which tends to be driven by creating classes and implementing interfaces. Neither approach is better or worse, but the style you prefer may influence your decision on which system is better suited to your needs.
Like Storm, Spark is designed for massive scalability, and the Spark team has documented users of the system running production clusters with thousands of nodes. In addition, Spark won the recent 2014 Daytona GraySort contest, turning in the best time for a shouldering workload consisting of sorting 100TB of data. The Spark team also documents Spark ETL operations with production workloads in the multiple Petabyte range.
Spark is a fast, scalable, and flexible open source distributed computing platform, compatible with Hadoop and Mesos, which supports several computational models, including streaming, graph-centric operations, SQL access, and distributed machine learning. Spark has been documented to scale exceptionally well and, like Storm, is an excellent platform on which to build a real-time analytics and business intelligence system.
How do you choose between Storm and Spark?
If your requirements are primarily focused on stream processing and CEP-style processing and you are starting a greenfield project with a purpose-built cluster for the project, I would probably favor Storm — especially when existing Storm spouts that match your integration requirements are available. This is by no means a hard and fast rule, but such factors would at least suggest beginning with Storm.
On the other hand, if you’re leveraging an existing Hadoop or Mesos cluster and/or if your processing needs involve substantial requirements for graph processing, SQL access, or batch processing, you might want to look at Spark first.
Another factor to consider is the multilanguage support of the two systems. For example, if you need to leverage code written in R or any other language not natively supported by Spark, then Storm has the advantage of broader language support. By the same token, if you must have an interactive shell for data exploration using API calls, then Spark offers you a feature that Storm doesn’t.
In the end, you’ll probably want to perform a detailed analysis of both platforms before making a final decision. I recommend using both platforms to build a small proof of concept — then run your own benchmarks with a workload that mirrors your anticipated workloads as closely as possible before fully committing to either.
Of course, you don’t need to make an either/or decision. Depending on your workloads, infrastructure, and requirements, you may find that the ideal solution is a mixture of Storm and Spark — along with other tools like Kafka, Hadoop, Flume, and so on. Therein lies the beauty of open source.
Whichever route you choose, these tools demonstrate that the real-time BI game has changed. Powerful options once available only to an elite few are now within the reach of most, if not all, midsize-to-large organizations. Take advantage of them.
Andrew C. Oliver is a professional cat herder who moonlights as a software consultant. He is president and founder of Open Software Integrators, a big data consulting firm based in Durham, N.C. Originally published at www.infoworld.com