Apache Storm 2.0 Improvements

By Kishor Patil, Principal Software Systems Engineer at Verizon Media, and PMC member of Apache Storm & Bobby Evans, Apache Member and PMC member of Apache Hadoop, Spark, Storm, and Tez

We are excited to be part of the new release of Apache Storm 2.0.0. The open source community has been working on this major release, Storm 2.0, for quite some time. At Yahoo we had a long time and strong commitment to using and contributing to Storm; a commitment we continue as part of Verizon Media. Together with the Apache community, we’ve added more than 1000 fixes and improvements to this new release. These improvements include sending real-time infrastructure alerts to the DevOps folks running Storm and the ability to augment ingested content with related content, thereby giving the users a deeper understanding of any one piece of content.  

Performance

Performance and utilization are very important to us, so we developed a benchmark to evaluate various stream processing platforms and the initial results showed Storm to be among the best. We expect to release new numbers by the end of June 2019, but in the interim, we ran some smaller Storm specific tests that we’d like to share.

Storm 2.0 has a built-in load generation tool under examples/storm-loadgen. It comes with the requisite word count test, which we used here, but also has the ability to capture a statistical representation of the bolts and spouts in a running production topology and replay that load on another topology, or another version of Storm. For this test, we backported that code to Storm 1.2.2. We then ran the ThroughputVsLatency test on both code bases at various throughputs and different numbers of workers to see what impact Storm 2.0 would have. These were run out of the box with no tuning to the default parameters, except to set max.spout.pending in the topologies to be 1000 sentences, as in the past that has proven to be a good balance between throughput and latency while providing flow control in the 1.2.2 version that lacks backpressure.

In general, for a WordCount topology, we noticed 50% - 80% improvements in latency for processing a full sentence. Moreover, 99 percentile latency in most cases, is lower than the mean latency in the 1.2.2 version. We also saw the maximum throughput on the same hardware more than double.

image
image

Why did this happen? STORM-2306 redesigned the threading model in the workers, replaced disruptor queues with JCTools queues, added in a new true backpressure mechanism, and optimized a lot of code paths to reduce the overhead of the system. The impact on system resources is very promising. Memory usage was untouched, but CPU usage was a bit more nuanced.

image
image

At low throughput (< 8000 sentences per second) the new system uses more CPU than before. This can be tuned as the system does not auto-tune itself yet. At higher rates, the slope of the line is much lower which means Storm has less overhead than before resulting in being able to process more data with the same hardware. This also means that we were able to max out each of these configurations at > 100,000 sentences per second on 2.0.0 which is over 2x the maximum 45,000 sentences per second that 1.2.2 could do with the same setup. Note that we did nothing to tune these topologies on either setup. With true backpressure, a WordCount Topology could consistently process 230,000 sentences per second by disabling the event tracking feature. Due to true backpressure, when we disabled it entirely, then we were able to achieve over 230,000 sentences per second in a stable way, which equates to over 2 million messages per second being processed on a single node.

Scalability

In 2.0, we have laid the groundwork to make Storm even more scalable. Workers and supervisors can now heartbeat directly into Nimbus instead of going through ZooKeeper, resulting in the ability to run much larger clusters out of the box.

Developer Friendly

Prior to 2.0, Storm was primarily written in Clojure. Clojure is a wonderful language with many advantages over pure Java, but its prevalence in Storm became a hindrance for many developers who weren’t very familiar with it and didn’t have the time to learn it.  Due to this, the community decided to port all of the daemon processes over to pure Java. We still maintain a backward compatible storm-clojure package for those that want to continue using Clojure for topologies.

Split Classpath

In older versions, Storm was a single jar, that included code for the daemons as well as the user code. We have now split this up and storm-client provides everything needed for your topology to run. Storm-core can still be used as a dependency for tests that want to run a local mode cluster, but it will pull in more dependencies than you might expect.

To upgrade your topology to 2.0, you’ll just need to switch your dependency from storm-core-1.2.2 to storm-client-2.0.0 and recompile.  

Backward Compatible

Even though Storm 2.0 is API compatible with older versions, it can be difficult when running a hosted multi-tenant cluster. Coordinating upgrading the cluster with recompiling all of the topologies can be a massive task. Starting in 2.0.0, Storm has the option to run workers for topologies submitted with an older version with a classpath for a compatible older version of Storm. This important feature which was developed by our team, allows you to upgrade your cluster to 2.0 while still allowing for upgrading your topologies whenever they’re recompiled to use newer dependencies.

Generic Resource Aware Scheduling

With the newer generic resource aware scheduling strategy, it is now possible to specify generic resources along with CPU and memory such as Network, GPU, and any other generic cluster level resource. This allows topologies to specify such generic resource requirements for components resulting in better scheduling and stability.

More To Come

Storm is a secure enterprise-ready stream but there is always room for improvement, which is why we’re adding in support to run workers in isolated, locked down, containers so there is less chance of malicious code using a zero-day exploit in the OS to steal data.

We are working on redesigning metrics and heartbeats to be able to scale even better and more importantly automatically adjust your topology so it can run optimally on the available hardware. We are also exploring running Storm on other systems, to provide a clean base to run not just on Mesos but also on YARN and Kubernetes.

If you have any questions or suggestions, please feel free to reach out via email.

P.S. We’re hiring! Explore the Big Data Open Source Distributed System Developer opportunity here.