Microservices are gaining popularity as an architecture style to achieve extreme agility. The application is functionally decomposed into a set of loosely coupled collaborating services that interact through well-defined (REST) APIs. Adopting these design principles allow teams of developers to continuously evolve individual microservices independently, in an extremely fast-paced fashion. Organizations using this development model are known to update their deployments from 50 to 300 times a day!
One of the biggest “complaints” against microservices is that while we gain agility for individual microservices, gaining insight about the entire operation of the system (consisting of tens of interacting microservices) becomes more difficult. As shown in Figure 1, multiple networked services work in coalition to generate a response to the user’s request; an end-to-end view of the application execution is vital to quickly diagnose and address performance degradation issues in production deployments. In applications with tens of microservices (and hundreds of instances of each), it becomes increasingly hard to understand how information is flowing across various services, where the choke points are and whether the delay experienced by a user is an artifact of the network or a microservice in the call chain.
Given the rising need for performance profiling tools for microservice-based applications running in a cloud environment, at IBM Research, we are experimenting with the idea of baking a real-time, performance profiler in to the platform substrate itself, akin to services like auto scaling, load balancing, etc. The service would work in a non-intrusive fashion by capturing and analyzing network communication between microservices in the application. Operating at cloud scale, the analytics aspect of the service needs to process vast amounts of communication traces from tenant applications in real-time, discover the application topologies, track individual requests as they flow across microservices over the network, and so on. Since we needed to run both batch and real-time analytics applications, we decided to use Spark as our big-data analytics platform.
Figure 2 illustrates a simple experiment that we setup to understand how we can leverage Spark for operational analytics. Our setup consists of a Openstack cloud, a set of microservice-based applications operating in different tenant networks, and a small spark cluster. Software network taps are installed on every Nova compute host to capture network packets transiting inside tenant networks. Wire-data captured from the tenant network is pushed into a Kafka bus. We wrote connectors in our Spark applications to pull packets out of Kafka and analyze them in real-time.
We wrote Spark applications to attempt to answer the following questions:
- How does information flow across services while generating the response to a particular end user’s request? In the IT Operational Analytics world, this specific type of analytical operation is commonly known as “transaction tracing”.
- Given a time window, what is the caller/callee relationship between various microservices in the application?
- Given a time window, what were the response times for various microservices in the application?
We developed two spark applications to answer these questions: a near real-time transaction tracing application and a batch analytics application to generate the application’s communication graph and latency statistics. The former was built on top of Spark’s streaming abstractions, while the latter was a set of batch processing jobs managed by the spark job server.
Tracing transactions (or request flows) across microservices requires establishing causality between request-response pairs across microservices in the application. In order to be completely agnostic to the application, we decided to treat the application as a black box. We designed our system under the assumption that the application did not use any globally unique request identifiers to track a user’s request across various microservices.
To track causality, we took a slightly modified version of the approach taken by Aguilera et al in their 2003 SOSP paper on performance profiling a distributed system of black boxes. For synchronous web services, the paper proposes a nesting algorithm that represents the distributed application as a graph of nodes (services), with edges representing interactions between nodes. The nesting algorithm examines the timestamps of calls between services to deduce causal relationships. Simply put, if service A calls service B, and service B talks to service C before returning a response to A, then the call B to C is said to be caused by the call A to B. By analyzing a large set of messages, we can derive call chains across services with statistical confidence measures and eliminate less likely alternatives. The original algorithm published in the paper is meant to operate in offline fashion on large trace sets. We modified the algorithm to operate on a moving window of packet streams and progressively refine the topology inference over time.
Figure 3 shows a partial workflow of jobs in the transaction tracing application. Figure 4 shows the transaction trace in a tenant application, deduced by the spark application. Packet streams arrive in chunks, encapsulated in the PCAP format. Individual flows are extracted from packet streams and grouped into sliding windows, i.e., DStreams. Within a given time window, HTTP requests and corresponding responses are extracted by comparing the standard five tuple (src_ip, src_port, dest_ip, dest_port, protocol), forming the next DStream, which is then shipped off to the rest of the processing chain that implements the nesting algorithm (not shown in figure). We stored the output of the transaction tracing application into a time series data store (InfluxDB).
The second spark application is a standard batch analytics application that generates a service call graphs as well as call latency statistics during a given time window. The application is submitted as a standard batch job to the spark job server. As illustrated in Figure 5, the batch analytics application pulls individual transaction traces out of InfluxDB and transforms them into a list of <vertex,edge> pairs, per transaction trace. The lists are then aggregated to form two RDDs, one containing a list of vertices and another, a list of edges. The vertices list is further de-duplicated based on the vertex name. Finally, the application’s call graph is computed in the form of a directed graph, along with statistics about the latencies on each edge in the graph. This graph is an instance of the application’s time-evolving graph, representing its state for a given time period. Figures 6 and 7 show the call graph and latency statistics of a tenant application, as output by the batch analytics job.
The Spark platform has enabled us to build different types of analytics applications such as batch processing, streaming and graph processing using a single unified big data platform. Our next step is to investigate the scalability aspects of our system such as ingesting data at line rate and dealing with thousands of tenant application traces simultaneously. We will continue to report on our progress along this front.