New Relic has released a new real-time profiling capability in New Relic One that streams JVM performance data from production applications with extremely low overhead. The feature works through Java Flight Recorder, a JVM feature which was open-sourced by Oracle in Java 11.
The real-time profiler expands beyond Java Flight Recorder's single-desktop GUI, Java Mission Control, to offer visibility across many JVMs over a long period of time. The capability is similar to DataDog's recent Continuous Profiler offering that was announced in August. Both offerings leverage the same JFR feature, which won Oracle’s Java "March Madness" bracket for popularity. Flight Recorder and Mission Control are well-developed capabilities, having been in the commercial JRockit JVM since before 2010, and a commercial part of Oracle’s Java SE 8 release. All code is now fully free and open source. Prior to the continuous profilers, many profilers relied on either instrumentation, log analysis, or native-loaded code.
Monitoring tools that leverage JFR gains deeper insight into the JVM and its operations, such as analysis from the ThreadLocal Allocation Buffer (TLAB) that can pinpoint which threads are allocating which object types. This can often be used in connection with garbage collection analysis to reveal not only what is being thrown away, but also where it is coming from.
New Relic One offers developers a consistent view over a fleet of flight recorders that can incorporate additional contextual information, such as logs. The addition of logging information helps those who interpret results to move beyond the data to see what the application was doing at a given time, rather than reading pure technical metrics alone.
Another benefit of context derives from incorporating the deep analysis of JFR with information about the infrastructure: noisy neighbor detection. When many applications are hosted on the same systems to consume the same compute and storage capacity, performance degradation often appears. When tracking each metric or application metrics alone, operators may be misled to investigate performance problems in the application when the cause is a neighbor fighting over the same resources. With its level of observation, the continuous profiling capability can help differentiate when tracked metrics stem from an application issue versus an infrastructure issue.
A drawback from the monitoring and observability angle is the reliance on human attention and reaction time to respond to issues that appear on the dashboard. More data or continuous data does not automatically produce faster resolutions. AWS recently published a "best practices guide for operational dashboards," with the observation that, "we have found that any operational process that requires a manual review of dashboards will fail due to human error, no matter how frequently the dashboards are reviewed." The guidance from AWS drives towards automatic alarms, while other organizations, such as Turbonomic, drive towards automation of action.
An example is the noisy neighbor problem over competition for CPU on a host, where a common action is to VMotion one of the virtual machines to a different less-utilized host in the same cluster. In this scenario, a dashboard would provide the "viewing" capability to show the issue. An alerting system would provide a similar viewing, reaching out to gain attention. An automated control system would provide the "doing," using its understanding of the infrastructure to do the vmotion, correct the issue, and then inform administrators with a view that an issue happened in case any attention is due. The benefit of automated action is continuous understanding and applying a much faster way to apply that understanding to make a decision across a large infrastructure in finite time.
Developers looking for continuous metrics from production applications or services scale can sign up for free accounts at New Relic One.