Using IBM Turbonomic for Monitoring Cloud Pak for Data — Part 2

Advanced report-based monitoring

Yongli An
IBM Data Science in Practice

--

Overview

In part 1 of the article, we introduce some basic UI navigation to show the built-in metrics. We share our experience with IBM Turbonomic in Cloud Pak for Data (CP4D) environments. Our experience focuses on using monitoring capabilities to help our development teams to analyze resource usage and identify performance and resource optimizations.

Because there are limitations in what’s offered in the basic UI navigation, we go to the report service for help. We share more details on how to use the report service and the predefined reports in Grafana dashboards. These reports are shown as customized graphs for the important resource usage metrics.

Advanced report-based monitoring

For the advanced monitoring with Turbonomic, you can create and customize reports that are exactly what you need based on your own interest and established practices. Because the reports are based on Grafana, you get all the benefits of Grafana when working with the graphs. For example, you can zoom into part of the graph by focusing on a time range, or hover to show data point values.

The Turbonomic Evaluation Edition entitles you to use the report feature. Based on our experience from using Turbonomic for CP4D, we have created some predefined reports to track the key metrics as part of our recommended best practices. We provide report configuration details on how to set up the customized container and pod usage reports in dashboards.

When the administrator sets up the reports, you should be able to log in to the Turbonomic server by using your user ID that has permission to access the reports. The predefined Container and Pod usage reports support four different levels of filtering.

When you are on the default Turbonomic home page, you should see the REPORTS option in the left navigation. Click it to open a new Dashboards page. You might see a screen like the following example. Click to expand the CP4D section, then click the Kubernetes Container Usage — with tag selection row to see the graphs. It might take some time to populate the graphs, depending on the amount of data that the server manages.

By default, the customized CP4D report dashboards have four filters:

  • All clusters
  • All namespaces on each cluster
  • All tags (labels) used by all the pods and containers
  • All containers

If the Turbonomic server is supporting many clusters, this might be messy. The following filter sequence is strongly suggested so that you can focus on the right scope.

  1. Cluster — Enter or search for your cluster name (required).
  2. Namespace — Select your name space on your cluster (recommended).
  3. Tag — Enter or search for the addOnId of your service to filter to your service only (as needed).
  4. Containers: Start with the default to show all. To check a specific container, apply the filter to for that container (as needed).

There are five metrics we always check when running performance tests or investigating performance issues in any environments. They are the same ones as covered in earlier in this article.

To confirm whether the application performance is constrained by the pod CPU or memory capacity, focus on these metrics:

  • vCPU (virtual CPU usage vs limit, %)
  • vMem (virtual memory usage vs limit, %)

To check whether the request settings (for reserving resources) are too high relative to the real usage, refer to these metrics:

  • vCPU (virtual CPU usage vs request, %)
  • vMem (virtual memory usage vs request, %)

They are useful to identify custom tuning specific to a customer environment. They also help identify generic optimization opportunities for product owners to evaluate further.

Finally, those containers with high CPU throttling should be checked. For those containers with high CPU throttling, if the CPU usage vs limit is also high, then there might be a concern. Otherwise, it should be fine as long the application performance is still satisfactory (in terms of response time and throughput).

Here are the two most important metrics that we recommend you should always pay attention to. They are the must-check metrics when you have some performance issues under load.

Metric 1: vCPU (virtual CPU usage vs limit, %)

A typical application performance problem might be that the request response times are getting too long with increased load in the system. First, we need to confirm if the cluster or node capacity is the constraining factor or not. If not, the container limit setting might be the next suspect.

The default dashboards have all the predefined reports in their own graphs on the same UI page. Filtering by the cluster must be done first. If needed, you should type in part of the name of your cluster (that’s used to register in the Turbonomic server), then click the cluster name in the drop-down list. The report graphs refresh with the data for your cluster only.

To see this vCPU vs limit percentage graph more clearly on its own, click the overflow menu, then click View.

Once you have the report graph on the page by itself, you see the graph at the top above a table with detailed values at the bottom.

Without applying any further filters on namespace, tag, or container, you can get a good impression of the overall usage as compared to the limit for all the containers in all the namespaces.

You can click the Mean or Max column headers to sort the table so that you have a better matching order between the lines in the graph and the rows in the table. The ones at the top are likely to be interesting. However, this view is probably still too messy to see anything clearly.

By default, the graph shows the data points for the last 24 hours. You can try to change the setting by clicking last 24 hours to see a list of options. For any constant high usage or frequent spikes in this graph, reaching 100% is a cause for concern.

Now zoom into those CP4D-related namespaces. The most important one is the CP4D instance namespace, which is named zen in our environment. There are a few other ones that are useful to check, including the cpd-operator and the ibm-certificate-manager namespace.

First, let’s use the cpd-operator namespace as an example. As it has few pods, there is no need to filter further. You can see clearly that the cert-manager-cainjector container is showing some high spikes periodically. Before we conclude anything, we need to check the memory usage vs limit metric, which is discussed in the next section.

Next, let’s filter on the zen namespace. It’s where all the installed CP4D services are running. As there are many pods under this namespace, it’s necessary to filter further by using a tag. The most useful one is AddOnId. Enter AddOn in the Tag field to see a list of the services/components. In case there is one component that you are interested in, such as wkc, you can select it from the list.

Click the checkbox for icpdsupport/addOnId: wkc to see the following graph. To make the graph more interesting, select a time range by click, hold, and move over a range and release the mouse on the initial graph (basic Grafana functionality). Now, the graph shows some real workload activities that drive some container CPU usage higher up to 65% of the container limit.

If you are interested in a particular pod, you can continue to type in part of the container name in the filter to see that container or other on their own. We show this step in the next section by using the second most important metric.

Metric 2 — vMem (virtual memory usage vs limit, %)

Let’s navigate back to the home page where we have all the metric reports. Now, the cluster should be still filtered to your own cluster and the namespace is filtered to ibm-certificate-manager. Click the view option of the vMem report to see the report graph on the page by itself.

Do you remember that the cert-manager-cainjector container that showed periodic high CPU usage in the section above? Now the graph shows the memory usage vs limit for the same pod (the green line) spikes as well but under 50%. This means that there is no risk of running out of memory. By checking the graph called vMem Limit to the right of the same graph above, you can see that the limit of the container is 1.02 GB. This container used to be a problematic one being investigated. It originally has the memory limit of 0.5 GB. As memory is not a compressible resource, insufficient memory capacity for the container might lead to application failures when the container runs out of memory and restarts. The optimization was to double the memory limit to handle peak load. The CPU setting was kept the same because most of the time CPU usage is low. CPU is a compressible resource and doesn’t cause immediate failures.

As mentioned earlier, the graphs from basic UI pages don’t show any dynamics pods that come and go while the workloads are running, unless they are running when you are browsing the graphs.

The good news is that the reports in the dashboards are able show such dynamic pods, even after they are completed some time ago. In other words, the graph can show you the history with both dynamic pods and the static pods (which are always running as long the CP4D services are installed on the cluster).

The graph below is an example of the vMem graph that uses the Spark service, by filtering the namespace to zen and the tag to Spark AddOn. Those broken lines turn out to be the Spark clean-up cron jobs that run periodically. They are dynamic pods as they run for short periods and only run from time to time.

The graph is also zoomed into a time range to show that the usage spikes related to one spark pod of our interest when investigating a problem in our tests.

Now, we can filter by the container to see the subset of the pods by not selecting any of those cron job pods. You can also click any row in the table to highlight one line in the graph.

The graph below is another example that shows the dynamic pods from running Jupyter Notebooks notebooks.

The graph also serves another purpose. Those pods are visible as dots or short line segments in the graph. The reason is due to a limitation in the current Turbonomic implementation. The monitoring sampling interval is 1 minute while the data points in the graphs are averaged values over a 10-minute interval. Now, you realize:

  • If a dynamic pod runs shorter than 1 minute, you might not see it.
  • If the dynamic pod got captured during that 1-minute sampling interval, once or twice, depending on the timing, you see one dot only, or two dots to form a line segment.
  • If your dynamic pod runs longer, then you have a good view of its usage out of multiple data points to form a longer line.

This situation is not ideal but seeing them with some data points is still an improvement. Unfortunately, these sampling internal or average internal settings are not configurable yet either. Hopefully Turbonomic will have future enhancements to address these limitations.

Other reports and metrics on the dashboard graphs

We covered two main reports that are used for identifying any constraints that are related to the pod CPU or memory capacity.

There are two more metrics that are related to CPU and memory usage vs request setting. They are used for understanding the optimal request setting based on your need.

In general, to improve the cluster resource efficiency, it’s recommended not to reserve too much by using the request setting. It’s okay to see usage is 200% or 300% of the request, or even higher. The benefit is that when the services are not active, the resource usage drops down to the reserved value with more resources available for other services to use. It’s a balance between not over-reserving and the need to have some guaranteed resources. Usually, CP4D products favor the cluster resource efficiency while ensuring that the needed resource by a service is reserved to a reasonable level.

On the right side of the home page, there are four report graphs next to their left side sibling graph. Correspondingly, they are:

  • vCPU Request: to show the container request setting values
  • vCPU limit: to show the container limit setting values
  • vMem Request: to show the container memory request setting values
  • vMem Limit: to show the container limit setting values

All those graphs can follow the same filters and for easy reference when you inspect the usage details in the graphs to their left side.

Recap: Other best practices and reminders

This is a recap of what’s covered already in part 1 of this article. For more details, please refer to part 1.

  • Most of the customers start with preproduction load testing by using realistic data and realistic load level to evaluate and ensure quality and performance before moving to production, with change control process for production.
  • In those preproduction test environments, Turbonomic is a very useful to help understand if any potential node capacity issues or pod level constraints.
  • In the production environment, most likely rule-based automatic tuning or automatic scaling isn’t acceptable when such actions cause service disruption. But using Turbonomic for monitoring should be very helpful.
  • The recommendations made by Turbonomic can be validated in the test environments. Any approved changes then can be promoted to production following your standard change control and promotion process.
  • Be aware that some environment and infrastructure issues might cause Turbonomic to stop working properly. Users should try the typical practices of restarting pods and/or wait for environmental issues to settle down. If still not able to recover, it’s better to re-create the setup assuming that there is no need to keep and recover the old data. There is no formal support with the Evaluation Edition.

Summary

Turbonomic is a powerful tool that you can use to gain much better visibility into the Cloud Pak for Data at the cluster, service, and pod level. This also means much enhanced observability. The built-in views in part 1 of this article cover a wide range of entities and metrics. Further, the advanced report dashboards make tracking the most important metrics much easier. You can focus on analyzing the trends and patterns because the key metrics are constantly and continuously available in the Grafana dashboards that are pre-built for Cloud Pak for Data clusters.

Acknowledgment

The author likes to thank Judy Liu (judyliu@ca.ibm.com) from Cloud Pak for Data Platform performance team, Eva Tuczai (eva.tuczai@ibm.com) and her colleagues in the Turbonomic organization for their continued collaboration and support to improve the integration between Turbonomics and Cloud Pak for Data.

References

--

--

Yongli An
IBM Data Science in Practice

Senior Technical Staff Member, Performance architect, IBM Data and AI. Love sports, playing or watching. See more at https://www.linkedin.com/in/yonglian/