How-to identify performance bottlenecks in a ROS 2 node using perf

In a previous blog post I described an algorithm implemented as a ROS node which is able to dynamically remove lidar scan echoes resulting not from the environment but from the robot body itself. In practical terms this requires receiving a lidar scan (sensor_msgs/PointCloud2), iterating over all points, removing those outside of an axis-aligned bounding box and finally publishing a filtered lidar scan message.

perf is a Linux profiling tool that samples CPU activity and call stacks to reveal where a program spends its time. Before you can use perf you have to temporarily allow unprivileged users access to performance monitoring events (PMU events):

1
sudo bash -c "echo 1 > /proc/sys/kernel/perf_event_paranoid"

To get an initial feeling for where the robot body filter node (executable name: l4bo) is spending most of its time, live profiling of an already running process using perf top is the ideal approach.

1
sudo perf top -p $(pidof l4bo)

Results of perf top for robot body filter

This initial analysis already shows that a significant share of CPU cycles is spent in the functions l4xz::to_point_cloud_vect and l4xz::to_point_cloud_msg. To gather detailed performance data for later offline analysis you can use perf record. It is usually run via perf record [perf parameter] ./your_program, but since most ROS nodes are started via ros2 launch I recommend using the prefix parameter to automatically launch perf together with your node.

1
2
3
         emulate_tty=True,
+        prefix=['perf record --call-graph lbr --freq 4000 --output robot_body_filter.perf.data'],
         parameters=[

For perf to produce meaningful call graphs and resolve symbols instead of [unknown] frames, the binary needs to be built with debug information and without omitted frame pointers. With colcon this can be achieved by building the package with:

1
colcon build --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo

The resulting robot_body_filter.perf.data can be analysed with a variety of performance analysis tools; I recommend KDAB/hotspot.

Results of hotspot for robot body filter before optimisation

Again the same functions show up on the hot path, consuming the most CPU cycles. Let us take a closer look at l4xz::to_point_cloud_vect:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
std::vector<Position>
to_point_cloud_vect(
  sensor_msgs::msg::PointCloud2::SharedPtr const cloud_in)
{
  std::vector<Position> point_cloud_vect;
  
  sensor_msgs::PointCloud2ConstIterator<float> iter_x(*cloud_in, "x");
  sensor_msgs::PointCloud2ConstIterator<float> iter_y(*cloud_in, "y");
  sensor_msgs::PointCloud2ConstIterator<float> iter_z(*cloud_in, "z");

  for (; iter_x != iter_x.end(); ++iter_x, ++iter_y, ++iter_z)
    point_cloud_vect.push_back(Position(
      static_cast<double>(*iter_x) * m,
      static_cast<double>(*iter_y) * m,
      static_cast<double>(*iter_z) * m));

  return point_cloud_vect;
}

The main inefficiency in this implementation is the way the output vector is built up. point_cloud_vect is default-constructed with zero capacity and then grown one element at a time through push_back. Each time the current capacity is exhausted, std::vector has to allocate a new, larger backing buffer, move every Position that was already inserted into that new buffer, and release the old one. The fix is trivial: the final element count is already known upfront from cloud_in->width * cloud_in->height and can be used to reserve the required capacity up front. Replacing push_back with emplace_back additionally avoids the temporary-to-container move on every inserted point.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
std::vector<Position>
to_point_cloud_vect(
  sensor_msgs::msg::PointCloud2::SharedPtr const cloud_in)
{
  std::vector<Position> point_cloud_vect;
  point_cloud_vect.reserve(cloud_in->width * cloud_in->height);

  sensor_msgs::PointCloud2ConstIterator<float> iter_x(*cloud_in, "x");
  sensor_msgs::PointCloud2ConstIterator<float> iter_y(*cloud_in, "y");
  sensor_msgs::PointCloud2ConstIterator<float> iter_z(*cloud_in, "z");

  for (; iter_x != iter_x.end(); ++iter_x, ++iter_y, ++iter_z)
    point_cloud_vect.emplace_back(
      static_cast<double>(*iter_x) * m,
      static_cast<double>(*iter_y) * m,
      static_cast<double>(*iter_z) * m);

  return point_cloud_vect;
}

With the change in place the node has to be rebuilt and the perf record run from above has to be repeated, so that the resulting robot_body_filter.perf.data reflects the optimised implementation. Loading the new trace into hotspot side by side with the original confirms that the reallocation overhead is gone.

Results of hotspot for robot body filter after optimisation

The workflow shown here is intentionally simple and works well as a default starting point for performance investigations on ROS 2 nodes: get a first impression with perf top on the running process, capture a detailed trace with perf record, inspect the result in hotspot, apply a targeted fix, and then repeat the measurement on the rebuilt binary.

For further reading I recommend Brendan Gregg’s perf examples as a reference for the tool itself, and the hotspot documentation for getting the most out of the visual analysis.


How-to identify performance bottlenecks in a ROS 2 node using perf

Alexander Entinger is a senior embedded robotics engineer who helps teams build reliable firmware and software for mobile robotic systems. With over a decade of experience spanning microcontroller firmware, real-time communication, Linux, and ROS, he bridges the gap between the embedded and robotics software worlds.