In a previous blog post I described an algorithm implemented as a ROS node which is able to dynamically remove lidar scan echoes resulting not from the environment but from the robot body itself. In practical terms this requires receiving a lidar scan (sensor_msgs/PointCloud2), iterating over all points, removing those outside of an axis-aligned bounding box and finally publishing a filtered lidar scan message.
perf is a Linux profiling tool that samples CPU activity and call stacks to reveal where a program spends its time. Before you can use perf you have to temporarily allow unprivileged users access to performance monitoring events (PMU events):
| |
To get an initial feeling for where the robot body filter node (executable name: l4bo) is spending most of its time, live profiling of an already running process using perf top is the ideal approach.
| |

This initial analysis already shows that a significant share of CPU cycles is spent in the functions l4xz::to_point_cloud_vect and l4xz::to_point_cloud_msg. To gather detailed performance data for later offline analysis you can use perf record. It is usually run via perf record [perf parameter] ./your_program, but since most ROS nodes are started via ros2 launch I recommend using the prefix parameter to automatically launch perf together with your node.
| |
For perf to produce meaningful call graphs and resolve symbols instead of [unknown] frames, the binary needs to be built with debug information and without omitted frame pointers. With colcon this can be achieved by building the package with:
| |
The resulting robot_body_filter.perf.data can be analysed with a variety of performance analysis tools; I recommend KDAB/hotspot.

Again the same functions show up on the hot path, consuming the most CPU cycles. Let us take a closer look at l4xz::to_point_cloud_vect:
| |
The main inefficiency in this implementation is the way the output vector is built up. point_cloud_vect is default-constructed with zero capacity and then grown one element at a time through push_back. Each time the current capacity is exhausted, std::vector has to allocate a new, larger backing buffer, move every Position that was already inserted into that new buffer, and release the old one. The fix is trivial: the final element count is already known upfront from cloud_in->width * cloud_in->height and can be used to reserve the required capacity up front. Replacing push_back with emplace_back additionally avoids the temporary-to-container move on every inserted point.
| |
With the change in place the node has to be rebuilt and the perf record run from above has to be repeated, so that the resulting robot_body_filter.perf.data reflects the optimised implementation. Loading the new trace into hotspot side by side with the original confirms that the reallocation overhead is gone.

The workflow shown here is intentionally simple and works well as a default starting point for performance investigations on ROS 2 nodes: get a first impression with perf top on the running process, capture a detailed trace with perf record, inspect the result in hotspot, apply a targeted fix, and then repeat the measurement on the rebuilt binary.
For further reading I recommend Brendan Gregg’s perf examples as a reference for the tool itself, and the hotspot documentation for getting the most out of the visual analysis.
