Planet ROS

Planet ROS - http://planet.ros.org

Planet ROS - http://planet.ros.org http://planet.ros.org

ROS Discourse General: Native rcl::tensor type

We propose introducing the concept of a tensor as a natively supported type in ROS 2 Lyrical Luth. Below is a sketch of how this would work for initial feedback before we write a proper REP for review.

Abstract

Tensors are a fundamental data structure often used to represent multi-modal information for deep neural networks (DNNs) at the core of policy-driven robots. We introduce rcl::tensor as a native type in rcl, as a container for memory that can be optionally externally managed. This type would be supported through all client libraries (rclcpp, rclpy, …) the ROS IDL rosidl, and all RMW implementations. This enables tensor_msgs ROS messages based on sensor_msgs which use tensor instead of uint8[]. The default implementation of rcl::tensor operations for creation/destruction and manipulation will be available on all tiers of supported platforms.. With the presence of an optional package and an environment variable, a platform-optimized implementation for rcl::tensor operations can then be swapped in at runtime to take advantage of accelerator-managed memory/compute. Through adoption of rcl::tensor in developer code and ROS messages, we can enable seamless platform-specific acceleration determined at runtime without any recompilation or deployment.

Motivation

ROS 2 should be accelerator-aware but accelerator-agnostic like other popular frameworks such as PyTorch or NumPy. This enables package developers that conform to ROS 2 standards to gain platform-specific optimizations for free (“optimal where possible, compatible where necessary”).

Background

AI robots and policy-driven physical agents rely on accelerated deep neural network (DNN) model inference through tensors. Tensors are a fundamental data structure to represent multi-dimensional data from scalar (rank 0), vectors (rank 1), and matrices (rank 2) to batches of multi-channel matrices (rank 4). These can be used to encode all data flowing through such graphs including images, text, joint positions, poses, trajectories, IMU readings, and more.

Performing inference on these DNN model policies requires these tensors to reside in accelerator memory. ROS messages, however, expect their payloads to reside in main memory with field types such as uint8[] or multi-dimensional arrays. This requires these payloads to be copied from main memory to accelerator memory and then copied back to main memory after processing in order to populate a new ROS message to publish. This quickly becomes the primary bottleneck for policy inference. Type adaptation in rclcpp provides a solution for this, but it requires all participating packages to have accelerator-specific dependencies and only applies within the client library, so RMW implementations cannot apply optimized-for-accelerator memory, for example.

Additionally, without a canonical tensor type in ROS 2, a patchwork of different tensor libraries across various ROS packages is causing impedance mismatches with popular deep learning frameworks including PyTorch.

Requirements

Provide a native way to represent tensors across all interfaces from client libraries through RMW implementations.
Make available a set of common operations on tensors that can be used by all interfaces.
Enable accelerated implementations of common tensor operations when available at runtime.
Enable accelerator memory management backing these tensors when available at runtime.
Optimize flow of tensors for deep neural network (DNN) model inference to avoid unnecessary memory copies.
Allow for backwards compatibility with all non-accelerated platforms.

Rough Sketch

struct rcl::tensor
{
    std::vector<size_t> shape; // shape of the tensor
    std::vector<size_t> strides; // strides of the tensor
    size_t rank; // number of dimensions

    union {
        void* data; // pointer to the data in memory handle
        size_t handle; // token stored by rcl::tensor for externally managed memory
    }
    size_t byte_size; // size of the data

    data_type_enum type; // the data type
}

Core Tensor APIs

Inline APIs available on all platforms in core ROS 2 rcl.

Creation

Create a new tensor from main memory.

rcl_tensor_create_copy_from_bytes(const void *data_ptr, size_t byte_size, data_type_enum type)
rcl_tensor_wrap_bytes(void *data_ptr, size_t size, data_type_enum type)
rcl_tensor_create_copy_from(const struct rcl::tensor & tensor)

Common operations

Manipulations performed on tensors that can be optionally accelerated. The more complete these APIs are, the less fragmented the ecosystem will be but the higher the burden on implementers. These should be modeled after PyTorch tensor API and existing C tensor libraries such as libXM or C++ libraries like xtensor.

reshape()
squeeze()
normalize()
fill()
zero()
…

Managed access

Provide a way to access elements individually in parallel.

rcl_tensor_apply(<functor on each element with index>)

Direct access

Retrieve the underlying data in main memory but may involve movement of data.

void* rcl_tensor_materialized_data()

Other Conveniences

rcl functions to check which tensor implementation is active.
tensor_msgs::Image to mirror sensor_msgs::Image to enable smooth migration to using tensor type in common ROS messages. Alternative is to add a “union” field in sensor_msgs::Image with the uint8[] data field.
cv_bridge API to convert between cv::Mat and tensor_msgs::Image.

Platform-specific tensor implementation

Without loss of generality, suppose we have an implementation of tensor that uses an accelerated library, such as rcl_tensor_cuda for CUDA. This package provides shared libraries that implement all of the core tensor APIs. An environment variable for RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda enables the loading of rcl_tensor_cuda at runtime without rebuilding any other packages. Unlike the native implementation, rcl_tensor_cuda copies the input buffer into a CUDA buffer and uses CUDA to perform operations on that CUDA buffer.

It also provides new APIs for creating a tensor from a CUDA buffer, for checking whether the rcl_tensor_cuda implementation is active, and for accessing the CUDA buffer from a tensor available for any other package libraries the link to rcl_tensor_cuda directly. An RMW implementation linked against rcl_tensor_cuda would query the CUDA buffer backing a tensor and use optimized transport paths to handle it, while a general RMW implementation could just call rcl_tensor_materialize_bytes and transport the main memory payload as normal.

Simple Examples

Example #1: rcl::tensor with “accelerator-aware” subscriber

Node A publishes a ROS message with rcl::tensor from main memory bytes and sends it to a topic Node B subscribes to. Node B happens to be written to first check whether the rcl::tensor is backed by externally managed memory AND checks that rcl_tensor_cuda is active (indicates this is backed by CUDA). Node B has a direct dependency on rcl_tensor_cuda in order to perform this check.

Alternatively, Node B could have also been written with no dependency on any rcl::tensor implementation to simply retrieve the bytes from the rcl::tensor and ignore the externally managed memory flag altogether, which would have forced a copy back from accelerator memory in Scenario 2.

MyMsg.msg
—--------
std_msgs/Header header
tensor payload

Scenario 1: RCL_TENSOR_IMPLEMENTATION = <none>
----------------------------------------------

┌─────────────────┐    ROS Message    ┌─────────────────┐
│   Node A        │ ────────────────► │   Node B        │
│                 │                   │                 │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Create Tensor│ │                   │ │Receive MyMsg│ │
│ │in MyMsg     │ │                   │ │             │ │
│ └─────────────┘ │                   │ └─────────────┘ │
│         │       │                   │         │       │
│         ▼       │                   │         ▼       │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Publish      │ │                   │ │Check if     │ │
│ │MyMsg        │ │                   │ │Externally   │ │
│ └─────────────┘ │                   │ │Managed      │ │
└─────────────────┘                   │ └─────────────┘ │
                                      │         │       │
                                      │         ▼       │
                                      │ ┌─────────────┐ │
                                      │ │Copy         │ │
                                      │ │to Accel Mem │ │
                                      │ └─────────────┘ │
                                      │          │       │
                                      │         ▼       │
                                      │ ┌─────────────┐ │
                                      │ │Process on   │ │
                                      │ │Accelerator  │ │
                                      │ └─────────────┘ │
                                      └─────────────────┘

Scenario 2: RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda
--------------------------------------------------------

┌─────────────────┐    ROS Message    ┌─────────────────┐
│   Node A        │ ────────────────► │   Node B        │
│                 │                   │                 │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Create Tensor│ │                   │ │Receive MyMsg│ │
│ │in MyMsg     │ │                   │ │             │ │
│ └─────────────┘ │                   │ └─────────────┘ │
│         │       │                   │         │       │
│         ▼       │                   │         ▼       │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Publish MyMsg│ │                   │ │Check if     │ │
│ └─────────────┘ │                   │ │Externally   │ │
└─────────────────┘                   │ │Managed      │ │
                                      │ └─────────────┘ │
                                      │         │       │
                                      │         ▼       │
                                      │ ┌─────────────┐ │
                                      │ │Process on   │ │
                                      │ │Accelerator  │ │
                                      │ └─────────────┘ │
                                      └─────────────────┘

In Scenario 2, the same tensor function call in Node A creates a tensor backed by accelerator memory instead. This allows Node B, which was checking for a rcl_tensor_cuda-managed tensor to skip the extra copy.

Example #2: CPU versus accelerated implementations

SCENARIO 1: RCL_TENSOR_IMPLEMENTATION = <none> (CPU/Main Memory Path)
========================================================================

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CPU/Main Memory Path                           │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Create    │    │  Normalize  │    │   Reshape   │    │ Materialize │
│   Tensor    │───▶│  Operation  │───▶│  Operation  │───▶│    Bytes    │
│  [CPU Mem]  │    │   [CPU]     │    │   [CPU]     │    │  [CPU Mem]  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
        │                   │                   │                   │
        ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Allocate    │    │ CPU-based   │    │ CPU-based   │    │ Return      │
│ main memory │    │ normalize   │    │ reshape     │    │ pointer to  │
│ for tensor  │    │ computation │    │ computation │    │ byte array  │
│ data        │    │ on CPU      │    │ on CPU      │    │ in main mem │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Memory Layout:
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Main Memory                                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Tensor    │  │  Normalized │  │  Reshaped   │  │ Materialized│         │
│  │   Data      │  │   Tensor    │  │   Tensor    │  │    Bytes    │         │
│  │  [CPU]      │  │   [CPU]     │  │   [CPU]     │  │   [CPU]     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────────────────────┘

SCENARIO 2: RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda (GPU/CUDA Path)
=======================================================================

┌─────────────────────────────────────────────────────────────────────────────┐
│                              GPU/CUDA Path                                  │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Create    │    │  Normalize  │    │   Reshape   │    │ Materialize │
│   Tensor    │───▶│  Operation  │───▶│  Operation  │───▶│    Bytes    │
│  [GPU Mem]  │    │   [CUDA]    │    │   [CUDA]    │    │  [CPU Mem]  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
        │                   │                   │                   │
        ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Allocate    │    │ CUDA kernel │    │ CUDA kernel │    │ Copy from   │
│ GPU memory  │    │ for normalize│   │ for reshape │    │ GPU to CPU  │
│ for tensor  │    │ computation │    │ computation │    │ memory      │
│ data        │    │ on GPU      │    │ on GPU      │    │ (cudaMemcpy)│
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Memory Layout:
┌─────────────────────────────────────────────────────────────────────────────┐
│                              GPU Memory                                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                          │
│  │   Tensor    │  │  Normalized │  │  Reshaped   │                          │
│  │   Data      │  │   Tensor    │  │   Tensor    │                          │
│  │  [GPU]      │  │   [GPU]     │  │   [GPU]     │                          │
│  └─────────────┘  └─────────────┘  └─────────────┘                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Main Memory                                    │
│                                                                             │
│                                                                             │
│  ┌─────────────┐                                                            │
│  │ Materialized│                                                            │
│  │    Bytes    │                                                            │
│  │   [CPU]     │                                                            │
│  └─────────────┘                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

IMPLEMENTATION NOTES
===================

• Environment variable RCL_TENSOR_IMPLEMENTATION controls which path is taken
• Same API calls work in both scenarios (transparent to user code)
• GPU path requires CUDA runtime and rcl_tensor_cuda package
• Memory management handled automatically by implementation
• Backward compatibility maintained for CPU-only systems

Discussion Questions

Should we constrain tensor creation functions to using memory allocators instead? rcl::tensor implementations would need to provide custom memory allocators for externally managed memory, for example.
Do we allow for mixed runtimes of cpu-backed/external memory managed tensors in one runtime? What creation pattern would allow for precompiled packages to “pick up” accelerated memory dynamically at runtime by default but also explicitly opt-out from it for specific tensors as well?
Do we need to expose the concept of “streams” and “devices” through the rcl::tensor API or can that be kept under the abstraction layer? They are generic concepts but may too strongly proscribe the underlying implementation. However, exposing them would let developers provide stronger intent on how they want their code to be executed in an accelerator-agnostic manner.
What common tensor operations should we keep as supported? The more we choose, the higher the burden on the rcl::tensor implementations, but the more standardized and less fragmented our ROS 2 developer base. For example, we do not want fragmentation where packages begin to depend on rcl_tensor_cuda and thus fallback only to CPU for rcl_tensor_opencl (wlog).
Should tensors have a multi-block interfaces from the get-go? Assuming one memory address seems problematic for rank 4 tensors, for example (e.g., sets of images from multiple cameras).
Should the ROS 2 canonical implementation of rcl::tensor be inline or based on an existing, open source library? If so, which one?

Summary

tensor as a native type in rcl and made available through all client libraries, ROS IDL, and all RMW implementations, like string array or uint8[].
- tensor_msgs::Image is sensor_msgs::Image but with tensor payload instead of uint8[].
- Add cv_bridge functions to create tensor_msgs::Image from cv2::Mat to spur adoption.
Implementations for tensor lifecycle and manipulation can be dynamically swapped at runtime with a package and an environment variable.
- Data for tensors can then be optionally stored in externally managed memory, eliminating need for type adaptation in rclcpp.
- Operations on tensors can then be optionally implemented with accelerated libraries.

1 post - 1 participant

Date	Time	Session	Topic
Day 4	AM	Session 1	Introduction to PiPER
Day 4	AM	Session 2	Motion analysis
Day 4	PM	Session 1	Overview of PiPER-sdk
Day 4	PM	Session 2	MoveIt + Gazebo simulation
Day 5	AM	Session 1	QR code recognition grasping
Day 5	AM	Session 2	Code-level analysis of grasping logic
Day 5	PM	Session 1	YOLO-based Object Recognition and Grasping with Code Analysis
Day 5	PM	Session 2	Frontier Insights on Embodied Intelligence

Date	Time	Session	Topic
Day 1	AM	Session 1	LIMO basic functions overview
Day 1	AM	Session 2	Chassis Kinematics Analysis
Day 1	PM	Session 1	ROS communication mechanisms
Day 1	PM	Session 2	LiDAR-based Mapping
Day 2	AM	Session 1	Path planning
Day 2	AM	Session 2	Navigation frameworks
Day 2	PM	Session 1	Navigation practice
Day 2	PM	Session 2	Visual perception
Day 3	AM	Session 1	Intro to deep reinforcement learning
Day 3	AM	Session 2	DRL hands-on session
Day 3	PM	Session 1	Multi-robot systems intro
Day 3	PM	Session 2	Multi-robot simulation practice

Planet ROS

Planet ROS - http://planet.ros.org

ROS Discourse General: Native rcl::tensor type

Abstract

Motivation

Background

Requirements

Rough Sketch

Core Tensor APIs

Creation

Common operations

Managed access

Direct access

Other Conveniences

Platform-specific tensor implementation

Simple Examples

Example #1: rcl::tensor with “accelerator-aware” subscriber

Example #2: CPU versus accelerated implementations

Discussion Questions

Summary

ROS Discourse General: ROS 2 Performance Benchmark - Code Release

ROS Discourse General: ROS 2 Rust Meeting: August 2025

ROS Discourse General: ROS 2 Cross-compilation / Multi architecture development

ROS Discourse General: Why do robotics companies choose not to contribute to open source?

ROS Discourse General: A Dockerfile and a systemd service for starting a rmw-zenoh server

ROS Discourse General: How to Implement End-to-End Tracing in ROS 2 (Nav2) with OpenTelemetry for Pub/Sub Workflows?

ROS Discourse General: Space ROS Jazzy 2025.07.0 Release

Release details

Code

Whatâ€™s Next

ROS Discourse General: Bagel, the Open Source Project | Guest Speakers Arun Venkatadri and Shouheng Yi | Cloud Robotics WG Meeting 2025-08-11

ROS Discourse General: What if your Rosbags could talk? Meet Bagel🥯, the open-source tool we just released!

ROS Discourse General: ROS Naija Linedlin Group

ROS Discourse General: [Case Study] Cross-Morphology Policy Learning with UniVLA and PiPER Robotic Arm

Motivation

Framework Overview

PiPER in Real-World Experiments

Experimental Results

About PiPER

Collaborate with Us

ROS Discourse General: [Demo] Remote Teleoperation with Pika on UR7e and UR12e

Key Features of Pika:

Task Set:

System Highlights:

Application Scenarios:

ROS Discourse General: TecGihan Force Sensor Amplifier for Robot Now Supports ROS 2

ROS Discourse General: RobotCAD 9.0.0 (Assemly WB -> RobotCAD converter)

ROS Discourse General: 🚀 [New Release] BUNKER PRO 2.0 – Reinforced Tracked Chassis for Extreme Terrain and Developer-Friendly Integration

Key Features:

Intelligent Expansion, Empowering the Future

Typical Use Cases:

ROS Discourse General: Cloud Robotics WG Meeting 2025-07-28 | Heex Technologies Tryout and Anomaly Detection Discussion

ROS Discourse General: Sponsoring open source project, what do you think?

ROS Discourse General: Baxter Robot Troubleshooting Tips

Finding Documentation

Startup & Boot Issues

Networking & Communication

ROS & Intera SDK Issues

Hardware & Motion Problems

Software/Configuration Mismatches

Testing, Debugging, & Logging

ROS Discourse General: Remote (Between Internet Networks) Control of Robot Running Micro-ROS

ROS Discourse General: AgileX Robotics at 2025 ROS Summer School: PiPER & LIMO Hands-on Tracks and Schedule

AgileX Robotics at 2025 ROS Summer School

Hands-on Tracks

PiPER – Mobile Manipulation Track

LIMO – Navigation & AI Track

ROS Discourse General: Is DDS suitable for RF datalink communication with intermittent connection?

ROS Discourse General: Feature freeze for Gazebo Jetty (x-post from Gazebo Community)

ROS Discourse General: Donate your rosbag (Cloudini benchmark)

How to

Data Donation Disclaimer: Public Availability for CI Benchmarking

ROS Discourse General: Everything I Know About ROS Interfaces: Explainer Video