RLWRLD تطلق RLDX-1 لأيدي روبوتية أكثر براعة

أطلقت RLWRLD نموذج RLDX-1 كأساس مخصص للأيدي الروبوتية عالية البراعة، مع قدرات تجمع بين الرؤية والحركة والذاكرة واستشعار القوة. يهدف النموذج إلى معالجة فجوات النماذج الحالية في مهام صناعية دقيقة مثل الالتقاط من ناقل متحرك، التعامل مع تغير الوزن، والتلاعب بالأجسام بالأصابع.

ملخص الذكاء الاصطناعي

أطلقت RLWRLD نموذج RLDX-1 كأساس مخصص للأيدي الروبوتية عالية البراعة، مع قدرات تجمع بين الرؤية والحركة والذاكرة واستشعار القوة. يهدف النموذج إلى معالجة فجوات النماذج الحالية في مهام صناعية دقيقة مثل الالتقاط من ناقل متحرك، التعامل مع تغير الوزن، والتلاعب بالأجسام بالأصابع.
يمثل RLDX-1 خطوة مهمة نحو تجاوز «الميل الأخير» في الأتمتة الصناعية، حيث تحتاج الروبوتات إلى فهم السياق والإحساس الفيزيائي وليس الرؤية فقط لتنفيذ مهام بشرية دقيقة.
RLWRLD said with RLDX-1, it aimed to include things like context memorization or force sensing, which existing models often lack. The post RLWRLD releases RLDX-1, a dexterity-first foundation model for robot hands appeared first on The Robot Report .

RLWRLD said real-world interaction requires recognizing what to do, maintaining relevant state over time, and grounding decisions in physically meaningful signals. | Source: RLWRLD RLWRLD last week presented RLDX-1, a new dexterity-first foundation model. The company built the model to tackle complex tasks in the real-world industry using high degree-of-freedom (DoF) robotic hands.

Existing foundation models often lack essential capabilities, such as context memorization or force sensing, required for seamless real-world deployment, according to RLWRLD. To address this, RLDX-1 encompasses the complete robotics lifecycle. It integrates a scalable data-collection pipeline, a versatile architecture design, robust training methodologies, and optimized deployment strategies, said the company .

As a result, RLDX-1 achieves state-of-the-art performance, claimed RLWRLD. The model showcases precision and generalization across both simulated environments and physical industrial applications. RLWLRD designed the RLDX-1 foundation model from the ground up for dexterous robot hands . Every component exists because a specific failure mode on a real task required them.

The result is a single model that can see, feel, remember, and adapt, deployable across single-arm, dual-arm, and humanoid embodiments with high-DoF hands. RLWRLD identifies five regimes of dexterity The last mile of industrial automation is dexterity. Today’s robots still cannot reliably pour coffee as the pot grows lighter, pick a moving object off a conveyor, or rotate a hex nut with fingertips, noted Seoul, South Korea-based RLWRLD.

RLWRLD distilled these recurring customer needs into DexBench , a benchmark that organizes them along five regimes of dexterity, where each regime is a specific failure mode of today’s robots. These five regimes are: Grasp diversity: Five-fingered hands are the prerequisite every regime below assumes. RLWLRD has run more than 10 of them in-house.

It uses two data pipelines to diversify grasping. Synthetic robot data augments a dataset from a small teleoperation set, while Human Data covers the high-DoF in-hand dexterity that teleoperation cannot reach. Spatial precision: The policy must capture sufficient scene structure to place contact correctly before contact is made. RLDX-1 strengthens this capability with a robot-specialized vision language model (VLM) fine-tuned on robot visual question and answering (VQA), where the questions explicitly target the geometric relationship between the robot end-effector and the target object.

This training encourages the VLM to better ground object locations and spatial relations that are critical for precise contact placement. Temporal precision: A single-frame policy commits to where objects were; by the time the hand arrives, the conveyor object has moved. To address this, the Motion Module extracts motion features from space-time visual correspondences and amortizes multi-frame context into a compact representation.

It lets the policy see where and how fast objects are going. Contact precision: A coffee pot growing lighter is visually invariant; the signal is in wrist torque. The Physics Module gives tactile and torque their own streams and predicts future contact states alongside actions, so the policy anticipates contact transitions before they happen.

Context awareness: This is task-level reasoning that wraps around the three precisions. Without it, even a perfectly executed motion is stranded at the single step it was planned for, said RLWRLD. The policy needs memory, recovery, and progress-awareness. RLDX is built on a multi-stream action transformer The full RLDX architecture.

| Source: RLWRLD Each regime enters the model as a fundamentally different modality: Torque is a high-rate continuous stream, video is sparse high-dimensional frames, and memory is stateful. In a single conventional transformer, whichever modality dominates the gradient absorbs all the capacity while the rest become decorative. The architectural answer is Multi-Stream Action Transformer (MSAT).

Each modality gets its own processing stream, and cognition tokens compress the VLM output into a fixed-size interface. What follows unpacks each layer. The architecture that holds these modalities together, the data engine that trains it, and the post-training that makes it deployable. RLDX is built on MSAT, an architecture where each modality gets its own processing stream, and joint self-attention lets them interact.

Existing vision-language-action models (VLAs) fuse modalities inside a single transformer stream, where whichever modality dominates the gradient absorbs all the capacity. MSAT gives each modality its own dedicated processing stream, then lets the streams communicate through joint self-attention without being forced into a shared representation prematurely.

Early blocks keep modalities in parallel streams; later blocks fuse them for action decoding, explained RLWRLD. RLDX-1 uses a robot-specialized VLM General-purpose VLMs are strong at visual reasoning, but they do not automatically understand what matters for robot control, RLWRLD asserted. To close this gap, RLDX-1 fine-tunes Qwen3-VL 8B on a robot-trajectory VQA dataset targeting three action-relevant abilities.

First, it targets spatial reasoning about the geometric relationship between the end-effector and target objects. Second is task understanding, which identifies the intermediate subtask implied by the current observation. Third is action grounding that reasons about the low-level action associated with the current frame. The fine-tuned model, RLDX-1-VLM, serves as the visual reasoning backbone for action generation: +3.42%p over the vanilla VLM on RoboCasa.

A single-frame policy is always one step behind the scene, noted RLWRLD. By the time the hand arrives, the conveyor object has moved. The Motion Module has two complementary pieces. A video token compression layer feeds multi-frame observations through the VLM, compressing past frames into motion tokens via average pooling, so the model efficiently sees where things are going.

A motion learning layer in the vision encoder models spatio-temporal self-similarities (STSS), capturing rotation, velocity, and interaction dynamics directly from visual features. Together: +37.5%p over GR00T N1.6 and π₀.₅ on performing a pick-and-place task on conveyor belt. RLDX-1’s Physics Module serves two key functionalities The Physics Module integrates tactile and torque feedback into RLDX as native modalities.

These physical signals are crucial for tasks that require contact-rich object manipulation, primarily serving two key functionalities: weight estimation and contact detection. For weight estimation, when a robot pours coffee, the module captures weight shifts across both hands to inform RLDX precisely when to stop. For contact detection, a robot needs to identify the exact moment of contact to transition from approaching to picking.

While joint angles provide ambiguous information regarding contact timing, torque signals offer distinct, sharp changes at the point of contact. To fully leverage this, RLDX employs a dedicated stream that not only processes these signals but also predicts future torque states, allowing the policy to possess informative physical embeddings.

Furthermore, when such sensors are unavailable, the sensory stream automatically deactivates for graceful degradation to vision-only, allowing a single model to support various hardware setups. Inside RLDX-1’s cognition interface and memory module The VLM produces a rich scene understanding, but passing all of its tokens to the action model can be slow and wasteful, said RLWRLD.

The Cognition Interface appends 64 learnable cognition tokens to the VLM’s input. Through attention, they compress the full sequence into a fixed-size representation that carries exactly the information the action model needs. The speed win: +35%p inference speedup (16.3→22.1 Hz). But these tokens d

الوسوم: #الأيدي الروبوتية #البراعة الحركية #استشعار القوة #نماذج الأساس #الأتمتة الصناعية

المصدر الأصلي: The Robot Report

ملخص الذكاء الاصطناعي

● اقرأ أيضا

Hello Robot is recognized by World Economic Forum as a tech pioneer

ACS raises $200M to scale autonomous counter-drone system

Robots can enhance manufacturing workers rather than replace them