Open X-Embodiment: The Robot Dataset That Changed Everything

What Open X-Embodiment Is

Open X-Embodiment (OXE) is a unified dataset of robot manipulation demonstrations collected across more than 22 different robot embodiments, including arms from Franka Emika, Trossen Robotics (WidowX, ViperX), Universal Robots (UR5), KUKA, Google's own robot fleet, and many others. The dataset totals over one million episodes covering hundreds of distinct manipulation tasks: picking, placing, opening drawers and cabinets, pouring liquids, wiping surfaces, stacking objects, and more.

The project was a collaboration of over 30 research institutions, led by Google DeepMind with major contributions from Stanford, UC Berkeley, Carnegie Mellon, MIT, Columbia, ETH Zurich, and others. Each lab contributed their existing demonstration datasets, which were then standardized into the RLDS (Robot Learning Dataset Specification) format. The full dataset is hosted on Google Cloud Storage and is freely available for research use.

The "X" in the name stands for cross-embodiment. The defining ambition of OXE is not just to create a large dataset, but to demonstrate that training on data from many different robots produces better policies than training on data from a single robot, even for that single robot. This hypothesis turned out to be correct, and the evidence has reshaped how the field thinks about robot data.

Why It Matters: The RT-X Results

The landmark finding from the OXE paper (Padalkar et al., 2023) was the performance of RT-X models, specifically RT-1-X and RT-2-X, trained on the full multi-embodiment dataset.

RT-1-X (a smaller, efficient model) trained on OXE data outperformed single-robot specialist RT-1 models by approximately 50% on held-out evaluation tasks across multiple robot platforms. This was the headline result: a single generalist model, trained on data from 22 different robots, performed better on any individual robot than a model trained only on that robot's own data. The mechanism is that cross-embodiment data forces the model to learn embodiment-agnostic manipulation representations, effectively providing a strong prior for visual understanding and task concepts.

RT-2-X (built on the much larger PaLM-E vision-language model) showed even stronger cross-embodiment transfer, with particularly impressive zero-shot generalization to robot embodiments not present in the training set. When evaluated on a held-out robot type, RT-2-X achieved meaningful task completion rates without any fine-tuning, something that would be impossible for a model trained only on a single robot's data.

These results validated a core hypothesis: robot manipulation knowledge is partially embodiment-agnostic. A policy that has seen a Franka arm open a drawer and a WidowX arm pick up a cup has learned something about drawers and cups that transfers to a UR5, even though the UR5 has completely different kinematics.

Key Findings from the Paper

What transfers across embodiments: Visual scene understanding (recognizing objects, understanding spatial relationships) transferred most strongly. High-level task semantics (the concept of "pick up," "open," "place on") transferred well. Pre-grasp approach trajectories (moving toward an object before contact) transferred moderately.

What does not transfer well: Precise grasp configurations (exact finger positions relative to object surfaces) required embodiment-specific data. Contact dynamics (grip force modulation, insertion forces) did not transfer. Fine motor control (sub-centimeter precision movements) required per-embodiment fine-tuning.

Data distribution matters: The OXE dataset is not uniformly distributed across embodiments and tasks. Some labs contributed tens of thousands of episodes, others contributed hundreds. The task distribution is heavily skewed toward tabletop pick-and-place. Despite this imbalance, the cross-embodiment benefit was robust, though the largest benefits accrued to under-represented embodiments (which gained the most from cross-embodiment transfer) rather than to the dominant embodiments (which had enough data to train strong specialists).

Scale helps, but diversity helps more: Ablation studies varying the number of embodiments in the training set while holding total episode count constant showed that adding a new embodiment with fewer episodes consistently outperformed adding more episodes from an already-represented embodiment. This diversity-over-volume finding has become one of the most cited and most practically important results in robot learning.

How to Access and Use the Dataset

OXE is hosted on Google Cloud Storage and can be downloaded using the tensorflow_datasets (TFDS) API. The dataset uses the RLDS format, where each episode is a sequence of steps containing observation dictionaries (images, joint states, gripper state), action vectors, reward signals, and natural language task annotations.

Getting started:

Install tensorflow_datasets: pip install tensorflow-datasets
Browse available sub-datasets at the OXE GitHub repository or the TFDS catalog
Load a specific sub-dataset: tfds.load('fractal20220817_data') (for the RT-1 Fractal dataset, one of the largest components)
For PyTorch users: use LeRobot's conversion utilities to transform RLDS data into LeRobot Parquet format, or use the oxe_torch_dataloader for direct PyTorch loading

Practical usage patterns:

Pre-training a foundation model: Download the full OXE dataset (or a diverse subset covering 10+ embodiments). Train your model on this data to learn general manipulation representations. Then fine-tune on your task-specific data. This consistently requires 5-10x fewer task-specific demonstrations than training from scratch.
Augmenting a small dataset: If you have 100-200 demonstrations on your specific robot, add relevant OXE sub-datasets to your training mixture. Focus on sub-datasets from similar embodiments (same gripper type, similar arm geometry) and similar task categories.
Evaluating cross-embodiment transfer: Use OXE's standard evaluation protocol and held-out task sets to benchmark your model's generalization capability against published baselines.

Limitations: What OXE Does Not Cover

OXE is transformative, but it has real limitations that teams should understand before relying on it.

Task diversity is skewed. The majority of episodes are tabletop pick-and-place, with smaller fractions covering drawer/cabinet opening, wiping, and pouring. Complex multi-step tasks, bimanual tasks, and mobile manipulation tasks are under-represented. If your deployment task is not well-covered by OXE's task distribution, the pre-training benefit will be limited.

Hardware is dated. Many contributing labs used hardware that was current in 2020-2023 but is now outdated: low-resolution cameras, older RealSense models, and arm configurations that differ from the ViperX/Franka setups most commonly used in 2026. The visual features from older cameras may not perfectly match the visual distribution of modern camera setups, reducing transfer efficiency.

Dexterity is limited. Almost all OXE data uses parallel-jaw grippers. Dexterous manipulation with multi-finger hands is essentially absent from the dataset. If your application involves dexterous hand manipulation, OXE provides limited direct benefit, though the visual understanding component still transfers.

Annotation quality varies. Language annotations range from careful, specific descriptions ("pick up the red cup from the left side of the table") to generic labels ("pick up object"). This inconsistency limits the effectiveness of language-conditioned training on the raw dataset without post-processing.

No force-torque data. The vast majority of OXE episodes contain only joint positions and camera images. Force-torque sensor data, which is critical for contact-rich tasks, is absent from most sub-datasets. This limits the usefulness of OXE for training policies that need to modulate grip force or handle compliant objects.

Loading OXE Data with LeRobot (Python)

# Load an OXE sub-dataset via LeRobot's HuggingFace integration
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load the Bridge V2 subset (WidowX data, 60K+ episodes)
dataset = LeRobotDataset("lerobot/bridge_orig")

# Inspect dataset structure
print(f"Number of episodes: {dataset.num_episodes}")
print(f"Keys per frame: {dataset[0].keys()}")
# Typical keys: observation.images.image_0, observation.state, action, ...

# Load a specific episode
episode = dataset.filter(lambda x: x["episode_index"] == 42)
for frame in episode:
    obs_image = frame["observation.images.image_0"]  # PIL Image or tensor
    state = frame["observation.state"]  # joint positions
    action = frame["action"]  # action vector
    # Process as needed for your training pipeline

# For TFDS-native loading (alternative):
# import tensorflow_datasets as tfds
# ds = tfds.load('fractal20220817_data', split='train')
# for episode in ds.take(10):
#     steps = episode['steps']
#     for step in steps:
#         image = step['observation']['image']
#         action = step['action']

Fine-Tuning a Foundation Model on OXE + Your Data

The standard pipeline for using OXE data to improve your task-specific policy involves three stages:

Stage 1: Select relevant OXE sub-datasets. Not all OXE data is equally useful for your task. Select sub-datasets based on: similar robot type (same gripper type is more important than same arm kinematics), similar task category (pick-place data helps pick-place; it does not help assembly), and data quality (prefer sub-datasets with language annotations and consistent camera setups). For a WidowX-based pick-place project, the Bridge V2 and Berkeley Cable Routing datasets are most relevant. For a Franka-based project, the Fractal and TOTO datasets are strongest.

Stage 2: Pre-train or use pre-trained weights. If using Octo or OpenVLA, the pre-trained weights already incorporate OXE data. Start from these weights and proceed to fine-tuning. If training a custom architecture, pre-train on your selected OXE sub-datasets for 100-200 epochs (typically 12-48 hours on 4x A100 GPUs depending on data volume). Monitor the validation loss on a held-out portion of your task-specific data to detect overfitting to OXE distribution at the expense of your target task.

Stage 3: Fine-tune on task-specific data. Fine-tune the pre-trained model on your task-specific demonstrations using a lower learning rate (typically 10x lower than pre-training: 1e-5 for Octo, 5e-6 for OpenVLA). Use all of your task-specific data plus a 10-20% mixture of the most relevant OXE data to prevent catastrophic forgetting of the general manipulation knowledge. Fine-tuning typically requires 50-200 epochs (2-8 hours on a single A100). Evaluate on held-out task-specific test episodes every 10 epochs and select the best checkpoint.

Benchmark Results: OXE-Trained Models vs. Specialists

Model	Pre-Training Data	Fine-Tune Data	In-Dist. Success	Novel Object Success
ACT (from scratch)	None	200 task demos	82%	28%
RT-1-X	Full OXE	200 task demos	88%	52%
Octo (fine-tuned)	Full OXE	200 task demos	86%	48%
OpenVLA (fine-tuned)	Full OXE	200 task demos	90%	58%

The pattern is consistent: OXE pre-training provides a modest in-distribution improvement (5-10%) and a large novel-object generalization improvement (20-30 percentage points). The in-distribution advantage narrows with more fine-tuning data but the OOD advantage persists even with 500+ task-specific demonstrations.

OXE Dataset Statistics: By the Numbers

Understanding the composition of OXE helps teams select the most relevant sub-datasets for their use case. Here are the key statistics as of the latest dataset revision.

Statistic	Value	Notes
Total episodes	~1.1 million	Growing as labs continue to contribute
Robot embodiments	22+	Franka, WidowX, UR5, KUKA, Google fleet, etc.
Contributing institutions	33	Led by Google DeepMind, Stanford, UC Berkeley
Distinct task categories	~500	Heavily skewed toward pick-place (~60% of episodes)
Total storage size	~12 TB (raw)	Most sub-datasets are 5-200 GB individually
Largest sub-dataset	Fractal (RT-1): ~130K episodes	Google's proprietary robot fleet data
Bridge V2 (WidowX)	~60K episodes	24 environments; most popular for WidowX fine-tuning
Language annotations	~70% of episodes	Quality varies significantly across sub-datasets
Camera resolution range	128x128 to 640x480	Most sub-datasets standardize to 256x256 or 224x224
Force-torque data	< 5% of episodes	Major gap; limits contact-learning applications

SVRC Data Format Compatibility

SVRC data collection outputs are designed to be interoperable with OXE and the major training frameworks. Here is how SVRC data maps to the ecosystem.

Native format: HDF5 (LeRobot-compatible). SVRC's primary output format is HDF5, organized in LeRobot's episode structure. Each episode contains: /observations/images/top (RGB, 640x480, 30fps), /observations/images/wrist (RGB, 640x480, 30fps), /observations/state (joint positions + velocities + gripper), /observations/ft (force-torque, 6-axis, 500Hz downsampled to 30Hz), /actions (joint position targets or delta EEF), and /language_instruction (natural language string).
RLDS export. For OXE compatibility and TFDS-based training pipelines, SVRC provides a one-command RLDS conversion that maps HDF5 episodes to RLDS format with the standardized observation, action, and language fields. This export is what you need to contribute SVRC-collected data to OXE or to mix with existing OXE sub-datasets.
LeRobot Parquet export. For Hugging Face-native workflows, SVRC exports to LeRobot's Parquet format with associated metadata YAML. This integrates directly with lerobot.common.datasets for ACT, Diffusion Policy, and VLA training.
Raw video + CSV. For custom pipelines, SVRC can export raw MP4 video per camera and CSV files with timestamped joint states, F/T readings, and gripper states. This is the most flexible format but requires the team to write their own data loading code.

The key differentiator between SVRC data and typical OXE sub-datasets: every SVRC episode includes synchronized force-torque data (when F/T sensors are present on the hardware), calibrated camera intrinsics and extrinsics, and language annotations following a standardized protocol. These features make SVRC data particularly valuable as high-quality fine-tuning data on top of OXE pre-training.

How SVRC Data Complements OXE

OXE provides breadth: many embodiments, many tasks, many environments. What it lacks is depth in specific domains and consistency in data quality. SVRC data collection fills this gap by providing: consistent camera setups with calibrated intrinsics and extrinsics across all episodes, force-torque sensor data synchronized with visual and proprioceptive streams (absent from almost all OXE data), systematic object diversity within target task categories (30+ objects per category, not the 5-10 typical in OXE sub-datasets), and language annotations following a standardized protocol rather than the variable quality across OXE sub-datasets.

The recommended approach for teams with a specific deployment target: use OXE-pretrained foundation model weights for the visual backbone and general manipulation knowledge, then fine-tune with SVRC-collected task-specific data that provides the depth and quality needed for deployment-grade reliability.

Dataset Composition Deep Dive: What's Actually in OXE

Not all OXE sub-datasets are created equal. Understanding the composition helps teams select the right subsets for pre-training and avoid wasting compute on irrelevant data.

Sub-Dataset	Robot	Episodes	Tasks	Best For
Fractal (RT-1)	Google Everyday Robot	~130K	Kitchen manipulation, pick-place	Visual diversity, mobile manipulation
Bridge V2	WidowX 250	~60K	24 environments, pick-place, sweeping	WidowX fine-tuning, env diversity
TOTO	Franka Panda	~1K	Tabletop manipulation	Franka fine-tuning baseline
Kuka	KUKA IIWA	~3K	Grasping, stacking	Industrial arm benchmarks
BC-Z	Google Robot	~25K	100+ tasks with language	Language conditioning, multi-task
Cable Routing	WidowX / Franka	~2K	Deformable object manipulation	Contact-rich, deformable tasks
Jaco Play	Kinova Jaco	~1K	Pick-place, drawer open	Kinova fine-tuning, 3-finger gripper data

For SVRC customers using OpenArm 101 (6-DOF with parallel jaw gripper), the most relevant OXE sub-datasets for pre-training mixtures are Bridge V2 (similar end-effector type) and BC-Z (language annotations for language-conditioned policies). Adding Fractal data provides additional visual diversity even though the robot morphology differs significantly.

Practical Data Mixing: Balancing OXE and Task-Specific Data

When combining OXE pre-training data with task-specific fine-tuning data, the mixing ratio matters significantly. Too much OXE data during fine-tuning can dilute task-specific learning; too little allows catastrophic forgetting of general manipulation knowledge.

Recommended mixing schedules:

Phase 1 (epochs 1-50): 80% OXE, 20% task-specific. The model retains general knowledge while beginning to adapt.
Phase 2 (epochs 51-150): 40% OXE, 60% task-specific. The model specializes while maintaining broad visual features.
Phase 3 (epochs 151-200): 10% OXE, 90% task-specific. Final refinement on the target task with minimal OXE as a regularizer.

This graduated schedule consistently outperforms fixed-ratio mixing by 5-8% on novel object generalization in our evaluations. The OXE data in Phase 3 acts as a regularizer that prevents overfitting to the fine-tuning distribution without interfering with task-specific action learning.

# Graduated data mixing for OXE + task-specific fine-tuning
from torch.utils.data import ConcatDataset, WeightedRandomSampler

def get_sampler(epoch, oxe_dataset, task_dataset, total_epochs=200):
    """Returns a weighted sampler with epoch-dependent mixing ratio."""
    progress = epoch / total_epochs
    if progress < 0.25:
        oxe_weight, task_weight = 0.8, 0.2
    elif progress < 0.75:
        oxe_weight, task_weight = 0.4, 0.6
    else:
        oxe_weight, task_weight = 0.1, 0.9

    weights = ([oxe_weight / len(oxe_dataset)] * len(oxe_dataset) +
               [task_weight / len(task_dataset)] * len(task_dataset))
    combined = ConcatDataset([oxe_dataset, task_dataset])
    sampler = WeightedRandomSampler(weights, num_samples=len(combined))
    return combined, sampler

How to Contribute Your Own Data

Contributing to OXE strengthens the community dataset and provides a mechanism for your data to be cited and used by the broader research community. The contribution process involves several steps.

Format your data in RLDS. Each episode must contain observations (images and proprioception), actions, and language annotations in the RLDS schema. The rlds_creator library provides conversion utilities.
Add per-step language annotations. Every step should have a natural language description of the current task. These annotations are used by language-conditioned models and are a requirement for inclusion.
Document your dataset. Provide a dataset card with: robot type and configuration, camera specifications and placement, collection environment description, task descriptions, operator count and training, and episode count per task.
Submit a pull request. The OXE GitHub repository accepts dataset contributions through pull requests. The review process checks format compliance, data quality (no corrupted episodes, no extreme outliers), and documentation completeness.

If your demonstrations were collected through SVRC's data services, our platform can generate RLDS-compatible exports with standardized metadata, simplifying the contribution process. Contact the SVRC team for guidance on preparing your data for OXE submission.

Common Pitfalls When Using OXE Data

Teams adopting OXE for the first time frequently encounter issues that waste compute and produce suboptimal results. Here are the most common pitfalls and how to avoid them.

Training on the full dataset without filtering. OXE contains over 1M episodes, and many are irrelevant to your specific task. Training on everything adds noise and increases training time by 10-50x without proportional benefit. Always filter to relevant sub-datasets first. A good starting point: select 3-5 sub-datasets with similar gripper type and task category, totaling 50K-200K episodes.
Ignoring action space normalization. Different OXE sub-datasets use different action representations: absolute joint positions, delta joint positions, absolute end-effector poses, delta end-effector poses, and various gripper action formats. Loading multiple sub-datasets without normalizing to a common action space produces garbage policies. Use the OXE action normalization utilities or LeRobot's built-in conversion layer.
Assuming uniform camera placement. Camera positions, angles, and resolutions vary dramatically across OXE sub-datasets. A model pre-trained on Bridge V2 (top-down camera, 256x256) may not transfer well to your wrist camera at 640x480. Match your pre-training sub-datasets to your deployment camera configuration as closely as possible.
Overweighting the largest sub-datasets. Fractal (130K episodes) dominates OXE by volume. If you sample uniformly by episode, 60%+ of your training batches will be Fractal data from Google's robot fleet, which uses a unique mobile manipulator that may not be relevant to your arm. Use balanced sampling across sub-datasets or up-weight smaller datasets that are more relevant to your target embodiment.
Neglecting the language annotation quality issue. About 30% of OXE episodes have generic or missing language annotations ("pick up object" rather than "pick up the red cup from the left side"). For language-conditioned policies, filter to sub-datasets with high-quality annotations (Bridge V2, BC-Z) or post-process annotations with an LLM relabeling step.

Action Space Normalization: A Practical Guide

The action space inconsistency across OXE sub-datasets is one of the biggest practical challenges. Here is how to handle it.

# Action space normalization for mixed OXE training
import numpy as np

class ActionNormalizer:
    """Normalize actions from different OXE sub-datasets to a common space."""

    def __init__(self, target_space="delta_eef_6d"):
        self.target_space = target_space
        # Per-dataset statistics (computed from dataset metadata)
        self.stats = {}

    def register_dataset(self, dataset_name, action_mean, action_std, action_type):
        """Register per-dataset normalization statistics."""
        self.stats[dataset_name] = {
            "mean": np.array(action_mean),
            "std": np.array(action_std),
            "type": action_type,  # "abs_joint", "delta_joint", "abs_eef", "delta_eef"
        }

    def normalize(self, action, dataset_name):
        """Normalize action to zero-mean, unit-variance in target space."""
        s = self.stats[dataset_name]
        # Step 1: Convert to target action type (if needed)
        converted = self._convert_action_type(action, s["type"], self.target_space)
        # Step 2: Standardize using per-dataset statistics
        normalized = (converted - s["mean"]) / (s["std"] + 1e-8)
        return np.clip(normalized, -5.0, 5.0)  # Clip outliers

    def _convert_action_type(self, action, source_type, target_type):
        """Convert between action representations using FK/IK."""
        if source_type == target_type:
            return action
        # Conversion requires robot-specific FK/IK -- use URDF-based solver
        # This is dataset-specific and must be implemented per-embodiment
        raise NotImplementedError(
            f"Conversion from {source_type} to {target_type} "
            f"requires robot-specific FK/IK model"
        )

The action normalization challenge is a key reason why foundation models like Octo and OpenVLA are so valuable: they handle the cross-dataset action normalization internally, saving teams from implementing it from scratch. If you are training a custom architecture on raw OXE data, budget 1-2 weeks of engineering time for proper action space handling.

Evaluation Protocol: How to Benchmark Against OXE Baselines

To meaningfully compare your fine-tuned model against published OXE baselines, use the standardized evaluation protocol from the RT-X papers.

Define 10-20 evaluation configurations spanning in-distribution (training objects in training positions) and out-of-distribution (novel objects in novel positions). Keep these fixed across all model evaluations.
Run 3 trials per configuration to account for stochastic policy behavior and environmental noise. Report mean and standard deviation.
Use the standard success criteria: object is within 3cm of target position for pick-place tasks; drawer/cabinet angle within 10 degrees of target for articulated tasks; task completed within 2x the median human demonstration duration.
Report both in-distribution and OOD success rates separately. Many papers only report in-distribution numbers, which can be misleading. The OOD evaluation is where cross-embodiment pre-training shows its real value.
Compare against the published baselines: ACT from scratch (lower bound), Octo fine-tuned (mid-range), and OpenVLA fine-tuned (upper bound for current models).

SVRC provides standardized evaluation sets for common task categories (tabletop pick-place, drawer opening, stacking) that include physical objects, position templates, and scoring rubrics. Contact us for access to these evaluation kits if you want apples-to-apples comparison with other teams using SVRC infrastructure.

What Comes Next: DROID, Bridge V2, and Beyond

OXE established the principle. The next generation of datasets is extending it in specific directions.

DROID (Khazatsky et al., 2024) focuses on environmental diversity. 76,000 demonstrations across 564 environments and 86 labs, specifically designed to test how environment diversity affects policy generalization. DROID is complementary to OXE: where OXE maximizes robot embodiment diversity, DROID maximizes scene and environment diversity.

Bridge V2 (Walke et al., 2023) provides a focused, high-quality dataset for WidowX-based manipulation. 60,000+ demonstrations across 24 environments with careful quality control. Bridge V2 is the go-to fine-tuning dataset for teams deploying on WidowX hardware because it provides the volume and environmental diversity needed for robust deployment, specifically for one embodiment.

Open-Anything datasets. The community is working toward OXE-style aggregation for domains currently under-represented: dexterous manipulation with multi-finger hands, bimanual tasks, mobile manipulation, and outdoor/field robotics. SVRC is actively contributing data from our bimanual and dexterous manipulation collection campaigns to these emerging aggregation efforts.

The broader trajectory is toward a robotics equivalent of the web-scale text corpora that enabled large language models. OXE was the proof of concept. The question now is whether the community can achieve the diversity and scale needed to train truly generalist robot foundation models, and how long that will take. SVRC's data collection infrastructure is designed to contribute to this effort while providing immediate practical value to teams building today's robot systems.