Training World Models on Scene Semantics, Not Pixels

A different recipe for training robot world models: compose pre-trained AI modules with classical computer vision to extract scene semantics from ordinary monocular video — no domain data, no synthetic frames.

Introduction

Today’s recipe for training robot AI looks the same almost everywhere: feed a giant neural network billions of pixels paired with text instructions and motor commands, and hope generalization emerges. Vision-Language-Action (VLA) models like RT-2, Octo, and π0 each ingest millions of teleoperated robot demonstrations or billions of image-text-action triples. A growing share of that data is synthetic, generated in simulators, rendered from digital twins, or hallucinated by generative video models, to escape the cost of collecting real robot data at scale.

This recipe works, up to a point. But it has three structural problems. First, it confuses two very different things both labeled “zero-shot” (more on that in a moment). Second, it inherits a sim-to-real gap that grows with every percentage point of synthetic data. Third, it is opaque – when a VLA fails, there is no per-module explanation, only a billion-parameter black box.

If you’ve ever chosen between a monolithic application and a microservices architecture, the tradeoff in this post will feel familiar. The dominant VLA approach is the monolith: one giant model that does everything end-to-end. The approach we describe here, F.R.A.N.T.Z. (Feature-Rich Reverse Kinematic Analysis for Neural Topology in Zero-Shot Scene), is the microservices version: a handful of independently-trained models, each doing one job well, composed together with classical computer-vision glue. The output is scene semantics – objects, planes, affordances, hand-arm kinematics, and a metric 3D point cloud – extracted directly from ordinary monocular video as ground truth, with zero task-specific training.

This post is a companion to four earlier posts on the AWS Physical AI Blog and HPC Blog:

A maturity level framework for industrial inspection report automation – which defines the four stages of automation that any computer-vision pipeline progresses through.
End-to-end scalable vision intelligence pipeline using LiDAR 3D point clouds on AWS – which covers the cloud infrastructure layer for processing LiDAR (Light Detection and Ranging) sequences at scale. When LiDAR data is not available, and most of the world is not yet instrumented with LiDAR, F.R.A.N.T.Z. reconstructs 3D scene structure from 2D monocular video using a novel composition of pre-trained generative AI models and classical computer vision.
Accelerating OpenCV on Graviton — the COOL framework – which describes our work to accelerate the OpenCV (Open Source Computer Vision) library on AWS Graviton. F.R.A.N.T.Z. is OpenCV-heavy by design, and scales with COOL on Graviton instances at AWS-native cost-performance.
Building Inspection Intelligence with AWS Spatial Data – which introduces Spatial Data Management on AWS (SDMA), a managed framework for organizing and governing spatially-referenced data. As we’ll discuss, before you scale either VLA training or F.R.A.N.T.Z. across thousands of videos and sites, you need somewhere disciplined to put the data. SDMA is that somewhere.

First, what “zero-shot” really means

You’ll have heard “zero-shot” thrown around in AI announcements. It deserves unpacking, because the term is used to mean two very different things, and the difference is the whole point of this post.

The strict definition (from machine learning textbooks): a model is zero-shot on a task if it has seen no training examples for that task. None. Zero.

The looser definition (how the term gets used today): a model is zero-shot on an instruction if it has never seen that exact sentence before, even if it was trained on millions of similar instructions.

A useful analogy: imagine a chef who has cooked 10,000 different recipes. Hand them a new recipe they have never read. They follow it competently. Is that “zero-shot cooking”? Not really. They are applying massively trained skills to a new prompt. The novelty is in the recipe, not in the cooking. That is how today’s VLAs are “zero-shot”: novel instructions, but the underlying skills – recognizing a cup, picking it up, putting it down – came from billions of training examples.

F.R.A.N.T.Z. is zero-shot in the older, stricter sense. It has seen zero examples from the kitchen-manipulation domain. Not a single training video, not a single labeled demonstration. Its ability to understand a kitchen scene emerges from composing modules that were each trained on completely unrelated tasks – depth estimation on internet photos, object detection on a generic object dataset, hand landmarks on generic hand photos. The kitchen-understanding capability is not in any module; it is in the composition.

We call this data-level zero-shot to distinguish it from VLAs’ prompt-level zero-shot. Both are valid uses of the term, but they imply very different cost structures: prompt-level zero-shot requires large-scale training investment (typically billions of examples, based on published model cards); data-level zero-shot requires no domain-specific data collection.

What you will learn

Why training world models on raw pixels and synthetic data hits diminishing returns for embodied AI
How F.R.A.N.T.Z. extracts ground-truth scene semantics from ordinary monocular video without any task-specific training
How affordances and scene priors transfer across new scenes with one human demonstration, not millions
Where classical computer vision (OpenCV on Graviton via COOL) and generative AI complement each other
A path toward JEPA (Joint Embedding Predictive Architecture)-style world models that predict scene semantics, not pixels

The pixel-vs-semantic tradeoff

A typical end-to-end VLA pipeline looks like this: raw pixels → tokenizer → giant transformer → action commands. Every stage of perception, language grounding, and motor planning is learned end-to-end from a single massive dataset. For that to work, the dataset has to cover the joint space of (visual scenes × language instructions × motor actions). That space is enormous, which is why VLAs need so much data.

F.R.A.N.T.Z. inverts the architecture. Each perceptual job is handled by a separate pre-trained model – none of them trained on the target domain:

MiDaS – estimates depth from a single RGB image, trained on generic internet photos.
YOLO (You Only Look Once) – detects objects and draws bounding boxes, trained on generic object datasets.
MediaPipe – detects hand landmarks (21 points per hand), trained on generic hand photos.
DeepLab – produces per-pixel semantic segmentation.

These off-the-shelf models are glued together with classical computer-vision algorithms – ORB (Oriented FAST and Rotated BRIEF) feature matching, optical flow, RANSAC (Random Sample Consensus) plane fitting, two-bone inverse kinematics (IK), and point cloud fusion. The output is a coherent 3D scene with object identities, hand trajectories, and a posed human skeleton.

The table below shows how this stacks up against the VLA approach:

Dimension	Pixel + synthetic VLA (RT-2, Octo, pi0)	Scene-semantic composition (F.R.A.N.T.Z.)
Training data required	Typically, millions to billions of image-text-action triples	None for the target domain
Zero-shot type	Prompt-level (novel instruction)	Data-level (zero domain examples)
Sim-to-real gap	Significant; grows with synthetic %	None on perception side (real video in)
Output representation	Implicit action tokens	Explicit 3D point cloud + tracked objects + IK skeleton
Interpretability	Low – monolithic	Full – each module produces inspectable intermediates
Failure diagnosis	Global, opaque	Per-module, local, independently fixable
Inference compute	Typically, 40-80B params, datacenter cluster	A handful of ONNX (Open Neural Network Exchange) models totaling tens of MB, approximately 5 W on an NVIDIA Jetson-class edge device
Module upgrade	Retrain the entire model	Swap one stage
Post-hoc editing of a scene	Not possible	Yes – object repositioning, IK skeleton repose, viewer-driven
3DGS / NeRF export	No	Yes – COLMAP-format output for 3D Gaussian Splatting

We call this the zero-shot inversion: VLAs need maximum data to deliver minimum prompt novelty (a sentence the model has not seen before). F.R.A.N.T.Z. needs minimum data to deliver maximum scene novelty (any kitchen, any person, any video). The VLA’s “zero-shot” lives in its language interface. F.R.A.N.T.Z.’s zero-shot lives in its architecture.

What scene semantics look like, concretely

For a single 15-minute YouTube cooking tutorial, F.R.A.N.T.Z. produces a single file (3d_structures.npz) containing:

A metric 3D point cloud – tens to hundreds of thousands of raw points (capped at 500,000 in our current implementation), filtered and down sampled to approximately 15,000 for visualization, with the camera ego-motion compensated and hands/persons masked out via semantic filtering.
A list of detected objects – each with class label (from the COCO-80 (Common Objects in Context) ontology, optionally extended), 3D bounding box, and timestamped position track.
A reconstructed plane structure – counter plane, wall planes, floor plane, with per-class object-to-plane bindings (e.g., bottles tend to sit on the counter).
Per-hand 3D wrist trajectories with phase annotations (idle, approach, contact, retreat) and inferred elbow positions via analytic two-bone IK.
A COLMAP-format export – directly usable to train a 3D Gaussian Splatting model of the same scene, with no further annotation.

That is the ground-truth scene semantics of the video. It was generated with zero training examples from the kitchen domain, and it is small enough to ship as a single artifact.

When LiDAR is not available: 3D from 2D monocular video

For LiDAR-instrumented domains – open-pit mining, surveyed infrastructure, autonomous-vehicle fleets – the LiDAR 3D point cloud pipeline provides dense sub-centimeter geometry at the cost of dedicated sensors. The vast majority of the world’s video, however, is monocular and uninstrumented. YouTube alone hosts millions of hours of cooking, repair, assembly, and craft tutorials with no depth information whatsoever.

F.R.A.N.T.Z. bridges this gap with three novel elements that recover usable 3D structure from a single moving camera:

AI-driven monocular depth – Intel’s MiDaS network estimates relative depth from each RGB frame at 256×256 input resolution. Back-projected through the camera intrinsics, this yields a per-frame 3D point cloud. In our implementation, depth inference runs adaptively, roughly every 10 frames when hands are visible and every 20 frames when the scene is idle, balancing GPU load against tracking continuity.
Semantically-filtered global motion compensation – In a typical kitchen video, hand motion dominates raw optical flow even though hands occupy only a small fraction of the frame area. Without filtering, the camera-motion estimate would be biased by hand motion and the point cloud would drift “with the hands.” F.R.A.N.T.Z. solves this by combining YOLO person-detections and an MOG2 (Mixture of Gaussians) background-subtraction mask to exclude all deformable regions from feature tracking. The surviving Lucas-Kanade optical-flow tracks on static background then yield a clean partial-affine camera-motion estimate.
Incremental structure-from-motion (SfM) with quality-selected keyframes – A parallel SfM pipeline triangulates sparse 3D points from ORB correspondences across keyframes selected on sharpness, optical-flow band, and feature count. The resulting sparse cloud is exportable in COLMAP format for downstream 3D Gaussian Splatting or NeRF training.

The combination produces a globally consistent point cloud that evolves smoothly over the duration of a video. Scene cuts are detected via HSV (Hue-Saturation-Value) histogram differences and trigger pose resets and new “views” that are later aligned in the offline viewer.

Where F.R.A.N.T.Z. sits in the maturity framework

Our maturity-level framework for industrial inspection report automation defines four stages: Stage 0 (basic 3D reconstruction), Stage 1 (asset detection), Stage 2 (differential scene understanding across successive captures), and Stage 3 (automated AI-driven report generation).

The mining-site pipeline described in our companion post implements Stages 2 and 3 on LiDAR sequences. F.R.A.N.T.Z. brings the same progression – reconstruction, detection, differential understanding – to environments where the input is only monocular video. It does this without any training data from the target domain, which means the pipeline can be pointed at a new domain (kitchens, workshops, retail floors, warehouses) with no data-collection investment.

Before you scale: organizing the spatial data with SDMA

Everything we have described so far is what happens for a single video. The reason this approach is interesting is that it scales: point F.R.A.N.T.Z. at a thousand videos, ten thousand videos, an entire site or facility’s worth of footage, and you get a thousand or ten thousand structured 3D scene reconstructions out the other end. The same multiplier applies if you are training a VLA – the input corpus is millions of teleoperated trajectories, each spatially anchored to a specific robot and workspace.

That volume of spatial data is the production problem. Each artifact – a point cloud, an object list with 3D poses, a COLMAP export, an affordance library – carries spatial coordinates, a timestamp, a provenance chain (which sensor, which capture session, which processing version), and is referenced by downstream consumers that may not exist yet. Without governance, this data drifts: metadata becomes inconsistent across teams, spatial identifiers diverge across capture sessions, and the ability to compare a scene captured today against the same scene captured six months ago quietly disappears.

This is the gap that Spatial Data Management on AWS (SDMA) is built to close. SDMA is a managed framework of architectural patterns and AWS services for organizing and governing spatially-referenced data over the full lifecycle of a physical asset. It provides:

Centralized storage with governance and access controls – every artifact lives in Amazon Simple Storage Service (S3) with consistent metadata and provenance.
Stable spatial identifiers that tie every image, observation, point cloud, and derived artifact to an explicit physical location.
An API layer (Amazon SageMaker inference endpoints, AWS Lambda orchestration) that decouples how data is consumed from how it is stored.
Decision lineage – model outputs, reviewer actions, and downstream interpretations are all preserved as part of the same spatial record.

For F.R.A.N.T.Z., SDMA is the natural home for the 3d_structures.npz output of every video, the COLMAP export of every keyframe set, and the affordance library extracted from every demonstration. Each artifact arrives spatially indexed and is queryable across captures. Comparing how a kitchen was used today vs. last month becomes a query rather than a re-processing job.

For VLA training, the same applies. As the SDMA post puts it: “Many analytics and machine learning workflows depend on consistent and well-structured input data… With a shared spatial framework in place, teams can… train models on data that already reflects how physical assets are organized.” That is the right substrate for training a world model whose prediction target is scene semantics, not pixels.

The order matters. Before you spin up AWS Graviton-accelerated F.R.A.N.T.Z. pipelines on thousands of videos in parallel, decide where the output lives and how it is indexed. The compute scales easily; the data organization, retrofitted later, does not.

The classical-AI ↔ generative-AI complementarity, and why Graviton matters

A point that is easy to lose in the current AI discourse is that most of F.R.A.N.T.Z. is classical computer vision. The generative AI components (MiDaS, YOLO, MediaPipe, DeepLab) handle what neural networks are demonstrably best at: pixel-to-semantic mapping. The geometric and temporal glue – ORB feature matching, sparse optical flow, RANSAC, plane fitting, Procrustes alignment, two-bone IK, quaternion SLERP (Spherical Linear Interpolation), COLMAP export – is all OpenCV, NumPy, and SciPy. There is no learned 3D representation. There is no monolithic model to retrain.

This matters in production. F.R.A.N.T.Z. spends the majority of its CPU time inside OpenCV, not inside neural networks. Accelerating those OpenCV calls is the single biggest knob for throughput and cost-per-video. That is exactly what the COOL framework addresses for AWS Graviton. Running F.R.A.N.T.Z. on Graviton with COOL-accelerated OpenCV gives the best price-performance for batch processing of large video corpora – for example, when ingesting thousands of YouTube tutorials in parallel to bootstrap training data for a robot policy.

A useful mental model: generative AI does the perception; classical computing does the composition; Graviton does it cheaply.

From scene semantics back into a world model

Extracting scene semantics is only useful if we can use them. F.R.A.N.T.Z.’s output drives a robot retargeting layer that generalizes a single human demonstration to many synthetic environments. Two extractions are central:

Affordances – for each object class encountered, we recover an approach direction, contact geometry, dwell time, and retreat vector from the hand-object interaction phases. Averaged across instances, this produces a compact per-class record (approximately 12 numeric values per class in our current implementation) encoding “how does a bottle tend to be approached, grasped, and released in this demo.” Affordances are reusable across scenes.
Scene priors – from the reconstructed planes and wrist-reach envelope, we extract counter height, wall positions, object-to-plane bindings, and the workspace volume the human’s hands actually used. These describe what a kitchen looks like to a manipulator — independent of the specific geometry of any one kitchen.

Procedurally synthesizing fresh kitchens with different counter heights, wall positions, and mixed seen/novel object classes, the system composes new trajectories by grounding the affordances in the new scene, verified against the synthetic point cloud. A small MLP (Multi-Layer Perceptron) policy (approximately 20,000 parameters – two hidden layers of 128 units each) trained on procedurally perturbed kitchens, using the planner as an oracle, generalizes across arbitrary counter heights and object placements with no robot data at all.

The total scale is striking: one monocular video → an affordance library + scene prior → arbitrarily many synthetic environments → a cross-kitchen policy. The supervision is synthesized on demand from the structural regularities extracted from a single video, with no teleoperation, no motion capture, and no depth sensing. Compare this to the millions of teleoperated trajectories that go into a VLA pre-training corpus.

Next steps: JEPA-style world models that predict semantics, not pixels

A natural next step would be to lift the F.R.A.N.T.Z. recipe from a fixed pipeline to a learned world model – one whose internal representation is the scene semantics, not the raw pixels.

Two research directions point the way:

Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) is a class of self-supervised models that predict in a learned representation space rather than at the pixel level. The idea is that pixel-by-pixel prediction wastes most of its capacity on perceptually irrelevant detail – textures, lighting, motion blur – while the meaningful structure of a scene lives in a much more compact latent space. JEPA leaves what that latent space should represent largely unspecified – it is a meta-architecture, not a prescription.
Eric Xing’s work on structured foundation models argues, in the same spirit, that the right inductive bias for embodied AI is structured representations – entities, relations, affordances, kinematics – rather than undifferentiated pixel streams.

F.R.A.N.T.Z. offers a concrete answer to the question JEPA leaves open: the prediction target should be scene semantics – objects with 3D poses, planes with bindings, affordances with phase structure, IK-posed kinematics – extracted as ground truth from real video. A JEPA-style predictor trained to predict the next scene-semantic state from the current one (rather than the next frame from the current frame) would inherit three useful properties:

Data-efficient, because a structured scene-semantic representation is orders of magnitude more compact than a raw video frame (a scene descriptor of tens of floats vs. millions of pixel values).
Interpretable, because the latent state has semantic meaning by construction (you can read off “the cup is here; the hand is approaching it”).
Transferable, because scene semantics generalize across visual styles in a way pixels do not (a cup looks different in every kitchen; its affordance is the same).

Affordance and scene-prior extraction is the entry point. The longer-term target is a world model that predicts scene semantics, with F.R.A.N.T.Z.-style ground-truth extraction as the supervision signal.

Conclusion

The case for training world models on scene semantics from ground-truth data, rather than on pixels and synthetic data, comes down to three observations. First, the current pixel-plus-synthetic recipe is hitting diminishing returns in cost, opacity, and sim-to-real gap. Second, the scene semantics an embodied agent needs, object identities and 3D poses, planar scene structure, affordances, kinematics, can be extracted as ground truth from ordinary monocular video by composing pre-trained generative AI modules with classical computer vision. Third, those scene semantics compose downstream into affordance libraries, scene priors, and learned policies that generalize from a single human demonstration to arbitrarily many synthesized environments.

The combination, pre-trained generative AI for perception, classical CV for composition, SDMA for spatial data governance, AWS Graviton with COOL for cost-effective scale, and a LiDAR-pipeline-style cloud back-end for batch processing, gives an end-to-end path from raw video to deployable robot behavior at a fraction of the data and compute cost of monolithic VLA training. Within the maturity framework, this is what Stage 3 looks like when the input is consumer video rather than LiDAR.

We are exploring a learned world model whose internal state is scene semantics, trained with JEPA-style latent-prediction objectives, as a natural evolution of this work.

Call to Action

Read the companion posts:

Try the recipe on your own data: any monocular video, no LiDAR required, no domain-specific training, COLMAP export ready for 3D Gaussian Splatting.

Decide your spatial data substrate before you scale: explore SDMA to organize and govern the spatial outputs F.R.A.N.T.Z. (or any equivalent pipeline) will produce in volume.

Have questions or want to discuss how this approach applies to your domain? Leave a comment below or reach out to the AWS Physical AI team.

AWS Physical AI Blog