Labeling data for 3D object tracking and sensor fusion in Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth now supports labeling 3D point cloud data. For more information about the launched feature set, see this AWS News Blog post. In this blog post, we specifically cover how to perform the required data transformations of your 3D point cloud data to create a labeling job in SageMaker Ground Truth for 3D object tracking use cases.

Autonomous vehicle (AV) companies typically use LiDAR sensors to generate a 3D understanding of the environment around their vehicles. For example, they mount a LiDAR sensor on their vehicles to continuously capture point-in-time snapshots of the surrounding 3D environment. The LiDAR sensor output is a sequence of 3D point cloud frames (typical capture rate is 10 per second). To build a perception system that can automatically track objects of interest around the vehicle (such as other vehicles or pedestrians), these companies first manually label the objects in the 3D point cloud frames and then train machine learning (ML) models using the labeled 3D frames.

In building perception systems, a common practice is to use inputs from multiple sensors to mitigate for shortcomings of an individual sensor. For example, a video camera can provide important contextual information like whether a traffic light is red, yellow, or green, but generally has limited perception in dark conditions. On the other hand, a LiDAR sensor can’t provide specific contextual understanding like traffic signal color, but can provide depth perception across 360 degrees regardless of whether it’s light or dark outside.

SageMaker Ground Truth makes it easy to label objects across a sequence of 3D point cloud frames for building ML training datasets, and supports sensor fusion of LiDAR data with up to eight video camera inputs. It requires upfront synchronization of the video frames with 3D point cloud frames. Once enabled, human labelers can view the 3D point cloud frame side-by-side with the synchronized video frame. In addition to providing more visual context for labeling, the sensor fusion capability also projects any label drawn in the 3D point cloud to the video frame, and label adjustments made in one frame appear in the other.

In this post, we demonstrate how to prepare your 3D point cloud data and synchronized video data for SageMaker Ground Truth to consume. We start with the KITTI Vision Benchmark Suite, which is a popular AV dataset.[1] It includes 3D point cloud data generated using a Velodyne LiDAR sensor in addition to video data. Specifically, we cover the following steps:

Discuss Ground Truth 3D point cloud labeling job input data format and requirements.
Convert the KITTI dataset from a local coordinate system to a world coordinate system.
Associate the synchronized video data to the LiDAR data for sensor fusion
Prepare a SageMaker Ground Truth input manifest file.
Create a labeling job for 3D point cloud object detection and tracking across a sequence of frames.
Use the assistive labeling tools in the worker labeling UI.

To complete this walkthrough, use the notebook 3D-point-cloud-input-data-processing.ipynb in the Amazon SageMaker Examples tab of a notebook instances, under Ground Truth Labeling Jobs. You can also find the notebook on GitHub.

3D point cloud labeling job input data

This section gives you an overview of Ground Truth input data concepts and requirements for 3D point cloud labeling jobs.

3D point cloud frame versus 3D point cloud sequence

A point cloud frame is defined as a collection of 3D points describing a 3D scene. Each point is described using three coordinates: x, y, and z. To add color and variations in point intensity to the point cloud, points may have additional attributes, such as i for intensity or values for the red (r), green (g), and blue (b) color channels (8-bit). All the positional coordinates (x, y, z) are in meters. When you create a Ground Truth 3D point cloud labeling job, you can use point cloud data in an ASCII or compact binary format.

A sequence is defined as a temporal sequence of 3D point cloud frames captured while the LiDAR moves (for example, on top of a car). SageMaker Ground Truth defines a sequence file format as an ordered sequence of frames, where each frame is associated with a timestamp.

In this post we demonstrate how to create a SageMaker Ground Truth sequence file from the KITTI dataset in preparation for creating a labeling job for 3D object tracking.

World coordinate system

In object tracking, you track the movement of an object (such as a pedestrian on the side walk) while your point of reference (the autonomous vehicle) is moving. Your point of reference has the sensors mounted to sense the surrounding environment and is typically called the ego vehicle.

When performing object tracking, you need to use a world coordinate system (also called a global frame of reference) because the ego vehicle itself is moving in the world. In general, SageMaker Ground Truth expects that you already transformed your point cloud data into a reference coordinate system of your choice. You usually do such transformations by multiplying each point in a 3D frame with the extrinsic matrix for the LiDAR sensor. The extrinsic matrix of a sensor is a homogeneous transformation matrix used to transform the perspective of the data from the sensor’s own coordinate system to a world coordinate system. A homogeneous transformation is just a sequence of three rotations (of the x, y, and z axes) and translation of the origin point. The rotation matrix is specifically the 3×3 matrix defining the sequence of three rotations.

This post walks through how to transform KITTI 3D frames from a local coordinate system to a world coordinate system using extrinsic matrices. The KITTI dataset provides an extrinsic matrix for each 3D point cloud frame; you can independently come up with the extrinsic matrix using the GPS data from the ego vehicle. You use the NumPy matrix multiplication function to multiply this matrix with each point in the frame to translate that point to the world coordinate system the KITTI dataset uses.

With your own setup, you would also be able to independently compute extrinsic transformation matrix using the GPS/IMU position and orientation (lat/longt/alt/roll/pitch/yaw) with respect to the LiDAR sensor on the ego vehicle. For example, you can compute pose from KITTI raw data using pose = convertOxtsToPose(oxts) to transform the oxts data into a local euclidean poses, specified by 4×4 rigid transformation matrices. You can then transform this pose transformation matrix to a global reference frame using the reference frames transformation matrix in world coordinate system.

Sensor fusion

The LiDAR sensors and each of the cameras have their own extrinsic matrices, and SageMaker Ground Truth uses them to enable the sensor fusion feature. To project a label from the 3D point cloud to the camera image plane, SageMaker Ground Truth needs to transform 3D points from the LiDAR coordinate system to the camera coordinate system. This is typically done by first transforming 3D points from LiDAR’s own coordinate to a world coordinate system using the LiDAR extrinsic matrix. Then you use the camera inverse extrinsic (world to camera) to transform the 3D points from the world coordinate system you obtained in the previous step into the camera image plane. If your 3D point cloud data is already transformed into world coordinate system, then the first transformation is not necessary, and translation between 3D and 2D only depends on the camera’s extrinsic matrix.

If you have a rotation matrix (made up of the axis rotations) and translation vector (or origin) in the world coordinate system instead of a single extrinsic transformation matrix, you can directly use rotation and translation to compute pose. See the following code:

rotation = [[ 9.96714314e-01, -8.09890350e-02,  1.16333982e-03],
[ 8.09967396e-02,  9.96661051e-01, -1.03090934e-02],
[-3.24531964e-04,  1.03694477e-02,  9.99946183e-01]]
 
 origin= [1.71104606e+00
          5.80000039e-01
          9.43144935e-01]
         
from scipy.spatial.transform import Rotation as R
# position is the origin
position = origin 
r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy 
heading = r.as_quat()

If you have a 4×4 extrinsic matrix then the matrix is just in the form of [R T; 0 0 0 1] where R is the rotation matrix and T is the origin translation vector. That means you can extract the rotation matrix and translation vector from the matrix as follows:

import numpy as np

transformation 
= [[ 9.96714314e-01, -8.09890350e-02,  1.16333982e-03, 1.71104606e+00],
   [ 8.09967396e-02,  9.96661051e-01, -1.03090934e-02, 5.80000039e-01],
   [-3.24531964e-04,  1.03694477e-02,  9.99946183e-01, 9.43144935e-01],
   [              0,               0,               0,              1]]

transformation  = np.array(transformation )
rotation = transformation[0:3][0:3]
translation= transformation[0:3][3]

from scipy.spatial.transform import Rotation as R
# position is the origin translation
position = translation
r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy 
heading = r.as_quat()
print(f"position:{position}\nheading: {heading}")

For sensor fusion, you provide your extrinsic matrix in the form of sensor pose in terms of origin position (for translation) and heading in quaternion (for rotation of the 3 axis). The following code is an example of the pose JSON you use in the input manifest file:

{
    "position": {
        "y": -152.77584902657554,
        "x": 311.21505956090624,
        "z": -10.854137529636024
      },
      "heading": {
        "qy": -0.7046155108831117,
        "qx": 0.034278837280808494,
        "qz": 0.7070617895701465,
        "qw": -0.04904659893885366
      }
}

All the positional coordinates (x, y, z) are in meters. All the pose headings (qx, qy, qz, qw) are measured in spatial orientation in quaternion. Separately for each camera, you provide pose data extracted from the extrinsic of that camera.

Camera calibrations, intrinsic, and distortion

Geometric camera calibration, also referred to as camera resectioning, estimates the lens and image sensor parameters of an image or video camera. You can use these parameters to correct for lens distortion, measure the size of an object in world units, or determine the location of the camera in the scene. Additionally, if your camera images are distorted, you can provide additional calibration parameters (such as intrinsic and distortion) to correct the images.

Camera parameters include intrinsic matrix parameters, extrinsic matrix parameters, and distortion coefficients. Learn more. See the following code:

# intrinsic paramaters
fx (float) - focal length in x direction.
fy (float) - focal length in y direction.
cx (float) - x coordinate of principal point.
cy (float) - y coordinate of principal point.
# Radial distortion parameters
k1 (float) - Radial distortion coefficient.
k2 (float) - Radial distortion coefficient.
k3 (float) - Radial distortion coefficient.
k4 (float) - Radial distortion coefficient.
# Tangential distortion parameters
p1 (float) - Tangential distortion coefficient.
p2 (float) - Tangential distortion coefficient.

The camera intrinsic transform is defined as the following equation, where * is matrix multiplication.

|x|   |f_x, 0, c_x|   |X|
|y| = |0, f_y, c_y| * |Y|
|z|   | 0,  0,  1 |   |Z|

Input manifest file

Ground Truth takes an input manifest where each line of the manifest describes a unit of task to be completed by annotators (or by auto labeling for some built-in task types). The format of your input manifest file depends on your task type:

3D point cloud object detection or semantic segmentation labeling job – Each line in your input manifest file contains information about a single 3D point cloud frame and associated sensor fusion data. This manifest file is called a 3D point cloud frame input manifest.
3D point cloud object detection and tracking labeling job – Each line of your input manifest file contains a link to a sequence file defining a sequence of 3D point cloud frames and associated sensor fusion data. Each sequence is called a 3D point cloud sequence. This manifest is called a point cloud sequence input manifest.

In this walkthrough, you create a point cloud sequence manifest file. You can also modify the solution to create a point cloud frame input manifest. Details are provided in the sample notebook.

Transforming the KITTI dataset to a world coordinate system

You can use the notebook to run the code blocks in this section.

The world coordinate system depends on the dataset. Some datasets use the LiDAR position in the first frame as the origin, and all the 3D frames in the dataset use that frame as the reference where the vehicle heading and position are located at the origin in the first frame. Some datasets have the device position starting from a different point than the origin. The KITTI datasets use the location of the first frame as reference for its world coordinate system.

This post shows how to transform the KITTI dataset with respect to the LiDAR sensor orgin in the first frame as the global frame of reference so SageMaker Ground Truth can consume it. The KITTI raw dataset includes a rotation matrix and translation vectors for each frame, and you can compute the extrinsic matrix for each frame using this data. Working with the raw data can be more challenging, and it is easier to use development packages like the pykitti Python module to work with the KITTI dataset.

In the dataset, dataset.oxts[i].T_w_imu gives the LiDAR extrinsic transform for the ith frame, and you can multiply it with the points of the frame to convert it to a world frame using the NumPy matrix multiplication function matmul: matmul(lidar_transform_matrix, points). In general, multiplying a point in a LiDAR frame with a LiDAR extrinsic matrix transforms it into world coordinates. The T_w_imu convention refers to the transformation from IMU to the world coordinate system (so notation is T_destinationFrame_originFrame).

The following code example demonstrates how you can transform KITTI point cloud frames into the world coordinate system:

import pykitti
import numpy as np

basedir = '/Users/nameofuser/kitti-data'
date = '2011_09_26'
drive = '0079'

# The 'frames' argument is optional - default: None, which loads the whole dataset.
# Calibration, timestamps, and IMU data are read automatically. 
# Camera and velodyne data are available via properties that create generators
# when accessed, or through getter methods that provide random access.
data = pykitti.raw(basedir, date, drive, frames=range(0, 50, 5))

# i is frame number
i = 0

# customer computes lidar extrinsics
lidar_extrinsic_matrix = data.oxts[i].T_w_imu

# velodyne raw point cloud in lidar scanners own coordinate system
points = data.get_velo(i)

# transform points from lidar to global frame using lidar_extrinsic_matrix
def generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix):
    tps = []
    for point in points:
        transformed_points = np.matmul(lidar_extrinsic_matrix, np.array([point[0], point[1], point[2], 1], dtype=np.float32).reshape(4,1)).tolist()
        if len(point) > 3 and point[3] is not None:
            tps.append([transformed_points[0][0], transformed_points[1][0], transformed_points[2][0], point[3]])
       
    return tps
    
# customer transforms points from lidar to global frame using lidar_extrinsic_matrix
transformed_pcl = generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix)

Associating video data with LiDAR data for sensor fusion

KITTI provides both the LiDAR extrinsic and the camera extrinsic matrices. You use them to extract pose data. You then format this pose data using the JSON format required for the 3D point cloud sequence input manifest.

For the KITTI dataset, you can use the pykitti Python module to load the KITTI data. In the dataset, .oxts[i].T_w_imu gives the LiDAR extrinsic matrix (lidar_extrinsic_transform) for the i’th frame. You can convert this transform into translation and quaternion, which are the heading and position of LiDAR for the JSON format. See the following code:

from scipy.spatial.transform import Rotation as R

# utility to convert extrinsic matrix to pose heading quaternion and position
def convert_extrinsic_matrix_to_trans_quaternion_mat(lidar_extrinsic_transform):
    position = lidar_extrinsic_transform[0:3, 3]
    rot = np.linalg.inv(lidar_extrinsic_transform[0:3, 0:3])
    quaternion= R.from_matrix(np.asarray(rot)).as_quat()
    trans_quaternions = {
        "translation": {
            "x": position[0],
            "y": position[1],
            "z": position[2]
        },
        "rotation": {
            "qx": quaternion[0],
            "qy": quaternion[1],
            "qz": quaternion[2],
            "qw": quaternion[3]
            }
    }
    return trans_quaternions

Similarly, you can use the camera extrinsic to extract camera pose data. You can calculate Camera_extrinsic_transform for cam0 in the i’th frame by inv(matmul(dataset.calib.T_cam0_velo, inv(dataset.oxts[i].T_w_imu))), which you can convert into heading and position for cam0. See the following code:

def convert_camera_inv_extrinsic_matrix_to_trans_quaternion_mat(camera_extrinsic_transform):
    position = camera_extrinsic_transform[0:3, 3]
    rot = np.linalg.inv(camera_extrinsic_transform[0:3, 0:3])
    quaternion= R.from_matrix(np.asarray(rot)).as_quat()
    trans_quaternions = {
        "translation": {
            "x": position[0],
            "y": position[1],
            "z": position[2]
        },
        "rotation": {
            "qx": quaternion[0],
            "qy": quaternion[1],
            "qz": quaternion[2],
            "qw": -quaternion[3]
            }
    }
    return trans_quaternions

Camera calibrations: Intrinsic and distortions

The KITTI dataset provides a calibrations parameter for each camera. For example, data.calib.K_cam0 contains the following camera intrinsic matrix:

            fx 0  cx
            0  fy cy
            0  0   1

Creating the input manifest file

After you convert a sequence of frames in the KITTI dataset to the world coordinate system and extract pose information from the LiDAR and camera extrinsic matrices, you can create a 3D point cloud sequence manifest file that includes sensor fusion data. You can use the following function to automatically create a sequence file for a sequence input manifest file:

def convert_to_gt():
    # The 'frames' argument is optional - default: None, which loads the whole dataset.
    # Calibration, timestamps, and IMU data are read automatically. 
    # Camera and velodyne data are available via properties that create generators
    # when accessed, or through getter methods that provide random access.
    data = pykitti.raw(basedir, date, drive, frames=range(0, 50, 5))
    image_paths  = [data.cam0_files,  data.cam1_files, data.cam2_files, data.cam3_files]
    camera_extrinsic_calibrations = [data.calib.T_cam0_velo, data.calib.T_cam1_velo, data.calib.T_cam2_velo, data.calib.T_cam3_velo]
    camera_intrinsics = [data.calib.K_cam0, data.calib.K_cam1, data.calib.K_cam2, data.calib.K_cam3]
    seq_json = {}
    seq_json["seq-no"] = 1
    seq_json["prefix"] = 's3://pdx-groundtruth-lidar-test-bucket/pdx-groundtruth-sequences/kittiseq2/frames/'
    seq_json["number-of-frames"] = len(data)
    seq_json["frames"] = []
    default_position = {"x": 0, "y": 0, "z": 0}
    default_heading = {"qx": 0, "qy": 0, "qz": 0, "qw": 1}
    for i in range(len(data)):
        # customer computes lidar extrinsics
        lidar_extrinsic_matrix = data.oxts[i].T_w_imu        
        # velodyne raw point cloud in lidar scanners own coordinate system
        points = data.get_velo(i)
        # customer transforms points from lidar to global frame using lidar_extrinsic_matrix
        transformed_pcl = generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix)    
        # Customer computes rotation quaternion and translation from LiDAR Extrinsic
        trans_quaternions = convert_extrinsic_matrix_to_trans_quaternion_mat(lidar_extrinsic_matrix)
        # Customer uses trans_quaternions to populates GT ego veicle pose
        ego_vehicle_pose = {}
        ego_vehicle_pose['heading'] = trans_quaternions['rotation']
        ego_vehicle_pose['position'] = trans_quaternions['translation']
        # open file to write the transformed pcl
        with open(output_base+"/"+str(i)+'.txt', "w") as out:
            writer = csv.writer(out, delimiter=' ', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            for point in transformed_pcl:
                writer.writerow((point[0], point[1], point[2], point[3]))
        frame = {}
        frame["frame-no"] = i
        frame["frame"] = str(i)+'.txt'
        frame["unix-timestamp"] = data.timestamps[i].replace(tzinfo=timezone.utc).timestamp()
        frame["ego-vehicle-pose"] = ego_vehicle_pose
        images = []
        image_dir_path = os.path.join(output_base, 'images')
        if not os.path.exists(image_dir_path):
            os.makedirs(image_dir_path)
        for j in range(len(image_paths)):
            # copy image
            image_path = image_paths[j][i]
            image_suffix_path = 'images/frame_'+str(i)+'_camera_'+str(j)+'.jpg'
            copyfile(image_path, os.path.join(output_base, image_suffix_path))
            # If customer has the camera extrinsics then they use them to compute the camera transform
            camera_transform= np.linalg.inv(np.matmul(camera_extrinsic_calibrations[j], np.linalg.inv(data.oxts[i].T_w_imu)))
            # Customer computes rotation quaternion and translation from camera inverse Extrinsic
            cam_trans_quaternions = convert_camera_inv_extrinsic_matrix_to_trans_quaternion_mat(camera_transform)
            image_json = {}
            image_json["image-path"] = image_suffix_path
            image_json["unix-timestamp"] = frame["unix-timestamp"]
            image_json['heading'] = cam_trans_quaternions['rotation']
            image_json['position'] = cam_trans_quaternions['translation']
            image_json['fx'] = camera_intrinsics[j][0][0]
            image_json['fy'] = camera_intrinsics[j][1][1]
            image_json['cx'] = camera_intrinsics[j][0][2]
            image_json['cy'] = camera_intrinsics[j][1][2]
            image_json['k1'] = 0
            image_json['k2'] = 0
            image_json['k3'] = 0
            image_json['k4'] = 0
            image_json['p1'] = 0
            image_json['p2'] = 0
            image_json['skew'] = 0
            images.append(image_json)
        frame["images"]=images
        seq_json["frames"].append(frame)
    with open(output_base+'/mykitti-seq2.json', 'w') as outfile:
        json.dump(seq_json, outfile)

Creating a labeling job

After you create an input manifest file, you can create a labeling job using the notebook. Use a private work team when you create your labeling job (using the instructions in the notebook) so that you can view the worker task in the worker portal and interact with the worker labeling UI.

The preprocessing time of your labeling job is the time it takes for the tasks to start appearing in the worker portal. It depends on the size of your input data, resolution of your point clouds, and data used for sensor fusion (if added). Therefore, when using the notebook, it may take a few minutes for your labeling job to appear in the worker portal. When a task appears, select it and choose Start Working.

For more information about the worker UI, see 3D Point Cloud Object Tracking. The following are some of the assistive labeling tools available in worker labeling UI:

Label auto-fill – When a worker adds a cuboid to a frame, they automatically add that cuboid to all frames in the sequence. These auto-filled labels automatically adjust as the worker manually adjusts annotations for the same object in other frames.
Label interpolation – After a worker labels a single object in two frames, Ground Truth uses those annotations to interpolate the movement of that object between those two frames.
Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option to have the Ground Truth auto-fitting tool snap the cuboid tightly around the object’s boundaries.
Fit to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or sidewalk in the scene.
Mutli-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side panel displays a front and two side perspectives to help the worker adjust the cuboid tightly around the object. Workers can adjust the label in the 3D point cloud or the side panel, and the adjustments appear in the other views in real time.
Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes and in 2D images, and the annotations are projected into the other view in real time.

The following video demonstrates interpolation. The worker has only added two cuboids to the first and last frames of the sequence. These manually added cuboids are represented by red-segments in the navigation-bar on the bottom-left. All labels in-between (in the blue-area of the navigation bar) are interpolated by Ground Truth.

The following images demonstrate multi-view labeling and sensor fusion. Once a cuboid has been added to the 3D point cloud visualization, it can be adjusted in the three side-views, and the camera image.

In addition to labels, you can add label attributes to gather additional metadata on each object. In the following video, you’ll see that a cuboid is added with the label “Car”. This label has two label attributes associated with it: Occlusion and Color. Workers can select a single value for each label attribute.

Ground Truth automatically saves your annotations every 15 minutes. When you are done working, choose Submit. When the task is complete, the output data is in the Amazon Simple Storage Service (Amazon S3) bucket you specify in HumanTaskConfig.

To learn more about the output data format, see 3D Point Cloud Object Tracking on the Ground Truth Output Data page.

Conclusion

In this walkthrough, you learned about the input data requirements and options for a Ground Truth 3D point cloud labeling job, and created an object tracking labeling job. For more information about the other task types that you can create for 3D point cloud labeling jobs, see 3D Point Cloud Task types. Also, we would like to thank the KITTI team for letting us use this dataset to demonstrate how to prepare your 3D point cloud data for use in SageMaker Ground Truth.

[1] The KITTI dataset is subject to its own license. Please make sure that any use of the dataset conforms to the license terms and conditions.

About the Authors

Vikram Madan is the Product Manager for Amazon SageMaker Ground Truth. He focusing on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys running long distances and watching documentaries.

Zahid Rahman is a Senior SDE in AWS AI where he builds complex machine learning services and large scale distributed systems. He is interested in computer vision, artificial intelligence, and big data technologies. In his spare time he enjoys reading history.

Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon. In her free time, she enjoys meditating and taking walks in nature.