fbpx

High-definition map automatic annotation system based on active learning

Chao ZhengXu CaoKun TangZhipeng CaoElena SizikovaTong ZhouErlong LiAo LiuShengtao ZouXinrui YanShuqi Mei

First published: 21 November 2023

Abstract

As autonomous vehicle technology advances, high-definition (HD) maps have become essential for ensuring safety and navigation accuracy. However, creating HD maps with accurate annotations demands substantial human effort, leading to a time-consuming and costly process. Although artificial intelligence (AI) and computer vision (CV) algorithms have been developed for prelabeling HD maps, a significant gap remains in accuracy and robustness between AI-based methods and traditional manual pipelines. Additionally, building large-scale annotated datasets and advanced machine learning algorithms for AI-based HD map labeling systems can be resource-intensive. In this paper, we present and summarize the Tencent HD Map AI (THMA) system, an innovative end-to-end, AI-based, active learning HD map labeling system designed to produce HD map labels for hundreds of thousands of kilometers while employing active learning to enhance product iteration. Utilizing a combination of supervised, self-supervised, and weakly supervised learning, THMA is trained directly on massive HD map datasets to achieve the high accuracy and efficiency required by downstream users. Deployed by the Tencent Map team, THMA serves over 1000 labeling workers and generates more than 30,000 km of HD map data per day at its peak. With over 90% of Tencent Map’s HD map data labeled automatically by THMA, the system accelerates traditional HD map labeling processes by more than tenfold, significantly reducing manual annotation burdens and paving the way for more efficient HD map production.

INTRODUCTION

With the rapid development of intelligent transportation systems, environmental perception has become a crucial aspect of autonomous driving. In response to this demand, various deep neural networks (DNN) have been developed to automatically understand traffic scenes, employing segmentation-based and object detection-based methods (Fernandes et al. 2021; Tang, Li, and Liu 2021; Yan et al. 2020). However, creating a robust framework that is suitable for level 3–5 autonomous driving remains a significant challenge. Real-world environments often exhibit extreme weather variations and obstacles, which can greatly impact the accuracy of detection results derived from real-time data. Additionally, the necessity of real-time analysis exacerbates these challenges.

To address these issues, the current industry standard relies on high-definition (HD) maps, a type of centimeter-level imagery collected using laser sensors. HD maps offer more detailed representations (Máttyus et al. 2016; Elhousni et al. 2020; Fan et al. 2018; Bao et al. 2022) and true ground-absolute accuracy while being less affected by driving environments compared to conventional RGB real-time traffic scene imagery. HD maps provide users with permanent road elements, such as lane marking types within annotated 3D point clouds. In contrast to real-time road images, HD maps deliver offline centimeter-level location services and valuable prior knowledge about traffic scenes, enabling self-driving vehicles to better avoid environmental interference.

The HD map production process, as depicted in Figure 1, consists of four main steps: (1) data sourcing, (2) backend automation, (3) map making and validation, and (4) map compilation and release. Data are sourced from various sensors mounted on a surveying car, including the Global Positioning System (GPS), Inertial Measurement Unit (IMU), LiDAR, and camera (Bao et al. 2022). GPS and IMU provide precise absolute localization of tracks, while LiDAR, the most essential sensor for HD maps, gathers object location information with centimeter-level precision. Cameras are employed to capture RGB images, which are then used to detect attributes of the HD map data.

Details are in the caption following the image
FIGURE 1Open in figure viewerPowerPointBasic production process of HD maps: (1) Data sourcing; (2) backend automation; (3) map making and validation; (4) map compile and release. Example shown uses the Tencent HD Map production.

The raw point cloud and image data collected from sensors are processed by the mid-process system, which encompasses point cloud fusion and an automatic labeling system. This system utilizes AI and computer vision (CV) analysis techniques for both point clouds and images (Elhousni et al. 2020; Pannen et al. 2020). Following prelabeling, the point cloud and prelabeling data are verified by the HD map maker during the map-making process. Finally, the HD map data are compiled and released.

The map-making stage is the most resource-intensive step in the process. Researchers have attempted to use DNNs to develop automatic AI systems for the labeling process of HD maps (Jiao 2018; Elhousni et al. 2020; Zhou et al. 2021; Kim, Cho, and Chung 2021; Li et al. 2022). These methods have yielded relatively good results for simple 2D tasks, such as lane marking and road detection. However, the primary challenge for existing automated AI solutions lies in creating HD maps with 2D ground and 3D aerial element annotations in densely populated urban environments. In these areas, maps often contain noise and numerous overlapping 3D objects, making accurate annotations more difficult. Moreover, users of HD maps demand precise maps from these complex urban environments for the maps to be widely applicable and useful.

In this work, we introduce the THMA system Tang et al. (2023), an innovative AI-based solution for rapidly labeling large collections of HD maps via active learning. Deployed by the Tencent Map team since 2021, THMA has served over 1000 users to date. Tencent Map’s smart city applications have employed the products, and the automatically labeled HD maps have been provided to downstream self-driving companies. THMA has significantly improved map makers’ operational efficiency and reduced HD map annotation costs. To the best of our knowledge, THMA is among the industry’s most advanced tools for creating HD map annotations, offering the following advantages: Low cost: THMA can effectively reduce the need for large-scale manual annotation in HD maps. It accelerates traditional HD map labeling processes by more than ten-fold. End-to-end active learning training pipeline: In comparison to existing HD map automatic labeling systems, THMA establishes an active learning closed loop between generating annotations and model training. It can generate annotations for new 2D ground and 3D aerial elements for the next generation HD map development. Modular design: With modular design, THMA can easily update each model components in the inference pipeline. It can consistently meet the needs of downstream users, providing a comprehensive, ready-for-integration solution.

THMA SYSTEM DESIGN

In this section, we provide a detailed description of the THMA system workflow. THMA was specifically designed to annotate hundreds of thousands of kilometers of high-density urban environments, such as China’s densely populated cities like Beijing, Shanghai, and Shenzhen (each with a population exceeding 10 million people). This presents an extremely challenging task. Consequently, THMA features a modular workflow, with the key components depicted in Figure 2.

Details are in the caption following the image
FIGURE 2Open in figure viewerPowerPointDetailed overview of the proposed Tencent HD Map AI (THMA) labeling system. The system is modular and is designed to accommodate challenges of labeling large volumes of HD maps of high-density urban environments.

The model inference step in the THMA pipeline, illustrated in Figure 3, involves a divide-and-conquer smart data processing approach that identifies objects in 3D multiscan fusion point clouds, 2.5D BEV images, and RGB images. The 3D detection algorithm automatically labels 3D points (for traffic lights, poles, tunnels, and traffic signs) and lines (for barriers and curbs) from the multiscan fusion point cloud. The 2.5D segmentation algorithm detects ground elements, such as lane markings, on the multichannel BEV image. The 2D segmentation algorithm identifies other attributes, like lane marking color, in the RGB image. All generated annotations are combined into the final HD map product. We first describe the HD map data acquisition and the single-time training pipeline under active learning for THMA in the next three subsections. Then we will introduce the full active learning pipeline.

Details are in the caption following the image
FIGURE 3Open in figure viewerPowerPointSmart data processing in THMA detects objects in 3D point clouds, 2.5D Bird’s-Eye-View (BEV) images, and RGB images, fusing resulting aerial and ground object detection results for higher accuracy. The 2D segmentation framework is based on Tao, Sapra, and Catanzaro (2020).

Data acquisition

We first introduce the collection of raw 3D point cloud data and the generation process for 2.5D bird’s-eye-view (BEV) images. The raw data of the Tencent HD map comprise RGB images, GPS position and attitude data, and laser-generated 3D point clouds. For our model training, we utilize the latest laser scanner, mounted at the vehicle’s tail at a 45ºangle, primarily focusing on scanning the road surface. Our dataset outperforms other HD map datasets in terms of high-density point clouds, clear differentiation between light and dark reflection intensities, and distinct visual features of ground elements.

Moreover, our 3D point cloud dataset is gathered from complex traffic scenes in densely populated Chinese cities like Beijing, Shanghai, Shenzhen, Guangzhou, Hangzhou, and Wuhan each with a population exceeding 10 million. These traffic scenarios encompass highways, urban expressways, ordinary urban roads, feeder roads, rural roads, tunnels, and interchanges, which are currently underrepresented in other HD map systems such as Nuscenes (Caesar et al. 2020), Waymo (Sun et al. 2020), and Argoverse (Wilson et al. 2021). Our point cloud scanning emphasizes capturing road features with high density, high resolution, and significant visual features of reflection intensity, which accentuates the refined detection of traffic attributes in accordance with HD map production requirements. Consequently, THMA generates data that represents diverse traffic conditions, making it the optimal source for next-generation HD maps.

For the detection of 3D aerial elements, the optimal approach is to analyze (segment and detect objects) in 3D point cloud data. However, when it comes to detecting ground elements, 2.5D BEV images, that is, top–down parallel projections of 3D cloud points, provide better accuracy and inference speeds. One of the key innovations of THMA is its ability to efficiently combine 2.5D BEV images and 3D point clouds. Sample data, consisting of 3D point clouds and 2.5D BEV images, is illustrated in Figure 4.

Details are in the caption following the image
FIGURE 4Open in figure viewerPowerPointQualitative results for the deployed THMA system: the annotations generated from 3D object detection and 2.5D/2D segmentation and object detection are merged into the HD map system and published to downstream map makers for use.

The 2.5D BEV projection images we utilize are generated from the 3D laser point cloud using a top–down parallel projection with minor modifications, such as car removal. For the original 3D point cloud data, we select a resolution of 0.05 m and calculate the center coordinates, image range, and point cloud range of each projection image according to the trajectory to determine the coordinate conversion parameters. Next, we convert the point cloud within the selected range to the Mercator coordinate system and perform elevation filtering on the 3D point cloud, retaining only the points near the ground. For the points falling within each pixel, we assign the reflection intensity value, the highest elevation value, and the lowest elevation values to the three channels of the 2.5D BEV output, respectively, and normalize the pixel range to 0–255.

The 2.5D BEV projection images generated through this process contain rich texture information. Each image is rotated in the direction of vehicle travel. The resulting image size is 1024 × 1024, and the pixel resolution is 0.05 m. Considering the quality and grayscale enhancement of the original point cloud, the reflection intensity channel of the BEV image can clearly and better reflect the texture characteristics of the road surface. Semantic information such as lane markings, ground signs, and zebra crossings in traffic scenes can be distinguished according to the light and dark changes of the reflection intensity. Additionally, each pixel records the highest and the lowest elevation values, respectively, to differentiate between the ground and curbs and guardrails, which are challenging to detect from single-channel 2D BEV images.

Smart data processing—3D point cloud

3D Point Cloud Object Detection

3D objects exhibit a wide range of shapes and sizes. Typically, 2D and 3D object detection algorithms are based on bounding box detection. However, these algorithms are only suitable for objects with known orientations and aspect ratios. For objects without defined directions, it becomes challenging or infeasible to define the corners and size of the bounding box. Even if the label is forcibly defined, conflicts between different training samples may arise, leading to non-convergence of training or degradation of algorithm performance. In our case, we need a unified framework that can accommodate the diversity of object shapes, sizes, and distributions.

Taking into account the above considerations and previous work in HD map labeling (Yang, Liang, and Urtasun 2018a; Yang, Luo, and Urtasun 2018b), we propose a new unified end-to-end 3D model. The schematic diagram is shown in Figure 3 3D point cloud branch and Figure 5. The backbones of the model include 2D and 3D convolutions. The output is a universal descriptor that provides information on the detected objects, rather than just the bounding boxes. In cases where the object direction cannot be identified, the output descriptor can offer a unique description without ambiguity. For example, the descriptor does not explicitly define the yaw of a pole. Instead, it provides the apex and bottom points of the pole, and the yaw can be calculated from these quantities. Another example is the traffic cone, which is described using the vertex, center, and radius of the bottom. The corners of the traffic sign can again be computed, according to the previous logic. Finally, the object detected is not required to be thick, flat, rectangular, or even planar.

Details are in the caption following the image
FIGURE 5Open in figure viewerPowerPointOur solution with knowledge distillation for 3D point cloud object detection.

The resulting model framework is compatible with the diversity of object shapes. Further, we can detect multiple objects (multi-objects) appearing at the same location. Without loss of generality, the output descriptor for multi-objects can be expressed as

��=�0⁢�1…��⁢�⃗0⁢�⃗1…�⃗�,

where ��

 is the activation probability of the corresponding description vector ��

, and ��

 represents the descriptor for a single object.

Knowledge distillation

Object labels in 3D cloud points often contain significant noise and labeling errors. These confounding factors influence performance, especially when using focal loss to solve the class imbalance problem during training. To address this challenge, we adopt knowledge distillation in our 3D object detection framework. Knowledge distillation has been proven to yield significant performance improvement for complex 3D point cloud object detection and segmentation tasks (Hou et al. 2022). Specifically, we construct two training paths. The upper path, shown in the Figure 5, is the basic model for 3D object detection, including point feature extraction module, point-to-voxel transformation module, encoder–decoder model, and so forth. The refined ground truth generated by the basic model is combined with the original ground truth and then used as the supervision target for the lower training path. We adopt the output confidence of a positive sample to calculate the difference.

Let ��

 be the set of ground truth, for example, the ground truth bounding boxes, and ��⁢�⁢�

 the output of the deep model. The refined ground truth ��

 can be computed as

��=(��∩��)∪�ℎ

��={�|�∈��⁢�⁢�,Confidence(�)>��⁢�⁢�}

�ℎ={�|�∈��⁢�⁢�,Confidence(�)>�ℎ⁢�⁢�⁢ℎ},

where ��

 is the low confidence result and �ℎ

 is the high confidence result.

3D point cloud annotation visualization

Sample 3D detection results from production are shown in the Figure 6A,B. In Figure 6A, the red arrow indicates the results of our algorithm, which show the challenging example where the pole could be detected correctly even though it is between trees. In some circumstances, the auto label algorithm even performs better than the human annotator, see Figure 6B. In this example, a part of the pole is occluded by the trees, and the human annotator labeled only the visible part. However, our algorithm correctly labels the missing right top point. The above results demonstrate that the 3D object detection algorithm in THMA is robust and accurate.

Details are in the caption following the image
FIGURE 6Open in figure viewerPowerPointSample label outputs of our system: (A) pole detection results (in yellow), (B) difficult example of pole detection: auto-labeling exceeds human labeling ability, labeling (red) part of the pole incorrectly not annotated by a human annotator.

We also present additional detection results of traffic lights in Figure 7. The 3D algorithm detects the traffic lights correctly, although they are small and sometimes densely arranged. In Figure 8, the detection result for the tunnel is shown. Unlike traffic lights, tunnels spread widely in space and a large receptive field is required for their detection. Finally, Figure 9 shows that our model maintains good results even in the concentrated and complicated traffic sign scenes.

Details are in the caption following the image
FIGURE 7Open in figure viewerPowerPoint(A) Multi-adjacent traffic-lights detection results. (B) Diverse angle distribution traffic-lights detection results.
Details are in the caption following the image
FIGURE 8Open in figure viewerPowerPointSample tunnel detection results. Our system is robust to extreme view-point variations.
Details are in the caption following the image
FIGURE 9Open in figure viewerPowerPointSample traffic sign detection results. Our system addresses challenging detection scenarios, such as closely positioned signs.

Smart data processing—2.5D BEV

2.5D BEV images provide valuable information for ground object detection, enabling improved detection and segmentation of lane markings, ground signs, and zebra crossings. We outline the key model training steps for 2.5D BEV images below. It includes three parts: (1) model weight initialization using masked self-supervised learning; (2) weakly supervised learning pretraining including noising labels; (3) model finetuning for each active learning loop.

Self-supervised pretraining

As discussed earlier, 2.5D BEV images often have missing labels and noise. To address these issues, we incorporate the latest self-supervised learning methods into our framework. Self-supervised learning focuses on designing auxiliary tasks that enable the model to learn meaningful representations from large-scale unlabeled data. We integrated the masked autoencoder (MAE) technique (He et al. 2022) into a CNN-based model and developed a new strategy for masked self-supervised pretraining by augmenting MAE with channel-based learning. We call that this new strategy masked channel autoencoder (MCAE) is shown in Figure 10.

Details are in the caption following the image
FIGURE 10Open in figure viewerPowerPointMCAE self-supervised learning mask recovery task. The model can learn how to solve the inpainting problem for going straight, turning left, and turning right landmarks with a very high input mask ratio.

Given a 2.5D BEV image of size �×�×3, we first divide the input images into regular nonoverlapping patches of size 4 × 4. Next, we randomly sample a mask channel following a uniform distribution. We propose a positional-encoding-free technique using a MaskConv layer. The MaskConv layer can be implemented as an extension of standard convolution, where an additional channel is introduced and filled with the mask. In the early stage of the model, the mask information is concatenated channel-wise with the input representation, resulting in a �×�×4 input.

The fundamental structure of MCAE consists of a deep CNN-transformer encoder and a lightweight decoder. The encoder is only fed with unmasked patches, while the decoder processes learnable masked tokens for image in-painting. Similar to other 3D space and video representation learning tasks (Bao, Dong, and Wei 2021; Feichtenhofer et al. 2022; Tong et al. 2022), we found MCAE to be both time-efficient and effective.

Weakly supervised pretraining

During manual production, annotation experts integrate the data according to production operation specifications and job anticounterfeiting, based on multiple data collections. As a result, there are large volumes of temporary datasets generated during production. These datasets may contain only one class of annotations or lack the job anticounterfeiting process, leading to significant label noise. Using these images directly for generating training data would result in inadequate model performance.

In THMA, we believe these unclean data can also be utilized for pretraining our model. To address the label noise limitation, we generate a large number of improved incomplete training samples through data mining, nonlast-night area filtering, and high-reliability area discrimination of the true value of the BEV images for weakly supervised learning (Zhou 2018). By using these large-scale uncleaned training samples and limited annotations for pretraining and then finetuning the model on the finely labeled sample set, we can significantly improve robustness in highly complex urban scenes.

Finetuning segmentation model for ground element detection

For all 2D and 2.5D ground elements, we selected SegFormer (Xie et al. 2021), a segmentation-based vision transformer structure, as the backbone for 2.5D BEV image detection. The key advantage of a transformer-based method is that the attention map of vision transformer encoders has larger receptive fields than traditional CNN encoders (see Figure 11). Different from the original vision transformer (ViT) (Dosovitskiy et al. 2020), SegFormer uses the sequence reduction process to reduce the amount of calculation and accelerates the convergence process during model training (Xie et al. 2021; Wang et al. 2021). The overall structure of SegFormer consists of a hierarchical transformer encoder and a lightweight MLP decoder, which can take advantage of the transformer-induced feature that produces both highly local and nonlocal attention to rendering powerful representations. Another advantage of SegFormer is it can embed with our MCAE masked self-supervised learning pipeline and utilize the encoder pretrained weight using self-supervised learning and weakly supervised learning.

Details are in the caption following the image
FIGURE 11Open in figure viewerPowerPointThe attention interactions between selected token in broken lane marking and other visual tokens of the last encoder layers of the SegFormer.

2.5D BEV segmentation visualization

Sample prediction results of the BEV SegFormer are shown in Figure 12. The scenario shown in Figure 12 is very complex, including lane markings type change, lane number change, and stop line detection. We not only need to identify the geometric position of the lane markings accurately but also need to accurately detect the attribute of the lane markings and the position at which the number of lanes changes. Benefiting from the self-supervised learning, weakly supervised pretraining, and vision transformer, our multitask model can solve the above tasks well.

Details are in the caption following the image
FIGURE 12Open in figure viewerPowerPointSample results of lane marking detection in BEV imagery from urban areas.

How does THMA work under active learning?

Active learning is a training data selection method for deep learning that leverages a trained model to process unlabeled raw datasets and annotate simple data. Simultaneously, it records failed cases where detection is challenging and sends them to humans. Human annotators can modify the failed annotations and add them to the training data, enhancing the model’s accuracy for perceiving objects in difficult conditions. Active learning automates the selection process while focusing on valuable and rarely seen data points, significantly improving the safety and recall of HD map annotation.

In the field of active learning, the confidence score of a model’s output is commonly used to distinguish between high-confidence and low-confidence labels. Specifically, a high-confidence threshold is set, and any model output above this threshold is considered a positive sample with a high degree of certainty. By identifying these positive samples, we can discover missed annotations. Similarly, a low threshold can be used to obtain negative samples and detect mislabeled data. However, for samples and positions that cannot be clearly determined, manual verification is necessary to ensure the quality of the data.

THMA relies on AI to scale annotations to extremely large volume datasets. The resulting active learning production pipeline, depicted in Figures 2 and 13, provides not only scalable infrastructure for training and inference but also a centralized data platform for metadata access. Once we have the confidence results of AI output, we can save the positive samples into the HD map production line and send the negative samples to annotation experts for review and relabeling. The updated labels for negative samples are then reintegrated into the HD maps and used to retrain the AI models in the next iteration. In this way, the THMA AI components form an end-to-end active learning loop (Jiao 2018; Haussmann et al. 2020).

Details are in the caption following the image
FIGURE 13Open in figure viewerPowerPointActive learning loop in THMA production.

In practice, we typically only automatically update high-confidence positive and negative samples once a month. This approach strikes a balance between the benefits of active learning and the cost of computation. In a month, we can accumulate 1000 km of training data for annotation and automatic model updates from the previous round. This allows us to maintain a high level of accuracy while minimizing the need for manual intervention.

Overall, this approach to active learning has proven effective in improving the efficiency and accuracy of data annotation, while minimizing the cost and effort required for manual verification.

THMA DEVELOPMENT AND DEPLOYMENT

The THMA framework, developed by the application research team (T, lab) at Tencent Map since 2020, took 1.5 years to build and improve. Initially, this system was an open loop, requiring numerous labeling workers to manually annotate training data and resulting in a decreased updating frequency to over 5 weeks. After upgrading to the closed-loop active learning framework, the updating frequency improved to 1–2 weeks. The modular design of THMA, combined with the closed-loop active learning framework, allows individual models for selected elements in the HD map to be tested separately and added to the production pipeline upon updates.

As discussed in previous sections, the entire architecture can be considered as a multitask framework, as different models are used for different elements. While each task can be tested individually, connections exist between tasks. For example, when detecting the change points of lane marking attributes, we must use the lane marking position and attribute information obtained by other model branches.

Python is the programming language used for deployment. To evaluate the model in each version, we use an independent subset of accurately annotated HD maps reviewed by map makers for validation. This subset includes 1000 km of HD map 3D point clouds, corresponding 3D aerial elements, and 2D ground element annotations. We update the evaluation results for each released version in the Tencent inner product documentation.

For overall system performance, we evaluate the performance in terms of automation ratio, throughput, and labeling speed acceleration. After comparing the labeling results from THMA and human labeling results, the overall automation ratio is more than 90%, accelerating the labeling speed more than 10 times. With its compact design, the system’s throughput is over 30,000 km/day.

THMA IMPACT

Next generation HD maps

Specifically, THMA generates annotations for next-generation HD maps, which provide highly accurate, up-to-date, and realistic representations of traffic scenery. These next-generation HD maps are explicitly designed for Level 4 and Level 5 autonomous vehicles and include more abundant and fine-grained traffic scene information compared to currently deployed HD maps. They are expected to be widely adopted by advanced self-driving systems in the near future. In the following paragraph, we explain the newest scene attributes of the THMA produced next-generation HD map in detail.

Previous work primarily applied older semantic segmentation deep learning algorithms, such as FCN (Long, Shelhamer, and Darrell 2015) and U-Net (Ronneberger, Fischer, and Brox 2015), to only identify commonly seen ground elements in HD map systems (Elhousni et al. 2020), while ignoring the class imbalance problem present in HD maps. THMA is one of the first methods to address this issue using active learning loops. In addition to the 20 types of lane markings and ground elements, we propose the detection of other rarely seen lane marking attribute change points, road waiting areas, and stop lines. Furthermore, THMA incorporates road separating facility modules to detect guardrails, curbs, and natural boundaries using 2.5D BEV images. These elements contribute to enhancing the safety of automated driving systems.

Another advantage of THMA is its ability to label a broader range of 3D aerial elements. This includes large-scale objects such as tunnels and small-scale objects like traffic lights. In terms of shape, there are linear objects like straight poles and curved ones, planar objects such as traffic signs, and thick objects like traffic lights and tunnels. We adopt a unified end-to-end framework when developing THMA. This framework generates a unified descriptor that adapts to the diversity of object shapes and the varying number of objects at specific locations.

Advantages of AI-driven annotation

THMA applied AI for HD map annotation. The advantages of this active learning-based AI driven annotation system are: Massive data collection and cloud computation: Hundreds of thousands of kilometers of HD map raw data are collected through several data-collecting vehicles. In this way, the HD map is updated quickly. To process more data and train new models, Tencent Cloud, a secure, reliable, and high-performance cloud computing service provided by Tencent, is used. Diverse training database: Due to the novelty of the framework, the training relies on over 400,000 km of HD map data, which contains partially incomplete and inaccurate annotations for self-supervised learning and weakly supervised learning, further enabling model generalization. Data and workflow management platform: With the powerful Tencent Cloud platform, high parallelism, traceability, caching workflow, and PB-level data management have been enabled. Completed labeling platform: The labeling platform consists of the HD map label platform, which produces the 3D HD map format data, and the traditional labeled tools, which could produce the detection and segmentation data for the 2D or 3D model. Powerful model zoo: The model zoo has up-to-date 2D and 3D detection and segmentation models implemented by PyTorch. In order to make the system end-to-end and generalizable, the model decoders have been redesigned to adapt to the HD map data format. Many self-supervised and weakly supervised methods have been implemented, trained on Tencent Map’s GPU clusters, and deployed in the cloud.

Application payoff

THMA has been developed and deployed for 2 years, used by thousands of annotation experts. To date, our system has produced over 400,000 km of HD map data. It has a record of serving almost 1000 workers to produce 30,000 km of HD map data per day, which is quite advanced to the best of our knowledge. Over the 2 years of usage, this system has achieved the following business improvements:

  • 1.Efficiency: In the traditional system of auto-labeling, to efficiently develop a model in massive traffic data scenarios such as China, at least several kilometers of point-cloud data and several 10,000 of images are needed, an annotation effort that would require a whole year. Due to the end-to-end data recycling, intelligent data mining, and weakly supervised and self-supervised techniques, THMA reduces the labeling time required by an order of magnitude.
  • 2.Model generalization ability: Due to processing and learning from hundreds of thousands of kilometers of challenging HD map data, the labeling system has high precision and recall, as well as generalization ability in challenging cases of urban scenery. As a result, THMA creates a record of serving 1000 makers and producing several tens of thousands kilometers per day.
  • 3.Iterative and incremental development: New requests from downstream smart city applications and self-driving companies are added as time goes by. Since THMA follows a modular design approach around different subtasks, product updates can be performed without affecting the overall AI solution. Since deployment in 2021, we release the latest annotations regularly to customers.

CONCLUSION

In this paper, we introduce the THMA system, a novel, end-to-end, and fully automatic AI system designed to label hundreds of thousands of kilometers of HD maps of densely populated urban environments for autonomous driving applications. The system has been designed and deployed in production by the Tencent Map T lab team and their users since 2021, generating 30,000 km of HD map data per day and serving over 1000 labeling workers. To the best of our knowledge, the resulting system is one of the largest in the world to date. The core active learning algorithm propagates model weights from existing Tencent large-scale HD map datasets to newly acquired data, allowing for fully automatic labeling and human-in-the-loop labeling used together, significantly reducing the time and cost associated with traditional manual annotation techniques.

In future work, we plan to expand the existing system focused on lane detection to auto-labeling more complex label relationships. We also aim to continually leverage iterative and incremental development to further improve robustness.

CONFLICT OF INTEREST STATEMENT

The authors declare that there is no conflict.

Biographies

  • Chao Zheng leads the Computer Vision Research team at Tencent Map, with a long-standing dedication to the field of autonomous driving. His research interests span across artificial intelligence, computer vision, and machine learning, with a particular focus on 3D perception and reconstruction within autonomous driving. His research achievements have been published in multiple top-tier conferences, including AAAI, ICCV, ECCV, and WACV, with one of his co-authored papers earning the IAAI Application Innovation Award.
  • Xu Cao received his M.S. in Computer Science from New York University in 2022 and his B.S. from Fudan University in 2020. He is the Co-founder of PediaMed.AI Lab. His research interests include AI for healthcare, AI for social good, Autonomous driving. His research achievements have been published in multiple top-tier AI conferences, including AAAI, IJCAI, ICASSP, UAI, BIBM with one of his co-authored papers earning the IAAI Application Innovation Award.
  • Kun Tang is a Machine Learning Researcher at Tencent Maps T.Lab. He is engaged in the development of Tencent’s automated high-definition maps. He obtained M.S in Mathematics and Applied Mathematics from Peking University in 2015. His main research areas are lane marking detection and segmentation of 3D point clouds.
  • Zhipeng Cao received his Ph.D. in 2015. During his Ph.D., his research directions were computer vision and partial differential equation-based image processing. Currently, his research fields include 3D point cloud detection and segmentation, image detection, ARVR, face recognition, image debluring, image super-resolution, and text-based image generation. He has led or participated in projects such as face recognition on Huawei Mate Pro series, Huawei Cyberverse, and Tencent HD Map.
  • Elena Sizikova received Ph.D in the Graphics/Vision Lab in Princeton Computer Science Department. She is interested in problems at the intersection of artificial intelligence (AI), regulatory science, medical imaging, and computer vision. Specifically, her research addresses challenges associated with training and evaluating neural networks for medical imaging problems with only limited access to large-scale datasets.
  • Tong Zhou is a Machine Learning Engineer at Tencent Maps T.Lab, primarily focusing on high-definition map ground feature detection, which includes the identification and geometric topology generation of elements such as lane markings and diversion zones. Additionally, Zhou is involved in the generation and management of high-precision massive sample data.
  • Erlong Li is a Machine Learning Engineer at Tencent Maps T.Lab, specializing in the recognition of key elements such as lanes and the structured topology of lane markings in high-definition maps. He has published several papers and patents related to autonomous driving.
  • Ao Liu is a Machine Learning Engineer at Tencent Maps T.Lab. His main work involves the recognition and topology of key elements in high-definition maps, including point cloud segmentation and GRB image segmentation.
  • Shengtao Zou is a Machine Learning Engineer at Tencent Maps T.Lab. His main field of study is Artificial Intelligence, and Computer Graphics. His recent research has been focused on 3D Point Cloud Detection and sample generation. He has many papers and patents in the above fields.
  • Xinrui Yan has completed the M.S in Control Science and Engineering in 2022. Her main interest during her graduate studies was the application of point cloud-based perception in the field of autonomous driving. She is now a Machine Learning Engineer at Tencent Maps T.Lab, focusing on applying point cloud-based perception algorithms to automatic generation of high-precision maps.
  • Shuqi Mei is the Director of Computer Vision Research team at Tencent Map. He earned his Ph.D. in 2008, with a focus on Computer Vision and Visual Servo Control of Mobile Robot. He has since worked at esteemed companies such as SONY, Alibaba, and Tencent. Currently, Dr Mei leads a team at Tencent Map, where he is responsible for algorithm development and application for building maps with high efficiency and quality.