As a fabless chip-design enterprise, Canaan Creative hopes to devote as much energy and resources as possible to the chip design and development work we excel at. Through cooperation with AWS, we have obtained the world’s leading IT infrastructure rapidly with lower human and resource investment to support multiple chip design projects, which has obviously improved the progress of our chip design work with more predictable project cycle and more than 30 percent total cost reduction.
Wu Jingjie Technology Vice President of Canaan Creative

Founded in 2013, Canaan Creative released the world’s first blockchain computing device based on ASIC chips the same year, leading the industry into the ASIC era and accumulating rich experience in chip mass production along the way. In 2016, the mass production of 16nm products indicated that Canaan Creative had become one of the first companies with advanced processing techniques in mainland China. Since 2018, Canaan Creative has achieved mass production of the world’s first 7nm chips based on in-house developed technology along with Kendryte K210, a RISC-V based self-innovated commercial intelligent edge computing chip. One such chip, KPU, an AI neural network accelerator, is completely based on independent research and development. Currently, Canaan Creative has achieved the mass production of tens of millions of chips per month, with its products and services covering more than 60 countries and regions around the globe. It has established a strong customer base in the United States, Canada, Sweden, Iceland, Bosnia and Herzegovina, Malaysia, South Korea, Russia, Armenia, Hong Kong, and other countries. In the future, on the foundation of its chip R&D and high-performance computing experience, Canaan Creative will work with its business partners to promote the popularization of AI in various fields to make life better.

With the gradual evolution of the semiconductor manufacturing process, the modern chip design industry increasingly relies on a variety of EDA (Electronic Design Automation) software tools to assist designers. However, in practice, Canaan Creative found that these design tools set high barriers of entry for enterprises’ IT infrastructure. The large amount of manpower and material resources required to build a data center for problem-solving often bring additional burden to the design workflow.

First of all, in different stages of chip design, designers need to use different software tools, but the requirements of varied software tools for IT infrastructure are rarely the same. For example, some software tools are highly dependent on the performance and stability of CPU, while some software tools need a high memory, and others need file system storage with high IOPS and throughput to support. When planning an on-premises data center, it is difficult for chip design companies to balance different performance requirements with a reasonable architecture and cost planning. In addition, a wide variety of devices also increases the difficulty of deployment, operation, and maintenance.

Secondly, as one of the application scenarios of high-performance computing, the performance requirements of modern chip design software sets high barriers for IT infrastructure. It is not uncommon for a single computing task to schedule hundreds of CPU cores, occupying terabytes of memory resources and requiring continuous operation for consecutive days. Simultaneously, tens of millions of small files and single files (potentially tens of terabytes) also need computing power. For chip-design companies, it is not easy to design, operate, and maintain a high-performance computing cluster environments of this scale and maintain their stable operation. Seemingly minor errors and failures may cause major risks such as calculation task failure, data loss, and time delay.

Finally, due to the nature of the entire semiconductor industry supply chain, the workloads for chip design are usually very spikey. Whether it is the short-term spike of centralized operation by designers during the project or the long-term spike caused by the overall project scheduling, even if a large amount of high-configuration equipment is deployed to meet the peak resource demand, it can be nearly impossible to avoid the cost of huge idle waste, with annual utilization rate of less than ten percent.

In addition to the above technical difficulties, many pain points in project management presented Canaan Creative with challenges. The company was limited by the size of its local data center; with multiple projects or teams working in parallel, it became necessary to solve the challenge of "serial queuing" often faced by the use of IT infrastructure resources, resulting in difficulty scheduling tasks due to unpredictable progress of projects. Further, with different teams and projects sharing the same set of IT infrastructure, it was difficult to create statistics for benefit assessments, such as resource utilization rate and cost allocation. In addition, when the projects were at their peak, the impact of sudden equipment procurement on financial planning and the risk of project delay brought by the long and uncontrollable procurement and equipment deployment cycle grew challenging. Finally, when branch offices in different regions build IT infrastructure separately, they became subject to decentralized management and increased security risks; however, connecting to the same local data center brought challenges to the performance, stability, and configuration flexibility of the company's network infrastructure.

After years of relying on its local data center practice, Canaan Creative turned its attention to the cloud, hoping to solve its problems with the help of the flexibilty and powerful features of cloud computing. When asked why it chose AWS as its cloud provider, Wu Jingjie, vice president of Canaan Creative said, "Because of the uncertainty of innovation itself, we hoped our exploration could be based on a more stable platform. There is no doubt about the reputation and position of AWS in the global cloud computing market. At the same time, the AWS Cloud's attention to security, architecture, and tools; deep understanding of the high-performance computing and semiconductor industries' needs; and many successful cases in the industry strengthened our confidence in choosing AWS. ”

Security topped the list of priorities for Canaan Creative's chip-design business. By choosing AWS services and solutions, Canaan Creative has built a comprehensive system covering data security, network security, operation security, and audit review. Using the AWS Direct Connect service, Canaan Creative has established a dedicated line connection between its on-premises data center and several AWS regions, which not only achieves better network connection performance but also ensures the safe transmission of data through encrypted communication. Specific to different projects and teams, the company created multiple Amazon Virtual Private Cloud (Amazon VPC) connections to build a logically-isolated cloud based network environment, establishing a multi-cluster security boundary, achieving the external network isolation of key resources with a private subnet, and controlling the internal traffic access permission through the security groups. Using the AWS Identity and Access Management (IAM) API, the system integrates with local directory management and authentication systems to complete the authorization and authentication operations for relevant personnel to provision resources from the cloud. For sensitive data, Canaan Creative uses AWS Key Management Service (KMS) to encrypt and protect its storage services: Amazon Elastic File System (Amazon EFS), Amazon FSx for Lustre, and Amazon Elastic Block Store (Amazon EBS). In addition, the encrypted VPN connection from each branch office to the cloud is established, and resource and operation logs are collected through AWS CloudTrail and Amazon CloudWatch services for future audit purposes. Finally, the company uses encrypted Amazon Simple Storage Service (Amazon S3) for centralized storage and remote archive backup on the cloud.

In addition to its basic network and authentication systems, Canaan Creative uses AWS ParallelCluster to deploy and manage SGE-based HPC clusters in the AWS Cloud. By compiling different AWS CloudFormation templates, the company can rapidly build different infrastructure environments required for different design phases within minutes. For example, for computationally-intensive tasks, it chooses Amazon Elastic Compute Cloud (Amazon EC2) Z1d instances or the computationally-optimized C5 instances with kernel frequency of up to 4.0GHz. However, for memory-intensive tasks, Canaan Creative opts for X1e instances with up to 3.9TB of memory or memory-optimized R5 instances. And, in order to meet the demand of high IOPS and high throughput for file storage in different stages of computing tasks, Canaan Creative chose Amazon FSx for Luster, a fully-hosted high-performance file system, to obtain throughputs of up to hundreds of gigabytes and millions of read/write IOPS, while taking into account the requirements of high data availability. In the areas where Amazon FSx for Luster service as of yet unavailable, a GlusterFS cluster is deployed by I3 instance to build the high-performance shared file system required for software operation. In addition, in the event that network bandwidth between instances is required, Placement Groups are selected to achieve low network delay and high network throughput.

To control costs, Canaan Creative selects the most cost-effective service and instance type for deployment through benchmark tests of different computing tasks. The idle state of each resource is inspected in a timely manner to trigger the release operation to reduce waste. For long-term stable load and short-term predictable burst load, the company also uses Amazon EC2 Reserved Instances and Spot Instances to take advantage of service discounts.

Figure 1: Schematic diagram of Canaan Creative's AWS-based architecture

 

By transferring its chip-design workload to AWS, Canaan Creative can obtain nearly unlimited infrastructure expansion capability in just minutes—no longer needing to worry about the shortage of specific resources—and has more flexibility in the choice of time cost versus hard cost. At the same time, teams and projects can work in a multi-cluster manner, which saves a significant amount of "queuing" time, thus improving the overall time-to-market for new chip R&D. After a given task is completed, now-idle resources can be released in a timely manner to save costs and truly achieve "pay-as-you-go" pricing.

"By using AWS services, we have objectively improved our overall security control level. Moreover, the AWS Cloud's infrastructure operation and maintenance management level is far higher than ours, and the service operation of AWS is more stable than our self-built data center. Since the semiconductor industry has been able to accept the authorization of IP manufacturers to carry out production through contract-manufacturing organizations, it shouldn't be far from widely accepting cloud computing services to improve its IT support capabilities,” says Wu Jingjie.

To learn how to accelerate innovation, optimize production, and provide advanced products and services through the AWS Cloud, please visit the Semiconductor and Electronics details page.