Broadcom reduces workbench testing time and cost with AWS Lambda and AWS Glue

Introduction

Workbench testing is a key step in the manufacturing of semiconductor chips. Testing consists of sending electrical signals to semiconductor devices and comparing output signals to expected values. It is essential to ensure that the produced devices work according to specifications.

Workbench testing is also time consuming and costly, so we do not want to run more tests than we have to. Our data analysis team at Broadcom Wireless Semiconductor Division designed a scalable process that analyzes test result files and identifies the tests that we can remove because they carry redundant information. In this post we show how we implemented this process on AWS using AWS Lambda, AWS Glue and the help of the Amazon Machine Learning Solutions Lab.

Solution

We called our system DPR for “Dynamic Parameter Reduction”, its architecture is described in the diagram below (Figure 1).

In the following sections we go deeper into some of the key steps:

File preprocessing with AWS Lambda
Correlation computing with AWS Glue
Optimal tests computing (combinatorial optimization)
Result consolidation

Preprocessing the test files with AWS Lambda

The test files follow a proprietary format that we preprocess into a well-formed comma separated values (CSV) format for data analysis. We can preprocess each file independently of other files, and each file does not require a lot of compute and memory resources, so we deploy the preprocessing function as a simple AWS Lambda function, using a Python package. Once we deployed the function, the Controller can invoke the function asynchronously on each test file. For each file, the lambda function downloads it from Amazon S3 into memory, transforms it into a well-formed tabular format, and uploads it back to S3 as a CSV file. The lambda function also updates its status in Amazon Aurora so the Controller can monitor progress.

Because the test files vary in size, we go a step further to optimize AWS Lambda usage of processing power and memory. We classify test files into multiple tiers based on the number of data samples, then we set up Lambda configuration to suitable tiers accordingly, as shown in Table 1 below. Also, because our AWS account has a concurrency quota of 2500 in the Singapore region, we take 2000 for file processing, leaving 500 for other usage. We split the 2000 quota into different tiers of Lambda based on our file size distribution analysis result. For example, for 512MB Lambda tier, we cap the concurrency at 310: when incoming 512MB tier file number is more than 310, the Controller will queue them up, wait for the completion of one Lambda, and then invoke a new Lambda for one following queued file. Until today, we process all our files within 3GB RAM or less.

Computing the correlations with AWS Glue

To ensure a scalable implementation in terms of the number and size of test files, we deployed the correlation function into AWS Glue as a Spark Job, using a custom script. The custom script, again in Python, performs 4 tasks: loading the CSV files into a Spark DataFrame, removing outliers, computing correlations, mean value and standard deviation value and saving the result as a parquet file in S3.

As our data varies in scales, we used MinMaxScaler() in spark to scale the data to 0-1 range before correlation calculation. In terms of Glue jobs parameters tuning, we modified the number of workers, spark.driver.maxResultSize and spark.rpc.message.maxSize for better performance.

Once the Controller detects that Lambda has preprocessed all files, it triggers the Glue Job directly. Then the Glue Job uploads its output in S3. The Controller will periodically try to pull the Glue result from the S3 Bucket to check if the work is done.

Computing the optimal set of tests

Using an open-source integer programming solver, we wrote a python library to solve the variant of dominating set problem that we face here. The library takes as inputs the cost of physical tests, the dependencies between physical tests and test parameters, and the parameters correlations. The library output is the optimal set of physical tests. The library is open-source and written in a generic way so you can easily reuse it (see our Github). We deployed the library directly on the Optimizer.

When the Controller detects that the correlations are ready, it downloads the correlation results from S3, and it calls the Optimizer’s solver locally.

Consolidating into a human friendly result

To provide final result to engineers, the Controller translates the Optimizer output plus other input metadata into human friendly result in JSON and CSV format. The Controller consolidates the result by presenting the removed test parameters with their correlated parameters, then uploads these JSON and CSV files into S3.

We call this custom format a “linktable” (see Table 2 below). We list the dropped parameters in column A. For each dropped parameter, engineers can check every kept parameter (column C) which has a correlation (r2 value) bigger than the threshold (0.7 in this case). In this example, we drop 9 parameters (9 groups), the dropped parameters and groups can be well represented by 7 kept parameters (4 kept groups). As a result, engineers can test less parameters and groups for cost reduction purpose.

Performance

The achieved cost reduction depends on the level of redundancy in the initial set of tests, it can be as high as 30%. What is important is that our algorithm is optimal, which means that we successfully identify 100% of the potential savings, regardless of the amount.

We also estimated the cost of running the Glue jobs: below in Table 3 is a sample cost table from peak weeks of Glue jobs. For the highest volume job, we use 30 Glue workers, complete “5,242 parameters x 5,743,232 DUT (Device Under Test)” computation with 116 minutes, cost $23.5.

Conclusion

We showed how you can use Amazon Elastic Container Service, AWS Lambda, AWS Glue and combinatorial optimization to reduce the cost of semiconductor testing while maintaining quality. The approach we took is very general and can be applied to other analytics and optimization problems in manufacturing or elsewhere. To dive deeper into the optimization algorithm, see our Github repository.

For more general information about optimization on AWS, we recommend these two other blog posts:

https://aws.amazon.com/blogs/machine-learning/solving-numerical-optimization-problems-like-scheduling-routing-and-allocation-with-amazon-sagemaker-processing/

https://aws.amazon.com/blogs/architecture/emerging-solutions-for-operations-research-on-aws/

For more information and resources for running semiconductor design on AWS, please visit our Semiconductor Page and our Semiconductor Resources Page.

AWS for Industries