Python support GA: improving Python code quality using Amazon CodeGuru Reviewer

We are pleased to announce the GA launch of Python support in Amazon CodeGuru Reviewer, a service that helps you improve source code quality by automatically detecting hard-to-find defects. CodeGuru Reviewer is powered by program analysis and machine learning, and trained on best practices and hard-learned lessons across millions of code reviews on open-source projects and internally at Amazon.

Python is a widely used language for various use cases, including data science, web application development, and DevOps. We analyzed large code corpuses and Python documentation to source hard-to-find coding issues and trained our detectors to provide best practice recommendations. We expect such recommendations to benefit beginners as well as expert Python programmers. We launched the Python support preview on December 3, 2020 and have been improving this feature for the GA (generally available). This GA launch further extends the coverage of CodeGuru Reviewer to increase the number of recommendations for the existing detectors and include the new detectors that have been validated internally. In this post, we discuss what are the recommendations generated from the new detectors.

For CodeGuru Reviewer, three detectors have been added or improved:

Code maintainability: new detector
Input validation: new detector
Resource leaks: improved detector

In the following sections, we provide real-world examples of bugs or code defects that can be detected in each of the categories.

Code maintainability detector

As a Python developer, you always want to know the quality metrics of your codebase so you can better maintain your codebase for the long term. These metrics cover various aspects of code structure including size, complexity, coupling and cohesion, which are measurable and desired by the managers who want to plan the resources for testing, maintainability, or even refactoring in the long run. Those metrics are developed from CodeGuru Reviewer’s in-house knowledge base. Violations of good practices in terms of these metrics could often lead to low readability, interpretability and/or maintainability of the code, which will increase technical debts in software development cycle.

For example, CodeGuru Reviewer identifies high fan-out for the following function. This example is patterned after actual code.

def main():
    positive_data = pd.read_csv(args.positive_data_file)[args.columns].dropna().reset_index(drop=True)
    negative_data = (
        pd.read_csv(args.negative_data_file)[args.columns].dropna().reset_index(drop=True)
    )
    training_data = positive_data.append(negative_data, ignore_index=True)
    target = training_data[["label"]]
    features = training_data.drop(["label"], axis=1)
    x_train, x_test, y_train, y_test = train_test_split(
        features, target, test_size=0.3, random_state=0
    )
    print("Features: {}".format(args.columns))
    scaler_train = preprocessing.StandardScaler().fit(x_train)
    x_test = scaler_train.transform(x_test)
    x_train = scaler_train.transform(x_train)
    model = LogisticRegression(penalty="l2", class_weight="balanced", C=0.1).fit(x_train, y_train)
    pred_prob = model.predict_proba(x_test)[:, 1]
    predictions = [round(value) for value in pred_prob]
    print(confusion_matrix(y_test, predictions))
    predicted = pred_prob &gt; args.threshold
    print(metrics.classification_report(y_test, predicted, digits=3))
    accuracy = model.score(x_test, y_test)
    print(accuracy)
    model_feat_imp = pd.Series(abs(model.coef_[0]), features.columns).sort_values(ascending=False)
    print(model_feat_imp)
    # save model 
    time = date.today().strftime("%d%m%Y%H%M%S")
    pickle_file = "model_{}.pkl".format(time)
    pickle_full_path = os.path.join(args.save, pickle_file)
    with open(pickle_full_path, "wb") as file:
         pickle.dump(model, file)
         print("Model is saved as {}".format(pickle_full_path))

In this main function, the developer performs data preparation and feature transformation, trains a machine learning model, evaluates its performance and saves the model. Wrapping the whole process into a single function leads to high coupling of the code with other functions. This makes the function prone to errors due to changes of its referenced functions and hard to test and maintain. The following screenshot shows the CodeGuru Reviewer recommendation.

CodeGuru recommendation indicating high fanout of the method

The developer accepted this recommendation and refactored the code to extract data preparation and transformation into a separate function.

def prepare_data():
     positive_data = pd.read_csv(args.positive_data_file)[args.columns].dropna().reset_index(drop=True)
     ...
     return x_train, y_train, x_test, y_test, features
def main():
     x_train, y_train, x_test, y_test, features = prepare_data()
     model = LogisticRegression(penalty="l2", class_weight="balanced", C=0.1).fit(x_train, y_train)
     ...

Input validation detector

Input validation is important to ensure software reliability and security. Processing unexpected inputs can lead to slowdown, crashes, sensitive data exposure, or execution along unintended code paths, which also presents an opportunity for attackers to compromise the code — and through it, users of the system whose data is stored and processed by the code — in certain cases. CodeGuru Reviewer performs careful analysis of the codebase, taking broad context into account, to determine code locations where input validation is insufficient or missing altogether. As part of this analysis, CodeGuru Reviewer accounts for validation patterns at the granularity of the entire codebase, as well as in the specific context of the function and parameters being checked. So far recommendations to add more parameter validation were restricted to Java code. We are now extending this detection category to Python, where the lack of static typing increases the risk of processing unexpected inputs.

Here is an example of a publicly accessible Python function where CodeGuru identified the need for more validation for the arn parameter:

def get_milestones_tasks_successors(arn: str, option: RmsEndpoint):
     logger = get_logger()
     ...
     fetched_data = None
     if option == RmsEndpoint.MT:
        ...
        fetched_data = get_rms_data(AWS_RMSV2_HOST, rmsv2_auth, endpoint, predecessor=arn) for endpoint in endpoints
        logger.info(f"Grabbed the milestone/task succesors for ARN: {arn}:\n{fetched_data}")
     elif option == RmsEndpoint.MILESTONES:
          fetched_data = [get_rms_data(AWS_RMSV2_HOST, rmsv2_auth, RmsEndpoint.MILESTONES, predecessor=arn)]
          logger.info(f"Grabbed the milestone succesors for ARN: {arn}:\n{fetched_data}")
     elif option == RmsEndpoint.TASKS:
          fetched_data = [get_rms_data(AWS_RMSV2_HOST, rmsv2_auth, RmsEndpoint.TASKS, predecessor=arn)]
          logger.info(f"Grabbed the task succesors for ARN: {arn}:\n{fetched_data}")
     return fetched_data

The following screenshot shows the CodeGuru Reviewer recommendation.

CodeGuru recommendation indicating need for input validation

The developer acknowledged the need for additional validation, and posted a new revision of their code with the following validation added at the function prefix:

def get_milestones_tasks_successors(arn: str, option: RmsEndpoint):
     logger = get_logger()
     if arn == None or arn == "":
        logger.error(f"Invalid arn input: {arn}")
     ...

Resource leak detector

The new resource leak detector for Python increases the coverage of the existing detector that was launched at Public Preview. In addition to detecting leaks on open file descriptors, CodeGuru Reviewer now generates recommendations about resource leaks on a more comprehensive set of resources, including leaks on connections, sessions, sockets, multiprocessing thread pools, etc. This detector is driven by the same technology that drives the resource leak detector for Java programs. To know more about the Java resource leak detector, refer to this earlier blog post.

Let us illustrate the Python resource leak detector through an example. Consider the code snippet below. The method get_issuer_for_endpoint is called in a separate thread for each URL endpoint (line 17). It opens a socket that connects to the endpoint (line 6), retrieves the certificate of the other side of connection (line 10), gets the issuer of this certificate and adds it to the shared set issuers (line 12).

 import ssl
 import socket
 def get_issuer_for_endpoint(url, port, issuers):
     s = socket.socket()
     s.settimeout(TIMEOUT)
     s.connect((url, port))
     ctx = ssl.create_default_context()
     ctx.check_hostname = False
     ctx.verify_mode = ssl.CERT_NONE
    cert = ctx.wrap_socket(s, server_hostname=url).getpeercert(True)
    pem = bytes(ssl.DER_cert_to_PEM_cert(cert), 'utf-8')
    issuers.add(get_issuer_from_x509(pem))
 def get_issuers_for_endpoints(urls):
    ... 
    issuers = set()
    for endpoint in urls:
        thread = threading.Thread(target=get_issuer_for_endpoint, args=(endpoint, 443, issuers))
        thread.start() 
    ...

In the above code snippet, the developer has forgotten to close the sockets before returning from the method. Since these sockets are created from within separate threads, they could continue to be in use for a long time after the method get_issuer_for_endpoint has returned. CodeGuru Reviewer generates the following recommendation.

CodeGuru recommendation indicating potential resource leak

The developer agreed this is a critical finding and fixed the code as shown below.

def get_issuer_for_endpoint(url, port,issuers):
     with socket.socket() as s:
          s.settimeout(TIMEOUT)
          s.connect((url,port))
          ctx = ssl.create_default_context()
          ctx.check_hostname = False
          ctx.verify_mode = ssl.CERT_NONE
          cert = ctx.wrap_socket(s, server_hostname = url).getpeercert(True)
          pem = bytes(ssl.DER_cert_to_PEM_cert(cert), 'utf-8')
          issuers.add(get_issuer_from_x509(pem))

Conclusion

In this post, we discussed what are the new and improved code quality detection categories we have added since Python support preview on December 3, 2020. Also, we introduced how to leverage those new detection categories of CodeGuru Reviewer to identify hard-to-find code defects in various categories. This Python GA launch can significantly help you improve code quality for Python applications. CodeGuru Reviewer is now available for you to try with a new lower and more predictable pricing model that offers reductions up to 90%. For more pricing information, please see Amazon CodeGuru pricing.

About the Authors

Hangqi Zhao is a Data Scientist in the Amazon CodeGuru team. He is passionate about building data-driven solutions and intelligent tools powered by machine learning. Outside of work, he is a sports fan and he likes playing poker as well.

Omer Tripp is a Senior Applied Scientist in the Amazon CodeGuru team. His research work is at the intersection of programming languages, machine learning and security. Outside of work, Omer likes to stay physically active (through tennis, basketball, skiing and various other activities) as well as tour the US and the world with his family.

Pranav Garg is a Senior Applied Scientist in the Amazon CodeGuru team. His research is at the intersection of programming languages and machine learning, and he currently works on ML applications for code. Outside of work, you may find him spending time with his newborn baby

Ran Fu is a Senior Product Manager in the Amazon CodeGuru team. He has a deep customer empathy, and love exploring who are the customers, what are their needs, and why those needs matter. Besides work, you may find him snowboarding in Keystone or Vail, Colorado.

AWS DevOps Blog