สร้าง machine learning model คุณภาพสูง สำหรับ financial service โดยใช้ Amazon SageMaker Autopilot

Machine learning (ML) ได้ถูกนำไปใช้อย่างทั่วถึงในอุตสาหกรรม financial service (บริการทางการเงิน) เพื่อทำหน้าที่หลายอย่าง เช่น fraud detection (การตรวจการฉ้อโกง), การตรวจตราด้านการตลาด, การทำ portfolio optimization, การ predict เกี่ยวกับการกู้เงิน, direct marketing, และอื่นๆอีกมากมาย

ซึ่ง use case เหล่านี้จำเป็นต้องให้หน่วยงานต่างๆใน line of business สามารถสร้าง model ที่คุณภาพสูงและมีประสิทธิภาพ โดยมีการเขียน code เพียงเล็กน้อย ซึ่งการลดระยะเวลาของ use case ตั้งแต่ concept ไปจนถึง production และสร้าง business value ให้เกิดขึ้น โดยสำหรับโพสต์นี้จะได้เรียนรู้การใช้งาน Amazon SageMaker Autopilot สำหรับ common use case ใน financial service industry

Autopilot สามารถสร้าง pipeline, train และ tune ML model ที่ดีที่สุด เพื่อใช้กับ classification หรือ regression task กับ tabular data (ข้อมูลจำพวกตาราง) คุณสามารถ maintain การควบคุมและยังมี visibilility เต็มที่ ซึ่ง Autopilot สามารถให้สร้าง ML model ได้อย่างอัตโนมัติโดยไม่ต้องมีประสบการณ์ทางด้าน ML และ Autopilot ยังสามารถวิเคราะห์ dataset หรือชุดข้อมูล, process ข้อมูลเข้าไปใน features และ train ML model ที่มีการ optimize ปรับให้เหมาะสมได้หลาย model

Data scientist ใน financial services มักจะทำงานกับชุดข้อมูลที่ imbalance หรือข้อมูลไม่มีความสมดุลสูง (มีการเทไปที่ข้อมูลใดข้อมูลหนึ่งเป็นจำนวนมาก) ตัวอย่างเช่น credit card fraud หรือการฉ้อโกงบัตรเครดิต (ซึ่งมีธุรกรรมเพียงเล็กน้อยเท่านั้นที่เป็นการฉ้อโกง) หรือการล้มละลาย (มีเพียงไม่กี่บริษัทเท่านั้นที่ยื่นฟ้องล้มละลาย) ในโพสต์นี้จะสาธิตว่า Autopilot สามารถจัดการความไม่สมดุลของข้อมูลอย่างอัตโนมัติได้อย่างไร โดยปราศจาก input เพิ่มเติมจาก user

Autopilot ได้ประกาศความสามารถในการปรับแต่ง model โดยใช้ metric: Area Under a Curve (AUC, คือค่าวัดประสิทธิภาพโดยใช้พื้นที่ใต้กราฟ) นอกเหนือจากค่า F1 (ค่าเฉลี่ยแบบ harmonic mean ซึ่งเป็นค่าเฉลี่ยที่ใช้ในอัตราการวัดต่างๆ ระหว่าง precision ความน่าจะเป็นที่ model ทำนายถูกต้อง และ recall ความน่าจะเป็นที่ model ตรวจจับได้ถูกต้อง) ซึ่งเป็น objective metric (เป็น default objective สำหรับ binary classification tasks) โดยเฉพาะอย่างยิ่งในพื้นที่ใต้ Receiver Operating Characteristic (ROC) curve ซึ่งในโพสต์นี้ จะอธิบายถึงวิธีการใช้งาน AUC เป็น model evaluation metric กับ imbalanced data เพื่อให้ Autopilot ได้ generate model ที่มีความแม่นยำสูงสุด

Use case แรกคือการตรวจ fraud ของการใช้งานบัตรเครดิตตามข้อมูล anonymized attribute (ข้อมูลที่ไม่ระบุตัวบุคคล) ซึ่ง dataset จะเป็นรูปแบบ imbalanced สูง เพราะกว่า 99% ของ transaction จะไม่ใช่ fraud, Use case ที่สองคือการ predict bankruptcy (คาดการณ์สถานะล้มละลาย) ของบริษัทในประเทศโปแลนด์ ซึ่ง bankruptcy จะมีความคล้ายกับ binary response variable (bankrupt = 1, ไม่ bankrupt = 0) และกว่า 96% ของบริษัทจะไม่ bankrupt

Prerequisites (ขั้นตอนเตรียมการ)

เพื่อที่จะทำตามขั้นตอนใน environment ของคุณ จำเป็นต้องทำตาม prerequisite ตามนี้:

สร้าง AWS Identity and Access Management (IAM) role เพื่ออนุญาตให้ Amazon SageMaker notebook access เข้าใช้งาน Amazon Simple Storage Service (Amazon S3) เพื่อเก็บ data
สร้าง Amazon SageMaker notebook instance
สร้าง S3 bucket เพื่อเก็บ output ของ machine learning models และ data อื่นๆ

Credit card fraud detection (ตรวจการฉ้อโกงของการใช้งานเครดิตการ์ด)

ใน fraud detection task บริษัทต่างๆสนใจที่จะ maintain false positive rate (อัตราการตรวจเจอที่ผิดพลาด) ในขณะที่จะต้องตรวจสอบ fraud transaction ให้ถูกต้องมากที่สุดเท่าที่จะเป็นไปได้ ซึ่ง false positive นำไปสู่การยกเลิก หรือ hold transaction เครดิตการ์ดของลูกค้า ที่ทำ transaction ที่ถูกต้อง ซึ่งสิ่งเหล่านี้ทำให้ลูกค้าได้รับ experience ที่ไม่ดี ด้วยเหตุนี้ ความแม่นยำจึงไม่ใช่ตัวชี้วัดที่ดีที่สุดในการพิจารณาปัญหานี้, โดย metric ที่ดีกว่าคือค่า AUC และ F1 score

ตัวอย่างโค้ดด้านล่าง จะแสดงข้อมูลของเครดิตการ์ด fraud task:

import pandas as pd 
fraud_df = pd.read_csv('creditcard.csv') 
fraud_df.head(5)

Class 0 และ Class 1 สอดคล้องกับ No Fraud และ Fraud ตามลำดับ จากภาพเราจะเห็นได้ว่า นอกจากคอลัมน์ Amount, คอลัมน์อื่นๆ จะไม่ระบุชื่อ ซึ่งเป็น anonymized field โดยจุดเด่นที่สำคัญของ Autopilot คือความสามารถในการประมวลผล raw data โดยตรง ตัวอย่างเช่น Autopilot จะแปลง categorical features ไปเป็นค่าตัวเลข, สามารถจัดการกับ missing value (ซึ่งจะแสดงในตัวอย่างที่สอง), และ process simple text ได้

ใช้ AWS boto3 API หรือ AWS Command Line Interface (AWS CLI), โดยจะอัปโหลดข้อมูลไปที่ Amazon S3 ในรูปแบบไฟล์ CSV:

import boto3
s3 = boto3.client('s3')
s3.upload_file(file_name, bucket, object_name=None)

fraud_df = pd.read_csv(<your S3 file location>)

ตอนนี้ก็เลือกทุกคอลัมน์ ยกเว้น Classs ซึ่งเป็น feature และ Class ที่เป็น target:

X = fraud_df[set(fraud_df.columns) - set(['Class'])]
y = fraud_df['Class']
print (y.value_counts())
0    284315
1       492

Binary label คอลัมน์ Class จะมีความไม่สมดุลสูง ซึ่งจะทำเป็นสิ่งปกติที่จะเกิดขึ้นใน financial use case เราจะต้อง verify วิธีที่จะให้ Autopilot สามารถรับมือกับข้อมูลที่มีความไม่สมดุลสูงได้

ในโค้ดต่อไปนี้ จะสาธิตวิธีกำหนดค่า Autopilot ในโน้ตบุ๊ก Jupyter และต้องจัดเตรียมไฟล์สำหรับ train และ test, และตั้งค่า TargetAttributeName เป็น Class, นี่คือ target column (คอลัมน์เป้าหมาย ซึ่งเป็นคอลัมน์ที่เราคาดการณ์):

auto_ml_job_name = 'automl-creditcard-fraud'
import boto3
sm = boto3.client('sagemaker')
import sagemaker  
session = sagemaker.Session()

prefix = 'sagemaker/' + auto_ml_job_name
bucket = session.default_bucket()
training_data = pd.DataFrame(X_train)
training_data['Class'] = list(y_train)
test_data = pd.DataFrame(X_test)

train_file = 'train_data.csv';
training_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'Class'
    }
  ]

ถัดไป เราจะสร้าง Autopilot job สำหรับโพสต์นี้ เราตั้งค่า ProblemType='BinaryClassification' และ job_objective='AUC' หากคุณไม่ได้ตั้งค่าฟิลด์เหล่านี้ Autopilot จะกำหนดประเภทเป็น supervised learning problem โดยวิเคราะห์จาก data และใช้ค่า default metric ซึ่งค่า default metric สำหรับ binary classification คือ F1 ดังนั้นเราต้องตั้งค่า parameter อย่างชัดเจน เนื่องจากต้องการเพิ่มประสิทธิภาพของ AUC

from sagemaker.automl.automl import AutoML
from time import gmtime, strftime, sleep
from sagemaker import get_execution_role

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
base_job_name = 'automl-card-fraud' 

target_attribute_name = 'Class'
role = get_execution_role()
automl = AutoML(role=role,
                target_attribute_name=target_attribute_name,
                base_job_name=base_job_name,
                sagemaker_session=session,
                problem_type='BinaryClassification',
                job_objective={'MetricName': 'AUC'},
                max_candidates=100)

สำหรับข้อมูลเพิ่มเติมเกี่ยวกับ parameter ของ job configuration สามารถดูเพิ่มเติมได้ที่ create-auto-ml-job

หลังจากสร้าง Autopilot job แล้ว ก็จะ call fit() function เพื่อรัน job:

automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)
describe_response = automl.describe_auto_ml_job()
print (describe_response)
job_run_status = describe_response['AutoMLJobStatus']
    
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response['AutoMLJobStatus']
    print (job_run_status)
    sleep(30)
print ('completed')

เมื่อ job complete ก็จะสามารถเลือก best candidate จาก based AUC objective metric:

best_candidate = automl.describe_auto_ml_job()['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))
CandidateName: tuning-job-1-7e8f6c9dffe840a0bf-009-636d28c2
FinalAutoMLJobObjectiveMetricName: validation:auc
FinalAutoMLJobObjectiveMetricValue: 0.9890000224113464

ตอนนี้เราได้สร้าง Autopilot model object โดยใช้ model artifact จาก Autopilot job ใน Amazon S3, และ inference container จาก best candidate หลังจากที่ tuning job ซึ่งในกรณีที่ต้องการ predicted label ที่เราสนใจค่าความน่าจะเป็นของ prediction — เราจะใช้ความน่าจะเป็นนี้ ในการ plot กราฟของ AUC, precision และ recall

model_name = 'automl-cardfraud-model-' + timestamp_suffix
inference_response_keys = ['predicted_label', 'probability']
model = automl.create_model(name=best_candidate_name,
candidate=best_candidate,inference_response_keys=inference_response_keys)

หลังจากสร้าง model แล้ว เราสามารถสร้าง inference สำหรับ test กับโค้ดต่อไปนี้ โดยระหว่าง inference time, Autopilot จะจัดการเรื่อง deploy inference pipeline, รวมไปถึง feature engineering และ ML algorithm บน inference machine

s3_transform_output_path = 's3://{}/{}/inference-results/'.format(bucket, prefix);
output_path = s3_transform_output_path + best_candidate['CandidateName'] +'/'
transformer=model.transformer(instance_count=1, 
                          instance_type='ml.m5.xlarge',
                          assemble_with='Line',
                          output_path=output_path)
transformer.transform(data=test_data_s3_path, split_type='Line', content_type='text/csv', wait=False)

describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print (describe_response)
    sleep(30)
print ('transform job completed with status : ' + job_run_status)

สุดท้าย เราจะนำ inference และ predicted data เข้าไปใน dataframe:

import json
import io
from urllib.parse import urlparse

def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip('/')
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')    
pred_csv = get_csv_from_s3(transformer.output_path, '{}.out'.format(test_file))
data_auc=pd.read_csv(io.StringIO(pred_csv), header=None)
data_auc.columns= ['label', 'proba']

Model metrics

metric ทั่วไปที่ใช้เพื่อเปรียบเทียบ classifiers คือ ROC curve และ precision-recall curve ซึ่ง ROC curve คือการ plot กราฟค่าอัตรา true positive rate เทียบกับ false positive rate สำหรับหลายๆ thresholds โดย prediction quality ของ classification model ยิ่งสูง, ROC curve ก็จะ skewed ไปทางด้านซ้ายบนมากขึ้นเท่านั้น

Precision-recall curve จะแสดงถึง การ trade-off ระหว่าง precision และ recall, ซึ่ง model ที่ดีที่สุดจะมี precision-recall curve ที่เป็นแบบ flat ในช่วงแรก และ drop ลงอย่างชัน เป็น recall approaches 1, โดย precision และ recall ยิ่งสูง, curve ก็จะ skewed ไปทางด้านขวาบนมากขึ้นเท่านั้น

การ optimize สำหรับค่า F1 score, เราต้อง repeat step ก่อนหน้านี้, ตั้งค่า job_objective={'MetricName': 'F1'} และ rerun Autopilot job เพราะว่า step เหมือนกัน, ซึ่งเราจะไม่ repeat ใน section นี้ แต่ต้อง note ไว้ว่า F1 objective คือค่า default สำหรับ binary classification problem ซึ่งโค้ดด้านล่างจะ plot ROC curve:

import matplotlib.pyplot as plt
colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]
from sklearn import metrics
for i in range(0,len(models)):
    fpr, tpr, _ = metrics.roc_curve(y_test, models[i]['proba'])
    fpr, tpr, _  = metrics.roc_curve(y_test, models[i]['proba'])
    auc_score = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label=str('Auto Pilot {:.2f} '+ model_names[i]).format(auc_score),color=colors[i]) 
        
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.title('ROC Cuve')

ผลลัพธ์จะ plot แสดงออกมาตามภาพด้านล่าง

ในการ plot AUC ROC ก่อนหน้า, Autopilot model จะให้ AUC ที่สูงเมื่อมีการ optimize ทั้ง object metrics ให้เหมาะสม โดยเราจะไม่ได้เลือก model หรือ tune hyperparameters ใดๆ; Autopilot จะทำงานในส่วนนี้ให้เรา

สุดท้าย เราจะ plot precision-recall curves สำหรับ Autopilot model ที่ถูก train แล้ว:

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
from sklearn import metrics

colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]

print ('model ', 'F1 ', 'precision ', 'recall ')
for i in range(0,len(models)):
precision, recall, _ = precision_recall_curve(y_test, models[i]['proba'])
print (model_names[i],f1_score(y_test, np.array(models[i]['label'])),precision_score(y_test, models[i]['label']),recall_score(y_test, models[i]['label']) )
plt.plot(recall,precision,color=colors[i],label=model_names[i])

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='upper right')
plt.show()

                    F1          precision      recall 
Objective : AUC 0.8164          0.872          0.7676
Objective : F1  0.7968          0.8947         0.7183

ผลลัพธ์จะ plot แสดงออกมาตามภาพด้านล่าง

ตามที่เราเห็นได้จากการ plot, Autopilot model จะให้ precision and recall ที่ดี, เพราะว่ากราฟ skew ไปทางมุมขวาบนค่อนข้างมาก

Autopilot outputs

นอกเหนือจากการจัดการงานในการ build และ train model, Autopilot ยังให้ visibility ขั้นตอนในการสร้าง model โดย generate ออกมาเป็น 2 notebook: CandidateDefinitionNotebook และ DataExplorationNotebook

คุณสามารถใช้ candidate definition notebook เพื่อรัน step ที่ใช้ใน Autopilot เพื่อหา candidate ที่ดีที่สุด คุณยังสามารถใช้ notebook นี้เพื่อ override runtime parameter ต่างๆ เช่น parallelism, hardware used, algorithms explored, feature engineering scripts, hyperparameter tuning ranges, และอื่นๆอีกมากมาย

คุณสามารถ download notebook จากที่อยู่ใน Amazon S3:

automl.describe_auto_ml_job()['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

Notebook ยังสรุปขั้นตอน feature engineering step ต่างๆในขั้นตอนการ build model และ model จะถูกจัดหมวดหมู่ตามประเภทของ model และ feature engineering pipeline ตัวอย่างเช่น ตามที่เเสดงในผลลัพธ์ของ Tuning Job, model ที่ดีที่สุดจะสอดคล้องกับ pipeline dpp1-xgboost:

best_candidate_name = best_candidate['CandidateName']
print(best_candidate). From there if we look at 
print (describe_response)

ถ้าเราค้นหา ModelDataUrl เราจะเจอ Autopilot ที่ถูกใช้

dpp1-xgboost 'ModelDataUrl': 's3://sagemaker-us-east-1-<ACCOUNT-NUM>/automl-card-fraud-7/tuning/automl-car-dpp1-xgb/tuning-job-1-7e8f6c9dffe840a0bf-009-636d28c2/output/model.tar.gz'

dpp1-xgboost คือ data transformation strategy ที่ transform feature ที่เป็นตัวเลข โดยใช้ RobustImputer ซึ่งจะรวม feature ที่ generate ขึ้นมา และ apply RobustPCA ตามด้วย RobustStandardScaler ซึ่งการ transform data จะใช้ในการ tune XGBoost model ด้วย

จาก candidate definition notebook, เราจะเห็นว่า Autopilot ได้ apply up-weighting โดยอัตโนมัติ เพื่อ minority class โดยใช้ scale_pos_weight วิธีนี้จะช่วยปรับปรุง prediction quality สำหรับ imbalanced dataset ซึ่ง model จะไม่เห็น example ของ minority class ระหว่าง train model คุณสามารถเปลี่ยน scale_pos_weight เป็นค่าอื่นๆที่แตกต่างกันได้:

STATIC_HYPERPARAMETERS = {
    'xgboost': {
        'objective': 'binary:logistic',
        'scale_pos_weight': 568.6114285714285,
    },
}

data ที่ explore ใน notebook จะ generate report เพื่อให้ข้อมูลเชิงลึก เกี่ยวกับ input dataset เช่น missing value หรือ data type สำหรับ feature ที่แตกต่างกัน:

automl.describe_auto_ml_job()['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

หลังจากอธิบายรายละเอียดเกี่ยวกับการใช้ Autopilot เพื่อตรวจจับ fraud ของบัตรเครดิตแล้ว ตอนนี้เราได้พูดคุยกันสั้นๆ เกี่ยวกับ task ที่สอง: การทำนายการล้มละลายของบริษัทต่างๆ

Predicting bankruptcy of Polish companies (การทำนายการล้มละลายของบริษัทต่างๆในประเทศโปแลนด์)

สำหรับโพสต์นี้ เราได้ explore ค่า attribute เชิงเศรษฐกิจที่แตกต่างกันของ dataset การล้มละลายของบริษัทต่างๆในประเทศโปแลนด์, ซึ่งมี 64 feature และ 1 target attribute class เราได้เปลี่ยนชื่อ column class เป็น bankrupt (not bankrupt = 0, bankrupt = 1) เพื่อความง่ายต่อการเข้าใจ ซึ่งตามที่ได้กล่าวไปตอนต้น ว่า dataset นี้ไม่มีความสมดุล เพราะกว่า 96% ของ data เป็น non-bankrupt category

เราได้ทำตาม process เพื่อ run และ config Autopilot ตามตัวอย่าง use case การตรวจจับ fraud ของ transaction เครดิตการ์ด อย่างไรก็ตามด้วย dataset ที่แตกต่างกันกับ use case ข้างต้น ซึ่ง dataset นี้มี missing value อยู่ด้วย ซึ่ง Autopilot สามารถจัดการกับ missing value ได้โดยอัตโนมัติ เราสามารถใน raw data ให้กับ Autopilot ได้เลย

เราจะไม่ repeat ในส่วนที่เป็นโค้ดใน section นี้ โดยจะแสดงให้เห็นถึง ROC และ precision-recall curves, Autopilot ได้ให้ model ที่มีคุณภาพสูงโดยดูจากหลักฐาน AUC, ROC และ precision-recall curve สำหรับการทำนาย bankruptcy นี้ ถ้าทำนายผิดจะนำไปสู่การตัดสินใจการลงทุนที่ผิดพลาด

ในการเพิ่มประสิทธิภาพของ model, Autopilot ยังทำ up-weight ให้กับ minority class label โดยอัตโนมัติ และจำแนก model สำหรับ mis-classifying กับ minority class ระหว่าง train model ตามภาพด้านล่างจะแสดงถึงกราฟการ plot ของ precision-recall curve

ภาพแสดงการ plot ของ ROC curve

เราจะเห็นจากกราฟต่างๆ สำหรับ bankruptcy, AUC objective จะดีกว่า F1 โดย Autopilot จะ generate prediction ที่แม่นยำ สำหรับเหตุการณ์ที่ซับซ้อน เช่น bankruptcy โดยไม่จำเป็นต้องมีขั้นตอน manual feature-engineering

Cleaning up (การลบ resource หลังทดสอบ)

Autopilot job ได้สร้าง artifact มากมาย เช่น การแบ่ง dataset, preprocessing scripts, และ preprocessed data โดยหลังจากที่ทดสอบแล้ว สามารถลบ resource ได้จากโค้ดด้านล่าง:

#s3 = boto3.resource('s3')
#bucket = s3.Bucket(bucket)
 
#job_outputs_prefix = '{}/output/{}'.format(prefix,auto_ml_job_name)
#bucket.objects.filter(Prefix=job_outputs_prefix).delete()

Conclusion (บทสรุป)

ในโพสต์นี้ เราสาธิตวิธีสร้าง model MLโดยใช้ Autopilot ซึ่งไม่ต้องมีความรู้เกี่ยวกับอัลกอริทึมมาก่อน สำหรับข้อมูลที่ไม่มีความสมดุล เช่น use case ทางด้าน financial เหล่านี้ เราก็ได้แสดงถึงการใช้ objective metric เช่น AUC และ F1 ร่วมกับการใช้ up-weighting กับ minority class อย่างอัตโนมัติ ซึ่งทำให้ได้ model ที่มีคุณภาพสูง อีกทั้ง Autopilot ยังให้ความคล่องตัวของ AutoML ซึ่งสามารถ control และเห็นรายละเอียด รวมทั้งสามารถทำ step ขั้นตอนต่างๆได้ด้วยตัวเอง ซึ่งเห็นทั้ง metadata และ code ที่ใช้ในขั้นตอนการ preprocess data และ train model และที่สำคัญคือ Autopilot สามารถทำงานกับ dataset ในทุกๆขนาด ตั้งแต่ไม่กี่ MBs ไปจนถึงหลายร้อย GBs โดยคุณไม่จำเป็นต้อง set up infrastructure, สุดท้าย Amazon SageMaker Studio ยังให้ UI สำหรับคุณในการ build, train, และ deploy model โดยใช้ Autopilot ด้วยการเขียนโค้ดเพียงเล็กน้อย คุณผู้อ่านสามารถเรียนรู้เพิ่มเติมเกี่ยวกับการ tune, train, และ deploy Autopilot model ตาม hands-on workshop: สร้าง machine learning model อย่างอัตโนมัติด้วย Amazon SageMaker Autopilot

References (อ้างอิง)

[1] Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

[2] Zieba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications.

AWS Thai Blog

สร้าง machine learning model คุณภาพสูง สำหรับ financial service โดยใช้ Amazon SageMaker Autopilot

Prerequisites (ขั้นตอนเตรียมการ)

Credit card fraud detection (ตรวจการฉ้อโกงของการใช้งานเครดิตการ์ด)

Model metrics

Autopilot outputs

Predicting bankruptcy of Polish companies (การทำนายการล้มละลายของบริษัทต่างๆในประเทศโปแลนด์)

Cleaning up (การลบ resource หลังทดสอบ)

Conclusion (บทสรุป)

References (อ้างอิง)

เรียนรู้

ทรัพยากร

Developer

ความช่วยเหลือ