MareArts Computer Vision Study.: sklearn

Showing posts with label sklearn. Show all posts

4/30/2023

MLFlow example, using kmean + iris dataset in sklearn.

refer to code:

import mlflow
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.cluster import KMeans
from sklearn import datasets
import json
import matplotlib.pyplot as plt


try:
    mlflow.create_experiment(name="iris_experiment")
except mlflow.exceptions.MlflowException:
    print("The experement already exist")

with mlflow.start_run(experiment_id=mlflow.get_experiment_by_name("iris_experiment").experiment_id):
    
    iris = datasets.load_iris()

    length, width = iris['data'].shape
    mlflow.log_dict( {"length":length, "width":width }, "shape.json")

    params_dict = {"length":iris['data'].shape[0], "width":iris['data'].shape[1]}
    params_json = json.dumps(params_dict)
    mlflow.log_param(key="param", value=params_json)


    BestK = 0
    BestScore = 0
    best_predict_labels = 0
    best_centroids = 0
    for step, k in enumerate(range(2,10)): 
        print(f'---- f:{k}')
        # Fit the model to the data
        kmeans = KMeans(n_clusters=k, random_state=0)

        X = iris['data']
        kmeans.fit(X)
        centroids = kmeans.cluster_centers_

        # Get the cluster assignments for each data point
        predicted_labels = kmeans.labels_
        true_labels = iris['target']

        # Calculate the Adjusted Rand Index
        ari = adjusted_rand_score(true_labels, predicted_labels)
        print("Adjusted Rand Index:", ari)

        # Calculate the Silhouette Score
        sil_score = silhouette_score(X, predicted_labels)
        print("Silhouette Score:", sil_score)

        if BestScore < (ari+sil_score)/2 :
            BestScore = (ari+sil_score)/2
            BestK = k
            best_predict_labels = predicted_labels
            best_centroids = centroids
        
        #log metric
        # mlflow.log_metric(key="kmean score", value={"ARI":ari, "Sil":sil_score})
        
        mlflow.log_metric(key="kmean_ari", value=ari, step=step)
        mlflow.log_metric(key="silhouette_score", value=sil_score, step=step)



    print('+++++++++')
    print(f'best score:{BestScore}, k:{BestK}')


    fig, ax = plt.subplots()
    ax.hist(predicted_labels)
    mlflow.log_figure(fig, artifact_file="lables_hist.png")

The code you provided is a Python script that uses the MLflow library to perform K-means clustering on the Iris dataset, logs the model parameters and metrics, and identifies the best K value (number of clusters) based on the Adjusted Rand Index (ARI) and Silhouette Score. It also generates a histogram plot of the predicted labels for the best K value and saves it as an artifact.

Here's an overview of the code:

Import the necessary libraries, including MLflow, scikit-learn, and matplotlib.

Create a new MLflow experiment called "iris_experiment" if it doesn't already exist.

Start an MLflow run in the context of the "iris_experiment".

Load the Iris dataset and log its shape as a dictionary.

Iterate through different K values (number of clusters) from 2 to 9 and perform the following steps:

a. Instantiate and fit a KMeans model with the current K value.

b. Obtain the predicted cluster assignments and compare them with the true labels using ARI and Silhouette Score.

c. Log the ARI and Silhouette Score metrics for each K value using MLflow.

d. Keep track of the best K value based on the average of ARI and Silhouette Score.

Print the best K value and its corresponding score.

Create a histogram plot of the predicted labels for the best K value and save it as an artifact using MLflow.

Thank you.

www.marearts.com

🙇🏻‍♂️

5/30/2022

find optimal clustering number using silhouette evaluation

To Find optimal clustering number using silhouette metrics

It evaluate clustering resulting in every k number of KMean algorithm.

And show it as figure.

Lager value is better result.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import numpy as np
import matplotlib.pyplot as plt
silhouette_vals = []
sk,ek = 2,20
for i in range(sk, ek):
    kmeans_plus = KMeans(n_clusters=i, init='k-means++')
    pred = kmeans_plus.fit_predict(cluster_df)
    silhouette_vals.append(np.mean(silhouette_samples(cluster_df, pred, metric='euclidean')))
plt.plot(range(sk, ek), silhouette_vals, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette')
plt.show()

For example here, 20 k is best clustering result.

Thank you.

11/12/2021

compute_class_weight() takes 1 positional argument but 3 were given

put a explicit parameter name like:

from sklearn.utils import class_weight

class_weights=class_weight.compute_class_weight(class_weight='balanced',classes=np.unique(y),y=y)

Thank you.

www.marearts.com

9/21/2021

write classification_report to txt file

Refer to code ^^

from sklearn.metrics import classification_report

report = classification_report(y_test_true_list, y_test_pred_list, labels=labels, target_names=target_names)

fp = open('report.txt', 'w')
fp.write(report)
fp.close()

Thank you.

www.marearts.com

10/09/2020

draw roc curve using python sklearn, Matplotlib

import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn import metrics


gt = [1, 0, 1, 0, 1, 1] #origin 
pre = [0.9, 0.5, 0.8, 0.4, 0.5, 0.8] #predict
fpr, tpr, thresholds = metrics.roc_curve(gt, pre)
roc_auc = metrics.auc(fpr, tpr)

fig, ax = plt.subplots(figsize=(10,7))
ax.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
ax.plot(np.linspace(0, 1, 100),
         np.linspace(0, 1, 100),
         label='baseline',
         linestyle='--')
plt.title('Receiver Operating Characteristic Curve', fontsize=18)
plt.ylabel('TPR', fontsize=16)
plt.xlabel('FPR', fontsize=16)
plt.legend(fontsize=12) 

9/20/2020

split train test dataset

import random

from sklearn.model_selection import train_test_split

random.shuffle(pkl_list)

pkl_train, pkl_test = train_test_split(pkl_list, test_size=0.2)

Pages

4/30/2023

MLFlow example, using kmean + iris dataset in sklearn.

5/30/2022

find optimal clustering number using silhouette evaluation

11/12/2021

compute_class_weight() takes 1 positional argument but 3 were given

9/21/2021

write classification_report to txt file

10/09/2020

draw roc curve using python sklearn, Matplotlib

9/20/2020

split train test dataset