5/30/2022

find optimal clustering number using silhouette evaluation

 

To Find optimal clustering number using silhouette metrics

It evaluate clustering resulting in every k number of KMean algorithm.

And show it as figure.

Lager value is better result.

..

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import numpy as np
import matplotlib.pyplot as plt
silhouette_vals = []
sk,ek = 2,20
for i in range(sk, ek):
kmeans_plus = KMeans(n_clusters=i, init='k-means++')
pred = kmeans_plus.fit_predict(cluster_df)
silhouette_vals.append(np.mean(silhouette_samples(cluster_df, pred, metric='euclidean')))
plt.plot(range(sk, ek), silhouette_vals, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette')
plt.show()

..

For example here, 20 k is best clustering result.


Thank you.


5/24/2022

t-SNE visualisation example code in Python

 

Refer to code


..


from sklearn.manifold import TSNE
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd


iris = load_iris()
x = iris.data
y = iris.target


# from sklearn.utils import shuffle
# x, y = shuffle(x, y)


tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x)


df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]


sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 3),
data=df).set(title="Iris data T-SNE projection")



..

output




5/14/2022

pathlib, path, pathlib.PosixPath,

 

Make path using pathlib.

refer to code

..

from pathlib import Path
image_dir = 'dataset/images'
images = '1.png'
print(str(Path(image_dir) / images))
print( type(Path(image_dir) / images ))
# dataset/images/1.png
# <class 'pathlib.PosixPath'>

..


www.marearts.com


yaml to dict, adding argparse to dict (easydict, yaml)

 

Simple code to adapt and know how to read yaml and convert it to dict.

And one more thing is add argparse param to dict which is made from yaml.

We use easydict for this.

Refer to below code, then you would understand at a glance.


..


from easydict import EasyDict
import yaml

def load_setting(setting):
with open(setting, 'r') as f:
cfg = yaml.load(f, Loader=yaml.FullLoader)
return EasyDict(cfg)

#----------------------------
cfg = load_setting('test.yaml')
print(cfg)
#{'V1': 'abc', 'V2': {'sub': [1, 2, 3]}}
#----------------------------

#----------------------------
import argparse
paser = argparse.ArgumentParser()
args = paser.parse_args("")
args.batch_size=10
args.epoch=10
#----------------------------

cfg.update(vars(args))
print(cfg, type(cfg))
#{'V1': 'abc', 'V2': {'sub': [1, 2, 3]}, 'batch_size': 10, 'epoch': 10} <class 'easydict.EasyDict'>
#----------------------------

..


Thank you.

www.marearts.com


5/13/2022

convert simple transformer ner model to onnx

 

..

!python -m transformers.onnx --model=./checkpoint-21-epoch-11 --feature=token-classification onnx/

..



tokens to word, transformer

 

Refer to code to figure it out

how tokens consisted for a word.

Code show you tokens list for a word.


..

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

example = "This is a tokenization example"

print('input sentence: ', example)
print('---')
print('tokens :')
print( tokenizer.encode(example, add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False) )
print('---')
print('word and tokens :')
print({x : tokenizer.encode(x, add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False) for x in example.split()})
print('---')
idx = 1
enc =[tokenizer.encode(x, add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False) for x in example.split()]
desired_output = []
for token in enc:
tokenoutput = []
for ids in token:
tokenoutput.append(idx)
idx +=1
desired_output.append(tokenoutput)

print('tokens in grouped list')
print(desired_output)
print('---')

..


input sentence:  This is a tokenization example
---
tokens :
[713, 16, 10, 19233, 1938, 1246]
---
word and tokens :
{'This': [713], 'is': [354], 'a': [102], 'tokenization': [46657, 1938], 'example': [46781]}
---
tokens in grouped list
[[1], [2], [3], [4, 5], [6]]
---


Thank you.
www.marearts.com

5/11/2022

python dict order shuffle

 


..

import random
d = {'a':1, 'b':2, 'c':3, 'd':4}
l = list(d.items())
random.shuffle(l)
d = dict(l)
print(d)

..

{'a': 1, 'c': 3, 'b': 2, 'd': 4}




5/09/2022

BERT Tokenizer, string to token, token to string

 

BERT Tokenizer token understanding examples

..

text = "I am e/mail"
# text = "I am a e-mail"
tokens = tokenizer.tokenize(text)
print(f'Tokens: {tokens}')
print(f'Tokens length: {len(tokens)}')
encoding = tokenizer.encode(text)
print(f'Encoding: {encoding}')
print(f'Encoding length: {len(encoding)}')
tok_text = tokenizer.convert_tokens_to_string(tokens)
print(f'token to string: {tok_text}')

..

output:

Tokens: ['I', 'ฤ am', 'ฤ e', '/', 'mail']
Tokens length: 5
Encoding: [0, 100, 524, 364, 73, 6380, 2]
Encoding length: 7
token to string: I am e/mail

--
Thank you.
www.marearts.com

5/04/2022

Python Convert List into a space-separated string

 

refer to code

..

lst = ['I', 'am', 'a', 'humen']
strlist = ' '.join(lst)
print(strlist, type(strlist))

..

I am a humen <class 'str'>

Thank you.