MareArts Computer Vision Study.: Deep learning

Showing posts with label Deep learning. Show all posts

9/30/2024

How Gradient calculation in batch size.

Let's use a simplified example with just 2 data points and walk through the process with actual numbers. This will help illustrate how gradients are calculated and accumulated for a batch.

Let's assume we have a very simple model with one parameter w, currently set to 1.0. Our loss function is the square error, and we're using basic gradient descent with a learning rate of 0.1.

Data points:

x1 = 2, y1 = 4
x2 = 3, y2 = 5

Batch size = 2 (both data points in one batch)

Step 1: Forward pass

For x1: prediction = w * x1 = 1.0 * 2 = 2
For x2: prediction = w * x2 = 1.0 * 3 = 3

Step 2: Calculate losses

Loss1 = (prediction1 - y1)^2 = (2 - 4)^2 = 4
Loss2 = (prediction2 - y2)^2 = (3 - 5)^2 = 4
Total batch loss = (Loss1 + Loss2) / 2 = (4 + 4) / 2 = 4

Step 3: Backward pass (calculate gradients)

Gradient1 = 2 * (prediction1 - y1) * x1 = 2 * (2 - 4) * 2 = -8
Gradient2 = 2 * (prediction2 - y2) * x2 = 2 * (3 - 5) * 3 = -12

Step 4: Accumulate gradients

Total gradient = (Gradient1 + Gradient2) / 2 = (-8 + -12) / 2 = -10

Step 5: Update weight (once for the batch)

New w = old w - learning_rate * total gradient
New w = 1.0 - 0.1 * (-10) = 2.0

So, after processing this batch of 2 data points:

We calculated 2 individual gradients (-8 and -12)
We accumulated these into one total gradient (-10)
We performed one weight update, changing w from 1.0 to 2.0

This process would then repeat for the next batch. In this case, we've processed all our data, so this completes one epoch.

7/15/2023

combine costum fc with hugging face model, good to remember and modify for modifications

refer to code:

    def model_forward(self, pixel_values, labels):
        # Origin vit encoder-decoder outputs
        outputs = self.model(pixel_values=pixel_values, labels=labels, output_hidden_states=True)
        # Get last hidden state
        last_hidden_state = outputs.decoder_hidden_states[-1] # batch_size, seq_len, hidden_size, ex)5, 15, 768
        return last_hidden_state

    def fc_part(self, last_hidden_state):
        # Reshape the last hidden state
        reshaped_logits = last_hidden_state.view(-1, self.model.config.decoder.hidden_size) # batch_size*seq_len, hidden_size
        # Apply the fully connected layer
        new_logits = self.custom_decoder_fc(reshaped_logits) # batch_size*seq_len, vocab_size
        return new_logits

    def compute_loss(self, new_logits, labels):
        # Reshape labels to match logits dimension
        reshaped_labels = labels.view(-1) #batch_size, seq_len -> batch_size*seq_len
        # Calculate loss
        # [batch_size*seq_len, vocab_size] vs [batch_size*seq_len]  #ex) [70, 13] vs [70]
        loss = self.loss_f(new_logits, reshaped_labels) #scalar tensor
        return loss

    def forward_pass(self, pixel_values, labels):
        last_hidden_state = self.model_forward(pixel_values, labels) # batch_size, seq_len, hidden_size
        new_logits = self.fc_part(last_hidden_state) # batch_size*seq_len, vocab_size
        loss = self.compute_loss(new_logits, labels) # scalar tensor
        
        # Reshape new_logits to match labels dimension
        new_logits = new_logits.view(labels.shape[0], labels.shape[1], -1) # bathc_size, seq_len, vocab_size

        return {'logits':new_logits, 'loss':loss}

forward_pass do process step by step.

And in the end return last hidden states logits and loss.

Thank you.

www.marearts.com

🙇🏻‍♂️

7/13/2023

Beam search function for image to text or nlp inference purpose.

refer to code first.

#this beam search only deal with batch size 1
    def beam_search(self, pixel_value, max_length):
        beam_size = self.cfg.num_beams
        alpha = self.cfg.beam_alpha  # Length normalization coefficient
        temperature = self.cfg.beam_temp  # Temperature for softmax

        # Initialize input ids as bos_token_id
        first_sequence = torch.full((pixel_value.shape[0], 1), self.model.config.decoder_start_token_id).to(pixel_value.device)
        # ic(first_sequence) #tensor([[1]])

        # Predict second token id
        outputs = self.forward_pass(pixel_value, first_sequence)
        # ic(outputs.keys()) #dict_keys(['logits', 'loss'])
        # We only need the logits corresponding to the last prediction
        next_token_logits = outputs['logits'][:, -1, :]  
        # ic(outputs['logits'].shape) #[1, 1, 13] batch, seq, vocab_size
        # ic(outputs['logits'][:, -1, :].shape) #[1, 13] batch, vocab_size

        # Apply temperature
        # ic(next_token_logits) 
        # [-5.0641, 32.7805, -2.6743, -4.6459,  0.8130, -1.3443, -1.2016, -4.0770,
        #                         -3.5401,  0.2425, -5.3685, -1.8074, -5.2606]],
        # next_token_logits /= temperature
        # ic(next_token_logits) 
        # [-7.2344, 46.8292, -3.8204, -6.6370,  1.1614, -1.9205, -1.7166, -5.8243,
        #                         -5.0573,  0.3464, -7.6693, -2.5820, -7.5152]],

        # Select top k tokens
        next_token_probs = F.softmax(next_token_logits, dim=-1) 
        top_k_probs, top_k_ids = torch.topk(next_token_probs, beam_size) 
        # ic(F.softmax(next_token_logits, dim=-1))
        # tensor([[3.3148e-24, 1.0000e+00, 1.0072e-22, 6.0241e-24, 1.4680e-20, 6.7340e-22,
        #                                            8.2570e-22, 1.3579e-23, 2.9239e-23, 6.4976e-21, 2.1458e-24, 3.4751e-22,
        #                                            2.5034e-24]]
        # ic(top_k_probs, top_k_ids)
        # top_k_probs: tensor([[1.]], grad_fn=<TopkBackward0>)
        # top_k_ids: tensor([[1]])

        # Prepare next sequences. Each top 1 token is appended to the first_sequence
        # ic(first_sequence.shape) #[1, 1]
        next_sequences = first_sequence.repeat_interleave(beam_size, dim=0)
        # ic(next_sequences.shape) #[10, 1] 10 is beam size, 1 is seq length
        next_sequences = torch.cat([next_sequences, top_k_ids.view(-1, 1)], dim=-1)
        # ic(next_sequences.shape) #[10, 2] 10 is beam size, 2 is seq length
        # ic(next_sequences) 

        # Also prepare a tensor to hold the cumulative scores of each sequence, or the sum of the log probabilities of each token in the sequence
        sequence_scores = (torch.log(top_k_probs).view(-1))  #/ (1 + 1) ** alpha
        # ic(sequence_scores) #[  0.0000, -15.9837]

        # We'll need to repeat the pixel_values for each sequence in each beam
        pixel_value = pixel_value.repeat_interleave(beam_size, dim=0)  
        # ic(pixel_value.shape) #[10, 3, 224, 224], 10 is beam size, 3 is channel, 224 is image size

        for idx in range(max_length - 1):  # We already generated one token
            # ic(idx, '--------------------')
            outputs = self.forward_pass(pixel_value, next_sequences)
            next_token_logits = outputs['logits'][:, -1, :]  
            # ic(outputs['logits'].shape, outputs['logits']) #[2, 2, 13], batch, seq, vocab_size
            # ic(next_token_logits.shape, next_token_logits)

            
            # Apply temperature
            # next_token_logits /= temperature

            # Convert logits to probabilities and calculate new scores
            next_token_probs = F.softmax(next_token_logits, dim=-1) 
            # ic(next_token_probs.shape, next_token_probs) #[2, 13], batch, vocab_size
            next_token_scores = torch.log(next_token_probs)
            # ic(next_token_scores.shape, next_token_scores) #[2, 13], batch, vocab_size

            new_scores = sequence_scores.unsqueeze(1) + next_token_scores
            # ic(sequence_scores.unsqueeze(1))
            # ic(new_scores.shape, new_scores) #[2, 13], batch, vocab_size

            
            # Select top k sequences
            # ic(new_scores.view(-1), new_scores.view(-1).shape)
            top_k_scores, top_k_indices = torch.topk(new_scores.view(-1), beam_size)  

            # ic(top_k_scores, top_k_indices)
            

            # Get the beam and token that each of the top k sequences comes from
            beams_indices = top_k_indices // self.cfg.num_tokens 
            token_indices = top_k_indices % self.cfg.num_tokens  
            # ic(beams_indices, token_indices)
            

            # Update pixel values, sequences, and scores
            # pixel_value = pixel_value[beams_indices]  
            # ic(next_sequences)
            next_sequences = next_sequences[beams_indices] 
            # ic(next_sequences)
            next_sequences = torch.cat([next_sequences, token_indices.unsqueeze(1)], dim=-1)
            # ic(next_sequences)
            sequence_scores = top_k_scores #/ (idx + 3) ** alpha

            # ic('-------------------')
            # if idx > 2: break

        # Select the best sequence
        max_score, max_score_idx = torch.max(sequence_scores, 0)
        # Select the sequence with the highest score
        best_sequence = next_sequences[max_score_idx]

        # ic(best_sequence, max_score)
        
        return best_sequence, max_score

This is portion of my class.

There are omitted code especially forward_pass however the code will work properly if you adapt this carefully.

And you can also capture some idea from here.

Thank you.

🙇🏻‍♂️

www.marearts.com

6/29/2023

Graph Neural Network Study Tutorial

Stanford CS224W Tutorials

https://data.pyg.org/img/cs224w_tutorials.png

The Stanford CS224W course has collected a set of graph machine learning tutorial blog posts, fully realized with PyG. Students worked on projects spanning all kinds of tasks, model architectures and applications. All tutorials also link to a Colab with the code in the tutorial for you to follow along with as you read it!

PyTorch Geometric Tutorial Project

The PyTorch Geometric Tutorial project provides video tutorials and Colab notebooks for a variety of different methods in PyG:

Introduction [ YouTube, Colab]
PyTorch basics [ YouTube, Colab]
Graph Attention Networks (GATs) [ YouTube, Colab]
Spectral Graph Convolutional Layers [ YouTube, Colab]
Aggregation Functions in GNNs [ YouTube, Colab]
(Variational) Graph Autoencoders (GAE and VGAE) [ YouTube, Colab]
Adversarially Regularized Graph Autoencoders (ARGA and ARGVA) [ YouTube, Colab]
Graph Generation [ YouTube]
Recurrent Graph Neural Networks [ YouTube, Colab (Part 1), Colab (Part 2)]
DeepWalk and Node2Vec [ YouTube (Theory), YouTube (Practice), Colab]
Edge analysis [ YouTube, Colab (Link Prediction), Colab (Label Prediction)]
Data handling in PyG (Part 1) [ YouTube, Colab]
Data handling in PyG (Part 2) [ YouTube, Colab]
MetaPath2vec [ YouTube, Colab]
Graph pooling (DiffPool) [ YouTube, Colab]

6/02/2023

torch tensor padding example code:

refer to code:

import torch
import torch.nn.functional as F

tensor = torch.randn(2, 3, 4)  # Original tensor
print("Original tensor shape:", tensor.shape)

# Case 1: Pad the last dimension (dimension -1) -> resulting shape: [2, 3, 8]
padding_size = 4
padded_tensor = F.pad(tensor, (padding_size, 0))  # Add padding to the left of the last dimension
print("Case 1 tensor shape:", padded_tensor.shape)

# Case 2: Pad the second-to-last dimension (dimension -2) -> resulting shape: [2, 8, 4]
padding_size = 5
padded_tensor = F.pad(tensor, (0, 0, padding_size, 0))  # Add padding to the left of the second-to-last dimension
print("Case 2 tensor shape:", padded_tensor.shape)

# Case 3: Pad the first dimension (dimension 0) -> resulting shape: [7, 3, 4]
padding_size = 5
padded_tensor = F.pad(tensor, (0, 0, 0, 0, padding_size, 0))  # Add padding to the left of the first dimension
print("Case 3 tensor shape:", padded_tensor.shape)

www.marearts.com

Thank you. 🙇🏻‍♂️

5/23/2023

Create custom tokenizer simple code.

In the sample code, vocabulary is "0,1,2,3,4" and max length is 20.

from typing import List, Union

class CustomTokenizer:
    def __init__(self, vocab: Union[str, List[str]], pad_token="<PAD>", cls_token="<BOS>", sep_token="<SEP>", max_len=20):
        if isinstance(vocab, str):
            with open(vocab, 'r') as f:
                self.vocab = {word.strip(): i for i, word in enumerate(f.readlines())}
        elif isinstance(vocab, list):
            self.vocab = {word: i for i, word in enumerate(vocab)}
        else:
            raise ValueError("vocab must be either a filepath (str) or a list of words")
        
        print('vocab: ', self.vocab)
        self.pad_token = pad_token
        self.cls_token = cls_token
        self.sep_token = sep_token
        self.max_len = max_len
        self.inv_vocab = {v: k for k, v in self.vocab.items()}

    def tokenize(self, text: str):
        tokens = [c for c in text if c in self.vocab]
        tokens = tokens[:self.max_len]
        padding_length = self.max_len - len(tokens)
        return [self.cls_token] + tokens + [self.sep_token] + [self.pad_token] * padding_length

    def convert_tokens_to_ids(self, tokens):
        return [self.vocab.get(token, self.vocab.get(self.pad_token)) for token in tokens]

    def convert_ids_to_tokens(self, ids):
        return [self.inv_vocab.get(id, self.pad_token) for id in ids]



vocab = ["<PAD>", "<BOS>", "<SEP>", "0", "1", "2", "3", "4"]
with open('vocab.txt', 'w') as f:
    for token in vocab:
        f.write(token + '\n')

# Initialize your custom tokenizer
tokenizer = CustomTokenizer(vocab='vocab.txt')

# Now you can use this tokenizer to tokenize your data, study.marearts.com
tokenized_text = tokenizer.tokenize('22342')
print("tokenized_text: ", tokenized_text)

# Convert tokens to ids
token_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
print("token_ids: ", token_ids)

# Convert ids back to tokens, marearts.com
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print("tokens: ", tokens)

Thank you.

🙇🏻‍♂️

5/02/2023

Yolo V7 vs V8

V7 vs V8 comparison

https://youtu.be/k1dOZFcLOek

https://youtu.be/tpOGDclq7KY

https://youtu.be/u5qxN2ACEP4

https://youtu.be/85SH08jN4dY

This is a comparison video between yolo v7 and v8.

Here is information for each version

Yolo V7
- Github : https://github.com/WongKinYiu/yolov7
- Model : yolov7x.pt
Yolo V8
- Github : https://github.com/ultralytics/ultralytics
- Model : yolov8x.pt

Testing Computer :

Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz
RTX 4090

Something might be useful code

yolo v8, video writer for detection result

import cv2
import time
from ultralytics import YOLO

def process_video(model, video_path, output_path):
    cap = cv2.VideoCapture(video_path)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))

    # Create a VideoWriter object to save the annotated video
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

    while cap.isOpened():
        success, frame = cap.read()

        if success:
            start_time = time.time()
            results = model(frame)
            end_time = time.time()
            processing_time = end_time - start_time
            fps = 1/processing_time
            # Visualize the results on the frame
            annotated_frame = results[0].plot()
            
            # Display the processing time on the annotated frame
            cv2.putText(annotated_frame, f"Processing time: {processing_time:.4f} seconds / {fps:.4f} fps",
                        (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)

            # Write the annotated frame to the output video
            out.write(annotated_frame)

            # cv2.imshow("YOLOv8 Inference", annotated_frame)
            # if cv2.waitKey(1) & 0xFF == ord("q"):
            #     break
        else:
            break

    cap.release()
    out.release()

def main():
    # Load the YOLO model
    model = YOLO('yolov8x.pt')

    # List of video files
    video_paths = [
        "../video/videoplayback-1.mp4",
        "../video/videoplayback-2.mp4",
        "../video/videoplayback-3.mp4",
        "../video/videoplayback-4.mp4",
    ]

    # Loop through video files and process them
    for i, video_path in enumerate(video_paths):
        output_path = f"../video/yolo_88_output_{i+1}.mp4"
        process_video(model, video_path, output_path)

    cv2.destroyAllWindows()

if __name__ == '__main__':
    main()

make 2 video to side by side

Combine Two Videos Side by Side with OpenCV python

Thank you! 😺

4/25/2023

vit encoder + transformer decoder model - export onnx example

refer to this code:

# If you want to combine a Vision Transformer (ViT) as an encoder with a Transformer-based decoder, 
# you can follow the steps below. 
# We will use the Hugging Face Transformers library and PyTorch.

# Install the required libraries:
# pip install torch torchvision transformers onnx

# Define the combined model:
# -----------------------------------------
import torch
import torch.nn as nn
from transformers import ViTModel, ViTConfig, AutoModelForSeq2SeqLM

class ViTTransformer(nn.Module):
    def __init__(self, vit_model, transformer_decoder):
        super(ViTTransformer, self).__init__()
        self.vit = vit_model
        self.transformer_decoder = transformer_decoder

    def forward(self, x, decoder_input_ids, **kwargs):
        encoder_outputs = self.vit(x)
        outputs = self.transformer_decoder(decoder_input_ids, encoder_outputs=encoder_outputs, **kwargs)
        return outputs
# -----------------------------------------

# Load the ViT and Transformer decoder models:
# Assuming you have a pre-trained ViT model and a pre-trained Transformer decoder model, load them as follows:

# -----------------------------------------
vit_config = ViTConfig()
vit_model = ViTModel(vit_config)
transformer_decoder = AutoModelForSeq2SeqLM.from_pretrained("your-pretrained-transformer-decoder")


# Create the combined model and load the checkpoint if you have one:
# -----------------------------------------
combined_model = ViTTransformer(vit_model, transformer_decoder)
# -----------------------------------------

# # If you have a checkpoint, load it as follows:
# # checkpoint = torch.load('path/to/checkpoint.pth')
# # combined_model.load_state_dict(checkpoint['model_state_dict'])
# Export the combined model to ONNX format:
# The process of exporting the combined model to ONNX is more complicated due to the dynamic nature of the Transformer-based decoder. 
# You might need to modify the export code depending on your specific use case. 
# However, here is a general example:

# -----------------------------------------
# # Set the combined model to evaluation mode
combined_model.eval()
# Create dummy input tensors with the correct dimensions
# (B x C x H x W) for image input and (B x seq_len) for decoder input
dummy_image_input = torch.randn(1, 3, 224, 224)
dummy_decoder_input = torch.randint(0, transformer_decoder.config.vocab_size, (1, 5))

# Export the combined model to ONNX format
torch.onnx.export(
    combined_model,
    (dummy_image_input, dummy_decoder_input),
    "vit_transformer.onnx",
    input_names=["image_input", "decoder_input"],
    output_names=["output"],
    dynamic_axes={
        "image_input": {0: "batch_size"},
        "decoder_input": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size", 1: "sequence_length"},
    },
    opset_version=12,
)
# -----------------------------------------

# This code will create an ONNX file (vit_transformer.onnx) containing the combined ViT and Transformer decoder model. 
# Note that you might need to adjust the code according to the specific needs of your application.

Thank you.🙇🏻‍♂️

2/17/2023

Overview of Image Retrieval Applications for Finding Images by Visual and Text Features

Here are a few examples of image retrieval applications:

Google Images: A popular image search engine that allows you to search for images using keywords and filters, such as color, size, and type. Google Images uses a combination of text and visual features to match images to search queries.
TinEye: A reverse image search engine that allows you to find where an image appears online or to search for similar images based on visual features. TinEye uses image recognition technology to analyze the content of images and identify matches.
Clarifai: An image and video recognition platform that allows you to search for images based on visual features such as color, texture, and object category, as well as text features such as captions and tags. Clarifai uses deep learning models to extract and analyze visual and textual features from images.
Microsoft Bing Visual Search: A search engine that allows you to search for images using visual and text features, such as color, object category, and image similarity. Bing Visual Search uses deep learning models to analyze visual features and search algorithms to find similar images.
Amazon Rekognition: An image and video analysis service that allows you to search for images based on visual features such as faces, objects, and scenes, as well as text features such as captions and tags. Amazon Rekognition uses deep learning models to extract and analyze visual and textual features from images.

thank you.
www.marearts.com
🙇🏻‍♂️

2/16/2023

Use the ConvNext classifier layer example

refer to code

import torch.nn as nn
import torchvision

# Load the pre-trained ConvNext model
model = torchvision.models.convnext_base(pretrained=True, stochastic_depth_prob=0.1, layer_scale=1e-4)

# Define a new linear layer with 10 output channels
new_linear_layer = nn.Linear(1024, 10)

# Replace the last linear layer in the classifier with the new one
classifier = model.classifier
classifier[-1] = new_linear_layer

# Set the modified classifier as the new classifier for the model
model.classifier = classifier

# Print the modified model architecture
print(model)

In this code, we first load the pre-trained ConvNext model using torchvision.models.convnext_base.

We then define a new linear layer with 10 output channels using nn.Linear.

Next, we extract the existing classifier from the model using model.classifier, and replace the last linear layer in the classifier with the new one using indexing ([-1]).

Finally, we set the modified classifier as the new classifier for the ConvNext model using model.classifier = classifier.

This code should print out the modified model architecture, which will have a classifier layer that is identical to the original ConvNext classifier except for the number of output channels in the last linear layer.

Thank you.

Pages

9/30/2024

7/15/2023

7/13/2023

6/29/2023

Stanford CS224W Tutorials

PyTorch Geometric Tutorial Project

6/02/2023

5/23/2023

5/02/2023

V7 vs V8 comparison

Testing Computer :

4/25/2023

2/17/2023

2/16/2023