MareArts Computer Vision Study.: 2024

12/31/2024

Persion Detection dataset

# The Ultimate Guide to Person Detection Datasets (2024 Edition)

Are you working on a computer vision project involving person detection? Choosing the right dataset can make or break your model's performance. In this comprehensive guide, we'll explore the best person detection datasets available in 2024, from industry standards to exciting new releases.

## Table of Contents

- [Industry Standard Datasets](#industry-standard-datasets)

- [Specialized Datasets](#specialized-datasets)

- [New Datasets for 2024](#new-datasets-for-2024)

- [How to Choose the Right Dataset](#how-to-choose-the-right-dataset)

## Industry Standard Datasets

### COCO (Common Objects in Context)

**The Gold Standard for Computer Vision**

- **Size**: 200,000+ images with 250,000+ person instances

- **What Makes It Special**:

- Diverse scenarios and lighting conditions

- High-quality annotations including segmentation masks

- Regular updates and strong community support

- **Best For**: General-purpose detection and benchmarking

### CrowdHuman

**Your Go-To for Crowded Scenes**

- **Size**: 15,000 images containing 470,000 person instances

- **Standout Features**:

- Average of 22.6 people per image

- Multiple annotation types (full body, visible body, head)

- Real-world crowd scenarios

- **Best For**: Surveillance systems and crowd monitoring

### MOT20

**Perfect for Video Applications**

- **Size**: 2.2M+ annotated boxes across video sequences

- **Key Strengths**:

- Temporal information

- Challenging crowd scenarios

- Moving camera situations

- **Best For**: Multi-object tracking and surveillance

## Specialized Datasets

### CityPersons

**Urban Environment Specialist**

- **Size**: 35,000 person instances

- **Resolution**: Crisp 2048x1024 images

- **Perfect For**:

- Autonomous driving

- Urban surveillance

- Street-level analysis

### SCUT-HEAD

**Head Detection Expert**

- **Size**: 4,500 images with 111,000 head annotations

- **Unique Features**:

- Specialized for head detection

- Various viewing angles

- Crowd density information

- **Best For**: Head counting and crowd analysis

## New Datasets for 2024

### HumanFlow

**Revolutionary Crowd Analysis**

- **Focus**: Dense crowd movement patterns

- **Size**: 50,000+ tracked trajectories

- **Unique Offering**: Group behavior analysis and flow patterns

### NightPersons

**Low-Light Detection Champion**

- **Specialty**: Night-time and low-light scenarios

- **Size**: 25,000 annotated instances

- **Extra Value**: Multi-spectrum data including thermal imaging

### MultiViewPeople

**Multi-Camera Innovation**

- **Size**: 1M+ synchronized frames

- **Highlight Features**:

- Multiple synchronized camera views

- Indoor and outdoor scenarios

- Activity labels

## How to Choose the Right Dataset

### 1. Consider Your Application

- **General Detection**: Start with COCO

- **Crowd Analysis**: CrowdHuman is your friend

- **Urban/Traffic**: CityPersons won't disappoint

- **Night Operations**: NightPersons is essential

- **Multi-Camera Setup**: MultiViewPeople has you covered

### 2. Check Your Resources

- **Storage Capacity**: Larger datasets need more space

- **Computing Power**: Consider your training infrastructure

- **Time Constraints**: Smaller datasets might be sufficient for prototyping

### 3. Evaluate Data Quality

- Look for consistent annotations

- Check update frequency

- Consider community support and available tools

### 4. Think About Your Environment

- Indoor vs. outdoor requirements

- Lighting conditions

- Camera angles and positions

- Scene complexity

## Conclusion

The perfect dataset for your person detection project depends on your specific needs. While COCO remains the industry standard, specialized datasets like CrowdHuman or the new NightPersons might better suit your particular use case. Don't be afraid to combine multiple datasets for better results!

### Pro Tips

1. Start with a smaller subset for initial testing

2. Consider data augmentation to enhance diversity

3. Check licensing terms before using in commercial projects

4. Look for datasets with similar conditions to your deployment environment

Need help getting started? Drop a comment below, and I'll be happy to help you choose the right dataset for your project!

---

*Last updated: December 2024*

hipblas, cublas algorithm

The HipBLASLt tuning process and algorithm selection is based on these factors in your data:

```
dev_cap,m,n,k,trans_a,trans_b,type_a,type_b,type_d,bias_type,lda,ldb,ldd,epi,comp,scale,ws_min,ws_max,algo_id,aidx
```

Key parameters:
1. Matrix Dimensions:
- `m,n,k`: Matrix dimensions for GEMM operations
- Example: `904,8192,2048,8192` = matrix sizes

2. Data Types:
- `type_a,type_b`: Input types (float8e4m3, bfloat16)
- `type_d`: Output type (bfloat16)
- `comp`: Computation type (f32)

3. Memory Layout:
- `trans_a,trans_b`: Matrix transposition (T=transposed, N=not)
- `lda,ldb,ldd`: Leading dimensions

4. Algorithm Selection:
- `algo_id`: Specific algorithm identifier
- `aidx`: Algorithm variant index
- workspace limits: `ws_min,ws_max`

The tuning process (`TE_HIPBLASLT_TUNING_RUN_COUNT=30` and `TE_HIPBLASLT_TUNING_ALGO_COUNT=100`) tests different combinations and selects the best based on:
1. Performance (speed)
2. Numerical stability
3. Memory usage
4. Hardware compatibility (dev_cap=904)

This tuning happens in the Tensor Engine (TE) library during the GEMM operations.

Insight about "fsdp_config.activation_checkpointing" option

The `fsdp_config.activation_checkpointing` does come with a computational overhead since it recomputes activations during the backward pass, but it's generally the most efficient option for large models like LLaMA 70B for several reasons:

1. Alternative memory saving options usually have bigger throughput impacts:

- `activation_cpu_offload=true`: Much slower due to CPU-GPU data transfer

- `fp32_cpu_offload=true`: Also involves slow CPU-GPU transfers

- Reducing batch size: Directly reduces throughput

- Reducing model size: Changes model behavior

2. Activation checkpointing's overhead is typically around 20-30% compute time, but:

- Only affects backward pass

- Modern GPUs often have compute headroom

- Memory savings (50-70%) usually outweigh the computation cost

3. The recompute strategy is smart:

- Only recomputes necessary activations

- Can utilize GPU compute while other operations are happening

- Modern implementations are optimized to minimize recomputation

So while it's not zero-cost, it's usually the minimum-impact option that allows large models to fit in memory while maintaining reasonable throughput.

12/30/2024

What is "GPU_MAX_HW_QUEUES"?

GPU_MAX_HW_QUEUES controls the maximum number of hardware queues that can be used for GPU operations in ROCm/HIP environment (AMD GPUs).

Specifically:

Each hardware queue can handle GPU operations (kernels) independently
Default value is typically 8
In your script, it's set to 4: export GPU_MAX_HW_QUEUES=4

This setting affects:

Parallel kernel execution
Memory transfers
Overall GPU utilization
Resource allocation

Lower values (like 4) can:

Reduce overhead
Improve memory usage
Lead to more predictable performance
Be better for large models like LLaMA 70B

Higher values can:

Allow more parallel operations
Potentially improve throughput for smaller models
Use more GPU resources

In your case, setting GPU_MAX_HW_QUEUES=4 is a conservative choice that helps manage memory and scheduling overhead when training the large LLaMA 70B model.

12/28/2024

fashion ai dataset, ordered by years

2024-2023:

DeepFashion2 (2023 Update)

491K images, 801K clothing items
13 clothing categories
Paired cross-pose images
High resolution (1024x768)
Style, occlusion, landmarks annotations

FashionAI (2023)

180K+ images
Hierarchical attribute system
Focus on e-commerce applications
Multi-label classification
Fine-grained attribute annotations

ACGPN Dataset (2023)

40K high-resolution images
Detailed semantic parsing maps
Virtual try-on ready
Human pose annotations included

2022-2021:

VITON-HD (2022)

13,679 front-view pairs
High resolution (1024x768)
Clean background images
Precise segmentation masks

LIP Dataset (2022 Version)

50K images
19 semantic parts
Instance-level human parsing
Multiple viewpoints

Fashion-MNIST+ (2021)

Enhanced version of Fashion-MNIST
70K images
Additional attribute annotations
Higher resolution than original

2020-2019:

DeepFashion2 (Original 2019)

191K images
13 clothing categories
Commercial-consumer image pairs
Landmark detection

FashionGen (2019)

325K images
Multi-modal fashion dataset
Text descriptions included
Attribute annotations

2018-2017:

ModaNet (2018)

55K street-style images
13 clothing categories
Pixel-level segmentation
Built on Paperdoll dataset

DeepFashion (2017)

800K images
50 clothing categories
Multiple tasks (category/attribute prediction)
Landmark detection

2016-2015:

Clothing Co-Parsing (CCP)

2,098 images
59 clothing categories
Pixel-level annotations
Early benchmark dataset

Fashion10000 (2015)

32K images
Basic attribute labels
Focus on style classification

Key Trends Over Time:

Resolution: Steady increase from 224x224 to 1024x768+
Dataset Size: Growing from thousands to hundreds of thousands
Annotation Quality: Moving from basic labels to multi-task annotations
Real-world Applicability: More focus on practical use cases
Diversity: Including more poses, styles, and demographics
Task Coverage: From simple classification to complex parsing/virtual try-on

Fashion ai dataset.

ModaNet (2018) was groundbreaking but there have been several more recent datasets and models for fashion segmentation and analysis. Here are some notable recent ones:

DeepFashion2 (2023 Update)

491K images with 801K clothing items
13 clothes categories (similar to ModaNet)
More detailed annotations including style, occlusion, zoom-in
Higher quality annotations and more diverse images
Link: https://github.com/switchablenorms/DeepFashion2

VITON-HD (2022)

High resolution virtual try-on dataset
13,679 front-view woman/clothing image pairs
High quality segmentation masks
Particularly good for virtual try-on applications

FashionAI Dataset (2023)

From Alibaba
Over 180K images
Focus on attribute recognition
Detailed hierarchical attribute annotations
More modern fashion styles and better image quality

LIP (Look Into Person) Dataset (2022 version)

50,000 images with pixel-level annotations
19 semantic human part labels
Multiple viewpoints and poses
Human parsing focused but includes detailed clothing segmentation

ACGPN Dataset (2023)

40,000 high-resolution person images
Detailed parsing maps
Semantic segmentation for clothes
Focuses on both parsing and virtual try-on

Key Improvements in Recent Datasets:

Higher resolution images
Better annotation quality
More diverse poses and viewpoints
More modern fashion styles
Better handling of occlusion and layering
More detailed attribute annotations
Multi-task annotations (segmentation + attributes + landmarks)

For your specific use case, I would recommend:

DeepFashion2 as your primary dataset - it's the most comprehensive and recent
Augment with ACGPN if you need higher resolution images
Consider FashionAI if you need very detailed attribute recognition

"fashion attribute recognition" or "clothing parsing"

AI model for clothing analysis and attribute extraction from person images. This is often called "fashion attribute recognition" or "clothing parsing" in computer vision.

For this task, you'll want to consider several components:

Person/Clothing Segmentation

First, you'll need to segment different clothing items
Models like DeepFashion2 or ModaNet provide good architectures for this
You can use Mask R-CNN or similar instance segmentation models as a base

Attribute Recognition For each segmented clothing item, you'll need to recognize:

Category (top, pants, hat, etc.)
Color
Material
Pattern
Style/type
Specific attributes (collar type, sleeve length, etc.)

Available Datasets:

DeepFashion Dataset

Over 800,000 images
50 clothing categories
Multiple attributes per item
Includes landmarks and segmentation
Good for both segmentation and attribute recognition

ModaNet

About 55,000 fully annotated images
13 clothing categories
Instance segmentation masks
Strong street-style focus

Fashion-MNIST

Simpler dataset, good for initial testing
70,000 grayscale images
10 clothing categories
Limited attributes

Clothing Co-Parsing (CCP) Dataset

2,098 fashion images
59 clothing categories
Pixel-level annotations
Good for fine-grained parsing

Recommended Approach:

Model Architecture:

Use a two-stage approach: a. First stage: Mask R-CNN or YOLOv8 for segmentation b. Second stage: ResNet or EfficientNet backbone with attribute-specific heads

Training Strategy:

Pre-train on large datasets like DeepFashion
Fine-tune on your specific use case
Use multi-task learning for different attributes

Implementation Frameworks:

PyTorch or TensorFlow
Consider using MMFashion (open-source fashion analysis toolbox)
HuggingFace Transformers for recent vision models

12/25/2024

Installing cuDNN on Ubuntu 22.04

Step 1: Download cuDNN

Go to https://developer.nvidia.com/cudnn
Sign in to your NVIDIA Developer account (or create one if needed)
Navigate to Downloads
Find and download cuDNN v9.6.0 for Ubuntu 22.04 (.deb package)

Step 2: Install cuDNN

Run these commands in order:

# Install the downloaded package
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.6.0_1.0-1_amd64.deb

# Copy the keyring
sudo cp /var/cudnn-local-repo-ubuntu2204-9.6.0/cudnn-*-keyring.gpg /usr/share/keyrings/

# Update package list
sudo apt-get update

# Install cuDNN
sudo apt-get -y install cudnn

# Install CUDA 12 specific package
sudo apt-get -y install cudnn-cuda-12

Step 3: Verify Installation

# Check if cuDNN is installed correctly
find /usr -name "libcudnn.so*"

Note: Direct download links won't work - you must download through NVIDIA's website after logging in.

12/11/2024

Pedestrian and human attribute dataset.

For Pedestrian Detection:

CityPersons - High-quality pedestrian detection dataset with diverse urban scenes from multiple European cities
Caltech Pedestrian Dataset - Contains approximately 250,000 frames with 350,000 bounding boxes and 2,300 unique pedestrians
INRIA Person Dataset - Includes full-body pedestrians in various poses and backgrounds
MOT (Multiple Object Tracking) Dataset - Contains pedestrians in crowded scenes

For Human Attribute Analysis:

RAP (Richly Annotated Pedestrian) Dataset - Over 40 attributes including clothing types, colors, and accessories
PETA Dataset - Large-scale surveillance person attribute dataset with 19,000 images
Market-1501 Attribute Dataset - Contains 27 attributes for clothing and personal items
DeepFashion Dataset - Focuses on clothing items with detailed annotations

Some considerations when choosing a dataset:

Make sure to check the license terms for each dataset
Consider the image quality and diversity needed for your specific use case
Check if the annotations match your requirements (bounding boxes, attributes, etc.)
Verify that the dataset size is sufficient for your model training needs

11/19/2024

Print detail Model structure

Refer to code

def print_model_structure(model, indent=0):
        for name, child in model.named_children():
            print(' ' * indent + f'└─ {name}: {child.__class__.__name__}')
            if list(child.children()):
                print_model_structure(child, indent + 2)

    print_model_structure(composer_model)

This is Llama 3.1 8b Model Structure

└─ model: LlamaForCausalLM
  └─ model: LlamaModel
    └─ embed_tokens: Embedding
    └─ layers: ModuleList
      └─ 0: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 1: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 2: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 3: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 4: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 5: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 6: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 7: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 8: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 9: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 10: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 11: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 12: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 13: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 14: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 15: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 16: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 17: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 18: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 19: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 20: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 21: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 22: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 23: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 24: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 25: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 26: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 27: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 28: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 29: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 30: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
      └─ 31: LlamaDecoderLayer
        └─ self_attn: LlamaFlashAttention2
          └─ q_proj: Linear
          └─ k_proj: Linear
          └─ v_proj: Linear
          └─ o_proj: Linear
          └─ rotary_emb: LlamaRotaryEmbedding
        └─ mlp: LlamaMLP
          └─ gate_proj: Linear
          └─ up_proj: Linear
          └─ down_proj: Linear
          └─ act_fn: SiLU
        └─ input_layernorm: LlamaRMSNorm
        └─ post_attention_layernorm: LlamaRMSNorm
    └─ norm: LlamaRMSNorm
    └─ rotary_emb: LlamaRotaryEmbedding
  └─ lm_head: Linear

11/17/2024

Hook Llama 3.1 8b layer and print dimension

refer to code

def register_dimension_hooks(model, rank):
    if rank != 0:
        return
        
    print('\n------------------- Model Structure -------------------')
    print("Model type:", type(model))
    
    # Get the actual model through the wrapper layers
    if hasattr(model, 'model'):
        model = model.model
        if hasattr(model, 'model'):
            model = model.model
    
    print("Base model type:", type(model))
    
    def make_hook(name, rank):
        def hook(module, input, output):
            print(f"\n--------------- Hook: {name} ---------------")
            if hasattr(module, 'weight'):
                weight = module.weight
                print(f"GPU {rank} - {name}:")
                print(f"Input shape: {input[0].shape}")
                if hasattr(weight, '_local_tensor'):
                    local_weight = weight._local_tensor
                    print(f"Local weight shape: {local_weight.shape}")
                print(f"Global weight shape: {weight.shape}")
                if hasattr(weight, 'device_mesh'):
                    print(f"Device mesh: {weight.device_mesh}")
                    print(f"Placement: {weight.placements}")
                print(f"Output shape: {output.shape}")
            print("-" * 50)
        return hook

    # Register hooks for embedding layer
    if hasattr(model, 'embed_tokens'):
        print("Found embed_tokens")
        model.embed_tokens.register_forward_hook(make_hook('embed_tokens', rank))

    # Register hooks for all transformer layers
    if hasattr(model, 'layers'):
        for i, layer in enumerate(model.layers):
            # Attention blocks
            layer.self_attn.q_proj.register_forward_hook(
                make_hook(f'layer_{i}_q_proj', rank))
            layer.self_attn.k_proj.register_forward_hook(
                make_hook(f'layer_{i}_k_proj', rank))
            layer.self_attn.v_proj.register_forward_hook(
                make_hook(f'layer_{i}_v_proj', rank))
            layer.self_attn.o_proj.register_forward_hook(
                make_hook(f'layer_{i}_o_proj', rank))
            
            # MLP blocks
            layer.mlp.gate_proj.register_forward_hook(
                make_hook(f'layer_{i}_mlp_gate_proj', rank))
            layer.mlp.up_proj.register_forward_hook(
                make_hook(f'layer_{i}_mlp_up_proj', rank))
            layer.mlp.down_proj.register_forward_hook(
                make_hook(f'layer_{i}_mlp_down_proj', rank))
            
            # Layer norms
            layer.input_layernorm.register_forward_hook(
                make_hook(f'layer_{i}_input_layernorm', rank))
            layer.post_attention_layernorm.register_forward_hook(
                make_hook(f'layer_{i}_post_attention_layernorm', rank))

    # Register hook for final layer norm
    if hasattr(model, 'norm'):
        model.norm.register_forward_hook(make_hook('final_layernorm', rank))

    # Register hook for LM head
    if hasattr(model, 'lm_head'):
        print("Found lm_head")
        model.lm_head.register_forward_hook(make_hook('lm_head', rank))

    # Print model structure to debug
    print("\nModel attributes:", dir(model))

Thank you.

11/03/2024

Auto Number Plate Recognition (ANPR), SDK source code

# install

pip install marearts-anpr

# code

# pip install marearts-anpr

import cv2

from PIL import Image

from marearts_anpr import ma_anpr_detector

from marearts_anpr import ma_anpr_ocr

from marearts_anpr import marearts_anpr_from_pil

from marearts_anpr import marearts_anpr_from_image_file

from marearts_anpr import marearts_anpr_from_cv2

if __name__ == '__main__':

    #################################

    ## Initiate MareArts ANPR

    print("EU ANPR")

    user_name = "your_email"

    serial_key = "your_serial_key"

    detector_model_version = "middle" # Options: refer to detector model table

    ocr_model_version = "eu" # Options: refer to ocr model table

    # MareArts ANPR Detector Inference

    anpr_d = ma_anpr_detector(detector_model_version, user_name, serial_key, conf_thres=0.3, iou_thres=0.5)

    # MareArts ANPR OCR Inference

    anpr_r = ma_anpr_ocr(ocr_model_version, user_name, serial_key)

    #################################

    #################################

    # Routine Task 1 - Predict from File

    image_path = './sample_images/eu_test1.jpg'

    output = marearts_anpr_from_image_file(anpr_d, anpr_r, image_path)

    print(output)

    # Routine Task 2 - Predict from cv2

    img = cv2.imread(image_path)

    output = marearts_anpr_from_cv2(anpr_d, anpr_r, img)

    print(output)

    # Routine Task 3 - Predict from Pillow

    pil_img = Image.open(image_path)

    output = marearts_anpr_from_pil(anpr_d, anpr_r, pil_img)

    print(output)

    #################################

    #################################

    ## Initiate MareArts ANPR for Korea

    print("ANPR Korean")

    # user_name, serial_key are already defined

    # anpr_d is also already initiated before

    ocr_model_version = "kr"

    # MareArts ANPR OCR Inference

    anpr_r = ma_anpr_ocr(ocr_model_version, user_name, serial_key)

    #################################

    # Routine Task 1 - Predict from File

    image_path = './sample_images/kr_test2.jpg'

    output = marearts_anpr_from_image_file(anpr_d, anpr_r, image_path)

    print(output)

    # Routine Task 2 - Predict from cv2 

    img = cv2.imread(image_path)

    output = marearts_anpr_from_cv2(anpr_d, anpr_r, img)

    print(output)

    # Routine Task 3 - Predict from Pillow

    pil_img = Image.open(image_path)

    output = marearts_anpr_from_pil(anpr_d, anpr_r, pil_img)

    print(output)

    #################################

# Ask license is here: https://study.marearts.com/p/anpr-lpr-solution.html

# Live Test is here: https://live.marearts.com

10/30/2024

brief explain about "Audio → Spectrogram → Mel-spectrogram → MFCC"

Audio → Spectrogram → Mel-spectrogram → MFCC

Spectrogram = raw photo

Mel-spectrogram = photo adjusted for human vision

MFCC = compressed, essential features extracted from that photo

Spectrogram

Raw time-frequency representation
Shows energy at each frequency over time
Doesn't account for human perception

Mel-spectrogram

Spectrogram mapped to mel scale
Mimics human frequency perception
Still maintains all frequency band information

MFCC

Derived FROM the mel-spectrogram
Additional step: DCT (Discrete Cosine Transform) is applied
Keeps only lower coefficients (dimensionality reduction)
Decorrelates features

Audio → Spectrogram
- Start with raw audio waveform
- Apply pre-emphasis to boost higher frequencies
- Frame the signal into short segments (typically 20-40ms with overlap)
- Apply window function (usually Hamming) to reduce edge effects
- Perform FFT on each frame
- Calculate power spectrum (|FFT|²)
Spectrogram → Mel-spectrogram
- Create mel filter banks (triangular overlapping windows)
- Convert frequencies to mel scale using formula: mel = 2595 * log10(1 + f/700)
- Apply mel filter banks to power spectrum
- Sum up the energy in each mel band
Mel-spectrogram → MFCC
- Take logarithm of mel filter bank energies (to match human perception)
- Apply Discrete Cosine Transform (DCT)
- Keep first N coefficients (typically 13-39)
- Optionally:
  - Calculate delta (velocity) features
  - Calculate delta-delta (acceleration) features
  - Apply cepstral mean normalization (CMN)

10/26/2024

Download Youtube Video as best Quality

code..

import yt_dlp
import os
from typing import Optional

def format_size(bytes):
    """Convert bytes to human readable format"""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if bytes < 1024:
            return f"{bytes:.2f} {unit}"
        bytes /= 1024
    return f"{bytes:.2f} TB"

def download_video(url: str, output_path: Optional[str] = None) -> str:
    """
    Download a YouTube video in the best quality using yt-dlp.
    
    Args:
        url (str): The URL of the YouTube video
        output_path (str, optional): Directory to save the video
    """
    try:
        if not output_path:
            output_path = os.getcwd()
        
        os.makedirs(output_path, exist_ok=True)
        
        # Configure yt-dlp options for best quality
        ydl_opts = {
            'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best',  # Best video + audio quality
            'outtmpl': os.path.join(output_path, '%(title)s.%(ext)s'),
            'merge_output_format': 'mp4',  # Merge to MP4
            'progress_hooks': [lambda d: print(f"\rDownloading: {d['_percent_str']} of {d['_total_bytes_str']}", end="") if d['status'] == 'downloading' else None],
            'postprocessor_hooks': [lambda d: print("\nMerging video and audio...") if d['status'] == 'started' else None],
            'quiet': False,
            'no_warnings': False,
            # Additional options for best quality
            'format_sort': ['res:2160', 'res:1440', 'res:1080', 'res:720'],
            'video_multistreams': True,
            'audio_multistreams': True,
            'prefer_free_formats': True,
            'postprocessors': [{
                'key': 'FFmpegVideoConvertor',
                'preferedformat': 'mp4',
            }],
        }
        
        print(f"Fetching video information...")
        
        # Create yt-dlp object and download the video
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            # Get video info first
            info = ydl.extract_info(url, download=False)
            video_title = info.get('title', 'video')
            duration = info.get('duration')
            formats = info.get('formats', [])
            
            # Find best quality format
            best_video = max(
                (f for f in formats if f.get('vcodec') != 'none'),
                key=lambda f: (
                    f.get('height', 0),
                    f.get('filesize', 0)
                ),
                default=None
            )
            
            # Print video details
            print(f"\nVideo details:")
            print(f"Title: {video_title}")
            print(f"Duration: {duration//60}:{duration%60:02d}")
            if best_video:
                print(f"Best quality available: {best_video.get('height', 'N/A')}p")
                if best_video.get('filesize'):
                    print(f"Approximate size: {format_size(best_video['filesize'])}")
            
            print("\nStarting download in best quality...")
            # Download the video
            ydl.download([url])
            
            # Get the output filename
            output_file = os.path.join(output_path, f"{video_title}.mp4")
            
            print(f"\nDownload completed successfully!")
            print(f"Saved to: {output_file}")
            
            return output_file
            
    except Exception as e:
        print(f"\nError: {str(e)}")
        print("\nTroubleshooting steps:")
        print("1. Check if the video URL is correct")
        print("2. Check your internet connection")
        print("3. Make sure yt-dlp is up to date: pip install -U yt-dlp")
        print("4. Install or update ffmpeg (required for best quality):")
        print("   - On macOS: brew install ffmpeg")
        print("   - On Ubuntu/Debian: sudo apt-get install ffmpeg")
        print("   - On Windows: download from https://ffmpeg.org/download.html")
        return ""

def main():
    """
    Main function to handle user input for video download.
    """
    print("YouTube Video Downloader (Best Quality)")
    print("-------------------------------------")
    print("This will download videos in the highest available quality")
    print("Note: Higher quality downloads may take longer and use more disk space")
    
    while True:
        url = input("\nEnter the YouTube video URL (or 'q' to quit): ").strip()
        
        if url.lower() == 'q':
            print("Goodbye!")
            break
            
        if not url:
            print("Please enter a valid URL")
            continue
            
        download_video(url)
        
        choice = input("\nWould you like to download another video? (y/n): ").strip().lower()
        if choice != 'y':
            print("Goodbye!")
            break

if __name__ == "__main__":
    main()

That's it.

but install this

pip install yt-dlp

Thank you!!!

Pages