12/31/2024

Persion Detection dataset

# The Ultimate Guide to Person Detection Datasets (2024 Edition)

Are you working on a computer vision project involving person detection? Choosing the right dataset can make or break your model's performance. In this comprehensive guide, we'll explore the best person detection datasets available in 2024, from industry standards to exciting new releases.

## Table of Contents
- [Industry Standard Datasets](#industry-standard-datasets)
- [Specialized Datasets](#specialized-datasets)
- [New Datasets for 2024](#new-datasets-for-2024)
- [How to Choose the Right Dataset](#how-to-choose-the-right-dataset)

## Industry Standard Datasets

### COCO (Common Objects in Context)
**The Gold Standard for Computer Vision**

- **Size**: 200,000+ images with 250,000+ person instances
- **What Makes It Special**: 
  - Diverse scenarios and lighting conditions
  - High-quality annotations including segmentation masks
  - Regular updates and strong community support
- **Best For**: General-purpose detection and benchmarking

### CrowdHuman
**Your Go-To for Crowded Scenes**

- **Size**: 15,000 images containing 470,000 person instances
- **Standout Features**:
  - Average of 22.6 people per image
  - Multiple annotation types (full body, visible body, head)
  - Real-world crowd scenarios
- **Best For**: Surveillance systems and crowd monitoring

### MOT20
**Perfect for Video Applications**

- **Size**: 2.2M+ annotated boxes across video sequences
- **Key Strengths**:
  - Temporal information
  - Challenging crowd scenarios
  - Moving camera situations
- **Best For**: Multi-object tracking and surveillance

## Specialized Datasets

### CityPersons
**Urban Environment Specialist**

- **Size**: 35,000 person instances
- **Resolution**: Crisp 2048x1024 images
- **Perfect For**: 
  - Autonomous driving
  - Urban surveillance
  - Street-level analysis

### SCUT-HEAD
**Head Detection Expert**

- **Size**: 4,500 images with 111,000 head annotations
- **Unique Features**:
  - Specialized for head detection
  - Various viewing angles
  - Crowd density information
- **Best For**: Head counting and crowd analysis

## New Datasets for 2024

### HumanFlow
**Revolutionary Crowd Analysis**

- **Focus**: Dense crowd movement patterns
- **Size**: 50,000+ tracked trajectories
- **Unique Offering**: Group behavior analysis and flow patterns

### NightPersons
**Low-Light Detection Champion**

- **Specialty**: Night-time and low-light scenarios
- **Size**: 25,000 annotated instances
- **Extra Value**: Multi-spectrum data including thermal imaging

### MultiViewPeople
**Multi-Camera Innovation**

- **Size**: 1M+ synchronized frames
- **Highlight Features**:
  - Multiple synchronized camera views
  - Indoor and outdoor scenarios
  - Activity labels

## How to Choose the Right Dataset

### 1. Consider Your Application
- **General Detection**: Start with COCO
- **Crowd Analysis**: CrowdHuman is your friend
- **Urban/Traffic**: CityPersons won't disappoint
- **Night Operations**: NightPersons is essential
- **Multi-Camera Setup**: MultiViewPeople has you covered

### 2. Check Your Resources
- **Storage Capacity**: Larger datasets need more space
- **Computing Power**: Consider your training infrastructure
- **Time Constraints**: Smaller datasets might be sufficient for prototyping

### 3. Evaluate Data Quality
- Look for consistent annotations
- Check update frequency
- Consider community support and available tools

### 4. Think About Your Environment
- Indoor vs. outdoor requirements
- Lighting conditions
- Camera angles and positions
- Scene complexity

## Conclusion

The perfect dataset for your person detection project depends on your specific needs. While COCO remains the industry standard, specialized datasets like CrowdHuman or the new NightPersons might better suit your particular use case. Don't be afraid to combine multiple datasets for better results!

### Pro Tips
1. Start with a smaller subset for initial testing
2. Consider data augmentation to enhance diversity
3. Check licensing terms before using in commercial projects
4. Look for datasets with similar conditions to your deployment environment

Need help getting started? Drop a comment below, and I'll be happy to help you choose the right dataset for your project!

---
*Last updated: December 2024*

hipblas, cublas algorithm



The HipBLASLt tuning process and algorithm selection is based on these factors in your data:

```
dev_cap,m,n,k,trans_a,trans_b,type_a,type_b,type_d,bias_type,lda,ldb,ldd,epi,comp,scale,ws_min,ws_max,algo_id,aidx
```

Key parameters:
1. Matrix Dimensions:
- `m,n,k`: Matrix dimensions for GEMM operations
- Example: `904,8192,2048,8192` = matrix sizes

2. Data Types:
- `type_a,type_b`: Input types (float8e4m3, bfloat16)
- `type_d`: Output type (bfloat16)
- `comp`: Computation type (f32)

3. Memory Layout:
- `trans_a,trans_b`: Matrix transposition (T=transposed, N=not)
- `lda,ldb,ldd`: Leading dimensions

4. Algorithm Selection:
- `algo_id`: Specific algorithm identifier
- `aidx`: Algorithm variant index
- workspace limits: `ws_min,ws_max`

The tuning process (`TE_HIPBLASLT_TUNING_RUN_COUNT=30` and `TE_HIPBLASLT_TUNING_ALGO_COUNT=100`) tests different combinations and selects the best based on:
1. Performance (speed)
2. Numerical stability
3. Memory usage
4. Hardware compatibility (dev_cap=904)

This tuning happens in the Tensor Engine (TE) library during the GEMM operations.

Insight about "fsdp_config.activation_checkpointing" option

 The `fsdp_config.activation_checkpointing` does come with a computational overhead since it recomputes activations during the backward pass, but it's generally the most efficient option for large models like LLaMA 70B for several reasons:


1. Alternative memory saving options usually have bigger throughput impacts:

- `activation_cpu_offload=true`: Much slower due to CPU-GPU data transfer

- `fp32_cpu_offload=true`: Also involves slow CPU-GPU transfers

- Reducing batch size: Directly reduces throughput

- Reducing model size: Changes model behavior


2. Activation checkpointing's overhead is typically around 20-30% compute time, but:

- Only affects backward pass

- Modern GPUs often have compute headroom

- Memory savings (50-70%) usually outweigh the computation cost


3. The recompute strategy is smart:

- Only recomputes necessary activations

- Can utilize GPU compute while other operations are happening

- Modern implementations are optimized to minimize recomputation


So while it's not zero-cost, it's usually the minimum-impact option that allows large models to fit in memory while maintaining reasonable throughput.


12/30/2024

What is "GPU_MAX_HW_QUEUES"?

 GPU_MAX_HW_QUEUES controls the maximum number of hardware queues that can be used for GPU operations in ROCm/HIP environment (AMD GPUs).

Specifically:

  1. Each hardware queue can handle GPU operations (kernels) independently
  2. Default value is typically 8
  3. In your script, it's set to 4: export GPU_MAX_HW_QUEUES=4

This setting affects:

  • Parallel kernel execution
  • Memory transfers
  • Overall GPU utilization
  • Resource allocation

Lower values (like 4) can:

  • Reduce overhead
  • Improve memory usage
  • Lead to more predictable performance
  • Be better for large models like LLaMA 70B

Higher values can:

  • Allow more parallel operations
  • Potentially improve throughput for smaller models
  • Use more GPU resources

In your case, setting GPU_MAX_HW_QUEUES=4 is a conservative choice that helps manage memory and scheduling overhead when training the large LLaMA 70B model.

12/28/2024

fashion ai dataset, ordered by years

 2024-2023:

  1. DeepFashion2 (2023 Update)
  • 491K images, 801K clothing items
  • 13 clothing categories
  • Paired cross-pose images
  • High resolution (1024x768)
  • Style, occlusion, landmarks annotations
  1. FashionAI (2023)
  • 180K+ images
  • Hierarchical attribute system
  • Focus on e-commerce applications
  • Multi-label classification
  • Fine-grained attribute annotations
  1. ACGPN Dataset (2023)
  • 40K high-resolution images
  • Detailed semantic parsing maps
  • Virtual try-on ready
  • Human pose annotations included

2022-2021:

  1. VITON-HD (2022)
  • 13,679 front-view pairs
  • High resolution (1024x768)
  • Clean background images
  • Precise segmentation masks
  1. LIP Dataset (2022 Version)
  • 50K images
  • 19 semantic parts
  • Instance-level human parsing
  • Multiple viewpoints
  1. Fashion-MNIST+ (2021)
  • Enhanced version of Fashion-MNIST
  • 70K images
  • Additional attribute annotations
  • Higher resolution than original

2020-2019:

  1. DeepFashion2 (Original 2019)
  • 191K images
  • 13 clothing categories
  • Commercial-consumer image pairs
  • Landmark detection
  1. FashionGen (2019)
  • 325K images
  • Multi-modal fashion dataset
  • Text descriptions included
  • Attribute annotations

2018-2017:

  1. ModaNet (2018)
  • 55K street-style images
  • 13 clothing categories
  • Pixel-level segmentation
  • Built on Paperdoll dataset
  1. DeepFashion (2017)
  • 800K images
  • 50 clothing categories
  • Multiple tasks (category/attribute prediction)
  • Landmark detection

2016-2015:

  1. Clothing Co-Parsing (CCP)
  • 2,098 images
  • 59 clothing categories
  • Pixel-level annotations
  • Early benchmark dataset
  1. Fashion10000 (2015)
  • 32K images
  • Basic attribute labels
  • Focus on style classification

Key Trends Over Time:

  1. Resolution: Steady increase from 224x224 to 1024x768+
  2. Dataset Size: Growing from thousands to hundreds of thousands
  3. Annotation Quality: Moving from basic labels to multi-task annotations
  4. Real-world Applicability: More focus on practical use cases
  5. Diversity: Including more poses, styles, and demographics
  6. Task Coverage: From simple classification to complex parsing/virtual try-on

Fashion ai dataset.

 ModaNet (2018) was groundbreaking but there have been several more recent datasets and models for fashion segmentation and analysis. Here are some notable recent ones:

DeepFashion2 (2023 Update)

  • 491K images with 801K clothing items
  • 13 clothes categories (similar to ModaNet)
  • More detailed annotations including style, occlusion, zoom-in
  • Higher quality annotations and more diverse images
  • Link: https://github.com/switchablenorms/DeepFashion2

VITON-HD (2022)

  • High resolution virtual try-on dataset
  • 13,679 front-view woman/clothing image pairs
  • High quality segmentation masks
  • Particularly good for virtual try-on applications

FashionAI Dataset (2023)

  • From Alibaba
  • Over 180K images
  • Focus on attribute recognition
  • Detailed hierarchical attribute annotations
  • More modern fashion styles and better image quality

LIP (Look Into Person) Dataset (2022 version)

  • 50,000 images with pixel-level annotations
  • 19 semantic human part labels
  • Multiple viewpoints and poses
  • Human parsing focused but includes detailed clothing segmentation

ACGPN Dataset (2023)

  • 40,000 high-resolution person images
  • Detailed parsing maps
  • Semantic segmentation for clothes
  • Focuses on both parsing and virtual try-on

Key Improvements in Recent Datasets:

  1. Higher resolution images
  2. Better annotation quality
  3. More diverse poses and viewpoints
  4. More modern fashion styles
  5. Better handling of occlusion and layering
  6. More detailed attribute annotations
  7. Multi-task annotations (segmentation + attributes + landmarks)

For your specific use case, I would recommend:

  1. DeepFashion2 as your primary dataset - it's the most comprehensive and recent
  2. Augment with ACGPN if you need higher resolution images
  3. Consider FashionAI if you need very detailed attribute recognition

"fashion attribute recognition" or "clothing parsing"

 AI model for clothing analysis and attribute extraction from person images. This is often called "fashion attribute recognition" or "clothing parsing" in computer vision.

For this task, you'll want to consider several components:

  1. Person/Clothing Segmentation
  • First, you'll need to segment different clothing items
  • Models like DeepFashion2 or ModaNet provide good architectures for this
  • You can use Mask R-CNN or similar instance segmentation models as a base
  1. Attribute Recognition For each segmented clothing item, you'll need to recognize:
  • Category (top, pants, hat, etc.)
  • Color
  • Material
  • Pattern
  • Style/type
  • Specific attributes (collar type, sleeve length, etc.)

Available Datasets:

  1. DeepFashion Dataset
  • Over 800,000 images
  • 50 clothing categories
  • Multiple attributes per item
  • Includes landmarks and segmentation
  • Good for both segmentation and attribute recognition
  1. ModaNet
  • About 55,000 fully annotated images
  • 13 clothing categories
  • Instance segmentation masks
  • Strong street-style focus
  1. Fashion-MNIST
  • Simpler dataset, good for initial testing
  • 70,000 grayscale images
  • 10 clothing categories
  • Limited attributes
  1. Clothing Co-Parsing (CCP) Dataset
  • 2,098 fashion images
  • 59 clothing categories
  • Pixel-level annotations
  • Good for fine-grained parsing

Recommended Approach:

  1. Model Architecture:
  • Use a two-stage approach: a. First stage: Mask R-CNN or YOLOv8 for segmentation b. Second stage: ResNet or EfficientNet backbone with attribute-specific heads
  1. Training Strategy:
  • Pre-train on large datasets like DeepFashion
  • Fine-tune on your specific use case
  • Use multi-task learning for different attributes
  1. Implementation Frameworks:
  • PyTorch or TensorFlow
  • Consider using MMFashion (open-source fashion analysis toolbox)
  • HuggingFace Transformers for recent vision models

12/25/2024

Installing cuDNN on Ubuntu 22.04

 

Installing cuDNN on Ubuntu 22.04

Step 1: Download cuDNN

  1. Go to https://developer.nvidia.com/cudnn
  2. Sign in to your NVIDIA Developer account (or create one if needed)
  3. Navigate to Downloads
  4. Find and download cuDNN v9.6.0 for Ubuntu 22.04 (.deb package)

Step 2: Install cuDNN

Run these commands in order:

# Install the downloaded package sudo dpkg -i cudnn-local-repo-ubuntu2204-9.6.0_1.0-1_amd64.deb # Copy the keyring sudo cp /var/cudnn-local-repo-ubuntu2204-9.6.0/cudnn-*-keyring.gpg /usr/share/keyrings/ # Update package list sudo apt-get update # Install cuDNN sudo apt-get -y install cudnn # Install CUDA 12 specific package sudo apt-get -y install cudnn-cuda-12

Step 3: Verify Installation

# Check if cuDNN is installed correctly find /usr -name "libcudnn.so*"

Note: Direct download links won't work - you must download through NVIDIA's website after logging in.

12/11/2024

Pedestrian and human attribute dataset.

 

For Pedestrian Detection:

  1. CityPersons - High-quality pedestrian detection dataset with diverse urban scenes from multiple European cities
  2. Caltech Pedestrian Dataset - Contains approximately 250,000 frames with 350,000 bounding boxes and 2,300 unique pedestrians
  3. INRIA Person Dataset - Includes full-body pedestrians in various poses and backgrounds
  4. MOT (Multiple Object Tracking) Dataset - Contains pedestrians in crowded scenes

For Human Attribute Analysis:

  1. RAP (Richly Annotated Pedestrian) Dataset - Over 40 attributes including clothing types, colors, and accessories
  2. PETA Dataset - Large-scale surveillance person attribute dataset with 19,000 images
  3. Market-1501 Attribute Dataset - Contains 27 attributes for clothing and personal items
  4. DeepFashion Dataset - Focuses on clothing items with detailed annotations

Some considerations when choosing a dataset:

  • Make sure to check the license terms for each dataset
  • Consider the image quality and diversity needed for your specific use case
  • Check if the annotations match your requirements (bounding boxes, attributes, etc.)
  • Verify that the dataset size is sufficient for your model training needs

11/19/2024

Print detail Model structure

Refer to code

..

def print_model_structure(model, indent=0):
for name, child in model.named_children():
print(' ' * indent + f'└─ {name}: {child.__class__.__name__}')
if list(child.children()):
print_model_structure(child, indent + 2)

print_model_structure(composer_model)

..


This is Llama 3.1 8b Model Structure

└─ model: LlamaForCausalLM
└─ model: LlamaModel
└─ embed_tokens: Embedding
└─ layers: ModuleList
└─ 0: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 1: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 2: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 3: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 4: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 5: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 6: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 7: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 8: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 9: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 10: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 11: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 12: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 13: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 14: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 15: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 16: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 17: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 18: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 19: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 20: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 21: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 22: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 23: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 24: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 25: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 26: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 27: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 28: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 29: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 30: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ 31: LlamaDecoderLayer
└─ self_attn: LlamaFlashAttention2
└─ q_proj: Linear
└─ k_proj: Linear
└─ v_proj: Linear
└─ o_proj: Linear
└─ rotary_emb: LlamaRotaryEmbedding
└─ mlp: LlamaMLP
└─ gate_proj: Linear
└─ up_proj: Linear
└─ down_proj: Linear
└─ act_fn: SiLU
└─ input_layernorm: LlamaRMSNorm
└─ post_attention_layernorm: LlamaRMSNorm
└─ norm: LlamaRMSNorm
└─ rotary_emb: LlamaRotaryEmbedding
└─ lm_head: Linear

11/17/2024

Hook Llama 3.1 8b layer and print dimension

 refer to code


.

def register_dimension_hooks(model, rank):
if rank != 0:
return
print('\n------------------- Model Structure -------------------')
print("Model type:", type(model))
# Get the actual model through the wrapper layers
if hasattr(model, 'model'):
model = model.model
if hasattr(model, 'model'):
model = model.model
print("Base model type:", type(model))
def make_hook(name, rank):
def hook(module, input, output):
print(f"\n--------------- Hook: {name} ---------------")
if hasattr(module, 'weight'):
weight = module.weight
print(f"GPU {rank} - {name}:")
print(f"Input shape: {input[0].shape}")
if hasattr(weight, '_local_tensor'):
local_weight = weight._local_tensor
print(f"Local weight shape: {local_weight.shape}")
print(f"Global weight shape: {weight.shape}")
if hasattr(weight, 'device_mesh'):
print(f"Device mesh: {weight.device_mesh}")
print(f"Placement: {weight.placements}")
print(f"Output shape: {output.shape}")
print("-" * 50)
return hook

# Register hooks for embedding layer
if hasattr(model, 'embed_tokens'):
print("Found embed_tokens")
model.embed_tokens.register_forward_hook(make_hook('embed_tokens', rank))

# Register hooks for all transformer layers
if hasattr(model, 'layers'):
for i, layer in enumerate(model.layers):
# Attention blocks
layer.self_attn.q_proj.register_forward_hook(
make_hook(f'layer_{i}_q_proj', rank))
layer.self_attn.k_proj.register_forward_hook(
make_hook(f'layer_{i}_k_proj', rank))
layer.self_attn.v_proj.register_forward_hook(
make_hook(f'layer_{i}_v_proj', rank))
layer.self_attn.o_proj.register_forward_hook(
make_hook(f'layer_{i}_o_proj', rank))
# MLP blocks
layer.mlp.gate_proj.register_forward_hook(
make_hook(f'layer_{i}_mlp_gate_proj', rank))
layer.mlp.up_proj.register_forward_hook(
make_hook(f'layer_{i}_mlp_up_proj', rank))
layer.mlp.down_proj.register_forward_hook(
make_hook(f'layer_{i}_mlp_down_proj', rank))
# Layer norms
layer.input_layernorm.register_forward_hook(
make_hook(f'layer_{i}_input_layernorm', rank))
layer.post_attention_layernorm.register_forward_hook(
make_hook(f'layer_{i}_post_attention_layernorm', rank))

# Register hook for final layer norm
if hasattr(model, 'norm'):
model.norm.register_forward_hook(make_hook('final_layernorm', rank))

# Register hook for LM head
if hasattr(model, 'lm_head'):
print("Found lm_head")
model.lm_head.register_forward_hook(make_hook('lm_head', rank))

# Print model structure to debug
print("\nModel attributes:", dir(model))

..


Thank you.


11/03/2024

Auto Number Plate Recognition (ANPR), SDK source code



 # install 

pip install marearts-anpr


# code

# pip install marearts-anpr
import cv2
from PIL import Image
from marearts_anpr import ma_anpr_detector
from marearts_anpr import ma_anpr_ocr
from marearts_anpr import marearts_anpr_from_pil
from marearts_anpr import marearts_anpr_from_image_file
from marearts_anpr import marearts_anpr_from_cv2
if __name__ == '__main__':
#################################
## Initiate MareArts ANPR
print("EU ANPR")
user_name = "your_email"
serial_key = "your_serial_key"
detector_model_version = "middle" # Options: refer to detector model table
ocr_model_version = "eu" # Options: refer to ocr model table
# MareArts ANPR Detector Inference
anpr_d = ma_anpr_detector(detector_model_version, user_name, serial_key, conf_thres=0.3, iou_thres=0.5)
# MareArts ANPR OCR Inference
anpr_r = ma_anpr_ocr(ocr_model_version, user_name, serial_key)
#################################
#################################
# Routine Task 1 - Predict from File
image_path = './sample_images/eu_test1.jpg'
output = marearts_anpr_from_image_file(anpr_d, anpr_r, image_path)
print(output)
# Routine Task 2 - Predict from cv2
img = cv2.imread(image_path)
output = marearts_anpr_from_cv2(anpr_d, anpr_r, img)
print(output)
# Routine Task 3 - Predict from Pillow
pil_img = Image.open(image_path)
output = marearts_anpr_from_pil(anpr_d, anpr_r, pil_img)
print(output)
#################################
#################################
## Initiate MareArts ANPR for Korea
print("ANPR Korean")
# user_name, serial_key are already defined
# anpr_d is also already initiated before
ocr_model_version = "kr"
# MareArts ANPR OCR Inference
anpr_r = ma_anpr_ocr(ocr_model_version, user_name, serial_key)
#################################
# Routine Task 1 - Predict from File
image_path = './sample_images/kr_test2.jpg'
output = marearts_anpr_from_image_file(anpr_d, anpr_r, image_path)
print(output)
# Routine Task 2 - Predict from cv2
img = cv2.imread(image_path)
output = marearts_anpr_from_cv2(anpr_d, anpr_r, img)
print(output)
# Routine Task 3 - Predict from Pillow
pil_img = Image.open(image_path)
output = marearts_anpr_from_pil(anpr_d, anpr_r, pil_img)
print(output)
#################################

..


# Ask license is here: https://study.marearts.com/p/anpr-lpr-solution.html

# Live Test is here: https://live.marearts.com


10/30/2024

brief explain about "Audio → Spectrogram → Mel-spectrogram → MFCC"

 Audio → Spectrogram → Mel-spectrogram → MFCC

  • Spectrogram = raw photo
  • Mel-spectrogram = photo adjusted for human vision
  • MFCC = compressed, essential features extracted from that photo
    1. Spectrogram
    • Raw time-frequency representation
    • Shows energy at each frequency over time
    • Doesn't account for human perception
    1. Mel-spectrogram
    • Spectrogram mapped to mel scale
    • Mimics human frequency perception
    • Still maintains all frequency band information
    1. MFCC
    • Derived FROM the mel-spectrogram
    • Additional step: DCT (Discrete Cosine Transform) is applied
    • Keeps only lower coefficients (dimensionality reduction)
    • Decorrelates features

    .

    1. Audio → Spectrogram
      • Start with raw audio waveform
      • Apply pre-emphasis to boost higher frequencies
      • Frame the signal into short segments (typically 20-40ms with overlap)
      • Apply window function (usually Hamming) to reduce edge effects
      • Perform FFT on each frame
      • Calculate power spectrum (|FFT|²)
    2. Spectrogram → Mel-spectrogram
      • Create mel filter banks (triangular overlapping windows)
      • Convert frequencies to mel scale using formula: mel = 2595 * log10(1 + f/700)
      • Apply mel filter banks to power spectrum
      • Sum up the energy in each mel band
    3. Mel-spectrogram → MFCC
      • Take logarithm of mel filter bank energies (to match human perception)
      • Apply Discrete Cosine Transform (DCT)
      • Keep first N coefficients (typically 13-39)
      • Optionally:
        • Calculate delta (velocity) features
        • Calculate delta-delta (acceleration) features
        • Apply cepstral mean normalization (CMN)

    ..

    10/26/2024

    Download Youtube Video as best Quality

     code..

    import yt_dlp
    import os
    from typing import Optional

    def format_size(bytes):
    """Convert bytes to human readable format"""
    for unit in ['B', 'KB', 'MB', 'GB']:
    if bytes < 1024:
    return f"{bytes:.2f} {unit}"
    bytes /= 1024
    return f"{bytes:.2f} TB"

    def download_video(url: str, output_path: Optional[str] = None) -> str:
    """
    Download a YouTube video in the best quality using yt-dlp.
    Args:
    url (str): The URL of the YouTube video
    output_path (str, optional): Directory to save the video
    """
    try:
    if not output_path:
    output_path = os.getcwd()
    os.makedirs(output_path, exist_ok=True)
    # Configure yt-dlp options for best quality
    ydl_opts = {
    'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best', # Best video + audio quality
    'outtmpl': os.path.join(output_path, '%(title)s.%(ext)s'),
    'merge_output_format': 'mp4', # Merge to MP4
    'progress_hooks': [lambda d: print(f"\rDownloading: {d['_percent_str']} of {d['_total_bytes_str']}", end="") if d['status'] == 'downloading' else None],
    'postprocessor_hooks': [lambda d: print("\nMerging video and audio...") if d['status'] == 'started' else None],
    'quiet': False,
    'no_warnings': False,
    # Additional options for best quality
    'format_sort': ['res:2160', 'res:1440', 'res:1080', 'res:720'],
    'video_multistreams': True,
    'audio_multistreams': True,
    'prefer_free_formats': True,
    'postprocessors': [{
    'key': 'FFmpegVideoConvertor',
    'preferedformat': 'mp4',
    }],
    }
    print(f"Fetching video information...")
    # Create yt-dlp object and download the video
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    # Get video info first
    info = ydl.extract_info(url, download=False)
    video_title = info.get('title', 'video')
    duration = info.get('duration')
    formats = info.get('formats', [])
    # Find best quality format
    best_video = max(
    (f for f in formats if f.get('vcodec') != 'none'),
    key=lambda f: (
    f.get('height', 0),
    f.get('filesize', 0)
    ),
    default=None
    )
    # Print video details
    print(f"\nVideo details:")
    print(f"Title: {video_title}")
    print(f"Duration: {duration//60}:{duration%60:02d}")
    if best_video:
    print(f"Best quality available: {best_video.get('height', 'N/A')}p")
    if best_video.get('filesize'):
    print(f"Approximate size: {format_size(best_video['filesize'])}")
    print("\nStarting download in best quality...")
    # Download the video
    ydl.download([url])
    # Get the output filename
    output_file = os.path.join(output_path, f"{video_title}.mp4")
    print(f"\nDownload completed successfully!")
    print(f"Saved to: {output_file}")
    return output_file
    except Exception as e:
    print(f"\nError: {str(e)}")
    print("\nTroubleshooting steps:")
    print("1. Check if the video URL is correct")
    print("2. Check your internet connection")
    print("3. Make sure yt-dlp is up to date: pip install -U yt-dlp")
    print("4. Install or update ffmpeg (required for best quality):")
    print(" - On macOS: brew install ffmpeg")
    print(" - On Ubuntu/Debian: sudo apt-get install ffmpeg")
    print(" - On Windows: download from https://ffmpeg.org/download.html")
    return ""

    def main():
    """
    Main function to handle user input for video download.
    """
    print("YouTube Video Downloader (Best Quality)")
    print("-------------------------------------")
    print("This will download videos in the highest available quality")
    print("Note: Higher quality downloads may take longer and use more disk space")
    while True:
    url = input("\nEnter the YouTube video URL (or 'q' to quit): ").strip()
    if url.lower() == 'q':
    print("Goodbye!")
    break
    if not url:
    print("Please enter a valid URL")
    continue
    download_video(url)
    choice = input("\nWould you like to download another video? (y/n): ").strip().lower()
    if choice != 'y':
    print("Goodbye!")
    break

    if __name__ == "__main__":
    main()

    ..


    That's it.

    but install this

    pip install yt-dlp      


    Thank you!!!