refer to image!
MareArts Computer Vision Study.
Computer Vision & Machine Learning Research Laboratory
2/03/2025
select best kernel using KNN
2/02/2025
Find and search which webcam is online on your computer.
python code:
.
..
output is looks like:
Starting camera detection...
Checking camera indices 0-9...
----------------------------------------
✓ Camera 0 is ONLINE:
Resolution: 640x480
FPS: 30.0
Backend: V4L2
Frame shape: (480, 640, 3)
Format: YUYV
Stability test: 5/5 frames captured successfully
[ WARN:0@0.913] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video1): can't open camera by index
[ERROR:0@0.972] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 1: Not available
✓ Camera 2 is ONLINE:
Resolution: 640x480
FPS: 30.0
Backend: V4L2
Frame shape: (480, 640, 3)
Format: YUYV
Stability test: 5/5 frames captured successfully
[ WARN:0@1.818] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video3): can't open camera by index
[ERROR:0@1.820] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 3: Not available
[ WARN:0@1.820] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video4): can't open camera by index
[ERROR:0@1.822] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 4: Not available
[ WARN:0@1.822] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video5): can't open camera by index
[ERROR:0@1.823] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 5: Not available
[ WARN:0@1.824] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video6): can't open camera by index
[ERROR:0@1.825] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 6: Not available
[ WARN:0@1.825] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video7): can't open camera by index
[ERROR:0@1.828] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 7: Not available
[ WARN:0@1.828] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video8): can't open camera by index
[ERROR:0@1.830] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 8: Not available
[ WARN:0@1.830] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video9): can't open camera by index
[ERROR:0@1.831] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 9: Not available
----------------------------------------
Summary:
Working camera indices: [0, 2]
----------------------------------------
Camera check complete!
so you can know which one is online
Thank you!
1/19/2025
FP32 vs FP8 with tiny NN model.
I'll create a simple example of a tiny neural network to demonstrate fp8 vs fp32 memory usage. Let's make a small model with these layers:
1. Input: 784 features (like MNIST image 28x28)
2. Hidden layer 1: 512 neurons
3. Hidden layer 2: 256 neurons
4. Output: 10 neurons (for 10 digit classes)
Let's calculate the memory needed for weights:
1. First Layer Weights:
```
784 × 512 = 401,408 weights
+ 512 biases
= 401,920 parameters
```
2. Second Layer Weights:
```
512 × 256 = 131,072 weights
+ 256 biases
= 131,328 parameters
```
3. Output Layer Weights:
```
256 × 10 = 2,560 weights
+ 10 biases
= 2,570 parameters
```
Total Parameters: 535,818
Memory Usage:
```
FP32: 535,818 × 4 bytes = 2,143,272 bytes ≈ 2.14 MB
FP8: 535,818 × 1 byte = 535,818 bytes ≈ 0.54 MB
```
Let's demonstrate this with some actual matrix multiplication:
```python
# Example of one batch of inference
Input size: 32 images (batch) × 784 features
32 × 784 = 25,088 numbers
For first layer multiplication:
(32 × 784) × (784 × 512) → (32 × 512)
```
During computation:
1. With fp32:
```
Weights in memory: 401,920 × 4 = 1,607,680 bytes
Input in memory: 25,088 × 4 = 100,352 bytes
Output in memory: 16,384 × 4 = 65,536 bytes
Total: ≈ 1.77 MB
```
2. With fp8:
```
Weights in memory: 401,920 × 1 = 401,920 bytes
Input in memory: 25,088 × 1 = 25,088 bytes
Output in memory: 16,384 × 1 = 16,384 bytes
Total: ≈ 0.44 MB
```
During actual computation:
```
1. Load a tile/block of the weight matrix (let's say 128×128)
fp8: 128×128 = 16,384 bytes
2. Convert this block to fp32: 16,384 × 4 = 65,536 bytes
3. Perform multiplication in fp32
4. Convert result back to fp8
5. Move to next block
```
This shows how even though we compute in fp32, keeping the model in fp8:
1. Uses 1/4 the memory for storage
2. Only needs small blocks in fp32 temporarily
3. Can process larger batches or models with same memory
1/17/2025
HipBlasLT type definition explanation
1. About Output Types (D):
No, the output D is not limited to fp32/int32. Looking at the table, D can be:
- fp32
- fp16
- bf16
- fp8
- bf8
- int8
2. Input/Output Patterns:
When A is fp16, you have two options:
```
Option 1:
A: fp16 → B: fp16 → C: fp16 → D: fp16 → Compute: fp32
Option 2:
A: fp16 → B: fp16 → C: fp16 → D: fp32 → Compute: fp32
```
The compute/scale is always higher precision (fp32 or int32) to maintain accuracy during calculations, even if inputs/outputs are lower precision.
3. Key Patterns in the Table:
- Inputs A and B must always match in type
- C typically matches A and B, except with fp8/bf8 inputs
- When using fp8/bf8 inputs, C and D can be higher precision (fp32, fp16, or bf16)
- The compute precision is always fp32 for floating point types
- For integer operations (int8), the compute precision is int32
4. Why Different Combinations?
- Performance: Lower precision (fp16, fp8) = faster computation + less memory
- Accuracy: Higher precision (fp32) = better accuracy but slower
- Memory Usage: fp16/fp8 use less memory than fp32
- Mixed Precision: Use lower precision for inputs but higher precision for output to balance speed and accuracy
Example Use Cases:
```
High Accuracy Needs:
A(fp32) → B(fp32) → C(fp32) → D(fp32) → Compute(fp32)
Balanced Performance:
A(fp16) → B(fp16) → C(fp16) → D(fp32) → Compute(fp32)
Maximum Performance:
A(fp8) → B(fp8) → C(fp8) → D(fp8) → Compute(fp32)
```
1/15/2025
GEMM, Triton and hipBLASlt and Transformer engine concept
1. GEMM (General Matrix Multiplication):
- This is the basic operation: C = A × B (matrix multiplication)
- Fundamental operation in deep learning, especially transformers
- Core computation in attention mechanisms, linear layers, etc.
2. Triton:
- A programming language for writing GPU kernels
- Lets you write your own custom GEMM implementation
- You control memory layout, tiling, etc.
- Example use: When you need a very specific matrix operation
3. hipBLASLt:
- A specialized library just for matrix operations
- Pre-built, highly optimized GEMM implementations
- Focuses on performance for common matrix sizes
- Example use: When you need fast, standard matrix multiplication
4. Transformer Engine:
- NVIDIA's specialized library for transformer models
- Automatically handles precision switching (FP8/FP16/FP32)
- Optimizes GEMM operations specifically for transformer architectures
- Includes specialized kernels for attention and linear layers
- Example use: When building large language models
The relationship:
```
Transformer Model
↓
Transformer Engine
↓
GEMM Operations (can be implemented via:)
↓
hipBLASLt / Triton / Other libraries
↓
GPU Hardware
```
the same matrix multiplication would be implemented using different approaches:
1. Basic GEMM Operation (what we want to compute):
```python
# C = A × B
# Where A is (M×K) and B is (K×N)
```
2. Using Triton (Custom implementation):
```python
@triton.jit
def matmul_kernel(
a_ptr, b_ptr, c_ptr, # Pointers to matrices
M, N, K, # Matrix dimensions
stride_am, stride_ak, # Memory strides for A
stride_bk, stride_bn, # Memory strides for B
stride_cm, stride_cn, # Memory strides for C
BLOCK_SIZE: tl.constexpr,
):
# Get program ID
pid = tl.program_id(0)
# Calculate block indices
block_i = pid // (N // BLOCK_SIZE)
block_j = pid % (N // BLOCK_SIZE)
# Load blocks from A and B
a = tl.load(a_ptr + ...) # Load block from A
b = tl.load(b_ptr + ...) # Load block from B
# Compute block multiplication
c = tl.dot(a, b) # Matrix multiply
# Store result
tl.store(c_ptr + ..., c)
```
3. Using hipBLASLt:
```cpp
// Initialize hipBLASLt
hipblasLtHandle_t handle;
hipblasLtCreate(&handle);
// Define matrix layout
hipblasLtMatrixLayout_t matA, matB, matC;
hipblasLtMatrixLayoutCreate(&matA, HIPBLAS_LT_R_16F, M, K, M);
hipblasLtMatrixLayoutCreate(&matB, HIPBLAS_LT_R_16F, K, N, K);
hipblasLtMatrixLayoutCreate(&matC, HIPBLAS_LT_R_16F, M, N, M);
// Execute GEMM
hipblasLtMatmul(
handle,
matmulDesc,
&alpha, // Scale factor
A, matA, // Input matrix A
B, matB, // Input matrix B
&beta, // Scale factor
C, matC, // Output matrix C
workspace, // Temporary workspace
streams // CUDA stream
);
```
4. Using Transformer Engine:
```python
import transformer_engine.pytorch as te
# Create TE layers
linear = te.Linear(in_features, out_features)
# Automatic precision handling
with te.fp8_autocast():
output = linear(input) # Internally uses optimized GEMM
```
Key differences:
1. Triton: You control everything (memory, blocks, compute)
2. hipBLASLt: Pre-optimized, you just call it
3. Transformer Engine: High-level, handles precision automatically
Performance comparison (general case):
```
Speed: hipBLASLt > Transformer Engine > Custom Triton
Flexibility: Triton > hipBLASLt > Transformer Engine
Ease of use: Transformer Engine > hipBLASLt > Triton
```
1/11/2025
fsdp difference between fsdp_config.activation_checkpointing and fsdp_config.activation_checkpointing_reentrant
explain the key differences between these two FSDP (Fully Sharded Data Parallel) configuration parameters:
`fsdp_config.activation_checkpointing`:
- This is the main switch that enables/disables activation checkpointing
- When set to `true`, it saves memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass
- In your command, it's set to `false`, meaning no activation checkpointing will be performed
`fsdp_config.activation_checkpointing_reentrant`:
- This is a more specific setting that controls HOW activation checkpointing is implemented
- When set to `true` (as in your command), it uses a reentrant approach which is more memory efficient but potentially slower
- Reentrant implementation makes nested activation checkpointing possible and handles complex model architectures better
- This setting only has an effect if `activation_checkpointing` is enabled
In your specific case, since `activation_checkpointing=false`, the `activation_checkpointing_reentrant=true` setting won't have any actual effect on the training process.
A typical memory-optimized configuration would be:
```yaml
fsdp_config:
activation_checkpointing: true
activation_checkpointing_reentrant: true
```
This would give you maximum memory efficiency at the cost of some computation overhead. However, your configuration seems to be optimized for speed rather than memory usage, which makes sense for a performance-focused training setup (as suggested by your YAML filename containing "performance").
1/10/2025
EEG dataset and approaches
recent EEG datasets and papers from the last 5 years:
- OpenNeuro EEG Datasets (2020-Present)
- DS003190: High-density EEG during motor tasks (2021)
- 128 participants
- 256-channel EEG recordings
- Recent papers:
- (2023) "Spatiotemporal Deep Learning for High-Density Motor EEG Classification" - 91.2% accuracy
- (2024) "Self-Supervised Learning on Large-Scale Motor EEG Data" - 92.8% accuracy
- BCIAUT-P300 Dataset (2021)
- Focuses on P300 responses in autism spectrum disorder
- 15 ASD participants and 15 controls
- High-quality 16-channel recordings
- Key papers:
- (2022) "Vision Transformer for P300 Detection in ASD" - 89.5% accuracy
- (2023) "Multi-head Attention Networks for P300 Classification" - 91.3% accuracy
- Cognitive Load EEG Dataset (2022)
- 100 participants performing cognitive tasks
- 64-channel EEG
- Mental workload classification
- Notable research:
- (2023) "Graph Neural Networks for Cognitive Load Assessment" - 87.9% accuracy
- (2024) "Hybrid CNN-Transformer for Mental Workload Classification" - 89.1% accuracy
- Sleep-EDF Database Expanded (2020 version)
- 197 sleep recordings
- Modern sleep stage classification
- Recent papers:
- (2023) "Attention-based Sleep Stage Classification" - 88.7% accuracy
- (2024) "Contrastive Learning for Sleep EEG Analysis" - 90.2% accuracy
- BEETL Dataset (2023)
- Brain-Environment-Engagement Through Learning
- 200+ participants
- Educational task-based EEG
- Emerging research:
- (2023) "Learning State Classification using Deep Networks" - 85.6% accuracy
- (2024) "Multi-task Learning for Educational EEG Analysis" - 87.3% accuracy
Recent Trends in EEG Classification (2023-2024):
- Self-supervised learning approaches
- Transformer-based architectures
- Multi-modal fusion (EEG + other biosignals)
- Explainable AI methods
- Few-shot learning techniques
Current Benchmark Standards:
- Use of cross-validation (usually 5 or 10-fold)
- Reporting confidence intervals
- Statistical significance testing
- Ablation studies
- Computational efficiency metrics
- OpenNeuro EEG Datasets:
- Main Repository: https://openneuro.org/
- DS003190 Dataset: https://openneuro.org/datasets/ds003190/
- Associated paper repository: https://github.com/OpenNeuroDatasets/ds003190
- BCIAUT-P300 Dataset:
- Official Repository: https://www.kaggle.com/datasets/disbeat/bciaut-p300
- Dataset Documentation: http://www.ieee-dataport.org/documents/bciaut-p300-dataset-p300-based-brain-computer-interface-autism
- Sleep-EDF Database:
- PhysioNet Link: https://physionet.org/content/sleep-edfx/1.0.0/
- GitHub Repository with Processing Tools: https://github.com/akaraspt/deepsleepnet
- BEETL Dataset:
- Project Page: https://beetl.ai/data
- Documentation: https://beetl.ai/documentation
Important Data Repositories for EEG Research:
- PhysioNet:
- https://physionet.org/about/database/#neuro
- Contains multiple EEG collections
- OpenNeuro:
- https://openneuro.org/
- Filter by "EEG" modality
- Brain Signals Data Repositories:
- IEEE DataPort: https://ieee-dataport.org/
- Search for "EEG" datasets
Popular Code Repositories for Recent Papers:
- EEGNet Implementation:
- Deep Learning for EEG:
Research Paper Collections:
- Papers with Code - EEG Section:
- Google Scholar Collections:
Note: When accessing these resources:
- Always check the dataset's license terms
- Verify any usage restrictions
- Cite the original dataset papers
- Check for updated versions of the datasets
- Review the documentation for preprocessing steps
-
Logistic Classifier The logistic classifier is similar to equation of the plane. W is weight vector, X is input vector and y is output...
-
* Introduction - The solution shows panorama image from multi images. The panorama images is processing by real-time stitching algorithm...
-
Image size of origin is 320*240. Processing time is 30.96 second took. The result of stitching The resul...
-
As you can see in the following video, I created a class that stitching n cameras in real time. https://www.youtube.com/user/feelmare/sear...
-
In past, I wrote an articel about YUV 444, 422, 411 introduction and yuv rgb converting example code. refer to this page -> http://feel...
-
OpenCV has AdaBoost algorithm function. And gpu version also is provided. For using detection, we prepare the trained xml file. Although...
-
refer to code: .. import torch # create a tensor x = torch . randn ( 3 , 4 ) # set print options to display full tensor torch . set_print...
-
On the window, we made new file easily by right button click. But on the mac, there is no function. So, very incompatible... By the way, ...
-
from matplotlib import pyplot as plt import numpy as np import cv2 img = imread( 'xxx.png' ) #or image_data img2 = img[:,:,:...
-
The method to make C module for using in python using Visual Studio IDE is introduced on the http://feelmare.blogspot.kr/2014/02/python-stu...