.screenrc
.screenrc
This guide explains how to access data from Computer A (data server) on Computer B (GPU machine) for machine learning training workflows.
When training machine learning models, you often need:
This tutorial will show you how to securely connect these computers using SSH, allowing the GPU machine to access data without copying everything locally.
If you don't already have an SSH key on your GPU computer (Computer B):
# On Computer B (GPU)
ssh-keygen -t rsa -b 4096
Press Enter to accept default locations and add a passphrase if desired.
# On Computer B (GPU)
# View your public key
cat ~/.ssh/id_rsa.pub
# Copy the output to clipboard
Now transfer this key to Computer A (data server):
# Option 1: Using ssh-copy-id (easiest)
ssh-copy-id username@computerA
# Option 2: Manual setup
# First, SSH into Computer A
ssh username@computerA
# Then on Computer A, create .ssh directory if it doesn't exist
mkdir -p ~/.ssh
chmod 700 ~/.ssh
# Add your public key to authorized_keys
echo "ssh-rsa AAAA...your key here..." >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
# Exit back to Computer B
exit
Ensure you can connect without a password:
# On Computer B (GPU)
ssh username@computerA
If successful, you should connect without entering a password.
Install SSHFS on your GPU computer:
# On Computer B (GPU)
# For Ubuntu/Debian
sudo apt-get update
sudo apt-get install sshfs
# For CentOS/RHEL/Fedora
sudo dnf install fuse-sshfs
Create a mount point and mount the remote directory:
# On Computer B (GPU)
# Create mount directory
mkdir -p ~/data_mount
# Mount the remote directory
sshfs username@computerA:/path/to/data ~/data_mount
# Verify the mount worked
ls ~/data_mount
Now you can access the data in your training scripts as if it were local:
# Example PyTorch script
import torch
from torch.utils.data import Dataset, DataLoader
# Point to your mounted data directory
data_dir = "~/data_mount/dataset"
# Your training code...
To automatically mount the remote directory when your GPU computer starts:
Edit your fstab file:
sudo nano /etc/fstab
Add this line (all on one line):
username@computerA:/path/to/data /home/username/data_mount fuse.sshfs defaults,_netdev,user,idmap=user,follow_symlinks,identityfile=/home/username/.ssh/id_rsa,allow_other,reconnect 0 0
Save and exit
To unmount the remote directory:
# On Computer B (GPU)
fusermount -u ~/data_mount
For better performance with large datasets, try these SSHFS options:
sshfs username@computerA:/path/to/data ~/data_mount -o Compression=no,big_writes,cache=yes,kernel_cache
If you experience frequent disconnections, add reconnect options:
sshfs username@computerA:/path/to/data ~/data_mount -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3
For production setups with large datasets, consider using NFS instead of SSHFS for better performance.
sudo systemctl status sshd
For any issues or questions, please contact your system administrator.
Let me explain how a specific normalized feature value is calculated using one concrete example.
Let's take the feature "GroupSize" which has:
These values are post-normalization, but we can work backwards to understand how they were calculated.
The normalization function you're using is:
normalized_features = (features - mean) / std
Where:
features
are the original, raw valuesmean
is the average of all values for that feature in the training setstd
is the standard deviation of all values for that feature in the training setLet's say we have these raw values for GroupSize in the training set:
First, we calculate the mean:
Then we calculate the standard deviation:
Now, we can normalize each value:
Going back to your data:
For GroupSize, this extreme range suggests:
If we assume the mean of raw GroupSize is ฮผ and standard deviation is ฯ, then:
This tells us that your maximum raw value is over 103 standard deviations away from the mean, which is extremely far! This confirms that your raw data has a heavily skewed distribution with significant outliers.
The fact that most normalized values for GroupSize are close to the minimum (-0.045121) suggests that the most common value is slightly below the mean, while a few extreme outliers are pulling the mean upward.
This type of skewed distribution is exactly why techniques like masking and autoencoder approaches are beneficial - they can help the model learn robust representations even with such extreme distributions.
Very Nice Convoluiton Convolution (korean)
https://gaussian37.github.io/dl-concept-covolution_operation/
✌️
This guide provides step-by-step instructions for setting up LabelMe with a custom login system and proper dataset management. We'll cover the entire workflow: login → annotation → save → logout.
We'll build a system with the following components:
# Update package lists
sudo apt update
sudo apt upgrade -y
# Install necessary packages
sudo apt install -y docker.io docker-compose apache2 php libapache2-mod-php php-json
sudo systemctl enable docker
sudo systemctl start docker
# Add your user to docker group to avoid using sudo with docker commands
sudo usermod -aG docker $USER
# Log out and log back in for this to take effect
# Create main project directory
mkdir -p ~/labelme-project
cd ~/labelme-project
# Create directories for different components
mkdir -p docker-labelme
mkdir -p web-portal
mkdir -p datasets/{project1,project2}
mkdir -p annotations
# Add some sample images to project1 (optional)
# You can replace this with your own dataset copying commands
mkdir -p datasets/project1/images
# Copy some sample images if you have them
# cp /path/to/your/images/*.jpg datasets/project1/images/
Create a file docker-labelme/docker-compose.yml
:
cd ~/labelme-project/docker-labelme
nano docker-compose.yml
Add the following content:
version: '3'
services:
labelme:
image: wkentaro/labelme
container_name: labelme-server
ports:
- "8080:8080"
volumes:
- ../datasets:/data
- ../annotations:/home/developer/.labelmerc
environment:
- LABELME_SERVER=1
- LABELME_PORT=8080
- LABELME_HOST=0.0.0.0
command: labelme --server --port 8080 --host 0.0.0.0 /data
restart: unless-stopped
This step ensures annotations are saved in the proper format and location:
cd ~/labelme-project
nano annotations/.labelmerc
Add the following content:
{
"auto_save": true,
"display_label_popup": true,
"store_data": true,
"keep_prev": false,
"flags": null,
"flags_2": null,
"flags_3": null,
"label_flags": null,
"labels": ["person", "car", "bicycle", "dog", "cat", "tree", "building"],
"file_search": true,
"show_label_text": true
}
Customize the labels
list according to your annotation needs.
cd ~/labelme-project/docker-labelme
docker-compose up -d
Verify it's running:
docker ps
You should see the labelme-server container running and listening on port 8080.
cd ~/labelme-project/web-portal
nano index.php
Add the following content:
<?php
// Display errors during development (remove in production)
ini_set('display_errors', 1);
error_reporting(E_ALL);
// Start session
session_start();
// Check if there's an error message
$error_message = isset($_SESSION['error_message']) ? $_SESSION['error_message'] : '';
// Clear error message after displaying it
unset($_SESSION['error_message']);
?>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LabelMe Login</title>
<style>
body {
font-family: Arial, sans-serif;
background-color: #f4f4f4;
margin: 0;
padding: 0;
display: flex;
justify-content: center;
align-items: center;
height: 100vh;
}
.login-container {
background-color: white;
padding: 30px;
border-radius: 8px;
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.1);
width: 350px;
}
h2 {
text-align: center;
color: #333;
margin-bottom: 20px;
}
input[type="text"],
input[type="password"] {
width: 100%;
padding: 12px;
margin: 8px 0;
display: inline-block;
border: 1px solid #ccc;
box-sizing: border-box;
border-radius: 4px;
}
button {
background-color: #4CAF50;
color: white;
padding: 14px 20px;
margin: 10px 0;
border: none;
cursor: pointer;
width: 100%;
border-radius: 4px;
font-size: 16px;
}
button:hover {
opacity: 0.8;
}
.error-message {
color: #f44336;
text-align: center;
margin-top: 10px;
}
.logo {
text-align: center;
margin-bottom: 20px;
}
</style>
</head>
<body>
<div class="login-container">
<div class="logo">
<h2>LabelMe Annotation</h2>
</div>
<form id="loginForm" action="auth.php" method="post">
<div>
<label for="username"><b>Username</b></label>
<input type="text" placeholder="Enter Username" name="username" required>
</div>
<div>
<label for="password"><b>Password</b></label>
<input type="password" placeholder="Enter Password" name="password" required>
</div>
<div>
<label for="project"><b>Select Project</b></label>
<select name="project" style="width: 100%; padding: 12px; margin: 8px 0; display: inline-block; border: 1px solid #ccc; box-sizing: border-box; border-radius: 4px;">
<option value="project1">Project 1</option>
<option value="project2">Project 2</option>
</select>
</div>
<button type="submit">Login</button>
<?php if (!empty($error_message)): ?>
<div class="error-message"><?php echo htmlspecialchars($error_message); ?></div>
<?php endif; ?>
</form>
</div>
</body>
</html>
cd ~/labelme-project/web-portal
nano auth.php
Add the following content:
<?php
// Start session management
session_start();
// Display errors during development (remove in production)
ini_set('display_errors', 1);
error_reporting(E_ALL);
// Configuration - Store these securely in production
$users = [
'admin' => [
'password' => password_hash('admin123', PASSWORD_DEFAULT), // Use hashed passwords
'role' => 'admin'
],
'user1' => [
'password' => password_hash('user123', PASSWORD_DEFAULT),
'role' => 'annotator'
],
'user2' => [
'password' => password_hash('user456', PASSWORD_DEFAULT),
'role' => 'annotator'
]
];
// Base path to the LabelMe application
$labelme_base_url = 'http://localhost:8080'; // Change this to your LabelMe server address
// Handle login form submission
if ($_SERVER['REQUEST_METHOD'] === 'POST') {
$username = isset($_POST['username']) ? $_POST['username'] : '';
$password = isset($_POST['password']) ? $_POST['password'] : '';
$project = isset($_POST['project']) ? $_POST['project'] : 'project1';
// Validate credentials
if (isset($users[$username]) && password_verify($password, $users[$username]['password'])) {
// Set session variables
$_SESSION['logged_in'] = true;
$_SESSION['username'] = $username;
$_SESSION['role'] = $users[$username]['role'];
$_SESSION['project'] = $project;
$_SESSION['last_activity'] = time();
// Redirect to LabelMe
header("Location: labelme.php");
exit;
} else {
// Failed login
$_SESSION['error_message'] = "Invalid username or password";
header("Location: index.php");
exit;
}
}
// For logout
if (isset($_GET['logout'])) {
// Log this logout
$log_file = 'user_activity.log';
$log_message = date('Y-m-d H:i:s') . " - User: " . ($_SESSION['username'] ?? 'unknown') .
" - Action: Logged out\n";
file_put_contents($log_file, $log_message, FILE_APPEND);
// Clear session data
session_unset();
session_destroy();
// Redirect to login page
header("Location: index.php");
exit;
}
?>
cd ~/labelme-project/web-portal
nano labelme.php
Add the following content:
<?php
// Start session management
session_start();
// Display errors during development (remove in production)
ini_set('display_errors', 1);
error_reporting(E_ALL);
// Check if user is logged in
if (!isset($_SESSION['logged_in']) || $_SESSION['logged_in'] !== true) {
// Not logged in, redirect to login page
header("Location: index.php");
exit;
}
// Security: Check for session timeout (30 minutes)
$timeout = 30 * 60; // 30 minutes in seconds
if (isset($_SESSION['last_activity']) && (time() - $_SESSION['last_activity'] > $timeout)) {
// Session has expired
session_unset();
session_destroy();
header("Location: index.php?timeout=1");
exit;
}
// Update last activity time
$_SESSION['last_activity'] = time();
// Configuration
$labelme_base_url = 'http://localhost:8080'; // Change this to your LabelMe server address
$project = $_SESSION['project'] ?? 'project1';
$labelme_url = $labelme_base_url . '/' . $project;
// Log user activity
$log_file = 'user_activity.log';
$log_message = date('Y-m-d H:i:s') . " - User: " . $_SESSION['username'] .
" - Role: " . $_SESSION['role'] .
" - Project: " . $project .
" - Action: Accessed LabelMe\n";
file_put_contents($log_file, $log_message, FILE_APPEND);
?>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LabelMe Annotation Tool</title>
<style>
body, html {
margin: 0;
padding: 0;
height: 100%;
overflow: hidden;
}
.header {
background-color: #333;
color: white;
padding: 10px;
display: flex;
justify-content: space-between;
align-items: center;
}
.user-info {
font-size: 14px;
}
.logout-btn {
background-color: #f44336;
color: white;
border: none;
padding: 5px 10px;
cursor: pointer;
border-radius: 3px;
text-decoration: none;
margin-left: 10px;
}
.logout-btn:hover {
background-color: #d32f2f;
}
.project-selector {
margin-left: 20px;
}
iframe {
width: 100%;
height: calc(100% - 50px);
border: none;
}
</style>
</head>
<body>
<div class="header">
<div>
<h3 style="margin:0;">LabelMe Annotation Tool</h3>
<span>Project: <strong><?php echo htmlspecialchars($project); ?></strong></span>
</div>
<div class="user-info">
Logged in as: <strong><?php echo htmlspecialchars($_SESSION['username']); ?></strong>
(<?php echo htmlspecialchars($_SESSION['role']); ?>)
<form method="post" action="" style="display:inline-block">
<select name="project" class="project-selector" onchange="this.form.submit()">
<option value="project1" <?php echo $project == 'project1' ? 'selected' : ''; ?>>Project 1</option>
<option value="project2" <?php echo $project == 'project2' ? 'selected' : ''; ?>>Project 2</option>
</select>
</form>
<a href="auth.php?logout=1" class="logout-btn">Logout</a>
</div>
</div>
<iframe src="<?php echo $labelme_url; ?>" allow="fullscreen"></iframe>
</body>
</html>
<?php
// Handle project switching
if ($_SERVER['REQUEST_METHOD'] === 'POST' && isset($_POST['project'])) {
$newProject = $_POST['project'];
$_SESSION['project'] = $newProject;
// Log project switch
$log_message = date('Y-m-d H:i:s') . " - User: " . $_SESSION['username'] .
" - Action: Switched to project " . $newProject . "\n";
file_put_contents($log_file, $log_message, FILE_APPEND);
// Redirect to refresh the page with new project
header("Location: labelme.php");
exit;
}
?>
sudo nano /etc/apache2/sites-available/labelme-portal.conf
Add the following configuration:
<VirtualHost *:80>
ServerName labelme.yourdomain.com # Change this to your domain or IP
DocumentRoot /home/username/labelme-project/web-portal # Update with your actual path
<Directory /home/username/labelme-project/web-portal> # Update with your actual path
Options Indexes FollowSymLinks
AllowOverride All
Require all granted
</Directory>
ErrorLog ${APACHE_LOG_DIR}/labelme-error.log
CustomLog ${APACHE_LOG_DIR}/labelme-access.log combined
</VirtualHost>
Update the paths to match your actual user and directory structure.
sudo a2ensite labelme-portal.conf
sudo systemctl restart apache2
# Set appropriate permissions for the web files
cd ~/labelme-project
sudo chown -R www-data:www-data web-portal
sudo chmod -R 755 web-portal
# Ensure the annotation directory is writable
sudo chown -R www-data:www-data annotations
sudo chmod -R 777 annotations
# Ensure datasets are accessible
sudo chmod -R 755 datasets
Structure your dataset directories as follows:
datasets/
├── project1/
│ ├── images/
│ │ ├── image1.jpg
│ │ ├── image2.jpg
│ │ └── ...
│ └── annotations/ # LabelMe will save annotations here
├── project2/
│ ├── images/
│ │ ├── image1.jpg
│ │ └── ...
│ └── annotations/
└── ...
Create a script to add new projects:
cd ~/labelme-project
nano add-project.sh
Add the following content:
#!/bin/bash
# Script to add a new project to the LabelMe setup
# Check if a project name was provided
if [ -z "$1" ]; then
echo "Usage: $0 <project_name>"
exit 1
fi
PROJECT_NAME="$1"
PROJECT_DIR="$HOME/labelme-project/datasets/$PROJECT_NAME"
# Create project directory structure
mkdir -p "$PROJECT_DIR/images"
mkdir -p "$PROJECT_DIR/annotations"
# Set permissions
chmod -R 755 "$PROJECT_DIR"
# Update the web portal to include the new project
# (This is a simplified approach - you'll need to manually edit index.php and labelme.php)
echo "Project directory created at: $PROJECT_DIR"
echo "Now copy your images to: $PROJECT_DIR/images/"
echo "Remember to manually update index.php and labelme.php to include the new project"
Make the script executable:
chmod +x add-project.sh
Open your browser and navigate to:
http://your-server-ip/
or http://labelme.yourdomain.com/
sudo apt install certbot python3-certbot-apache
sudo certbot --apache -d labelme.yourdomain.com
Edit the auth.php
file to use a database instead of hardcoded users.
If LabelMe doesn't load in the iframe:
docker ps
docker logs labelme-server
If you encounter permission issues with annotations:
sudo chmod -R 777 ~/labelme-project/annotations
sudo chown -R www-data:www-data ~/labelme-project/datasets
If annotations aren't saving properly:
.labelmerc
configuration filesudo tail -f /var/log/apache2/error.log
You now have a complete LabelMe annotation system with:
This setup allows your team to collaborate on annotation projects while maintaining control over who can access the system and what projects they can work on.
.
..
Thank you.
LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft that uses tree-based learning algorithms. It's designed to be efficient, fast, and capable of handling large-scale data with high dimensionality.
Here's a visualization of how LightGBM works:
Key features of LightGBM that make it powerful:
Leaf-wise Tree Growth: Unlike traditional algorithms that grow trees level-wise, LightGBM grows trees leaf-wise, focusing on the leaf that will bring the maximum reduction in loss. This creates more complex trees but uses fewer splits, resulting in higher accuracy with the same number of leaves.
Gradient-based One-Side Sampling (GOSS): This technique retains instances with large gradients (those that need more training) and randomly samples instances with small gradients. This allows LightGBM to focus computational resources on the more informative examples without losing accuracy.
Exclusive Feature Bundling (EFB): For sparse datasets, many features are mutually exclusive (never take non-zero values simultaneously). LightGBM bundles these features together, treating them as a single feature. This reduces memory usage and speeds up training.
Gradient Boosting Framework: Like other boosting algorithms, LightGBM builds trees sequentially, with each new tree correcting the errors of the existing ensemble.
LightGBM is particularly well-suited for your solver selection task because:
When properly tuned, LightGBM can often achieve better performance than neural networks for tabular data, especially with the right hyperparameters and sufficient boosting rounds.
.
# Find the largest directories in your home
du -h --max-depth=1 ~ | sort -rh | head -20
# Find the largest files
find ~ -type f -exec du -h {} \; | sort -rh | head -20
# Alternatively for a cleaner view of largest files
find ~ -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh | head -20
..
Thank you
checkgpu.py
..
.
๐
Thank you!
python code:
.
..
output is looks like:
Starting camera detection...
Checking camera indices 0-9...
----------------------------------------
✓ Camera 0 is ONLINE:
Resolution: 640x480
FPS: 30.0
Backend: V4L2
Frame shape: (480, 640, 3)
Format: YUYV
Stability test: 5/5 frames captured successfully
[ WARN:0@0.913] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video1): can't open camera by index
[ERROR:0@0.972] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 1: Not available
✓ Camera 2 is ONLINE:
Resolution: 640x480
FPS: 30.0
Backend: V4L2
Frame shape: (480, 640, 3)
Format: YUYV
Stability test: 5/5 frames captured successfully
[ WARN:0@1.818] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video3): can't open camera by index
[ERROR:0@1.820] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 3: Not available
[ WARN:0@1.820] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video4): can't open camera by index
[ERROR:0@1.822] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 4: Not available
[ WARN:0@1.822] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video5): can't open camera by index
[ERROR:0@1.823] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 5: Not available
[ WARN:0@1.824] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video6): can't open camera by index
[ERROR:0@1.825] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 6: Not available
[ WARN:0@1.825] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video7): can't open camera by index
[ERROR:0@1.828] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 7: Not available
[ WARN:0@1.828] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video8): can't open camera by index
[ERROR:0@1.830] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 8: Not available
[ WARN:0@1.830] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video9): can't open camera by index
[ERROR:0@1.831] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
✗ Camera 9: Not available
----------------------------------------
Summary:
Working camera indices: [0, 2]
----------------------------------------
Camera check complete!
so you can know which one is online
Thank you!
I'll create a simple example of a tiny neural network to demonstrate fp8 vs fp32 memory usage. Let's make a small model with these layers:
1. Input: 784 features (like MNIST image 28x28)
2. Hidden layer 1: 512 neurons
3. Hidden layer 2: 256 neurons
4. Output: 10 neurons (for 10 digit classes)
Let's calculate the memory needed for weights:
1. First Layer Weights:
```
784 × 512 = 401,408 weights
+ 512 biases
= 401,920 parameters
```
2. Second Layer Weights:
```
512 × 256 = 131,072 weights
+ 256 biases
= 131,328 parameters
```
3. Output Layer Weights:
```
256 × 10 = 2,560 weights
+ 10 biases
= 2,570 parameters
```
Total Parameters: 535,818
Memory Usage:
```
FP32: 535,818 × 4 bytes = 2,143,272 bytes ≈ 2.14 MB
FP8: 535,818 × 1 byte = 535,818 bytes ≈ 0.54 MB
```
Let's demonstrate this with some actual matrix multiplication:
```python
# Example of one batch of inference
Input size: 32 images (batch) × 784 features
32 × 784 = 25,088 numbers
For first layer multiplication:
(32 × 784) × (784 × 512) → (32 × 512)
```
During computation:
1. With fp32:
```
Weights in memory: 401,920 × 4 = 1,607,680 bytes
Input in memory: 25,088 × 4 = 100,352 bytes
Output in memory: 16,384 × 4 = 65,536 bytes
Total: ≈ 1.77 MB
```
2. With fp8:
```
Weights in memory: 401,920 × 1 = 401,920 bytes
Input in memory: 25,088 × 1 = 25,088 bytes
Output in memory: 16,384 × 1 = 16,384 bytes
Total: ≈ 0.44 MB
```
During actual computation:
```
1. Load a tile/block of the weight matrix (let's say 128×128)
fp8: 128×128 = 16,384 bytes
2. Convert this block to fp32: 16,384 × 4 = 65,536 bytes
3. Perform multiplication in fp32
4. Convert result back to fp8
5. Move to next block
```
This shows how even though we compute in fp32, keeping the model in fp8:
1. Uses 1/4 the memory for storage
2. Only needs small blocks in fp32 temporarily
3. Can process larger batches or models with same memory
1. About Output Types (D):
No, the output D is not limited to fp32/int32. Looking at the table, D can be:
- fp32
- fp16
- bf16
- fp8
- bf8
- int8
2. Input/Output Patterns:
When A is fp16, you have two options:
```
Option 1:
A: fp16 → B: fp16 → C: fp16 → D: fp16 → Compute: fp32
Option 2:
A: fp16 → B: fp16 → C: fp16 → D: fp32 → Compute: fp32
```
The compute/scale is always higher precision (fp32 or int32) to maintain accuracy during calculations, even if inputs/outputs are lower precision.
3. Key Patterns in the Table:
- Inputs A and B must always match in type
- C typically matches A and B, except with fp8/bf8 inputs
- When using fp8/bf8 inputs, C and D can be higher precision (fp32, fp16, or bf16)
- The compute precision is always fp32 for floating point types
- For integer operations (int8), the compute precision is int32
4. Why Different Combinations?
- Performance: Lower precision (fp16, fp8) = faster computation + less memory
- Accuracy: Higher precision (fp32) = better accuracy but slower
- Memory Usage: fp16/fp8 use less memory than fp32
- Mixed Precision: Use lower precision for inputs but higher precision for output to balance speed and accuracy
Example Use Cases:
```
High Accuracy Needs:
A(fp32) → B(fp32) → C(fp32) → D(fp32) → Compute(fp32)
Balanced Performance:
A(fp16) → B(fp16) → C(fp16) → D(fp32) → Compute(fp32)
Maximum Performance:
A(fp8) → B(fp8) → C(fp8) → D(fp8) → Compute(fp32)
```
1. GEMM (General Matrix Multiplication):
- This is the basic operation: C = A × B (matrix multiplication)
- Fundamental operation in deep learning, especially transformers
- Core computation in attention mechanisms, linear layers, etc.
2. Triton:
- A programming language for writing GPU kernels
- Lets you write your own custom GEMM implementation
- You control memory layout, tiling, etc.
- Example use: When you need a very specific matrix operation
3. hipBLASLt:
- A specialized library just for matrix operations
- Pre-built, highly optimized GEMM implementations
- Focuses on performance for common matrix sizes
- Example use: When you need fast, standard matrix multiplication
4. Transformer Engine:
- NVIDIA's specialized library for transformer models
- Automatically handles precision switching (FP8/FP16/FP32)
- Optimizes GEMM operations specifically for transformer architectures
- Includes specialized kernels for attention and linear layers
- Example use: When building large language models
The relationship:
```
Transformer Model
↓
Transformer Engine
↓
GEMM Operations (can be implemented via:)
↓
hipBLASLt / Triton / Other libraries
↓
GPU Hardware
```
the same matrix multiplication would be implemented using different approaches:
1. Basic GEMM Operation (what we want to compute):
```python
# C = A × B
# Where A is (M×K) and B is (K×N)
```
2. Using Triton (Custom implementation):
```python
@triton.jit
def matmul_kernel(
a_ptr, b_ptr, c_ptr, # Pointers to matrices
M, N, K, # Matrix dimensions
stride_am, stride_ak, # Memory strides for A
stride_bk, stride_bn, # Memory strides for B
stride_cm, stride_cn, # Memory strides for C
BLOCK_SIZE: tl.constexpr,
):
# Get program ID
pid = tl.program_id(0)
# Calculate block indices
block_i = pid // (N // BLOCK_SIZE)
block_j = pid % (N // BLOCK_SIZE)
# Load blocks from A and B
a = tl.load(a_ptr + ...) # Load block from A
b = tl.load(b_ptr + ...) # Load block from B
# Compute block multiplication
c = tl.dot(a, b) # Matrix multiply
# Store result
tl.store(c_ptr + ..., c)
```
3. Using hipBLASLt:
```cpp
// Initialize hipBLASLt
hipblasLtHandle_t handle;
hipblasLtCreate(&handle);
// Define matrix layout
hipblasLtMatrixLayout_t matA, matB, matC;
hipblasLtMatrixLayoutCreate(&matA, HIPBLAS_LT_R_16F, M, K, M);
hipblasLtMatrixLayoutCreate(&matB, HIPBLAS_LT_R_16F, K, N, K);
hipblasLtMatrixLayoutCreate(&matC, HIPBLAS_LT_R_16F, M, N, M);
// Execute GEMM
hipblasLtMatmul(
handle,
matmulDesc,
&alpha, // Scale factor
A, matA, // Input matrix A
B, matB, // Input matrix B
&beta, // Scale factor
C, matC, // Output matrix C
workspace, // Temporary workspace
streams // CUDA stream
);
```
4. Using Transformer Engine:
```python
import transformer_engine.pytorch as te
# Create TE layers
linear = te.Linear(in_features, out_features)
# Automatic precision handling
with te.fp8_autocast():
output = linear(input) # Internally uses optimized GEMM
```
Key differences:
1. Triton: You control everything (memory, blocks, compute)
2. hipBLASLt: Pre-optimized, you just call it
3. Transformer Engine: High-level, handles precision automatically
Performance comparison (general case):
```
Speed: hipBLASLt > Transformer Engine > Custom Triton
Flexibility: Triton > hipBLASLt > Transformer Engine
Ease of use: Transformer Engine > hipBLASLt > Triton
```
explain the key differences between these two FSDP (Fully Sharded Data Parallel) configuration parameters:
`fsdp_config.activation_checkpointing`:
- This is the main switch that enables/disables activation checkpointing
- When set to `true`, it saves memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass
- In your command, it's set to `false`, meaning no activation checkpointing will be performed
`fsdp_config.activation_checkpointing_reentrant`:
- This is a more specific setting that controls HOW activation checkpointing is implemented
- When set to `true` (as in your command), it uses a reentrant approach which is more memory efficient but potentially slower
- Reentrant implementation makes nested activation checkpointing possible and handles complex model architectures better
- This setting only has an effect if `activation_checkpointing` is enabled
In your specific case, since `activation_checkpointing=false`, the `activation_checkpointing_reentrant=true` setting won't have any actual effect on the training process.
A typical memory-optimized configuration would be:
```yaml
fsdp_config:
activation_checkpointing: true
activation_checkpointing_reentrant: true
```
This would give you maximum memory efficiency at the cost of some computation overhead. However, your configuration seems to be optimized for speed rather than memory usage, which makes sense for a performance-focused training setup (as suggested by your YAML filename containing "performance").
recent EEG datasets and papers from the last 5 years:
Recent Trends in EEG Classification (2023-2024):
Current Benchmark Standards:
Important Data Repositories for EEG Research:
Popular Code Repositories for Recent Papers:
Research Paper Collections:
Note: When accessing these resources: