.screenrc
.screenrc
This guide explains how to access data from Computer A (data server) on Computer B (GPU machine) for machine learning training workflows.
When training machine learning models, you often need:
This tutorial will show you how to securely connect these computers using SSH, allowing the GPU machine to access data without copying everything locally.
If you don't already have an SSH key on your GPU computer (Computer B):
# On Computer B (GPU)
ssh-keygen -t rsa -b 4096
Press Enter to accept default locations and add a passphrase if desired.
# On Computer B (GPU)
# View your public key
cat ~/.ssh/id_rsa.pub
# Copy the output to clipboard
Now transfer this key to Computer A (data server):
# Option 1: Using ssh-copy-id (easiest)
ssh-copy-id username@computerA
# Option 2: Manual setup
# First, SSH into Computer A
ssh username@computerA
# Then on Computer A, create .ssh directory if it doesn't exist
mkdir -p ~/.ssh
chmod 700 ~/.ssh
# Add your public key to authorized_keys
echo "ssh-rsa AAAA...your key here..." >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
# Exit back to Computer B
exit
Ensure you can connect without a password:
# On Computer B (GPU)
ssh username@computerA
If successful, you should connect without entering a password.
Install SSHFS on your GPU computer:
# On Computer B (GPU)
# For Ubuntu/Debian
sudo apt-get update
sudo apt-get install sshfs
# For CentOS/RHEL/Fedora
sudo dnf install fuse-sshfs
Create a mount point and mount the remote directory:
# On Computer B (GPU)
# Create mount directory
mkdir -p ~/data_mount
# Mount the remote directory
sshfs username@computerA:/path/to/data ~/data_mount
# Verify the mount worked
ls ~/data_mount
Now you can access the data in your training scripts as if it were local:
# Example PyTorch script
import torch
from torch.utils.data import Dataset, DataLoader
# Point to your mounted data directory
data_dir = "~/data_mount/dataset"
# Your training code...
To automatically mount the remote directory when your GPU computer starts:
Edit your fstab file:
sudo nano /etc/fstab
Add this line (all on one line):
username@computerA:/path/to/data /home/username/data_mount fuse.sshfs defaults,_netdev,user,idmap=user,follow_symlinks,identityfile=/home/username/.ssh/id_rsa,allow_other,reconnect 0 0
Save and exit
To unmount the remote directory:
# On Computer B (GPU)
fusermount -u ~/data_mount
For better performance with large datasets, try these SSHFS options:
sshfs username@computerA:/path/to/data ~/data_mount -o Compression=no,big_writes,cache=yes,kernel_cache
If you experience frequent disconnections, add reconnect options:
sshfs username@computerA:/path/to/data ~/data_mount -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3
For production setups with large datasets, consider using NFS instead of SSHFS for better performance.
sudo systemctl status sshd
For any issues or questions, please contact your system administrator.
Let me explain how a specific normalized feature value is calculated using one concrete example.
Let's take the feature "GroupSize" which has:
These values are post-normalization, but we can work backwards to understand how they were calculated.
The normalization function you're using is:
normalized_features = (features - mean) / std
Where:
features
are the original, raw valuesmean
is the average of all values for that feature in the training setstd
is the standard deviation of all values for that feature in the training setLet's say we have these raw values for GroupSize in the training set:
First, we calculate the mean:
Then we calculate the standard deviation:
Now, we can normalize each value:
Going back to your data:
For GroupSize, this extreme range suggests:
If we assume the mean of raw GroupSize is ฮผ and standard deviation is ฯ, then:
This tells us that your maximum raw value is over 103 standard deviations away from the mean, which is extremely far! This confirms that your raw data has a heavily skewed distribution with significant outliers.
The fact that most normalized values for GroupSize are close to the minimum (-0.045121) suggests that the most common value is slightly below the mean, while a few extreme outliers are pulling the mean upward.
This type of skewed distribution is exactly why techniques like masking and autoencoder approaches are beneficial - they can help the model learn robust representations even with such extreme distributions.