ML Workloads on WSL2 with RTX 4060¶

PyTorch¶

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verify:

python3 -c "
import torch
print('CUDA available:', torch.cuda.is_available())
print('Device:', torch.cuda.get_device_name(0))
print('VRAM:', torch.cuda.get_device_properties(0).total_mem / 1024**3, 'GB')
"
# Expected: CUDA available: True
#           Device: NVIDIA GeForce RTX 4060
#           VRAM: ~16.0 GB

TensorFlow¶

pip install tensorflow[and-cuda]

Verify:

python3 -c "
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
"
# Expected: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Docker + NVIDIA Container Toolkit¶

Install Docker Engine¶

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Log out and back in for group membership

Install NVIDIA Container Toolkit¶

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Configure Docker Runtime¶

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify¶

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Limitation¶

Only --gpus all is supported in WSL2. You cannot select specific GPUs by index (e.g., --gpus '"device=0"'). Not relevant for the single RTX 4060.

RTX 4060 16GB -- ML Capacity¶

Workload	Fits in 16GB VRAM?	Notes
7B models (LLaMA 2, Mistral) at FP16	Yes	~14 GB, comfortable
13B models at 8-bit (GPTQ/AWQ)	Yes	~13 GB quantized
13B models at 4-bit	Yes	~7 GB, room to spare
LoRA/QLoRA fine-tuning of 7B	Yes	Practical for research
Full fine-tuning of 7B	Tight	May need gradient checkpointing
Training from scratch >7B	No	Need multi-GPU or larger card

Performance Notes¶

Near-native for compute-bound workloads (matrix ops, training loops).
~5-33% overhead vs bare-metal Linux, depending on workload type.
I/O-bound workloads (data loading) see more overhead due to WSL2 filesystem.
Store training data on the Linux filesystem (/home/...), not on /mnt/c/ -- the Windows filesystem mount is significantly slower.