Python¶
Overview¶
Python is an extremely versatile programming language widely used for data science, machine learning, and scientific computing. On the cluster, you can load different versions of Python and create virtual environments to manage packages in isolation.
This guide shows you how to configure and use Python efficiently in the HPC environment.
Check Available Versions¶
Before starting, check which Python versions are available as modules:
# List available Python modules
module avail python
# See details of a specific module
module spider python/3.10
Create Virtual Environment¶
Step by Step¶
1. Load Python
2. Choose where to create the environment
Where to create virtual environments
- Small environments (< 5 GB):
/home/$USER/venvs/ - Large or shared environments:
/scratch/projetos/<your_project>/envs/
Example:
3. Create the virtual environment
# Personal environment
python -m venv ~/venvs/my_env
# Or for project
python -m venv /scratch/projetos/<my_project>/envs/my_env
4. Activate the environment
# Personal environment
source ~/venvs/my_env/bin/activate
# Or for project
source /scratch/projetos/<my_project>/envs/my_env/bin/activate
When active, you'll see the environment name in the prompt:
5. Install packages
# Update pip first
pip install --upgrade pip
# Install packages
pip install numpy pandas matplotlib scikit-learn
# Install from requirements.txt file
pip install -r requirements.txt
# List installed packages
pip list
6. Create requirements.txt
For reproducibility, save your packages:
7. Deactivate the environment
Use Python in SLURM Jobs¶
Simple job¶
#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --output=/scratch/projetos/<my_project>/logs/job_%j.out
#SBATCH --error=/scratch/projetos/<my_project>/logs/job_%j.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
# Load Python
module load python/3.10
# Activate virtual environment
source ~/venvs/my_env/bin/activate
# Run script
python my_script.py
Job with parallel processing¶
#!/bin/bash
#SBATCH --job-name=python_parallel
#SBATCH --output=/scratch/projetos/<my_project>/logs/parallel_%j.out
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
module load python/3.10
source ~/venvs/data_science/bin/activate
# Configure environment variables for NumPy/SciPy libraries
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Run with multiprocessing
python parallel_processing.py --workers $SLURM_CPUS_PER_TASK
Job array¶
To process multiple files or parameters:
#!/bin/bash
#SBATCH --job-name=python_array
#SBATCH --output=/scratch/projetos/<my_project>/logs/array_%A_%a.out
#SBATCH --array=1-10
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
module load python/3.10
source ~/venvs/my_env/bin/activate
# Use SLURM_ARRAY_TASK_ID as parameter
python process_file.py --id $SLURM_ARRAY_TASK_ID
See complete SLURM job examples for more options.
Best Practices¶
1. Environment Organization¶
# Create organized structure
~/venvs/
├── data_analysis/ # For data analysis
├── machine_learning/ # For ML
└── web_scraping/ # For scraping
# Or by project
/scratch/projetos/<my_project>/envs/
├── production/
└── development/
2. Manage Quotas¶
Python environments can grow quickly. Monitor size:
3. Use requirements.txt¶
Always maintain an updated requirements.txt file:
# Create/update
pip freeze > requirements.txt
# Reinstall in new environment
pip install -r requirements.txt
4. Specify Versions¶
In requirements.txt, specify exact versions for reproducibility:
5. Use Local Pip Cache¶
To avoid repeated downloads, configure local cache:
Common Packages for Data Science¶
Data Analysis¶
Machine Learning¶
Parallel Processing¶
Visualization¶
Jupyter Notebooks¶
To use Jupyter on the cluster:
# Install in environment
pip install jupyter jupyterlab
# Start (in interactive job or as job)
jupyter notebook --no-browser --port=8888
See the complete Jupyter guide for SSH tunnel configuration.
Common Problems¶
ImportError after installing package¶
Problem: Package installed but Python can't find it.
Solution:
# Check if environment is activated
which python
# Should show environment path, not /usr/bin/python
# If not, activate environment again
source ~/venvs/my_env/bin/activate
Out of space in /home¶
Problem: Quota exceeded when installing packages.
Solution:
# Move environment to /scratch
mv ~/venvs/my_env /scratch/projetos/<my_project>/envs/
# Create symbolic link
ln -s /scratch/projetos/<my_project>/envs/my_env ~/venvs/my_env
Packages require compilation¶
Problem: Error installing packages that need compilation (e.g., numpy, scipy).
Solution:
# Load compiler
module load gcc/11.2.0
# Or use pre-compiled binaries
pip install --only-binary :all: numpy
Incompatible Python version¶
Problem: Package requires specific Python version.
Solution:
# List available versions
module avail python
# Load appropriate version
module load python/3.11
# Create new environment with this version
python -m venv ~/venvs/my_env_py311
Remove Virtual Environment¶
When you no longer need the environment:
Additional Resources¶
- SLURM Job Examples with Python
- File Management
- Conda Guide (alternative to venv)
- Official Python Documentation
- Pip Guide
Support¶
If you encounter problems using Python on the cluster:
- Check if module is loaded:
module list - Check if environment is activated:
which python - See our support page
- Contact us: hpc@fieb.org.br