Python¶

Overview¶

Python is an extremely versatile programming language widely used for data science, machine learning, and scientific computing. On the cluster, you can load different versions of Python and create virtual environments to manage packages in isolation.

This guide shows you how to configure and use Python efficiently in the HPC environment.

Check Available Versions¶

Before starting, check which Python versions are available as modules:

# List available Python modules
module avail python

# See details of a specific module
module spider python/3.10

Create Virtual Environment¶

Step by Step¶

1. Load Python

module load python/3.10

2. Choose where to create the environment

Where to create virtual environments

Small environments (< 5 GB): /home/$USER/venvs/
Large or shared environments: /scratch/projetos/<your_project>/envs/

Example:

# For personal environment
mkdir -p ~/venvs

# For project environment
mkdir -p /scratch/projetos/<my_project>/envs

3. Create the virtual environment

# Personal environment
python -m venv ~/venvs/my_env

# Or for project
python -m venv /scratch/projetos/<my_project>/envs/my_env

4. Activate the environment

# Personal environment
source ~/venvs/my_env/bin/activate

# Or for project
source /scratch/projetos/<my_project>/envs/my_env/bin/activate

When active, you'll see the environment name in the prompt:

(my_env) [user@login0 ~]$

5. Install packages

# Update pip first
pip install --upgrade pip

# Install packages
pip install numpy pandas matplotlib scikit-learn

# Install from requirements.txt file
pip install -r requirements.txt

# List installed packages
pip list

6. Create requirements.txt

For reproducibility, save your packages:

pip freeze > requirements.txt

7. Deactivate the environment

deactivate

Use Python in SLURM Jobs¶

Simple job¶

#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --output=/scratch/projetos/<my_project>/logs/job_%j.out
#SBATCH --error=/scratch/projetos/<my_project>/logs/job_%j.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

# Load Python
module load python/3.10

# Activate virtual environment
source ~/venvs/my_env/bin/activate

# Run script
python my_script.py

Job with parallel processing¶

#!/bin/bash
#SBATCH --job-name=python_parallel
#SBATCH --output=/scratch/projetos/<my_project>/logs/parallel_%j.out
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G

module load python/3.10
source ~/venvs/data_science/bin/activate

# Configure environment variables for NumPy/SciPy libraries
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run with multiprocessing
python parallel_processing.py --workers $SLURM_CPUS_PER_TASK

Job array¶

To process multiple files or parameters:

#!/bin/bash
#SBATCH --job-name=python_array
#SBATCH --output=/scratch/projetos/<my_project>/logs/array_%A_%a.out
#SBATCH --array=1-10
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2

module load python/3.10
source ~/venvs/my_env/bin/activate

# Use SLURM_ARRAY_TASK_ID as parameter
python process_file.py --id $SLURM_ARRAY_TASK_ID

See complete SLURM job examples for more options.

Best Practices¶

1. Environment Organization¶

# Create organized structure
~/venvs/
├── data_analysis/      # For data analysis
├── machine_learning/   # For ML
└── web_scraping/       # For scraping

# Or by project
/scratch/projetos/<my_project>/envs/
├── production/
└── development/

2. Manage Quotas¶

Python environments can grow quickly. Monitor size:

# Check environment size
du -sh ~/venvs/my_env

# Clear pip cache
pip cache purge

3. Use requirements.txt¶

Always maintain an updated requirements.txt file:

# Create/update
pip freeze > requirements.txt

# Reinstall in new environment
pip install -r requirements.txt

4. Specify Versions¶

In requirements.txt, specify exact versions for reproducibility:

numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0

5. Use Local Pip Cache¶

To avoid repeated downloads, configure local cache:

# Add to ~/.bashrc
export PIP_CACHE_DIR=/scratch/projetos/<my_project>/.pip_cache

Common Packages for Data Science¶

Data Analysis¶

pip install numpy pandas scipy matplotlib seaborn

Machine Learning¶

pip install scikit-learn tensorflow torch

Parallel Processing¶

pip install dask joblib multiprocessing

Visualization¶

pip install plotly bokeh altair

Jupyter Notebooks¶

To use Jupyter on the cluster:

# Install in environment
pip install jupyter jupyterlab

# Start (in interactive job or as job)
jupyter notebook --no-browser --port=8888

See the complete Jupyter guide for SSH tunnel configuration.

Common Problems¶

ImportError after installing package¶

Problem: Package installed but Python can't find it.

Solution:

# Check if environment is activated
which python

# Should show environment path, not /usr/bin/python
# If not, activate environment again
source ~/venvs/my_env/bin/activate

Out of space in /home¶

Problem: Quota exceeded when installing packages.

Solution:

# Move environment to /scratch
mv ~/venvs/my_env /scratch/projetos/<my_project>/envs/

# Create symbolic link
ln -s /scratch/projetos/<my_project>/envs/my_env ~/venvs/my_env

Packages require compilation¶

Problem: Error installing packages that need compilation (e.g., numpy, scipy).

Solution:

# Load compiler
module load gcc/11.2.0

# Or use pre-compiled binaries
pip install --only-binary :all: numpy

Incompatible Python version¶

Problem: Package requires specific Python version.

Solution:

# List available versions
module avail python

# Load appropriate version
module load python/3.11

# Create new environment with this version
python -m venv ~/venvs/my_env_py311

Remove Virtual Environment¶

When you no longer need the environment:

# Deactivate if active
deactivate

# Remove directory
rm -rf ~/venvs/my_env

Additional Resources¶

Support¶

If you encounter problems using Python on the cluster:

Check if module is loaded: module list
Check if environment is activated: which python
See our support page
Contact us: hpc@fieb.org.br