Skip to content

Python

Overview

Python is an extremely versatile programming language widely used for data science, machine learning, and scientific computing. On the cluster, you can load different versions of Python and create virtual environments to manage packages in isolation.

This guide shows you how to configure and use Python efficiently in the HPC environment.

Check Available Versions

Before starting, check which Python versions are available as modules:

# List available Python modules
module avail python

# See details of a specific module
module spider python/3.10

Create Virtual Environment

Step by Step

1. Load Python

module load python/3.10

2. Choose where to create the environment

Where to create virtual environments

  • Small environments (< 5 GB): /home/$USER/venvs/
  • Large or shared environments: /scratch/projetos/<your_project>/envs/

Example:

# For personal environment
mkdir -p ~/venvs

# For project environment
mkdir -p /scratch/projetos/<my_project>/envs

3. Create the virtual environment

# Personal environment
python -m venv ~/venvs/my_env

# Or for project
python -m venv /scratch/projetos/<my_project>/envs/my_env

4. Activate the environment

# Personal environment
source ~/venvs/my_env/bin/activate

# Or for project
source /scratch/projetos/<my_project>/envs/my_env/bin/activate

When active, you'll see the environment name in the prompt:

(my_env) [user@login0 ~]$

5. Install packages

# Update pip first
pip install --upgrade pip

# Install packages
pip install numpy pandas matplotlib scikit-learn

# Install from requirements.txt file
pip install -r requirements.txt

# List installed packages
pip list

6. Create requirements.txt

For reproducibility, save your packages:

pip freeze > requirements.txt

7. Deactivate the environment

deactivate

Use Python in SLURM Jobs

Simple job

#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --output=/scratch/projetos/<my_project>/logs/job_%j.out
#SBATCH --error=/scratch/projetos/<my_project>/logs/job_%j.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

# Load Python
module load python/3.10

# Activate virtual environment
source ~/venvs/my_env/bin/activate

# Run script
python my_script.py

Job with parallel processing

#!/bin/bash
#SBATCH --job-name=python_parallel
#SBATCH --output=/scratch/projetos/<my_project>/logs/parallel_%j.out
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G

module load python/3.10
source ~/venvs/data_science/bin/activate

# Configure environment variables for NumPy/SciPy libraries
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run with multiprocessing
python parallel_processing.py --workers $SLURM_CPUS_PER_TASK

Job array

To process multiple files or parameters:

#!/bin/bash
#SBATCH --job-name=python_array
#SBATCH --output=/scratch/projetos/<my_project>/logs/array_%A_%a.out
#SBATCH --array=1-10
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2

module load python/3.10
source ~/venvs/my_env/bin/activate

# Use SLURM_ARRAY_TASK_ID as parameter
python process_file.py --id $SLURM_ARRAY_TASK_ID

See complete SLURM job examples for more options.

Best Practices

1. Environment Organization

# Create organized structure
~/venvs/
├── data_analysis/      # For data analysis
├── machine_learning/   # For ML
└── web_scraping/       # For scraping

# Or by project
/scratch/projetos/<my_project>/envs/
├── production/
└── development/

2. Manage Quotas

Python environments can grow quickly. Monitor size:

# Check environment size
du -sh ~/venvs/my_env

# Clear pip cache
pip cache purge

3. Use requirements.txt

Always maintain an updated requirements.txt file:

# Create/update
pip freeze > requirements.txt

# Reinstall in new environment
pip install -r requirements.txt

4. Specify Versions

In requirements.txt, specify exact versions for reproducibility:

numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0

5. Use Local Pip Cache

To avoid repeated downloads, configure local cache:

# Add to ~/.bashrc
export PIP_CACHE_DIR=/scratch/projetos/<my_project>/.pip_cache

Common Packages for Data Science

Data Analysis

pip install numpy pandas scipy matplotlib seaborn

Machine Learning

pip install scikit-learn tensorflow torch

Parallel Processing

pip install dask joblib multiprocessing

Visualization

pip install plotly bokeh altair

Jupyter Notebooks

To use Jupyter on the cluster:

# Install in environment
pip install jupyter jupyterlab

# Start (in interactive job or as job)
jupyter notebook --no-browser --port=8888

See the complete Jupyter guide for SSH tunnel configuration.

Common Problems

ImportError after installing package

Problem: Package installed but Python can't find it.

Solution:

# Check if environment is activated
which python

# Should show environment path, not /usr/bin/python
# If not, activate environment again
source ~/venvs/my_env/bin/activate

Out of space in /home

Problem: Quota exceeded when installing packages.

Solution:

# Move environment to /scratch
mv ~/venvs/my_env /scratch/projetos/<my_project>/envs/

# Create symbolic link
ln -s /scratch/projetos/<my_project>/envs/my_env ~/venvs/my_env

Packages require compilation

Problem: Error installing packages that need compilation (e.g., numpy, scipy).

Solution:

# Load compiler
module load gcc/11.2.0

# Or use pre-compiled binaries
pip install --only-binary :all: numpy

Incompatible Python version

Problem: Package requires specific Python version.

Solution:

# List available versions
module avail python

# Load appropriate version
module load python/3.11

# Create new environment with this version
python -m venv ~/venvs/my_env_py311

Remove Virtual Environment

When you no longer need the environment:

# Deactivate if active
deactivate

# Remove directory
rm -rf ~/venvs/my_env

Additional Resources

Support

If you encounter problems using Python on the cluster:

  • Check if module is loaded: module list
  • Check if environment is activated: which python
  • See our support page
  • Contact us: hpc@fieb.org.br