Marbec-GPU Documentation

Welcome to the Marbec-GPU cluster documentation. This document provides an overview of the cluster, its capabilities, and how to get started with using it.

The Marbec-GPU cluster is designed to provide high-performance computing resources for code execution, such as those using Python and R. It is built on the Linux-Ubuntu kernel and features a Jupyter interface for ease of use. Several common tools are installed, including Python, R, Git, Conda, CUDA, and RStudio.

Features

Ressources
- 2 NVIDIA A40 GPUs
- 2 Intel Xeon Platinum 8380 CPUs, 2x40 cores, 2x80 threads
- 1,48 To de RAM
- MARBEC-DATA Interconnections

Registration

To start using the Marbec-GPU cluster, you will need to join the Marbec-DEN group. Contact the Administrators for more details : Contact DEN administrators.

Documentation

For detailed instructions on how to use the Marbec-GPU cluster, please refer to the following sections:

Support

If you encounter any issues or have questions interact with the RocketChatIRD.

FAQ

What resources do I need to allocate?

Good question! It depends on your input data (size and type), your model (stochastic, statistical, neural network, etc.), your task, but most importantly, the packages you are using. For example, some packages do not support GPU computations, while others cannot parallelize across multiple CPUs. Make sure to research the packages you’re using to avoid allocating resources that won’t be utilized, and adapt your scripts accordingly. Here are some examples of resource allocation: Training Pytorch YOLO: --mem=64G, --c=16, and --gres=gpu:1; Running HSMC (TensorFlow): --mem=64GB, --cpus-per-task=30, and --gpus-per-node=1.

Does my script is GPU-capable ?

No, not directly. However, some libraries are GPU-capable. If your framework or script does not specifically use the GPU, your code will NOT utilize GPU hardware. Main examples of GPU-capable libraries: PyTorch, TensorFlow, Keras, Theano, Caffe, etc.

How to cancel a submitted job ?

Use the command scancel JOBID, where JOBID is the job ID of the job you want to cancel. You can find the job ID in the output of the sbatch command when you submit a job, or by using the squeue command as mentioned in the previous question, for more details SLURM scancel documentation.

How access job queue ?

Use the following command : squeue -O NAME,UserName,TimeUsed,tres-per-node,state,JOBID. This command displays a detailed list of jobs in the queue, including the job name (e.g., spawner-jupyterhub for a “job-session”; otherwise, the name specified in the #SBATCH --job-name argument), username, running time, node name (eg., gres:gpu:1 for a GPU allocation, gres:gpu:0 for a CPU allocation), job state (e.g., PENDING for jobs waiting to start due to resource availability or scheduling, or RUNNING for jobs currently being executed) and JOBID (a unique identifier for each job), refer to the SLURM squeue documentation for more details.

How to submit multiple jobs without blocking other users ?

Thank you from the entire MarbecGPU Community for using resources in a cooperative and friendly manner. You can use #SBATCH --dependency=afterany:JOBID parameter, where JOBID is the job ID of the job you want to wait for (e.g., 4391). You can find the job ID in the output of the sbatch command when you submit a job, or by using the squeue command as mentioned in the previous question. According to the SLURM sbatch documentation this parameter ensures that the start of your job is deferred until the specified dependency is satisfied. For file-based dependencies or more complex cases, you can explore other mechanisms to further delay or sequence your job execution as needed.