Marbec-GPU Documentation
Welcome to the Marbec-GPU cluster documentation. This document provides an overview of the cluster, its capabilities, and how to get started with using it.
The Marbec-GPU cluster is designed to provide high-performance computing resources for code execution, such as those using Python and R. It is built on the Linux-Ubuntu kernel and features a Jupyter interface for ease of use. Several common tools are installed, including Python, R, Git, Conda, CUDA, and RStudio.
Features
- Ressources
- 2 NVIDIA A40 GPUs
- 2 Intel Xeon Platinum 8380 CPUs, 2x40 cores, 2x80 threads
- 1,48 To de RAM
- MARBEC-DATA Interconnections
Registration
To start using the Marbec-GPU cluster, you will need to join the Marbec-DEN group. Contact the Administrators for more details : Contact DEN administrators.
Documentation
For detailed instructions on how to use the Marbec-GPU cluster, please refer to the following sections:
- Initiation Guide (comming soon)
- Useful Linux Commands Guide
- Basic Script Execution (via SLURM)
- R Script Execution
Support
If you encounter any issues or have questions interact with the RocketChatIRD.
FAQ
- What resources do I need to allocate?
Good question! It depends on your input data (size and type), your model (stochastic, statistical, neural network, etc.), your task, but most importantly, the packages you are using. For example, some packages do not support GPU computations, while others cannot parallelize across multiple CPUs. Make sure to research the packages you’re using to avoid allocating resources that won’t be utilized, and adapt your scripts accordingly. Here are some examples of resource allocation: Training Pytorch YOLO: --mem=64G
, --c=16
, and --gres=gpu:1
; Running HSMC (TensorFlow): --mem=64GB
, --cpus-per-task=30
, and --gpus-per-node=1
.
- Does my script is GPU-capable ?
No, not directly. However, some libraries are GPU-capable. If your framework or script does not specifically use the GPU, your code will NOT utilize GPU hardware. Main examples of GPU-capable libraries: PyTorch, TensorFlow, Keras, Theano, Caffe, etc.
- How to cancel a submitted job ?
Use the command scancel JOBID
, where JOBID
is the job ID of the job you want to cancel. You can find the job ID in the output of the sbatch command when you submit a job, or by using the squeue
command as mentioned in the previous question, for more details SLURM scancel documentation.
- How access job queue ?
Use the following command : squeue -O NAME,UserName,TimeUsed,tres-per-node,state,JOBID
. This command displays a detailed list of jobs in the queue, including the job name (e.g., spawner-jupyterhub
for a “job-session”; otherwise, the name specified in the #SBATCH --job-name
argument), username, running time, node name (eg., gres:gpu:1
for a GPU allocation, gres:gpu:0
for a CPU allocation), job state (e.g., PENDING
for jobs waiting to start due to resource availability or scheduling, or RUNNING
for jobs currently being executed) and JOBID (a unique identifier for each job), refer to the SLURM squeue documentation for more details.
- How to submit multiple jobs without blocking other users ?
Thank you from the entire MarbecGPU Community for using resources in a cooperative and friendly manner. You can use #SBATCH --dependency=afterany:JOBID
parameter, where JOBID
is the job ID of the job you want to wait for (e.g., 4391). You can find the job ID in the output of the sbatch command when you submit a job, or by using the squeue
command as mentioned in the previous question. According to the SLURM sbatch documentation this parameter ensures that the start of your job is deferred until the specified dependency is satisfied. For file-based dependencies or more complex cases, you can explore other mechanisms to further delay or sequence your job execution as needed.