|
|
# Monitoring Resources
|
|
|
|
|
|
Constantly monitoring your resource usage on all servers where you have running jobs is *fundamental*. In general, you should always be aware of your scripts':
|
|
|
Constantly monitoring your resource usage on all servers where you have running jobs is _fundamental_. In general, you should always be aware of your scripts':
|
|
|
|
|
|
* CPU and RAM usage
|
|
|
* GPU usage (for servers with GPUs)
|
|
|
* Storage usage
|
... | ... | @@ -8,62 +9,63 @@ Constantly monitoring your resource usage on all servers where you have running |
|
|
## Monitoring CPU and RAM Usage
|
|
|
|
|
|
The best way to monitor your CPU and RAM usage is through the `htop` command line utility, available in all servers. This utility shows you, in a graphical way, the current load status of the entire server, and the resource utilization of all running processes. By default, all user processes are shown. If you only want to see your own, run:
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
htop -u <user>
|
|
|
```
|
|
|
|
|
|
where `<user>` is your username. From inside the htop window, use `q` to exit.
|
|
|
|
|
|
Please check htop's [man page](https://man7.org/linux/man-pages/man1/htop.1.html) for all details on the output format and the various display options.
|
|
|
|
|
|
## Monitoring GPU Usage
|
|
|
|
|
|
> :warning: Thesis students only have access to `philae.polito.it`, which does not have a GPU installed (see ["Servers Information"](/servers)). So, this sub-section is only for other group members.
|
|
|
|
|
|
You can monitor the GPU utilization and GPU memory occupation of your scripts using the `nvidia-smi` command. A nicer looking version of the same information can also be obtained using the alternative custom script `nvidia-htop.py`.
|
|
|
You can monitor the GPU utilization and GPU memory occupation of your scripts using the `nvidia-smi` command. A nicer looking version of the same information can also be obtained using the alternative custom script `nvidia-htop.py`, only available on `icaro`.
|
|
|
|
|
|
Differently from `htop`, these two commands produce the output once and then terminate immediately. If you want to see a continuously updating output, similar to htop, you can launch the scripts with the `watch` command line utility. For example:
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
watch nvidia-htop.py
|
|
|
```
|
|
|
Check the man page of watch for details on its usage.
|
|
|
|
|
|
**IMPORTANT:** most programs and scripts that use the GPU will **crash** if they run out of memory. This means that, if another user is currently taking 40\% of one GPU's memory, and you launch a script that requires more than the remaining 60\%, either your script of the other user's will crash. This may cause a really big damage for other people, that may waste hours or days of compute because of your misbehavior. Notice also that some libraries (e.g., PyTorch) do not let you set a hard limit to the GPU memory requested by your scripts. For all these reasons, you should *really* pay attention to these monitoring tools at all times.
|
|
|
Check the man page of the `watch` command for details on its usage.
|
|
|
|
|
|
**IMPORTANT:** most programs and scripts that use the GPU will **crash** if they run out of memory. This means that, if another user is currently taking 40<span dir="">%</span> of one GPU's memory, and you launch a script that requires more than the remaining 60<span dir="">%</span>, either your script of the other user's will crash. This may cause a really big damage for other people, that may waste hours or days of compute because of your misbehavior. Notice also that some libraries (e.g., older versions of PyTorch) do not let you set a hard limit to the GPU memory requested by your scripts. For all these reasons, you should _really_ pay attention to these monitoring tools at all times.
|
|
|
|
|
|
## Monitoring Storage Usage
|
|
|
|
|
|
To monitor your total disk usage you can use the `quota` command, as explained in the ["Storage Management and Quotas"](/storage) page.
|
|
|
|
|
|
You can also compute the total storage occupation of one directory (with all its content) with the following command:
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
du -hs <path-to-directory>
|
|
|
```
|
|
|
|
|
|
Finally, if you want to know the total storage space available in (all disks of) a server you can use the following command:
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
df -h
|
|
|
```
|
|
|
|
|
|
Check the man page of each `quota`, `du` and `df` for further options and details.
|
|
|
|
|
|
|
|
|
## Forgetting to Monitor
|
|
|
|
|
|
You should always monitor *all* your processes carefully. When you run a new script for the first time, *always* use the monitoring tools described above to make sure that you are using a reasonable amount of resources. If the script is a long-running one, repeat this check periodically to ensure that it doesn't have memory leaks or other resource-related issues.
|
|
|
You should always monitor _all_ your processes carefully. When you run a new script for the first time, _always_ use the monitoring tools described above to make sure that you are using a reasonable amount of resources. If the script is a long-running one, repeat this check periodically to ensure that it doesn't have memory leaks or other resource-related issues.
|
|
|
|
|
|
#### What is a reasonable amount of resources?
|
|
|
|
|
|
Except for storage, we do not impose hard limits.
|
|
|
|
|
|
However, you should keep in mind that, typically, more than 10 people are actively running jobs on each server at all times (day and night, 7 days a week). So, if your processes alone take (say) 70\% of all cores and RAM memory, you are clearly not being respectful of others. Even worse, if you take 100\% of the resources, you could render the server completely unreachable, making it impossible for other users (or even for sysadmins that want to kill your processes) to connect.
|
|
|
However, you should keep in mind that, typically, more than 10 people are actively running jobs on each server at all times (day and night, 7 days a week). So, if your processes alone take (say) 70<span dir="">%</span> of all cores and RAM memory, you are clearly not being respectful of others. Even worse, if you take 100<span dir="">%</span> of the resources, you could render the server completely unreachable, making it impossible for other users (or even for sysadmins that want to kill your processes) to connect.
|
|
|
|
|
|
#### Why are you allowed to do damage?
|
|
|
|
|
|
We chose not to setup the servers so that doing such kind of damage would be completely *impossible* for users, because this would require a mechanism (e.g. private VMs, hard limits on cores/RAM per user, etc.) that would reduce the resources available to everyone in normal conditions (of positive cooperation).
|
|
|
We chose not to setup the servers so that doing such kind of damage would be completely _impossible_ for users, because this would require a mechanism (e.g. private VMs, hard limits on cores/RAM per user, etc.) that would reduce the resources available to everyone in normal conditions (of positive cooperation).
|
|
|
|
|
|
#### What happens if you do?
|
|
|
|
|
|
While you have the possibility of behaving badly, this does not mean that you will not have consequences if you do. In fact, when a server is overloaded, sysadmins automatically receive a notification with the resource usage details of all processes and any possible misbehavior (especially if repeated), will be reported directly to Prof. Enrico Macii.
|
|
|
|
|
|
Remember: *"errare humanum est, perseverare autem diabolicum"*
|
|
|
|
|
|
Remember: _"errare humanum est, perseverare autem diabolicum"_ |
|
|
\ No newline at end of file |