|
|
# Monitoring Resources
|
|
|
|
|
|
Constantly monitoring your resource usage on all servers where you have running jobs is *fundamental*. In general, you should always be aware of your jobs':
|
|
|
* CPU and RAM usage
|
|
|
* GPU usage (for servers with GPUs)
|
|
|
* Storage usage
|
|
|
|
|
|
## Monitoring CPU and RAM Usage
|
|
|
|
|
|
The best way to monitor your CPU and RAM usage is through the `htop` command line utility, available in all servers. This utility shows you, in a graphical way, the current load status of the entire server, and the resource utilization of all running processes. By default, all user processes are shown. If you only want to see your own, run:
|
|
|
```
|
|
|
htop -u <user>
|
|
|
```
|
|
|
where `<user>` is your username. Please check htop's [man page](https://man7.org/linux/man-pages/man1/htop.1.html) for all details on the output format and the various display options.
|
|
|
|
|
|
## Monitoring GPU Usage
|
|
|
|
|
|
TODO
|
|
|
|
|
|
## Monitoring Storage Usage
|
|
|
|
|
|
TODO
|
|
|
|
|
|
## Forgetting to Monitor
|
|
|
|
|
|
You should always monitor *all* your processes carefully. When you run a new script for the first time, *always* use the monitoring tools described above to make sure that you are using a reasonable amount of resources. If the scripts are long-running ones, do this check periodically to ensure that they don't have memory leaks or other resource-related issues.
|
|
|
|
|
|
What is a reasonable amount of resources? Except for storage, we do not impose hard limits.
|
|
|
|
|
|
However, you should keep in mind that, typically, more than 10 people are actively running jobs on each server at all times (day and night, 7 days a week). So, if your processes alone take (say) 70\% of all cores and RAM memory, you are clearly not being respectful of others. Even worse, if you take 100\% of the resources, you could render the server completely unreachable, making it impossible for other users (or even for sysadmins that want to kill your processes) to connect.
|
|
|
|
|
|
We chose not to setup the servers so that doing such kind of damage would be completely *impossible* for users, because this would require a mechanism (e.g. cgroups, private VMs, etc.) that would reduce the resources available to everyone in normal conditions (of positive cooperation).
|
|
|
|
|
|
So, while you have the possibility of behaving badly, this does not mean that you will not have consequences if you do. In fact, when a server is overloaded, sysadmins automatically receive a notification with the resource usage details of all processes and any possible misbehavior (especially if repeated), will be reported directly to Prof. Enrico Macii. Remember: *"errare humanum est, perseverare autem diabolicum"*
|
|
|
|