Skip to content
GitLab
  • Explore
  • Sign in
  • EDA Guides
  • eda-servers-guide
  • Wiki
  • monitoring

monitoring · Changes

Page history
Create monitoring authored May 06, 2021 by Daniele Jahier Pagliari's avatar Daniele Jahier Pagliari
Hide whitespace changes
Inline Side-by-side
monitoring.md 0 → 100644
View page @ 73da4695
# Monitoring Resources
Constantly monitoring your resource usage on all servers where you have running jobs is *fundamental*. In general, you should always be aware of your jobs':
* CPU and RAM usage
* GPU usage (for servers with GPUs)
* Storage usage
## Monitoring CPU and RAM Usage
The best way to monitor your CPU and RAM usage is through the `htop` command line utility, available in all servers. This utility shows you, in a graphical way, the current load status of the entire server, and the resource utilization of all running processes. By default, all user processes are shown. If you only want to see your own, run:
```
htop -u <user>
```
where `<user>` is your username. Please check htop's [man page](https://man7.org/linux/man-pages/man1/htop.1.html) for all details on the output format and the various display options.
## Monitoring GPU Usage
TODO
## Monitoring Storage Usage
TODO
## Forgetting to Monitor
You should always monitor *all* your processes carefully. When you run a new script for the first time, *always* use the monitoring tools described above to make sure that you are using a reasonable amount of resources. If the scripts are long-running ones, do this check periodically to ensure that they don't have memory leaks or other resource-related issues.
What is a reasonable amount of resources? Except for storage, we do not impose hard limits.
However, you should keep in mind that, typically, more than 10 people are actively running jobs on each server at all times (day and night, 7 days a week). So, if your processes alone take (say) 70\% of all cores and RAM memory, you are clearly not being respectful of others. Even worse, if you take 100\% of the resources, you could render the server completely unreachable, making it impossible for other users (or even for sysadmins that want to kill your processes) to connect.
We chose not to setup the servers so that doing such kind of damage would be completely *impossible* for users, because this would require a mechanism (e.g. cgroups, private VMs, etc.) that would reduce the resources available to everyone in normal conditions (of positive cooperation).
So, while you have the possibility of behaving badly, this does not mean that you will not have consequences if you do. In fact, when a server is overloaded, sysadmins automatically receive a notification with the resource usage details of all processes and any possible misbehavior (especially if repeated), will be reported directly to Prof. Enrico Macii. Remember: *"errare humanum est, perseverare autem diabolicum"*
Clone repository
Home

Server Information
Account
Connecting to the Servers
Storage Management and Quotas
Monitoring Resources
Gitlab
Software and Libraries
  • Additional Software on Philae
  • Remote Code Deployment
  • Python Virtual Environments
  • EDA Technology Libraries
  • Shared Datasets
Miscellaneous
  • Setting CUDA Drivers on Icaro

Sidebar