Okapi and Caprine the Servers

Updated versions of this document may be available at

https://fractal.math.unr.edu/~okapi/

In addition to the information contained on this page, there are a number of tutorials written by Alex Knudson discussing a way to install specific versions of R and Python using miniconda as well as how to access the department server using jupyter notebooks. You may wish to consult slides for the talk given by Grant Schissler on March 31, 2021 in the graduate student seminar. Finally, the slides for the talk given by Eric Olson on October 4, 2023 are also avaialble.

Since this page has become rather long, please use the following table of contents to navigate to the section you are intrested in:

Overview of Okapi and Caprine
Hardware Configuration
How to Connect to Okapi
Oops I Deleted My Files
Interactive Software
Running Programs in the Batch Queue
Moving to the Pronghorn Cluster

Overview of Okapi and Caprine

The servers are named

okapi.math.unr.edu with IP address 134.197.75.27
caprine.math.unr.edu with IP address 134.197.63.60

You should be able to access either server by name or by IP number from on campus. For off-campus you will need the UNR VPN. More information about VPN access is available in the section How to Connect to Okapi.

photograph of an okapi

The goal of Okapi is to make a server available to all graduate students and department faculty for small computational runs and as a convenient software development environment for learning. This improves the research environment in the department and supports our advanced degree programs. Feel free to try things out.

Medium-sized computational runs are now supported on Caprine using the slow batch queue. Note this queue is hidden so it will not appear unless specified explicity with -pslow or with -a to view all queues. For example,

$ sinfo -a

displays the current state of all queues and

$ squeue -a

views information about all jobs in the system. See Running Jobs in the Slow Queue for more information.

For very large computations please use the UNR Pronghorn supercomputer.

Note that the Mathematics and Statistics Department owns a fully-paid 32-core CPU node on Pronghorn. This node may be reserved by contacting Mihye Ahn and filling out the appropriate paperwork. While an okapi is arguably cuter than a pronghorn, please use Pronghorn if you have a very large computation to finish.

Hardware Configuration

Okapi consists of the following hardware:

48 TB storage as six 8 TB drives (available as 24 TB RAID1)
24 CPU cores/48 hardware threads (older Xeon Gold CPUs)
NVIDIA P100 GPU (same kind as the Pronghorn supercomputer)
NVIDIA V100 GPU (better and faster than the P100)
384 GB RAM

Caprine consists of the following hardware:

40 TB storage as four 10 TB drives (available as 20 TB RAID1)
128 CPU cores/256 hardware threads (latest EPYC CPUs)
2 NVIDIA V100 GPU (same as the faster GPU on Okapi)
512 GB RAM

How to Connect to Okapi

Both Okapi and Caprine allow all University of Nevada Department of Mathematics and Statistics faculty and graduate students to log in using their university NetID and password. Once your NetID has been activated by the university, your account on Okapi and Caprine still needs to be enabled before you can log in. While efforts are made to automatically enable accounts for new members of the department, please send an email to Eric Olson or another member of the department computing committee if your account is not working.

On-Campus Access

For a text-based command-line interface, you can connect with secure login using putty on Windows or ssh on Linux and Apple OS/X. Remote desktop is also configured with the Linux Mint graphical user interface. Thanks to Paul for a great picture of an okapi for the login screen. The standard Microsoft remote desktop client should work as does the remote desktop client built into the Macintosh. In recent versions of you may want to download the Microsoft Remote Desktop for free from the iStore. You may also download and use the third-party client freerdp which is available for Windows, Linux and OS/X. If you use freerdp on Linux you may need to disable the clipboard redirect with the option -clipboard in order to run Matlab.

Mount Your Home Directory on Windows

It is no longer possible to mount your Okapi home directory on your office computer as a Windows share. Please let me know if you were using this feature and want it back.

Remote Access

Due to university firewall settings the only way to access Okapi or Caprine remotely is through the UNR VPN. Before connecting you will need to request service by clicking on the request-service button after following the link. A reasonable justification for VPN access is that you are faculty or a graduate student in the Department of Mathematics and Statistics and want access to the department server Okapi from off campus.

After connecting through the VPN, access to Okapi and Caprine will be the same as from on campus. Please let me know if you have any difficulty getting the VPN set up.

Oops I Deleted My Files

21TB of storage is available as home directories. This storage is not backed up, but Okapi employs fault-tolerant RAID1 using BTRFS in a best effort to secure your data. If you accidently delete your files please look in

/x/okaa/yesterday/<your_net_id>
/x/okaa/yesterweek/<your_net_id>
/x/okaa/yestermonth/<your_net_id>

to find older read-only versions of your home directory. Copy the deleted files back to your usual home directory /home/<your_net_id> before midnight to ensure they are safe.

Interactive Software

In order to ease the transition between desktop and batch-processed supercomputing, Okapi encourages interactive use through a text-based interface as well as a mouse driven graphical environment.

Standard Applications

Latex, Octave, R, gnuplot, C, C++, Fortran, FFTW, LAPACK and OpenBLAS have been installed. Please request any additional applications that need to be installed for your work.

Singularity and Software Versions

It is possible to use a specific version of a software application using the Singularity container system for reproducable science. Note that the same container system is available on the Pronghorn supercomputer. This means it is be possible to copy Singularity containers back and forth between machines to run exactly the same version of the software in both places.

Here is an example using Singularity to download a container for the Julia programming language. A similar procedure can be followed for R and many other programs used in the mathematical sciences.

First search the package repository for julia.

$ singularity search julia
Found 1 users for 'julia'
    library://julian

No collections found for 'julia'

Found 5 containers for 'julia'
    library://sylabs/examples/julia.sif
        Tags: latest

    library://sylabs/examples/julia
        Tags: latest

    library://dtrudg-utsw/demo/julia
        Tags: 20190319 latest

    library://sebastian_mc/default/julia
        Tags: julia1.1.0

    library://crown421/default/juliabase
        Tags: 1.3.1 1.4.2 latest

Choose a version, and pull the container.

$ singularity pull library://crown421/default/juliabase:latest
INFO:    Downloading library image
137.2MiB / 137.2MiB [=======================================] 100 % 5.6 MiB/s 0s
WARNING: Container might not be trusted; run 'singularity verify juliabase_latest.sif' to show who signed it

Verify the container with

$ singularity verify juliabase_latest.sif
Container is signed by 1 key(s):

Verifying partition: FS:
69FC410C07D1F59F435D3E4D8987BB3E9255805E
[REMOTE]  Steffen Ridderbusch (xps-linux) 
[OK]      Data integrity verified

INFO:    Container verified: juliabase_latest.sif

And finally run the contained by typing

$ singularity exec juliabase_latest.sif julia

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.2 (2020-05-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> 1+1
2

julia> exit();

A similar procedure can be used to install specific versions of R, anything else in the repository. It is also possible to create your own containers.

Running R Studio

Okapi also has R-studio server. R-studio is a GUI available through remote desktop. Click on the Start menu and navigate to all applications and then click on the R-studio icon.

R-studio server is a web interface and used to be available at

http://okapi.math.unr.edu:8787/

accessible only from on campus. The web interface is currently disabled. If you want to try it out, please contact Eric Olson and ask about it. Note even though the web interface doesn't use a TLS encrypted socket, login credentials are encrypted using RSA public-key cryptography. Please request any additional R packages that need to be installed to run your code.

Using the GPU Hardware

There are two GPUs currently installed: an NVIDIA Tesla V100 and a Tesla P100. The V100 is advertised to deliver 7 teraflops of double-precision and 14 teraflops of single-precision performance while the Tesla P100 is advertised to deliver 5 teraflops of double-precision and 10 teraflops of single-precision performance. The V100 appears as /dev/nvidia0 and serves as the default GPU while the P100 appears as /dev/nvidia1. Both GPUs are available interactively for program development and short computational runs on a first come first serve basis.

Batch GPU jobs can also be run from the default queue by adding

#SBATCH --gres=gpu:1

to your job configuration file. Note that both GPUs can be scheduled. Generally the first job will run on the V100 and the second on the P100. If you want your program to wait until a particular GPU is available use

#SBATCH --gres=gpu:volta:1

#SBATCH --gres=gpu:pascal:1

to specify the V100 and P100 GPUs respectively. More information about using the batch queues is provided in the section Running Programs in the Batch Queue.

The CUDA compiler nvcc is installed and the resulting executables can be launched from the command prompt. If you simultaneously run multiple interactive GPU programs, they will run in shared mode on the V100. Set the environment variable CUDA_VISIBLE_DEVICES=1 to run an interactive GPU program on the P100.

For example, supposing your program is called mygpucalc, then you could run it on the P100 using the commands

$ export CUDA_VISIBLE_DEVICES=1
$ ./mygpucalc

Similarly, if you want to your interactive program to run on the V100, then type

$ export CUDA_VISIBLE_DEVICES=0
$ ./mygpucalc

Running Programs in the Batch Queue

If you detach a remote desktop session rather than logging out, interactive graphical programs should continue running. However, if you switch computers before you reconnect, the resolution of your old remote desktop session may not display properly on a different computer.

A more convenient way to run a series of computations is using the Slurm batch scheduling system. Both CPUs, GPUs and RAM are configured as consumable resources. CPU limits are strictly enforced using Linux cgroups while GPU and RAM limits are of an advisory nature. Thus, your program can exceed the memory limits without being deleted. However, if multiple jobs use more memory than expected, the system may experience a performance issue called thrashing in which swap usage dominates the total execution time. The default for a single job is 1 CPU core and 4GB RAM. Additional resources up to 24 cores and 256GB memory can be specified.

Please don't submit GPU jobs without adding

#SBATCH --gres=gpu:1

Otherwise, the GPUs will run in shared mode with a resulting loss of performance.

How to Cancel a Job

Before learning how to schedule a batch job, it is a important to know how to cancel it in case something goes wrong. If you have a job that needs to be canceled, first use squeue to determine the correct job sequence number. Then type "scancel X" where X is the sequence number of the job to cancel. The following example illustrates cancelling job number 109.

$ squeue
 JOBID                  NAME         USER ST        TIME MIN CPU   REASON
   109              dfft.slm      ejolson  R        0:02  4G   5     None
$ scancel 109

If you want to cancel all your jobs type

$ scancel -u <your_net_id>

A Single Processor Job

Suppose you have an executable called a.out that you want to execute using the default resources consisting of 4GB of memory and one processor core. To do this, create a configuration file onecpujob.slm of the form

#!/bin/bash ./a.out

Then, in the terminal window type

$ sbatch onecpujob.slm

When there are available resources on the Okapi server, it will execute your program and place the output in a file of the form slurm-X.out where X is the job sequence number displayed when you submitted the job. To check the status of the queue type

$ sinfo $ squeue

Running Jobs in the Slow Queue

If you submit many jobs at the same time, this could create a backlog in the default queue that delays other people's work. This is because the default queue is scheduled on a first come first serve basis. To avoid this, Caprine has been set reserved for medium sized computational runs. Please submit your jobs to the slow queue on Caprine to avoid affecting other users on Okapi.

To submit a job to the slow queue add the line

#SBATCH -p slow

to your configuration file.

For example, to run a single processor job in the slow queue create a configuration file oneslowjob.slm of the form

#!/bin/bash #SBATCH -p slow ./a.out

Then, in the terminal window type

$ sbatch oneslowjob.slm

This job and others submitted in the same way will then be scheduled using a separate queue that will not block other users from getting their work done.

Note that the slow queue is hidden and does not appear unless you explicitly specify -pslow or use the -a option. For example, to view information about jobs in the slow queue type

$ squeue -pslow

Submitting an R Script Job

If instead of an executable you want to run an R script called mycalc.R create a configuration file named mycalc.slm of the form

#!/bin/bash Rscript mycalc.R

Then, in the terminal window type

$ sbatch mycalc.slm

Again, if you plan to submit many R script jobs at the same time please use the slow queue by changing mycalc.slm to read

#!/bin/bash #SBATCH -p slow Rscript mycalc.R

Note that the slow queue is hidden. To view information about jobs in the slow queue type

$ squeue -pslow

More information about the slow queue may be found in the section Running Jobs in the Slow Queue.

Running the Demo Script by Grant

Grant has provided a R script demo.R that can be used to test out the batch system. This demo script uses about 16GB of memory as it is running which is more than the default allocation of 4GB. It is important to tell the job scheduler that this script needs more memory so it can properly allocate resources.

After downloading the demo.R, create a slurm configuration file called demo.slm of the form

#!/bin/bash #SBATCH --mem=16GB Rscript demo.R

Then, in the terminal window type

$ sbatch demo.slm

The demo job will take about 2 minutes to run. You can monitor its progress by typing

$ squeue

to see whether the job has starting running or for how long it has been running. Another way to check what calculations are running on the system is using an interactive program called top. Run this program as

$ top

To exit the top program type "q" on the keyboard. After the demo script has finished, you may view the results of the calculation using the command

$ less slurm-X.out

where X is the job sequence number that was assigned when you submitted the job using the sbatch command. Note that less is an interactive text-file viewer that can page up and page down while reading a file. Type "h" for help and "q" to exit the less viewer.

A Multi-threaded R Script with Extra Memory

Note that some R libraries are multi-threaded and can make parallel use of multiple computing cores and require additional memory. For example, if you want to allocate 4 processor cores to an R script along with 32GB of memory, modify the configuration file in the previous section to read

#!/bin/bash #SBATCH -n 4 #SBATCH --mem=32GB Rscript mycalc.R

Submitting a Matlab Job

To run a Matlab script called flops.m create a configuration file named flops.slm of the form

#!/bin/bash Mscript flops.m

Then, in the terminal window type

$ sbatch flops.slm

Submitting many Julia Jobs

To run a statistical simulation consisting of many parallel Julia jobs download demo.jl and the shell script mkslm.

Then, in the terminal window type

$ ./mkslm
$ for i in *.slm; do sbatch $i; done

More information about this set of batch jobs was presented in the Fall 2020 graduate student seminar.

Submitting a Python Job

Consider an interactive Python program called pi_dartboard.py which performs a Monte-Carlo simulation to compute the value of pi. When run at the terminal this program prompts for the number of trials and the number of samples used in each trial with the user input shown in bold as follows:

$ python3 pi_dartboard.py 
Number of random points to include in each trial = 100
Number of trials to run = 10000
Doing 10000 trials of 100 points each
Executed trial 0 using random seed 23173 with result 85
Executed trial 1000 using random seed 28951 with result 81
Executed trial 2000 using random seed 22201 with result 79
Executed trial 3000 using random seed 54954 with result 74
Executed trial 4000 using random seed 65485 with result 74
Executed trial 5000 using random seed 53049 with result 78
Executed trial 6000 using random seed 10095 with result 86
Executed trial 7000 using random seed 26957 with result 80
Executed trial 8000 using random seed 48687 with result 82
Executed trial 9000 using random seed 39487 with result 81
The value of Pi is estimated to be 3.14905600000000
using 1000000 points

To run this program as a batch job the inputs must be specified in the configuration file. This can be done by a configuration file named pi_dartboard.slm of the form

#!/bin/bash printf '100\n10000\n' | python3 pi_dartboard.py

Then, in the terminal window submit the batch job by typing

$ sbatch pi_dartboard.slm

An MPI Parallel Job

Suppose you have an MPI parallel code which you have compiled using the command mpicc or mpifort to obtain an executable called mympiprog. To run this program on Okapi using 12 cores and 18GB of memory create the batch configuration file called mpijob.slm of the form

#!/bin/bash #SBATCH -n 12 #SBATCH --mem=18GB mpirun ./mympiprog

Then in the terminal window type

$ sbatch mpijob.slm

Since Slurm will execute each batch job only when resources are available, you can queue many different jobs and they will run one after the other, perhaps overnight, until all the computations are finished.

A Multi-threaded MPI Parallel Job

Traditional MPI programs consist of n independent ranks running on separate CPUs which communicate to each other by passing messages. It is possible to adopt a hybrid approach where each of the ranks consist of c threads where the threads in each rank communicate to each other using shared memory. This approach is common when OpenMP is combined with MPI, when the MIT Cilk extensions to the C programming language are combined with MPI or when POSIX threads are directly used within each MPI rank.

A multi-threaded MPI parallel job using 4 ranks each with 3 threads requires 12 available CPU cores and a batch configuration file of the form

#!/bin/bash #SBATCH -n 4 #SBATCH -c 3 export OMP_NUM_THREADS=3 export CILK_NWORKERS=3 mpirun ./hybridprog

Details for submitting a multi-threaded MPI job to the Slurm batch scheduler are the same as for a traditional MPI parallel job.

Moving to the Pronghorn Cluster

The department servers Okapi and Caprine run Linux just like Pronghorn the UNR high-performance computing cluster. You may request access to a prepaid 32-core node on Pronghorn by sending an email to Mihye Ahn and then filling out an account request to be a member of Ahn association. In details this node consisits of dual Xeon Gold 6130 CPUs running at 2.10GHz with a total of 64 hardware threads and 192GB of RAM. There are no additional charges for the use of Pronghorn in this way.

You may also obtain funding (either external or through an internal grant sponsored, for example, by the graduate school) for pay-as-you go access to the entire cluster. This would be useful if you want to use significantly more than 32-cores for a short period of time.

Since Pronghorn does not support remote desktop but only secure shell, it may be easier to develop programs on Okapi and later move them to Pronghorn to perform the final computations. In this section we describe how to move files back and forth between the two machines as well some modifications which need to be made to the batch submission files.

Copying Files To Pronghorn

While it is possible to use scp to transfer files back and forth, it may be easier to directly mount your home directory from Pronghorn onto Okapi and drag and drop them using the remote desktop. Such mounting also allows one to edit files on Pronghorn as if they are local files on Okapi. To mount your home directory from Pronghorn on Okapi first make a subdirectory called pronghorn with the commands

$ cd $ mkdir pronghorn

Next mount your Pronghorn home directory by typing

$ cd $ sshfs pronghorn.rc.unr.edu: pronghorn

At this point is should be possible to see your home directory and files from Pronghorn as if they were local files on Okapi. In particular, you can use the file explorer to drag and drop files and directories from Okapi to Pronghorn.

Before logging out, please unmount your Pronghorn home directory from Okapi. Do this with the command

$ cd $ fusermount -u pronghorn

This will free up resources on both machines.

Running an R Script on Pronghorn

Unlike Okapi, the only way to run a program on Pronghorn is as a batch job. While both machines use the same Slurm batch scheduling system, there are slight differencing in what needs to be placed in the batch files. The differences are mostly attributable to the fact that Pronghorn employs environment modules to configure the execution environment for each user.

Suppose you want to run an R script called mycalc.R which has already been transferred to Pronghorn. This must be done using the batch system. First, log into Pronghorn using a command such as

$ slogin pronghorn.rc.unr.edu -l <your_net_id>

with <your_net_id> replaced by your actual UNR NetID. More information about how to log into Pronghorn may be found here. Like on Okapi, create a mycalc.slm configuration file. The contents, however, are slightly different because Pronghorn deploys R through a Singularity container. For example, a batch configuration file to run R on Pronghorn might look like

#!/bin/bash #SBATCH -p cpu-s1-ahn-0 #SBATCH -A cpu-s1-ahn-0 #SBATCH --mem=4GB singularity exec /apps/R/r-base-3.4.3.simg R -q -f mycalc.R

Note that the above configuration file explicitly specifies that 4GB of RAM are needed for running the job. If you do not specify a memory limit, Pronghorn (unlike on Okapi) will assume you want to use all available memory on the node for that one job. Such a job will block all other jobs--even if only using one core--until it is finished and can't start running until the node is fully empty either. If you specify the memory option properly, then it will be possible to simultaneously run 32 single-core jobs on the node at a time.

Submitting to the batch queue is the same as on Okapi and done by typing

$ sbatch mycalc.slm

Cancelling a Job on Pronghorn

Because Pronghorn is used by many people, the scheduling system implements fair-share policy in which your priority decreases based on how much computational time you have already used in comparison to the other people waiting. Not only does this free up resources for others, but it gives you a higher priority next time you need to run a job.

When cancelling a job it can be difficult to find your job hidden among all the other jobs on the system. To view only your jobs type

$ squeue -u <your_net_id>

with <your_net_id> replaced by your actual UNR NetID. Once you locate the job number you wish to cancel, it can be cancelled using the command "scancel X" where X is the sequence number of the job to cancel. For example, if you wish to cancel job 798644 then type

$ scancel 798644

As with Okapi, you may cancel all your jobs with

$ scancel -u <your_net_id>

Additional help on using Pronghorn is available at https://unrrc.slack.com from the UNR Research Computing team.

------------------------------------------------------------------------
 Dual Intel Xeon 6126 Gold 2.60GHz/384GB             okapi.math.unr.edu
                                                      @__@
 Welcome to the UNR Mathematics and                   (oo)\
 Statistics Department server!                        |_/\ \________
                                                          \      ===)
 This system based on Void Linux.                          |=-----==|\
 Unauthorized use prohibited.                              ||      ||
                                                           ||      ||
------------------------------------------------------------------------

Last Updated: Sun Sep 24 10:40:45 PM PDT 2023