Frequently Asked Questions
Running Jobs
If you need to run many instances of a serial code (as in a typical parameter sweep study for instance), we highly recommend using Eden. Eden is a simple script-based master-worker framework for running multiple serial jobs within a single PBS job.
Python's multiprocessing module is similar to threading, so you should use the following in your Darter batch script to launch the python script on a single node:
module load python aprun -d 16 python script.pyThis will make all 16 cores on the node available to the Python script. Please note: whether or not the cores are fully utilized is up to the programming of the script.
If the python script is parallelized using MPI (e.g. with mpi4py which is
available on Darter), then it should be run just like any other MPI program using the following syntax in your batch script:
module load python aprun -n numproc python parallel_script.pyIf there is no MPI in the python script, use the following syntax in your batch script:
module load python aprun python serial.py
Listed below are the limits on the compute nodes of NICS operated resources. Here are the results from some basic tests that were run on our resources to check the real maximum values for allocatable memory and open files:
System | MaxMem | MaxOpenFiles ------------------------------ Darter | 31.1GB | 1015 Nautilus | 32.1GB | 48 Keeneland | 32.1GB | >2048
Request more time for interactive jobs by providing a specific number of hours/minutes/seconds using
qsub –I –l walltime=hh:mm:ss
Note that 24 hours is the maximum that can be requested. If you need an extension, send an email to help@nics.utk.edu along with any job ids that need to run for more than 24 hours.
They should be stored at $TMPDIR/mic# and they need to be copied to either the home or Lustre filesystem before the submitted job is complete
If a user wants to use:
#PBS -l size=192 ### Assuming you want to use 24 MPI tasks
aprun -n 24 -N 2 -S 1
Here's what the above aprun command means. You are asking for 24 MPI tasks, 2 MPI tasks per node, and 1 MPI task per socket.
At 1 task per socket, it is 2 tasks per node, so it will use 12 nodes (24/2) so the size would be 12*16 = 192. It is best to start with the aprun command to figure out how many nodes will be used, then multiply by 16 to get the value of size. Now, if you want to leave one socket empty for each node (use every other socket), then you would use aprun -n 24 -N 1, that tells it put one MPI process per node.
Each Darter node has 16 cores and 32 Gbytes of memory: about 2 GB per core if all cores are used. Sometimes it is necessary to leave some cores idle to make more memory available per core. For example, if you use 8 cores per node, each core has access to about 4 Gbytes of memory.
#PBS -l walltime=01:00:00,size=1500 aprun -n 1500 -S 4 $PBS_O_WORKDIR/a.out
The above aprun command won't work. The nodes on Darter have 2 sockets and each socket has 8 cores. That makes a total of 16 cores per node. Your size should be a multiple of 16. To make a long story short, use the following formula to get close to a multiple of 16 with what you want to do.
cores per socket on Darter * size / mpi processes per socket 8*(1500)/4 = 3000
The next number that is a multiple of 16 is 3008. Change your size = to 3008 in your pbs option and you should be fine.
Array jobs are not supported on NICS systems. The submission filter will reject jobs which make use of job arrays (i.e. #PBS -t or qsub -t). These jobs (if submitted) will not run and should be deleted.
The shell initiation line in PBS scripts is not guaranteed to be used to determine the interpreting shell. The default behavior is to use the user's default login shell or the value of the PBS option -S (i.e. #PBS -S /bin/bash or qsub -S /bin/bash). If you are using a shell for a PBS script which is different than your default shell, please use the PBS -S option.
When trying to run some java code (a statistical modeling code called maxent) for the Nimbios project on Nautilus, we were seeing that one instance of the code would spawn ~1200 threads. I thought initially that maxent was the culprit--until I ran a simple 'hello world' java program and it too spawned 1200 threads.
Turns out that the java virtual machine spawns garbage collecting threads in accordance with the number of processors that it detects. It also turns out that you can have a say in this process with the following flags:
-XX:ParallelGCThreads=2 -XX:+UseParallelGC
Adding these flags when running the maxent code brought the thread count down to around 16, which seems to be around the baseline of the number of startup threads needed by the jvm. I think any java code run on Nautilus should benefit from using these flags. I haven't done any specific tests on how the value of ParallelGCThreads affects performance. At least with the maxent code, I did notice faster startup times for the jvm.
The MPI_IN_PLACE option causes communication on an intra-communicator to happen in place, rather than being copied into buffers. This reduces the required number of operations (it is only possible within a node, not between nodes).
In order to use this option with MPI_Alltoall
, you need to disable Cray's optimization for that call:
export MPICH_COLL_OPT_OFF=mpi_alltoall
The -V
option tells the batch system to remember all of your environmental variables. For example, if I want to set OMP_NUM_THREADS
to 4 and then submit the job, I need this flag so that OMP_NUM_THREADS
is still set in the batch script. You can use it as a flag such as qsub -V ...
or in your batch script like:
#PBS -V
While this can be convenient, it is best practice not to use -V
. Why?
- It makes jobs more self contained. If the script itself must set all the environment variables it needs, the script can be shared between people without confusion. Additionally, when debugging an issue, it is clear from looking at the script what variables are set.
- This option, when used often, can create additional load for the scheduler, and in rare cases cause a crash (particularly if used in jobs which resubmit themselves)
If you do use -V
it is not a problem, and may be recommendable for something like an interactive job, but it is best not to include it in every job script as a matter of habit.
Dynamic shared libraries are supported on Darter but it may have a performance impact. See Dynamic Linking on Darter.
Cray's MPICH has a number of settings (changed using environment variables) that affect what algorithms are used, buffer space, etc. For a list of these variables and their default settings, you can set the following prior to calling aprun
:
export MPICH_ENV_DISPLAY=1
This causes rank 0 to display all MPICH environment variables and their current settings at MPI initialization time. If two or more nodes are used, MPICH/GNI environment settings are also included in the listing.
For more information about some of these settings, please see the man page for intro_mpi
. You can also find that information on Cray's documentation page (under "Introduction to MPI man pages").
When you connect to a system, your environment is set up with default limits for stack size, core file size, number of open files, etc. The system sets both soft and hard limits for these parameters. The soft limit is the actual limit imposed by the system. For example, the soft stack limit is the maximum stack size the system will allow a process to use. Users cannot increase their hard limits. Hard Limits can be decreased, but its not recommended.
While it is rarely necessary to change shell limits on Darter or Nautilus, there may be times when limits must be changed to get your program to run properly. However, users occasionally need to increase the default limits. This is where the hard limit becomes important. The system allows users to increase their soft limits, but it uses the hard limit as the upper bound. So, users cannot increase their soft limit to a value greater than their hard limit.
The command to modify limits varies by shell. The C shell (csh
) and its derivatives (such as tcsh
) use the limit
command to modify limits. The Bourne shell (sh
) and its derivatives (such as ksh
and bash
) use the ulimit
command. The syntax for these commands varies slightly and is shown below. More detailed information can be found in the man page for the shell you are using.
Limit commands
Operation | sh/ksh/bash command | csh/tcsh command |
---|---|---|
View soft limits | ulimit -S -a | limit |
View hard limits | ulimit -H -a | limit -h |
Set stack size to 128 MB | ulimit -S -s 131072 | limit stacksize 128m |
With any shell, you can always reset both soft and hard limits to their default values by logging out and back in.
On the Cray XT, both RLIMIT_CORE and RLIMIT_CPU limits are always forwarded to the compute nodes. If you wish to set any other user resource limits, you must set APRUN_XFER_LIMITS
environment variable to 1 along the new limits within the job script before the aprun call:
export APRUN_XFER_LIMITS=1 #or setenv APRUN_XFER_LIMITS 1
Default user resource limits
The default user resource limits in the compute nodes are:
time(seconds) unlimited file(blocks) unlimited data(kbytes) unlimited stack(kbytes) unlimited coredump(blocks) 0 memory(kbytes) unlimited locked memory(kbytes) 512 process unlimited nofiles 1024 vmemory(kbytes) unlimited locks unlimited
No, users cannot login directly to a compute node, but by submitting an interactive batch job, users can get access to an aprun node, from where they can execute aprun commands to run on a compute node. For more information on how to run interactive batch jobs, please view the information found at Interactive Batch Jobs
This message occurs always when running C-shell style job scripts. It is not really an error message, it is a friendly reminder that this is a remote batch job which can not be acted upon (such as ^C or ^Z for suspension).
If you submit your job, it only executes for an instant, gets terminated without any error messages and the output files are empty, it may be that you have a customized login script that changes your shell interpreter at login time by explicitly executing another shell. For example, sometimes users whose default shell is Bash will change it to the C-Shell by doing the following in their .bashrc file:
# .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # User specific aliases and functions exec csh
If you do want to change your default shell, use the NICS User Portal . To log into the portal, you need to use your RSA SecurID.
Linux uses "virtual memory" for each process, which creates the illusion of a contiguous memory block when a process starts, even if physical memory is fragmented, or residing on a hard disk. When a process calls malloc
, it is given a pointer to an address in this virtual memory. When the virtual memory is first used, it is then mapped to physical memory.
Optimistic Memory Allocation means that Linux is willing to allocate more virtual memory than there is physical memory, based on the assumption that a program may not need to use all the memory it asks for. When a node has used all its physical memory, and there is another call to malloc
, instead of giving a null pointer, the program will receive a seemingly good pointer to virtual memory. When the memory is used, the kernel will try to map the virtual memory to physical memory, and enter an "Out of Memory" condition. To recover, the kernel will kill one (or more) process. On Darter, this will almost certainly be your executable. You should see "OOM killer terminated this process.
"
For more information, see O'Reilly's writeup or man malloc
under "Bugs".
Sometimes some of the basic error messages (such as reading past the EOF) are suppressed because a shell interpreter is not specified in the PBS script. Make sure that the first line of the PBS script contains a shell interpreter: #!/bin/bash
, for example.
There are a couple of easy ways to find out what nodes are assigned to your batch job. The easiest is to use the checkjob
command. Part of the output will return a list of nodes like the following:
Allocated Nodes: [84:1][85:1][86:1][87:1][88:1][89:1][90:1][91:1]
The method returns the a logical numbering of nodes. A physical numbering of the nodes as well as the pid layout can be obtained by setting the PMI_DEBUG
variable to 1.
> setenv PMI_DEBUG 1 > aprun -n4 ./a.out Detected aprun CNOS interface MPI rank order: Using default aprun rank ordering rank 0 is on nid00015 pid 76; originally was on nid00015 pid 76 rank 1 is on nid00015 pid 77; originally was on nid00015 pid 77 rank 2 is on nid00016 pid 69; originally was on nid00016 pid 69 rank 3 is on nid00016 pid 70; originally was on nid00016 pid 70
From within your code, you can reference PMI_CNOS_Get_nid
to get the physical number for each process.
#include#include "mpi.h" int main (int argc, char *argv[]) { int rank,nproc,nid; int i; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nproc); PMI_CNOS_Get_nid(rank, &nid); printf(" Rank: %10d NID: %10d Total: %10d \n",rank,nid,nproc); MPI_Finalize(); return 0; }
The output with four cores would be as follows:
aprun -n4 ./hello-mpi.x Rank: 1 NID: 15 Total: 4 Rank: 0 NID: 15 Total: 4 Rank: 2 NID: 16 Total: 4 Rank: 3 NID: 16 Total: 4 Application 13390 resources: utime 0, stime 0
aprun
can be used to run Unix commands on the compute nodes that display the node names as shown below.
> aprun -n4 /bin/hostname nid00015 nid00015 nid00016 nid00016 >
Or
> aprun -n4 /bin/cat /proc/cray_xt/nid 15 15 16 16 >
Debugging
To find out how to use the performance tools on Darter, enter the following commands on the login node:
module load perftools man intro_perftools
While using tools is a preferable method of debugging to simply using print statements, sometimes the latter option is the only method to find the bug. In this case, the most effective way to isolate the error in your code is through the method of bisection, which is an iterative process for tracing the program manually.
Step 1: In the main routine of your code, comment out the second half of the code (or approximately the second half).
Step 2: Compile and run the code. Did it crash as before?
Step 3A: If yes, return to step one and comment out the second half of the part of the main routine that ran successfully. Repeat until you have narrowed it down to the line/routine causing issues, which may include following this same tack within a subroutine.
Step 3B: If no, then swap out the half which was commented and try compiling and running again. Then, go to Step 3A.
Additionally, the use of print statements to see variable values can give insight into some earlier piece of code that might have been run through just fine but is creating an errant, unacceptable value that causes a later routine to crash.
Finally, if there is any way to duplicate the error in serial, this makes the print statements more consistent (as far as being ordered chronologically, since they are not all coming from different processors' buffers).
Now, while this might sound like a lot of work, and it is non-trivial, here is a tip to make your burden lighter: Have three sessions open on Darter simultaneously.
1. One session to edit the code.
2. Another session to compile the code.
3. Another session in which you submit for an Interactive Job so that you do not have to submit your job every time and wait in the queue.
Sometimes a code will work fine in many cases and circumstances but there will be a bug which only rears its head when a certain perfect storm of case and job size occurs. This causes the code to die in a strange spot and it is not obvious exactly why or where. In cases like this, Cray's ATP (Abnormal Termination Processing) can likely help!
Simply do
module load atp
and re-compile your code without optimization (use the "-g" flag for debugging) using any backend compiler (PrgEnv) with the Cray wrappers (ftn, cc, or CC). This simultaneously helps assure that the error was not brought on by compiler optimization mistakes and creates the instrumented executable.
Now, you are ready to use ATP to generate a backtrace to the line where the code died.
Add the following to your PBS script to make sure that the ATP module is loaded into your aprun environment and that the ATP environment variable is set to collect information:
module load atp export ATP_ENABLED=1
If a backtrace file appears in your directory upon run termination, search through it to find the line that your code died on. If the code completes successfully, you need to lower the compiler optimization number in order that the compiler does not optimize your code to incorrect results.
Also, you may go back and add "-traceback", an Intel compiler flag, to the compilation, which may assist in producing a traceback file as well. This only works when "ProgEnv-intel" is loaded, but you can pass it to the Cray wrappers "cc", "CC", or "ftn" and it will pass it to the backend Intel compiler.
If you are still unable to find the problem, stepping through with a debugger like DDT or Totalview may be helpful.
In order to determine memory usage for a given process on a compute node, one would normally simply issue the command "top" and look at the memory usage of the process in question. However, this cannot be done on a Darter compute node, since they are not accessible to the user. Also, OOM (Out of Memory) errors often occur even when a problem has been discretized finely enough but memory leaks in the code occur in the worst case scenario, causing the program to crash.
This crashing behavior means that the user needs to instrument their code and fix the memory leaks, and the Scientific Computing staff at NICS have created a simple method to add to your current program in spots where memory usage is suspect due to possible leaks. This can assist with finding potential memory leaks as well as diagnosing situations where memory is growing in a manner not commensurate with what the user expected. While tools like valgrind and electric fence exist, they often slow the code execution to the point where the memory issue cannot be found within the prescribed wall time, making the run a waste of SUs and user time.
The following is a C function "GetMemoryUsage" which can be added into the source tree and compiled along with the rest of the user code. This function returns a program's memory usage on the compute node at the point in the program at which it is called. The idea is that one can insert "GetMemoryUsage" function calls at different places in the source, recompile, and run to observe memory leaks. To test if a function / subroutine has memory leak, one can call GetMemoryUsage at the beginning and end of the function and check if there is noticeable different in memory usage. If there is, that means there is some memory leak in that function, unless it is allocating memory of its own. If the latter is true, then the user should be able to note that the growth was by the exact amount allocated, otherwise a memory leak still exists. Regardless, the user should be able to see how much memory is allocated for a given function and determine if that is commensurate with what they were expecting. Through repeated insertion of the GetMemoryUsage function call, one can narrow down which part of large code is contributing to the memory leak.
The sample program "memusage_test.c" is to show how the function can be used, and running this should assist the user in becoming familiar with how the application works to prepare for use in a larger code base. In the sample program, a code with memory leak is created intentionally, and therefore GetMemoryUsage will keep returning higher and higher memory usage as the program continues. A sample makefile is also provided for convenience.
#include <stdio.h> #include <stdlib.h> #include <string.h> #define MEMORY_INFO_FILE "/proc/self/status" #define BUFFER_SIZE 1024 void GetMemoryUsage ( HWM, RSS ) double *HWM, *RSS; { FILE *fp; size_t n = BUFFER_SIZE; char buffer [ BUFFER_SIZE ], scratch [ BUFFER_SIZE ]; char *loc; fp = fopen ( MEMORY_INFO_FILE, "r" ); while ( fgets ( buffer, BUFFER_SIZE, fp ) ) { if ( strncmp ( buffer, "VmHWM:", 6 ) == 0 ) { loc = strchr(&buffer [ 7 ], 'k'); n = loc - &buffer [ 7 ]; strncpy ( scratch, &buffer [ 7 ], n ); *HWM = strtod ( scratch, NULL ); } if ( strncmp ( buffer, "VmRSS:", 6 ) == 0 ) { loc = strchr(&buffer [ 7 ], 'k'); n = loc - &buffer [ 7 ]; strncpy ( scratch, &buffer [ 7 ], n ); *RSS = strtod ( scratch, NULL ); } } }memusage_test.c
#include <stdio.h> #include <stdlib.h> int main ( int argc, char **argv) { int i, j; double HWM, RSS; double *Array; GetMemoryUsage ( &HWM, &RSS ); printf ( "Initial Usage: \nHWM : %f kB \nRSS : %f kB\n\n", HWM, RSS ); // Create leaky code for ( j = 1; j < 100; j++ ) { Array = malloc ( sizeof ( double ) * 100000 ); for ( i = 0; i < 100000; i++ ) Array [ i ] = 0.0; Array = NULL; GetMemoryUsage ( &HWM, &RSS ); printf ( "Usage at j = %d \nHWM : %f kB \nRSS : %f kB\n\n", j, HWM, RSS ); } return 0; }Makefile
all: cc -c GetMemoryUsage.c cc -o memusage_test.exe memusage_test.c GetMemoryUsage.o clean: rm -f *.o *.exe
In order to enable the creation of a coredump file when a program crashes in the compute node of a CRAY system like Darter, the following command should be added to the job script before the aprun call:
Bourne shell | ulimit -c unlimited |
---|---|
C shell | limit coredumpsize unlimited |
For example if using a Bourne like job scrip, the script will look like:
#PBS MY_PROJECT #PBS -l size=16,walltime=00:05:00 #PBS -S /bin/bash cd $PBS_O_WORKDIR ulimit -c unlimited aprun -n 4 ./helloWorld
In the previous example, if program 'helloworld' crashes (for example, due a segmentation fault), a coredump file named 'core' will be created in the same directory where the program is located.
Note: Using the compiler option '-g' at compile time, will add debugging information to the executable that will facilitate figuring out the location in the source code where the program crashed.
Compiling
If you want to see what potentially happens while compiling your code, but you don't want any files to be created or overwritten, you must use the -dryrun
option flag when using Cray wrappers. This option shows commands built by the driver but does not actually compile.
For example, "cc -dryrun hello.C -o hello.exe".
Unlike Darter's compute nodes, its login nodes have modest hardware specs: a single quad-core processor with 8 gigabytes of memory. However, each of the Darter login nodes may have up to 30 user login sessions active at any given time. As a result, a single user who runs a very processor- or memory-intensive task on a Darter login node can affect the work of several dozen other users. As a result, NICS recommends that concurrent makes ("make -j N") on Darter be done with an N of 2 or less.
- Replace all compiler commands (
mpicc
,mpif90
,icc
,ifort
,pgCC
,pgf90
, etc) with the following:cc
(C),CC
(C++) orftn
(Fortran). - Remove all references to MPI libraries and environment variables related to third-party libraries within the makefile.
- Any references to libraries BLAS, LAPACK, BLACS, and ScalaPACK should be removed from your makefiles. The system will automatically link with the most highly optimized versions of these libraries. (For a complete list of libraries, enter:
man libsci
) - References to MKL can often be removed because their function is replaced by libsci.
Before you compile your code, load any relevant modules for third-party libraries. For example:
module load cray-hdf5-parallel
The documentation will tell you how to use environment variables in your makefile. In the hdf5 example, this is documented in HDF5.
cc -o hdf5example.x hdf5example.c
There are three advantages to using the module with the environment variable instead of the pathname:
- If you change versions of hdf5, you only need to load a different module. The makefile does not have to be modified.
- If you change to a different compiler and then reload the hdf5 module, the system will load a version of hdf5 that is compiled with the other compiler.
- Libraries loaded via module are automatically included and linked to the program without the need for additional compile and linking flags.
For a list of libraries and other software available for Darter see Darter Software. Also see Compiling on Darter Example 3.
For Darter, consult the “Cray online documentation” (http://docs.cray.com).
For C, search for the Cray “C and C++ Reference Manual” and for Fortran, consult the “Cray Fortran Compiler Commands and Directives Reference Manual”.
This error message indicates that the node is running Out Of Memory. This could be the result of a bug in the code, or memory requirements for the given input. Note that due to optimistic memory allocation, you probably will not get a null pointer, even if you are out of memory. The program should be killed at the point the memory is used.
One quick solution might be to run with only four MPI processes per socket so each process gets a larger share of the memory on the node:
aprun -n-S 4 ./a.out
Where -S 2
). The best solution may be to identify the memory requirements in the code and make any necessary changes there, in terms of memory parameters, domain decomposition, etc.
#include
is the Standard C++ way to include header files. The 'iostream' is an identifier that maps to the file iostream.h
. In older C++ versions you had to specify the file name of the header file, hence #include
. Older compilers may not recognize the modern method but newer compilers will accept both methods even though the old method is obsolete.
fstream.h
became fstream
vector.h
became vector
string.h
became string
, etc.
So although the
library was deprecated for several years, many C++ users still use it in new code instead of using the newer, standard compliant
library. What are the differences between the two? First, the .h
notation of the standard header files was deprecated more than 5 years ago. Using deprecated features in new code is never a good idea. In terms of functionality,
contains a set of templatized I/O classes which support both narrow and wide characters. By contrast,
classes are confined to char exclusively. Third, the C++ standard specification of iostream's interface was changed in many subtle aspects. Consequently, the interfaces and implementation of
differ from
components are declared in the global scope. Because of these substantial differences, you cannot mix the two libraries in one program. As a rule, use
in a new code and stick to
in legacy code that is incompatible with the new
library.The following error message:
#error "SEEK_SET is #defined but must not be for the C++ binding of MPI"
Is the result of a name conflict between stdio.h and the MPI C++ binding. Users should place the mpi include before the stdio.h and iostream includes.
Users may also see the following error messages as a result of including stdio or iostream before mpi:
#error "SEEK_CUR is #defined but must not be for the C++ binding of MPI" #error "SEEK_END is #defined but must not be for the C++ binding of MPI"
When profiling with TAU, you may get this message regardless of the order. In this case, you can add -DMPICH_IGNORE_CXX_SEEK
to the compile line to remove the error (this fix should work generally).
Under the 1.5 programming environments used under Catamount, ftn
linked in libC.a
. Under the 2. programming environments used under CNL, ftn
does not link in libC.a
. Fortran codes that link in libraries that contain C++ objects will need to add -lC
to the link line.
Access
To enable X11 forwarding on NICS resources that are a part of XSEDE's Single Sign On Hub, do the following:
MAC/Linux Terminals:- Type "ssh -Y (username)@login.xsede.org", where 'username' is your XSEDE username, then hit Return/Enter.
- Enter your XSEDE portal password, hit Return/Enter.
To enable X11 Forwarding using Putty you must first download and install a X server (Xming). For more detailed installation and configuration instructions, refer to the Xming Installation Instructions. Once installed, do the following:
Go to the start menu and click on Xlaunch. (This brings up a window with different configuration options).
- Multiple windows (default)
- Display number 0 (default)
- Start "no client" (default)
- Select Clipboard (default)
- Remote Font Server option (if any), leave blank
- Additional parameters for Xming, leave blank
- Save configuration (if you like).
Open Putty window:
- Click on SSH (on the left panel of Putty).
- Click on X11.
- Click Enable X11 forwarding.
- In the box to the right of X display location type: 0.0
- Go back up to the top of the left menu (Category) and click on session.
- Enter the host name (darter, nautilus, keeneland, etc.).
- Click "Open".
Once you have logged into the Single Sign-On (SSO) Hub via Mac, Linux or Putty, you will see some text that resembles the following on your screen:
Welcome to the XSEDE Single Sign-On (SSO) Hub! Your storage on this machine is limited to 100MB. You may connect from here to any XSEDE resource on which you have an account. To view a list of sites where you actually have an account, visit: https://portal.xsede.org/group/xup/accounts. Here are the login commands for common XSEDE resources: Blacklight: gsissh blacklight.psc.xsede.org Darter: gsissh gsissh.darter.nics.xsede.org Gordon Compute Cluster: gsissh gordon.sdsc.xsede.org Gordon ION: gsissh gordon.sdsc.xsede.org Keeneland: gsissh gsissh.keeneland.gatech.xsede.org Lonestar: gsissh lonestar.tacc.xsede.org Mason: gsissh mason.iu.xsede.org Maverick: gsissh -p 2222 maverick.tacc.xsede.org Nautilus: gsissh gsissh.nautilus.nics.xsede.org Open Science Grid: gsissh submit-1.osg.xsede.org Stampede: gsissh -p 2222 stampede.tacc.xsede.org Trestles: gsissh trestles.sdsc.xsede.org Contact help@xsede.org for any assistance that may be needed.
When logging into a NICS system (Darter, Nautilus or Keeneland), add the "-Y" or "-X" option to enable X11 forwarding. After successfully logging into a system, test that X11 is successfully forwarding by typing the xclock
or xeyes
command. You should see either a graphical display of an analog clock or non-creepy cartoon eyes in a small window on your computer's display.
NICS provides different methods for logging into their resources. To log in via the XSEDE portal, please click on the link below for step-by-step instructions.
To share read access to home directories and top-level scratch directories, each member of the group should enter the following commands:
chmod 750 $HOME
chmod 750 /lustre/medusa/$USER
To provide write access to the members of the group, use "chmod 770". This should be used on a subdirectory and not the top-level directory.
Apple handles the local "DISPLAY" variable necessary for X11 connections differently than Linux/Unix, therefore older versions of gsissh
had trouble parsing the variable, yielding this error:
$ xlogo /tmp/launch-xZ1piK/: unknown host. (no address associated with name) X connection to localhost:14.0 broken (explicit kill or server shutdown).
Note that if the remote DISPLAY
variable had been broken, it would have given the error "Can't open display
".
Updating to the most recent release of gsissh
should resolve this issue.
Your SSH client may not be set up to use the keyboard-interactive authentication method. You will need to use a client that supports the keyboard-interactive authentication method to connect to the NICS computers. Different SSH clients will have different ways of setting the preferred authentication methods, so you may need to contact your system administrator to get your client set correctly.
If your ssh
client seems to be set up correctly, it may be that the resource you are trying to connect to is unavailable. You may want to check our announcements.
For instructions on activating and using your RSA SecurID, see the connecting page.
If backspace produces ^?
instead of what you expect, use the following to fix it at the command prompt:
stty erase
You can put this in your .profile
(ksh
) or .login
(csh
) file so upon logging it automatically will be set. This stty
command should also be executed only for interactive shells, not batch.
Another tactic is to change the configuration of your SSH client. For instance, if you are using Putty SSH from a Windows system, the default backspace key is
. This can be changed by going to the keyboard category and changing backspace to be
.
HPSS
You can use the split command to split an archive into multiple files. Please follow the steps and examples provided below.
"Cd" into your /lustre/medusa/ directory where your data is temporary stored and run the following command. Make sure the file striping (https://www.nics.tennessee.edu/computing-resources/file-systems/lustre-s...) in the directory is appropriate for what is being done.
NOTE: The syntax is very important. Please pay close attention to the "." at the end of the filename (i.e. myarchive.tar.).
If you want to combine multiple files into an archive, then split them into 1 GB files, do the following:
$ tar -cvf -file1 file2 file3 | split --bytes=1G --suffix-length=4 --numeric-suffix - myarchive.tar.
When the files need to be recombined and untarred:
$ cat myarchive.tar.* | tar xvf -
If you already have a single tar file and you want to split it into 10 GB files, do the following:
$ split --bytes=10G --suffix-length=4 --numeric-suffix lustre.scratch.Cray_Tests.tar lustre.scratch.Cray_Tests.tar.split.
If you have a directory you want to tar up, then split into 10MB files (in this case an "applications" directory) you would do the following:
$ tar -cvf - applications | split --bytes=10M --suffix-length=4 --numeric-suffix - applications.tar.
The size of the split files is determined by the option --bytes=??
When the command finishes executing (which could be a while), you will end up with files applications.tar.0000, applications.tar.0001, and so on. See example output below.
$ ls -l applications.tar* -rw-r--r-- 1 you 10485760 Jul 24 13:49 applications.tar.0000 -rw-r--r-- 1 you 10219520 Jul 24 13:49 applications.tar.0001
After splitting your achieves, type hsi put *.tar.*. This will start uploading the files to HPSS. This could also take a while so feel free to use the nohup command with this.
When you are ready to retrieve the files for use, type hsi get *.tar.*. After all the files have been transferred to your /lustre/medusa/$USER area, if you want to combine the split files and extract their contents run the following command:
$ cat applications.tar.* | tar xvf -
Wait a bit and all the files should join and one file called applications.tar will be extracted.
In order to check one's usage on HPSS, enter the "hsi" command. Then, use the HPSS "du" command in the most top level directory, or "du -s" (summary for the entire directly only) option.
For example:
O:[/home/username]: du 2137614 4 /home/std00/
305920049 1 /home/std00/directory1/ 86648223 1 /home/std00/directory2/ 211942420 1 /home/std00/directory3/ 156677661 47 /home/std00/direcotry4/ 6455083743 1 /home/std00/directory5/ 0 0 /home/std00/ ----------------------- 7218409709 total 512-byte blocks, 55 Files (3,695,825,770,765 bytes)
Unfortunately, no. There are no backups of HPSS. Even if a file is written with "copies=2", the overwrite will affect both files (a recovery might technically be possible, but not without significant system interruption).
The ~ is appended if the user has "autobackup=on" in their .hsirc file. Otherwise, the file is simply overwritten. Another option is to use "hsi cput" instead of "hsi put". Using cput will cause hsi to give a warning message if the file exists. The file that the user is attempting to store won't be written to HPSS, but the old one won't be overwritten. (The user also needs to pay careful attention to the output from hsi so that they'll notice the file wasn't stored.)
When storing files with similar names, it is best to append a date (and time if necessary) in 4-digit year, 2-digit month, 2-digit day, 2-digit hour, 2-digit minute form to the filename. This provides a unique name but also causes the files to be automatically sorted by ls based on the date for which they contain information (which might not always be the date/time they were written). An example might be file.tar.201212032250 for a date at 10:50 PM.
Your file transfer has caused a Lustre storage server (OST) to become full, resulting in an error like:
ead_cond_timedwait() return error 22, errno=0 OUT OF SPACE condition detected while writing local file
This usually happens because the stripe count is too small (often 1). To solve this issue, remove the partially transferred file and change the stripe count of the directory before transferring the file. To change the stripe count of the directory, first cd
to that directory. Second, type the following command:
lfs setstripe . -c 8
where 8 is the new stripe count, meaning that any new files in that directory will be striped across 8 OSTs. The larger the stripe count, the more OSTs the file will be striped across. Typically, a stripe count that results in a file using less than 100 GB per OST should usually work.
To find out what groups you are a member of on HPSS use the groups command.
K:[/home/username]: groups K:HPSS Group List: 1045: nsf008 1928: nsf008q4s
This shows the user is a member of groups nsf008 and nsf008q4s.
If other members of your team are listed in the same group you can simply log into HPSS using HSI and change the group and permissions to share the files or directories.
For example, if both you and other members are all in nsf008q4s you will simply need to do a chgrp.
K:[/home/yourusername]: chgrp nsf008q4s filename
Then you will need to do a chmod to make the file group readable.
K:[/home/yourusername]: chmod 750 filename
The other members of the group should then be able to access your files on HPSS.
If you are unsure of the HPSS group that correlates to the NICS project, or the other members of your group are not members of the same group you will need to submit a ticket to help@xsede.org and request they be added to the group on HPSS. Please reference this FAQ in your request.
HTAR provides the “-Hverify=option[,option...]” command line option, which causes HTAR to first create the archive file normally, and then to go back and check its work by performing a series of checks on the archive file. You choose the types of checks to be performed by specifying one or more comma-separated options. The options can be either individual items, or the keyword “all”, or a numeric level between 0, 1 or 2. Each numeric level includes all of the checks for lower-valued levels and adds additional checks. The verification options are:
all | Enables all possible verification options except “paranoid” |
info | Reads and verifies the tar-format headers that precede each member file in the archive |
crc | Reads each member file and recalculates the Cyclic Redundancy Checksum (CRC), and verifies that it matches the value that is stored in the index file. |
compare | This option directs HTAR to compare each member file in the archive with the original local file. |
paranoid | This option is only meaningful if “-Hrmlocal” is specified, which causes HTAR to remove any local files or symbolic links that have been successfully copied to the archive file. |
If “paranoid” is specified, then HTAR makes one last check before removing local files or symlinks to verify that:
a. For files, the modification time has not changed since the member file was copied into the archive
b. The object type has not changed, for example, if the original object was a file, it has not been deleted and recreated as a symlink or directory, etc.
It is also possible to specify a verification option such as “all”, or a numeric level, such as 0, 1 or 2, and then selectively disable one or more options. In practice, this is rarely, if ever, useful, but the following options are provided:
0 | Same as “info” |
1 | Same as “info,crc” |
2 | Same as “info,crc,compare” |
nocompare | Disables comparison of member files with their original local files |
nocrc | Disables CRC checking |
noparanoid | Disables checking of modification time and object type changes |
htar -cvf TEST_VERIFY.TAR /lustre/medusa/$USER -Hcrc -Hverify=2
htar -Hcrc -tvf TEST_VERIFY.TAR
In the example above, (1) the archive file is created (-c) with verification level 2, including CRC generation and checking. The verbose output option (-v) is used to cause HTAR to display information about each file that is added during the create phase, and then verified during the verification phase.
(2) the archive file is then listed (-t) using the "-Hcrc" option to cause HTAR to display the CRC value for each member file. />/>/>
Use hsi -ls to show the tar file in HPSS
Use "htar" to list the contents of the tar file:
>hsi ls -l file.tar
...
-rw------- 1 username username 12800 Oct 2 2008 file.tar
Use "htar" to extract a single file (name must match what is listed by the above command):
> htar -tvf file.tar
HTAR: drwxr-xr-x username/nicsstaff 0 2008-10-02 10:47 dir2/
HTAR: -rw-r--r-- username/nicsstaff 1492 2008-10-02 10:47 dir2/data.pbs
HTAR: -rw-r--r-- username/nicsstaff 1924 2008-10-02 10:47 dir2/mpi.pbs
> htar -xvf file.tar dir2/data.pbs
HTAR: -rw-r--r-- username/nicsstaff 1492 2008-10-02 10:47 dir2/data.pbs
To retrieve a single directory from HPSS use the -R option. For example,
>hsi
>get -R dir1
Administrators may disable users for archiving too many small files at a time. Archiving too many small files introduces a lot of overhead on the system, and this archiving system is not designed to handle a lot of small files. Please use htar
to tar together your files. Documentation can be found here.
We should contact you if this happens, but if you are concerned that your access to HPSS has been disabled, contact us at help@xsede.org. We can re-enable your HPSS access provided that it is used correctly.
One easy way to increase file size on HPSS is to use 'htar'. For the most part, this works the same as the regular tar
. We would prefer that you perform htar
on ~10GB chunks. After you confirm that you will be using htar
from now on, we will proceed to provide you access to HPSS.
Our system staff would like you to remove all of your archived small files from HPSS and archive them again using htar
.
There is nothing that should prevent you from running a script that creates multiple simultaneous connections to HPSS. The HPSS system administrator recommends that you should not create more than 1 or 2 connections at a time. Every time you introduce a new instance, the performance of the overall system is degraded.
Because HSI is a third-party package, clients may be available for your system; however, NICS currently supports access to HPSS only through HSI clients on Darter and Nautilus.
If you log into Darter or Nautilus using your passcode from your OTP token, you can run HSI without entering your passcode each time. You can also run batch scripts that use HSI in the "hpss" queue. If you logged using GSI authentication you will be prompted for your passcode each time you use HSI.
HPSS performance is greatly improved when the transfer size is between 8 GB and 256 GB. For that reason, users with large numbers of relatively small files should combine those files into one or a few 8 GB to 256 GB files and then transfer the larger files. The files can be combined with tar
on the HPC system, or they can be created on the fly with a command similar to tar cvf some_dir -|hsi put - : somedir.tar
. This command will tar
all files in the some_dir
subdirectory into a file named somedir.tar
on HPSS. HPSS also supports the htar
command.
The HSI utility allows automatic authentication and provides a user-friendly command line and interactive interface to HPSS.
Users may access HPSS from any NICS high-performance computing (HPC) system with the Hierarchical Storage Interface (HSI) utility. An OTP token is required upon entry. Access to HPSS is enabled by typing the command hsi
in your linux environment. To exit, simply type quit
.
Software Environment
If the Intel compilers and programming environment is still desired, you need only to execute:
module swap intel-mpi $otherMPIModule
However, if you wish to completely remove the Intel programming environment in order to use another compiler, then you must remove the mpi module first:
module unload intel-mpi
Then, you can unload the compilers, which will automatically unload the Programming Environment (PE-intel):
module unload intel-compilers
Some sites recommend using the .modulerc
file to set your default modules. Do not do so on NICS systems
: the .modulerc
file is read every time module
is called. This causes issues with some of the Cray software on Darter, the global default module list, and can lead to unexpected results (if you unload a module in the .modulerc
file, it will be re-loaded next time you use the module command). Instead, set your default environment in your .bashrc
file (or analogue). It is best to send the output (stderr
in particular) to a log or /dev/null
to prevent .bashrc
from printing anything, which may cause errors.
For information on modules, see the modules page.
Different operating systems use different methods of indicating the end of a line in a text file. UNIX uses only a new line, whereas Windows uses a carriage return and a line feed. If you have transferred text files from your PC to a UNIX machine, you may need to remove the carriage-return characters. (The line-feed character in Windows is the same as the new-line character under UNIX, so it doesn’t need to be changed.) Some systems provide a command dos2unix
that can perform the translation. However, it can also be done with a simple perl
command. In the following example, win.txt
is the file transferred from your PC, and unix.txt
is the new file in UNIX text format:
perl -p -e 's/\r$//' win.txt unix.txt
If vi
appears to hang, but other commands (ls
, cat
, etc) work normally, try renaming the the .viminfo
file:
mv ~/.viminfo ~/.viminfo.bak
This file saves the state of vim
, but can sometimes appear to get corrupted due to incompatibilities between different versions of vim
Users may change their default shell in the NICS User Portal . To log into the portal, you need to use your RSA SecurID.
If you haven't already, please check out the other Darter resource pages at Darter resources on compiling, file systems, batch jobs, open issues, parallel I/O tips, CrayPAT overview, and other reports and presentations related to Darter.
Another good resource (without Darter-specific information) is the documentation that Cray provides at CrayDocs.
Lustre
The default stripe count on the Lustre Medusa filesystem is 2. Lustre Medusa has 90 OSTs (Object Storage Targets), therefore the maximum stripe count possible is 90.
lfs osts | grep medusa
In the event of a lustre slowdown, there are many things to consider as lustre has many working parts and is shared by all users on the system. NICS continually monitors lustre's performance and seeks to improve researcher's data communications. If you notice your code's I/O performance or the lustre filesystem is slower than usual, please answer the following questions to the best of your knowledge and email XSEDE Help Desk your answers.
- When did you first notice the slowdown? How long did it last?
- Which login node were you on?
- Can you estimate the magnitude of the slowdown? (ex - "It took 2 min instead of 3 secs", "batch job exceeded walltime limit of 10 hours, but normally finishes in 8 hours")
- What were you doing? Interactive command (like "ls")? Batch job?
-
For interactive commands:
- Which host were you using?
- Did you see the same behavior on other hosts?
- Can you provide the exact command that was run and the directory in which it was run?
-
For batch jobs:
- Can you supply the job IDs for jobs that were affected?
- Can you provide any details about the IO pattern for your job?
Yes! A basic ls
only has to contact the meta-data server (MDS), not the object-storage servers (OSS), where the bottleneck often occurs. In general, ls
is aliased to give additional information, which requires the OSS's. You can bypass this by using /bin/ls
. When there are many files in the same directory, and you don't need the output to be sorted, /bin/ls -U
is even faster.
You can also use the Lustre utility lfs
to look for files. For example, the syntax to emulate a regular ls
in any directory is
lfs find -D 0 *
For convenience, you may want to add an alias definition to your login config files. For example Bash users can add to their ~/.bashrc
the following line to create an alias called lls
.
alias lls="/bin/ls -U"
A user can change the striping settings for a file or directory in Lustre by using the lfs
command. The usage for the lfs
command is
lfs setstripe-s -i -c
where
size - the number of bytes on each OST (0 indicating default of 1 MB) specified with k, m, or g to indicate units of KB, MB, or GB, respectively.
index - the OST index of first stripe (-1 indicating default)
count - the number of OSTs to stripe over (0 indicating default of 4 and -1 indicating all OSTs [limit of 160]).
NOTE: If you change the settings for existing files, the file will get the new settings only if it is recreated.
To change the settings for an existing directory, you will need to rename the directory, create a new directory with the proper settings, and then copy (not move) the files to the new directory to inherit the new settings.
If your application is the type in which each separate process writes to its own file, then we believe that the best option is to not use striping. This can be set by using this command:
> lfs setstripe-c 1
Then we see that
> lfs find -v testdirectory OBDS: 0: ost1_UUID ACTIVE --snip-- testdirectory/ default stripe_count: 1 stripe_size: 0 stripe_offset: -1
This shows we have a stripe count of 1 (no striping), the stripe size is set to 0 (which means use the default), and the stripe offset is set to -1 (which means to round-robin the files across the OSTs).
NOTE: You should always use -1 for stripe_offset
.
The stripe count and stripe size are something you can tweak for performance. If your application writes very large files, then we believe that the best option is to stripe across all or a subset of the OSTs on the file system. Striping across all OSTs can be set by using this command:
> lfs setstripe-c -1
Caution: Not striping large files may cause a write error if the file's size is larger than the space on a single OST. Each OST has a finite size which is smaller than the total Lustre area of all OSTs.
A file's striping is inherited from its parent directory. The lfs getstripe
command can be used to determine the striping for a file, or the default striping for a directory. Note that each file and directory can have its own striping pattern. This means that a user can set striping patterns for his own files and/or directories. The default stripe width for a user may be 1 or 4, you can determine by running lfs getstripe /lustre/medusa/$USER
.
This command will also give you information on the striping information for a directory/file.
lfs find -v
The Lustre file system is made up of an underlying set of file systems called Object Storage Targets (OST's), which are essentially a set of parallel IO servers. A file is said to be striped when read and write operations access multiple OST's concurrently. File striping is a way to increase IO performance since writing or reading from multiple OST's simultaneously increases the available IO bandwidth.
Striping will likely have little impact for the following codes:- Serial IO where a single processor or node performs all of the IO for an application.
- Multiple nodes perform IO, access files at different times.
- Multiple nodes perform IO simultaneously to different files that are small (each
Lustre allows users to set file striping at the file or directory level. As mentioned above, striping will not improve IO performance for all files. For example, in a parallel application, if each processor writes its own file then file striping will not provide any benefit. Each file will already be placed in its own OST and the application will be using OST's concurrently. File striping, in this case, could lead to a performance decrease due to contention between the processors as they try to write (or read) pieces of their files spread across multiple OST's.
For MPI applications with parallel IO, multiple processors accessing multiple OST's can provide large IO bandwidths. Using all the available OST's will provide maximum performance.
There are a few disadvantages to striping. Interactive commands such as ls -l will be slower for striped files. Additionally, striped files are more likely to suffer from data loss from a hardware failure since the the file is spread across multiple OST's.
Please see also: Scratch Space (Lustre) and I/O and Lustre Tips.
Runtime Errors
Use micssh instead of ssh
The necessary SSH keys are provided through the micssh script
Use micmpiexec instead of mpiexec The necessary SSH keys are provided through the micmpiexec script
micmpiexec –n 1 ./programinstead of
micmpiexec –n 1 program
Your Fortran program may be writing a large file of stripe size 1, resulting in an error like:
forrtl: No space left on device, forrtl: severe (38): error during write, unit 12, file /lustre/medusa/$USER/...
Move the partially transferred file elsewhere or delete it. Then, cd
to the directory where the partially transferred file once was. Issue the following command to change the striping of the directory:
lfs setstripe . -c 8
Accounts
Users can use the "showusage" command to see how many SUs have been used.
$ showusage Project Resource StartDate EndDate Allocation Remaining Usage -------------+-------------------------+----------+----------+------------+------------+------------ TG-STA110018S darter.nics.teragrid 09-21-2013 09-21-2014 300000.00 271140.22 28859.78
When the C shell (or one of its derivatives, such as tcsh
) is starting up and encounters an error in one of its initialization files, it stops processing its initialization files. So, any aliases, environment settings, etc., that occur after the line that caused the error will not be processed. For help in troubleshooting the startup files, contact the User Assistance Center.
This message usually means that you are at or near your home directory quota and that some part of the login process was trying to write there. This is often caused when the modules utility is loaded because it needs to write files to your home directory. You will need to reduce the usage in your home directory to log in successfully.
You may also notice that after getting this message, some commands cannot be found. This is due to the way C shell handles errors.
You can use the showusage
utility to view all projects for which you are a member.
Users can view their allocation and usage on allocated systems using the showusage
utility. showusage
returns year-to-date usage and allocation for the calling user’s allocated project(s). Usage is calculated from the first day of the fiscal year through midnight of the day before the request.
You can also check charges for individual jobs using glsjob [-u
. The sum of jobs within a project should equal the showusage
within rounding error.
File Transfer
Examples of this error are
File transfer server could not be started or it exited unexpectedly. Exit value 0 was returned. Most likely the sftp-server is not in the path of the user on the server-side.orReceived message too long 1500476704
These errors are usually caused by commands in a shell run-control file (.cshrc
, .profile
, .bashrc
, etc.) that produce output to the terminal. This output interferes with the communication between the SSH daemon and the SFTP-server subsystem. Examples of such commands might be date
or echo
. If you use the mail
command to check for mail, it can also cause the error.
You can check to see if this is likely the problem. If you are unable to SFTP to a machine, try to connect via SSH. If you are able to SSH, and you receive output to your terminal other than the standard login banner (for example, “You have mail”), then you need to check your run-control files for commands that might be producing the output.
To solve this problem, you should place any commands that will produce output in a conditional statement that is executed only if the shell is interactive. For C shell users, a sample test to put in your .cshrc
file would be
if ($?prompt) date endif
The equivalent command for your .profile
file (ksh/bash
) would be
if [[ -n $PS1 ]]; then date fi
The SSH-based SCP and SFTP utilities can be used to transfer files to and from NICS systems.
For larger files, the multistreaming transfer utility BBCP may be used (not available on Darter or Beacon). The BBCP utility is capable of breaking up your transfer into multiple simultaneously transferring streams, thereby transferring data faster than single-streaming utilities such as SCP and SFTP.
For more information on data transfers, see the remote data section of the data management page.
Globus is unable to find the correct certificate to authenticate: most likely, you have a ~/.globus/certificates
directory which is overriding the system defaults. If this directory exists, rename it and try again. Within the TeraGrid, these certificates are managed for you, so you should not need a certificates directory.
If you do need it for regular transfers to non-TeraGrid sites, you can generally get the certificate from the /etc/grid-security/certificates
directory—the name of the file is given by the error message: Untrusted self-signed certificate in chain with hash
cp /etc/grid-security/~/.globus/certificates
These certificates may be changed without notice, so you will periodically have to remove and replace expired certificates.
NFS
It depends on where the files were and how recently they were created. Scratch directories (/lustre/medusa/$USER
) are not backed up at all, so any files deleted from those directories cannot be recovered. Home directories are different. Please contact NICS support if you have inadvertently deleted a file in your home area.