Using Condor to submit to NorduGrid






Contents:




Getting and installing condor as a standalone client:

Condor can be downloaded from http://www.cs.wisc.edu/condor/downloads/v6.8.license.html. You will need to enter you name and email address.

If you chose a tarball unpack it and run the installation script. Follow the instructions on the screen. Most of the options can be left as is, but you should answer yes to the question "Would you like to setup this host as a submit-only machine?".




Submitting jobs:

In this section we will only examine how to construct the JDL file. So first you have to prepare your job as you normally would.

First you have to specify that you want to submit to a grid resource. This is done by using the grid universe:

universe=grid

Then you have to specify that you want a nordugrid resource and which one you want. Condor does not do brokering of NG resources:

grid_resource = nordugrid morpheus.dcgc.dk

That done you can now use standard JDL notation to specify the executable, stdout, stderr and logs:

executable = myjob
output = my.out
error = my.err
log = my.log

If you need to specify arguments to the job you can do so by using the JDL args attribute. Condor will translate args to ARC 0.4 xRSL arguments i.e. it will prepend the executable name to the argument list.

args= -l (this is translated to : (arguments= myjob -l))

To transfer other files than the executable and stdout, -err you need to add the following tags:

transfer_input_files = benchmark.pov
transfer_output_files = benchmark.png

And to tell Condor when to transfer the files:

WhenToTransferOutput = ON_EXIT

Because of a bug in condor (<= v.6.8.2) you must add the executable once again in the nordugrid_rsl attribute. Here you can also add any other xRSL attribute you might need to set, like runtimeenvironment for instance.

nordugrid_rsl = (executable=runpov.sh)(runtimeenvironment=APPS/GRAPH/POVRAY-3.6)

Finally, add queue to the JDL file:

queue

Example job:

universe = grid
executable = runpov.sh
WhenToTransferOutput = ON_EXIT
transfer_input_files = benchmark.pov
transfer_output_files = benchmark.png
output = pov.$(Cluster).out
error = pov.$(Cluster).err
log = pov.$(Cluster).log
grid_resource = nordugrid interop.dcgc.dk
nordugrid_rsl = (executable=runpov.sh)(runtimeenvironment=APPS/GRAPH/POVRAY-3.6)
queue

This job can then be submitted with:

condor_submit test.jdl




Tests:

The following jobs have been run to test condor NG submission.

Test 1:


Test 2:


Test 3:


Test 4:


Test 5:

Test 6:




Test conlusion

It is possible to run standard NG jobs using condor as the submission platform, although Condor does have a few shortcomings.


Bugs and feature shortcomings:

Bug 1:

As the tests showed standard jobs fail to submit to an ARC server. Analisys of the submitted xRSL script revealed that Condor translates the JDL executable = job.sh to ARC (executables=job.sh). While this is valid ARC xRSL, it tells ARC which files to grant executable rights, it does not tell ARC which file contains the actual job. This also makes it impossible to run a multiple executable job.

Workaround:

Since Condor does not set the executable attribute we can do so ourselves by adding the following line to our JDL file:

nordugrid_rsl= (executable=job.sh)

Change job.sh to whatever you put in the JDL executable.

To run multiple executable jobs, the job mentioned in the executable tag should then run chmod to change the permissions on any other executable.

This bug has been fixed, and should be ready in the next release of Condor.

Bug 2:

Using directories in the file statements results in condor trying to retrieve faulty filenames from the NG CE.

Workaround:

Do not use directory names in output files. I.e. output=/tmp/out.std will not work.

Bug/feature 3:

Arguments are handled in a fashion compatible with v. 0.4.. of ARC.

workaround:

Remember to treat arguments in a 0.4 compatible way and if necessary specify (middleware<=0.4.5) in nordugrid_rsl

Data management:

At the moment, there is not way to specify non local files in the JDL. That is there is no equivalent statement to the xRSL (outputfiles=(in.txt "gsiftp://..../in.txt").

Jobs can only take local input and will either retrieve the files back to the local disk or leave them on the execution cluster.




Christian Ulrik Søttrup, NBI