-
Submitting Single GEO600 Jobs
-
This section aims to demonstrate the GEO600 workflow in practice. We will use the
submission host to submit a single GEO600 job to a
grid resource. We will also discuss the location of output during each stage of the workflow.
-
If you are logged into the submission host you can submit a single GEO600
job using following command:
-
perl contact.pl -config=../etc/grid-run.conf -host=supergrid.aei.mpg.de -submit -timeout=0.00:10:00 -number-jobs=1
USER Active Pending Execution Suspended cputime[h] submissions FLOPS[G]
------------------------------------------------------------------------------------------------------------------------------------
/O=GermanGrid/OU=AEI/CN=Robert Engel 0 0 0 2 2726.8 155 0.0
HOST Active Pending Execution Suspended cputime[h] submissions FLOPS[G]
------------------------------------------------------------------------------------------------------------------------------------
supergrid.aei.mpg.de 0 0 0 2 2726.8 155 0.0
------------------------------------------------------------------------------------------------------------------------------------
0 0 0 2 2726.8 155 0.0
# id STATUS RC state time[s] shost ghost ehost
------------------------------------------------------------------------------------------------------------------------------------
1 7745 Unsubmi 0 P 0 buran.aei.mpg.de supergrid.aei.mpg.de
-
Here supergrid.aei.mpg.de must be listed in the grid run configuration
file, otherwise an error will be reported. As discussed indepth earlier the
run section for supergrid.aei.mpg.de located in the
grid run configuration file defines all relevant options used during job
submission. Arguments specified at the command line overwite the settings in the configuration file.
Here the timeout was reduced to 10 minutes and
the number of jobs to be submitted was fixed to be one.
-
The contact.pl script itself invokes the GRAM-WS service (command globusrun-ws -submit)
to submit a single Grid job. In order to optimise job submission for multiple jobs to the same machine, the script
internally creates a credential delegation file (using globus-credential-generate) and passes that
on as an option to globusrun-ws -submit. For globus-credential-generate to work
properly, make sure you have the shell environment variable X509_USER_PROXY set to point to your
local grid proxy file (typically something like /tmp/x509up_u1003).
-
The USER table in the output shown above summarizes the current state and
statistics of jobs run by the grid user on the specified resource
prior to submitting the job. Remembering the earlier introduced
APES model for the application state we recognize the individual states in
row two to five. The following two rows list the accumulated computational time and the number
of jobs submitted to the resource. The last row shows the currently achieved floating point
performance of the resource as measured by BOINC.
-
The HOST table is identical to the USER table for the simple
case demonstrated above. We will discuss more complicated scenarios in the following sections where
the two tables will differ.
-
The last table lists all jobs that have been submitted at this time. Since we only submitted one
job the table has only one column. Rows named STATUS and RC
refer to the status and return code as reported by Globus in contrast to the
state row which refers to the application state in the APES
model.
-
The temporal order of events taking place on different hosts is reflected by the order of host
in the last table. First the submission host used for submitting the job,
followed by the grid host that accepted the job submission and finally the
execution host where the job will be executed. At this time the execution
host is not known. It will be known when the application running on the execution host started
BOINC and sent a message home to the MySQL database.
-
Time to check how our application is progressing using the command line option
-list-running|-lr to contact.pl:
-
perl contact.pl -config=../etc/grid-run.conf -host=supergrid.aei.mpg.de -lr
# id STATUS RC state time[s] shost ghost ehost
------------------------------------------------------------------------------------------------------------------------------------
1 7745 Active 0 E 306 buran.aei.mpg.de supergrid.aei.mpg.de supergrid
-
To find out about the internals of contact.pl it is worth to have a look at the detailed
log located in GEO600-devel/log/contact.log. The above command
queries the database for all jobs currently running on supergrid.aei.mpg.de which have
been submitted by the grid user. A complete record for each job includes the
job contact assigned by Globus. This EPR is then
used to query the STATUS and the RC of the application.
-
The application running on supergrid.aei.mpg.de also sent a message to the database announcing
its new state to be in execution for the last 306 secondes on
the execution host supergrid.
-
When the timeout of ten minutes had been reached the application sent signal
TERM to boinc. Awaiting the clean exit of boinc the working directory of the
job was archived and last the job was deannounced to the database. After a while
no running jobs will be found:
-
perl contact.pl -config=../etc/grid-run.conf -host=supergrid.aei.mpg.de -lr
2008/03/20 12:10:48 INFO> no running jobs found!
-
Congratulations! You just donated ten minutes of computational time to the
search for gravitational waves. Perhaps it took more than ten minutes of your time to get
to this point, but you will soon break even when we will scale up the job submission by a few
orders of magnitude. Running 10 or 10,000 jobs at a time
will only be limited by the size of your grid.
-
It will be worth to repeat the job submission again, this time taking a look at the output produced
at various locations. GEO600 uses the popular Log::log4perl
library for logging events. The logging behavior is very flexible and can be configured as
needed outside the source code in
GEO600-devel/main/etc/log.conf. In dependence of
your Log::log4perl configuration you may see following events in
contact.log:
-
# the distinguished name of the grid user
2008/03/20 14:19:31 DEBUG environment::grid_proxy_info> DN = "/O=GermanGrid/OU=AEI/CN=Robert Engel"
# the settings for supergrid found in the grid run configuration file
2008/03/20 14:19:31 DEBUG main::grid_run_settings> -- supergrid.aei.mpg.de --------------------------------------------------------------------------
2008/03/20 14:19:31 DEBUG main::grid_run_settings> GEO600_HOME = ${GLOBUS_USER_HOME}/GEO600-devel
2008/03/20 14:19:31 DEBUG main::grid_run_settings> TIMEOUT = 1200s
2008/03/20 14:19:31 DEBUG main::grid_run_settings> GRAMWS_PORT = 8443
2008/03/20 14:19:31 DEBUG main::grid_run_settings> FT = Fork
2008/03/20 14:19:31 DEBUG main::grid_run_settings> QUEUE = default
2008/03/20 14:19:31 DEBUG main::grid_run_settings> JOBS_RUNNING_MAX = 1
2008/03/20 14:19:31 DEBUG main::grid_run_settings> JOBS_QUEUE_MAX = 1
2008/03/20 14:19:31 DEBUG main::grid_run_settings> JOBS_QUEUE_MIN = 0
2008/03/20 14:19:31 DEBUG main::grid_run_settings> PRESTAGE = 0
2008/03/20 14:19:31 DEBUG main::grid_run_settings> ALLDAYS =
2008/03/20 14:19:31 DEBUG main::grid_run_settings> WEEKDAYS =
2008/03/20 14:19:31 DEBUG main::grid_run_settings> WEEKENDS =
2008/03/20 14:19:31 DEBUG main::grid_run_settings> MAXWALLTIME = 42m
2008/03/20 14:19:31 DEBUG main::grid_run_settings> MAXMEMORY = 1024MB
2008/03/20 14:19:31 DEBUG main::grid_run_settings> MINMEMORY = 256MB
2008/03/20 14:19:31 DEBUG main::grid_run_settings> PREFIX = ${GLOBUS_USER_HOME}/GEO600-devel/build/boinc_5.8.16_i686-pc-linux-gnu
2008/03/20 14:19:31 DEBUG main::grid_run_settings> CHECK_ARCHIVE = 1
2008/03/20 14:19:31 DEBUG main::grid_run_settings> SUSPEND_ALL = 0
2008/03/20 14:19:31 DEBUG main::grid_run_settings> USE_BETA = 0
2008/03/20 14:19:31 DEBUG main::grid_run_settings> USE_TMP =
# the current status of jobs on supergrid
2008/03/20 14:19:31 DEBUG main::grid_run_status> -- supergrid.aei.mpg.de --------------------------------------------------------------------------
2008/03/20 14:19:31 DEBUG main::grid_run_status> status: (A=0|P=0|E=0|S=2)
# the submission strategy
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> -- supergrid.aei.mpg.de --------------------------------------------------------------------------
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> JOBS_QUEUE_MIN : 0
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> JOBS_QUEUE_MAX : 1
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> JOBS_RUNNING_MAX : 1
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> PENDING : 0
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> RUNNING : 0
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> SUBMIT : 1
2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> -- supergrid.aei.mpg.de 1/1 ------------------------------------------------------
# fetching a task from the database
2008/03/20 14:19:31 DEBUG database::taskid> taskid(dn=>/O=GermanGrid/OU=AEI/CN=Robert Engel, ghost=>supergrid.aei.mpg.de)
2008/03/20 14:19:31 DEBUG database::taskid> received task 7745 from database buran.aei.mpg.de:24999
# checking the job archive of the task
2008/03/20 14:19:31 DEBUG database::archive_check> 7745: gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar
2008/03/20 14:19:31 DEBUG database::archive_check> process storage host = astrodata09.gac-grid.org ...
2008/03/20 14:19:31 DEBUG database::archive_check> astrodata09.gac-grid.org: /store/01/aei/GEO600/tasks/7745.tar
2008/03/20 14:19:31 DEBUG database::archive_check> gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar = 0
# creating the RSL, submitting the job, saving the EPR on disk and in database
2008/03/20 14:19:32 DEBUG main::submit> RSL = /home/robert/GEO600-devel/log/7745.rsl
2008/03/20 14:19:32 DEBUG main::submit> EPR = /home/robert/GEO600-devel/log/7745.epr
# checking the Globus status
2008/03/20 14:19:33 DEBUG main::globus_status> EPR = /home/robert/GEO600-devel/log/7745.epr found!
2008/03/20 14:19:33 DEBUG main::globus_status> ret_status == 0
2008/03/20 14:19:33 DEBUG main::globus_status> gstatus: Unsubmitted
# update job record in database
2008/03/20 14:19:33 DEBUG database::connect> connect to DBI:mysql:database=eah;host=buran.aei.mpg.de;port=24999 ...
2008/03/20 14:19:33 DEBUG database::connect> connected!
2008/03/20 14:19:33 DEBUG database::update> update jobstate
2008/03/20 14:19:33 DEBUG database::disconnect> disconnect from mysql eah robert@buran.aei.mpg.de:24999
2008/03/20 14:19:33 DEBUG main::submit> job submission completed
-
GEO600 first creates a job description RSL for the job. When the
job has been submitted to the grid resource, the local gramws will return the
job contact EPR back to the user which is saved in the database.
-
For a simple display of the state of your job you can call contact.pl providing the
task -id=7745 on the command line:
-
perl contact.pl -config=../etc/grid-run.conf -debug -id=7745
key value
---------------------------------------------------------------------------------------
id 7745
state E
dn /O=GermanGrid/OU=AEI/CN=Robert Engel
shost buran.aei.mpg.de
suser robert
stime Thu Mar 20 14:19:33 2008
ghost supergrid.aei.mpg.de
guser engro
ehost supergrid.aei.mpg.de
etime Thu Mar 20 14:17:09 2008
ztime
proc 1
walltime 236765
gstatus Unsubmitted
gexit 0
cputime 236765
archive gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar
snumber 9
cnumber 7
stdout gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.out
stderr gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.err
log gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.log
-
The debugging view contains every information needed to be certain about the
state of a GEO600 job.
-
Lets start above at the submission host buran.aei.mpg.de. The grid user known by
his distinguished name /O=GermanGrid/OU=AEI/CN=Robert Engel was using his login
robert at Thu Mar 20 14:19:33 2008 to
submit the GEO600 task with id 7745 to the grid resource
supergrid.aei.mpg.de.
-
The grid resource supergrid.aei.mpg.de accepted the job and put it into execution
using the local login engro. The execution started on the execution host
supergrid.aei.mpg.de at Thu Mar 20 14:17:09 2008 and is
utilizing one cpu core.
-
How can the job be executed prior to job submission? The problem here is the time skew between
the two machines. It is recommended to use network time protocol to keep machines
in sync, but this may not always be the case.
-
On the statistical side the number of job submissions 9 vs. the number of times
the job completed successfully 7 acts as a measure of success. The task
spent a total of 236765 secondes or almost ten days being executed.
-
The location of stdout, stderr and
the logfile always point to the actual location of these files.
This can be used to stage the files for debugging purposes while the job is running:
-
globus-url-copy gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.log file://buran.aei.mpg.de/tmp/7745.log
-
After the timeout has been reached the job will create an archive of its working directory
on the grid resource in GEO600-devel/tasks/7745.tar. This archive will be
staged by Globus to the specified storage resource and be deleted afterwards.
The location of stdout, stderr and the logfile previously located at the grid resource in
GEO600-devel/log will now point to their location on the output storage resource
and the task will be suspended. For details just execute:
-
perl contact.pl -config=../etc/grid-run.conf -debug -id=7745
key value
---------------------------------------------------------------------------------------
id 7745
state S
dn /O=GermanGrid/OU=AEI/CN=Robert Engel
shost buran.aei.mpg.de
suser robert
stime Thu Mar 20 14:19:33 2008
ghost supergrid.aei.mpg.de
guser engro
ehost supergrid.aei.mpg.de
etime Thu Mar 20 14:17:09 2008
ztime Thu Mar 20 14:37:15 2008
proc 1
walltime 237971
gstatus Unsubmitted
gexit 0
cputime 237971
archive gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar
snumber 9
cnumber 8
stdout gsiftp://buran.aei.mpg.de/store/GEO600/tasks/7745.out
stderr gsiftp://buran.aei.mpg.de/store/GEO600/tasks/7745.err
log gsiftp://buran.aei.mpg.de/store/GEO600/tasks/7745.out
-
Note how the completion number and submission number both have been increased by one. Also the
walltime increased by twenty minutes and the task is ready for resubmission at this time.