GEO600
The Search for Gravitational Waves using the GRID
  • Deployment
  • Configuration
  • Testing
  • Workflow
  • Content

    • Introduction
    • Single Jobs
    • Multiple Jobs
    • Automatization
    • FAQ
    • Introduction
    • This section aims to introduce the lifetime of a GEO600 task in grid environments guided by examples that demonstrate this process step by step. Starting with the submission of a single job to a grid workstation, we will proceed to submit multiple jobs to larger grid resources. At last we will discuss mechanisms used for automatization followed by hints on troubleshooting.
    • At this point we assume that GEO600 has been succesfully deployed on at least one grid resource and that you changed various configuration files to reflect your grid environment.
    • The lifetime of a GEO600 task can be devided into stages that appear in temporal order:
    • requesting a task from the database
    • preparing the task for submission
    • submitting the task as a job to a grid resource
    • executing the task on the grid resource
    • putting the task back into the database
    • The former three stages all take place on the submission host while the later two stages take place on the execution host. The execution host may be a grid workstation or the internal compute element of a grid cluster.
    • A GEO600 task will be created whenever the grid user is requesting a GEO600 task from the database for execution on a new grid resource. Hereby new refers to the fact that the grid user known by his or her distinguished name (DN) has never submitted a job to the grid resource before. A GEO600 task will also be created when all tasks belonging to the grid user and the grid resource are in execution and the grid user is requesting yet another task from the database for submission.
    • A GEO600 task is bound to the distinguished name of the grid user and the grid resource it was first submitted to! Individual tasks cannot be shared between grid resources and neither between grid users!
    • A GEO600 task consists of several elements that manage the lifetime of the task. An internal STATE variable points to the current state of the task:
    • 	
      Active    (A) : when received from database and prepared for submission
      Pending   (P) : when successfully submitted to a grid resource but not yet in execution
      Execution (E) : when the task is in execution on a grid resource
      Suspended (S) : when the task finished and is ready for resubmission
      						
    • This four state APES model describes all possible stages of a GEO600 task in temporal order. One complete circle in the state model describes one complete job submission as literally described in five points above. A GEO600 Job is a GEO600 task doing one complete circle in the state model.
    • It is possible to take a look at the content of the database using tools distributed with GEO600. Lets assume we were interested in a specific task known to us by its numeric ID. We then could execute:
    • robert@buran:~/GEO600-devel/main/scripts> perl contact.pl -config=../etc/grid-run.conf -id=8134 -debug
      2008/03/18 13:17:24 ERROR> job with id = 8134 belongs to /O=GermanGrid/OU=AEI/CN=Thomas Radke!
      2008/03/18 13:17:24 INFO> STATUS        = Unsubmitted
      2008/03/18 13:17:24 INFO> RC            = 0
      2008/03/18 13:17:24 INFO> connection to database established!
      
            key value
      ---------------------------------------------------------------------------------------
             id 8134
          state E
             dn /O=GermanGrid/OU=AEI/CN=Thomas Radke
          shost buran.aei.mpg.de
          suser tradke
          stime Mon Mar 17 22:24:14 2008
          ghost red.unl.edu
          guser ligo
          ehost node066.unl.edu
          etime Mon Mar 17 22:26:37 2008
          ztime
           proc 1
       walltime 403635
        gstatus Unsubmitted
          gexit 0
        cputime 403635
        archive gsiftp://red.unl.edu/opt/osg/data/LIGO/GEO600/data/tasks/8134.tar
        snumber 8
        cnumber 7
         stdout gsiftp://red.unl.edu/mnt/nfs04/opt/osg/app/ligo/GEO600-devel/log/8134.out
         stderr gsiftp://red.unl.edu/mnt/nfs04/opt/osg/app/ligo/GEO600-devel/log/8134.err
            log gsiftp://red.unl.edu/mnt/nfs04/opt/osg/app/ligo/GEO600-devel/log/8134.log
      
      						
    • In the case above we used the specific but else arbitrary -id=8134 on the command line. The first line informs us that the task belongs to a grid user known by his distinguished name /O=GermanGrid/OU=AEI/CN=Thomas Radke. We therefor could not trigger the globus status of the task which else would be reported correctly in the next two lines.
    • The above table lists all variables associated with a GEO600 task. The state variable reports the task to be in execution. It was submitted from buran.aei.mpg.de at Mon Mar 17 22:24:14 2008 to red.unl.edu. A few minutes later the task was put into execution at Mon Mar 17 22:26:37 2008 and is ever since running on node066.unl.edu. Some other variables used for statistical purposes and the location of output files are also reported. We will discuss these variables in some more detail at a later point in this description.
    • Submitting Single GEO600 Jobs
    • This section aims to demonstrate the GEO600 workflow in practice. We will use the submission host to submit a single GEO600 job to a grid resource. We will also discuss the location of output during each stage of the workflow.
    • If you are logged into the submission host you can submit a single GEO600 job using following command:
    • perl contact.pl -config=../etc/grid-run.conf -host=supergrid.aei.mpg.de -submit -timeout=0.00:10:00 -number-jobs=1
      
       USER                                                   Active    Pending  Execution  Suspended  cputime[h] submissions    FLOPS[G]
      ------------------------------------------------------------------------------------------------------------------------------------
       /O=GermanGrid/OU=AEI/CN=Robert Engel                        0          0          0          2      2726.8        155         0.0
      
       HOST                                                   Active    Pending  Execution  Suspended  cputime[h] submissions    FLOPS[G]
      ------------------------------------------------------------------------------------------------------------------------------------
       supergrid.aei.mpg.de                                        0          0          0          2      2726.8        155         0.0
      ------------------------------------------------------------------------------------------------------------------------------------
                                                                   0          0          0          2      2726.8        155         0.0
      
      
            #     id  STATUS      RC   state   time[s] shost                              ghost                              ehost
      ------------------------------------------------------------------------------------------------------------------------------------
            1   7745 Unsubmi       0       P         0 buran.aei.mpg.de                   supergrid.aei.mpg.de
      						
    • Here supergrid.aei.mpg.de must be listed in the grid run configuration file, otherwise an error will be reported. As discussed indepth earlier the run section for supergrid.aei.mpg.de located in the grid run configuration file defines all relevant options used during job submission. Arguments specified at the command line overwite the settings in the configuration file. Here the timeout was reduced to 10 minutes and the number of jobs to be submitted was fixed to be one.
    • The contact.pl script itself invokes the GRAM-WS service (command globusrun-ws -submit) to submit a single Grid job. In order to optimise job submission for multiple jobs to the same machine, the script internally creates a credential delegation file (using globus-credential-generate) and passes that on as an option to globusrun-ws -submit. For globus-credential-generate to work properly, make sure you have the shell environment variable X509_USER_PROXY set to point to your local grid proxy file (typically something like /tmp/x509up_u1003).
    • The USER table in the output shown above summarizes the current state and statistics of jobs run by the grid user on the specified resource prior to submitting the job. Remembering the earlier introduced APES model for the application state we recognize the individual states in row two to five. The following two rows list the accumulated computational time and the number of jobs submitted to the resource. The last row shows the currently achieved floating point performance of the resource as measured by BOINC.
    • The HOST table is identical to the USER table for the simple case demonstrated above. We will discuss more complicated scenarios in the following sections where the two tables will differ.
    • The last table lists all jobs that have been submitted at this time. Since we only submitted one job the table has only one column. Rows named STATUS and RC refer to the status and return code as reported by Globus in contrast to the state row which refers to the application state in the APES model.
    • The temporal order of events taking place on different hosts is reflected by the order of host in the last table. First the submission host used for submitting the job, followed by the grid host that accepted the job submission and finally the execution host where the job will be executed. At this time the execution host is not known. It will be known when the application running on the execution host started BOINC and sent a message home to the MySQL database.
    • Time to check how our application is progressing using the command line option -list-running|-lr to contact.pl:
    • perl contact.pl -config=../etc/grid-run.conf -host=supergrid.aei.mpg.de -lr
            #     id  STATUS      RC   state   time[s] shost                              ghost                              ehost
      ------------------------------------------------------------------------------------------------------------------------------------
            1   7745  Active       0       E       306 buran.aei.mpg.de                   supergrid.aei.mpg.de               supergrid
      						
    • To find out about the internals of contact.pl it is worth to have a look at the detailed log located in GEO600-devel/log/contact.log. The above command queries the database for all jobs currently running on supergrid.aei.mpg.de which have been submitted by the grid user. A complete record for each job includes the job contact assigned by Globus. This EPR is then used to query the STATUS and the RC of the application.
    • The application running on supergrid.aei.mpg.de also sent a message to the database announcing its new state to be in execution for the last 306 secondes on the execution host supergrid.
    • When the timeout of ten minutes had been reached the application sent signal TERM to boinc. Awaiting the clean exit of boinc the working directory of the job was archived and last the job was deannounced to the database. After a while no running jobs will be found:
    • perl contact.pl -config=../etc/grid-run.conf -host=supergrid.aei.mpg.de -lr
      2008/03/20 12:10:48 INFO> no running jobs found!
      						
    • Congratulations! You just donated ten minutes of computational time to the search for gravitational waves. Perhaps it took more than ten minutes of your time to get to this point, but you will soon break even when we will scale up the job submission by a few orders of magnitude. Running 10 or 10,000 jobs at a time will only be limited by the size of your grid.
    • It will be worth to repeat the job submission again, this time taking a look at the output produced at various locations. GEO600 uses the popular Log::log4perl library for logging events. The logging behavior is very flexible and can be configured as needed outside the source code in GEO600-devel/main/etc/log.conf. In dependence of your Log::log4perl configuration you may see following events in contact.log:
    • # the distinguished name of the grid user
      2008/03/20 14:19:31 DEBUG environment::grid_proxy_info> DN = "/O=GermanGrid/OU=AEI/CN=Robert Engel"
      
      # the settings for supergrid found in the grid run configuration file
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> -- supergrid.aei.mpg.de --------------------------------------------------------------------------
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> GEO600_HOME      = ${GLOBUS_USER_HOME}/GEO600-devel
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> TIMEOUT          = 1200s
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> GRAMWS_PORT      = 8443
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> FT               = Fork
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> QUEUE            = default
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> JOBS_RUNNING_MAX = 1
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> JOBS_QUEUE_MAX   = 1
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> JOBS_QUEUE_MIN   = 0
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> PRESTAGE         = 0
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> ALLDAYS          = 
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> WEEKDAYS         = 
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> WEEKENDS         = 
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> MAXWALLTIME      = 42m
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> MAXMEMORY        = 1024MB
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> MINMEMORY        = 256MB
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> PREFIX           = ${GLOBUS_USER_HOME}/GEO600-devel/build/boinc_5.8.16_i686-pc-linux-gnu
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> CHECK_ARCHIVE    = 1
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> SUSPEND_ALL      = 0
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> USE_BETA         = 0
      2008/03/20 14:19:31 DEBUG main::grid_run_settings> USE_TMP          = 
      
      # the current status of jobs on supergrid
      2008/03/20 14:19:31 DEBUG main::grid_run_status> -- supergrid.aei.mpg.de --------------------------------------------------------------------------
      2008/03/20 14:19:31 DEBUG main::grid_run_status>  status: (A=0|P=0|E=0|S=2)
      
      # the submission strategy
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> -- supergrid.aei.mpg.de --------------------------------------------------------------------------
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> JOBS_QUEUE_MIN   : 0
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> JOBS_QUEUE_MAX   : 1
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> JOBS_RUNNING_MAX : 1
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> PENDING          : 0
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> RUNNING          : 0
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> SUBMIT           : 1
      2008/03/20 14:19:31 DEBUG main::grid_run_submit_host> -- supergrid.aei.mpg.de 1/1 ------------------------------------------------------
      
      # fetching a task from the database
      2008/03/20 14:19:31 DEBUG database::taskid> taskid(dn=>/O=GermanGrid/OU=AEI/CN=Robert Engel, ghost=>supergrid.aei.mpg.de)
      2008/03/20 14:19:31 DEBUG database::taskid> received task 7745 from database buran.aei.mpg.de:24999
      
      # checking the job archive of the task 
      2008/03/20 14:19:31 DEBUG database::archive_check> 7745: gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar
      2008/03/20 14:19:31 DEBUG database::archive_check> process storage host = astrodata09.gac-grid.org ...
      2008/03/20 14:19:31 DEBUG database::archive_check> astrodata09.gac-grid.org: /store/01/aei/GEO600/tasks/7745.tar
      2008/03/20 14:19:31 DEBUG database::archive_check> gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar = 0
      
      # creating the RSL, submitting the job, saving the EPR on disk and in database
      2008/03/20 14:19:32 DEBUG main::submit> RSL		= /home/robert/GEO600-devel/log/7745.rsl
      2008/03/20 14:19:32 DEBUG main::submit> EPR		= /home/robert/GEO600-devel/log/7745.epr
      
      # checking the Globus status
      2008/03/20 14:19:33 DEBUG main::globus_status> EPR	= /home/robert/GEO600-devel/log/7745.epr found!
      2008/03/20 14:19:33 DEBUG main::globus_status> ret_status == 0
      2008/03/20 14:19:33 DEBUG main::globus_status> gstatus: Unsubmitted
      
      # update job record in database
      2008/03/20 14:19:33 DEBUG database::connect> connect to DBI:mysql:database=eah;host=buran.aei.mpg.de;port=24999 ...
      2008/03/20 14:19:33 DEBUG database::connect> connected!
      2008/03/20 14:19:33 DEBUG database::update> update jobstate
      2008/03/20 14:19:33 DEBUG database::disconnect> disconnect from mysql eah robert@buran.aei.mpg.de:24999
      2008/03/20 14:19:33 DEBUG main::submit> job submission completed
      						
    • GEO600 first creates a job description RSL for the job. When the job has been submitted to the grid resource, the local gramws will return the job contact EPR back to the user which is saved in the database.
    • For a simple display of the state of your job you can call contact.pl providing the task -id=7745 on the command line:
    • perl contact.pl -config=../etc/grid-run.conf -debug -id=7745
      
            key value
      ---------------------------------------------------------------------------------------
             id 7745
          state E
             dn /O=GermanGrid/OU=AEI/CN=Robert Engel
          shost buran.aei.mpg.de
          suser robert
          stime Thu Mar 20 14:19:33 2008
          ghost supergrid.aei.mpg.de
          guser engro
          ehost supergrid.aei.mpg.de
          etime Thu Mar 20 14:17:09 2008
          ztime
           proc 1
       walltime 236765
        gstatus Unsubmitted
          gexit 0
        cputime 236765
        archive gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar
        snumber 9
        cnumber 7
         stdout gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.out
         stderr gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.err
            log gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.log
            					
    • The debugging view contains every information needed to be certain about the state of a GEO600 job.
    • Lets start above at the submission host buran.aei.mpg.de. The grid user known by his distinguished name /O=GermanGrid/OU=AEI/CN=Robert Engel was using his login robert at Thu Mar 20 14:19:33 2008 to submit the GEO600 task with id 7745 to the grid resource supergrid.aei.mpg.de.
    • The grid resource supergrid.aei.mpg.de accepted the job and put it into execution using the local login engro. The execution started on the execution host supergrid.aei.mpg.de at Thu Mar 20 14:17:09 2008 and is utilizing one cpu core.
    • How can the job be executed prior to job submission? The problem here is the time skew between the two machines. It is recommended to use network time protocol to keep machines in sync, but this may not always be the case.
    • On the statistical side the number of job submissions 9 vs. the number of times the job completed successfully 7 acts as a measure of success. The task spent a total of 236765 secondes or almost ten days being executed.
    • The location of stdout, stderr and the logfile always point to the actual location of these files. This can be used to stage the files for debugging purposes while the job is running:
    • globus-url-copy gsiftp://supergrid.aei.mpg.de/home/engro/GEO600-devel/log/7745.log file://buran.aei.mpg.de/tmp/7745.log
      						
    • After the timeout has been reached the job will create an archive of its working directory on the grid resource in GEO600-devel/tasks/7745.tar. This archive will be staged by Globus to the specified storage resource and be deleted afterwards. The location of stdout, stderr and the logfile previously located at the grid resource in GEO600-devel/log will now point to their location on the output storage resource and the task will be suspended. For details just execute:
    • perl contact.pl -config=../etc/grid-run.conf -debug -id=7745
      
            key value
      ---------------------------------------------------------------------------------------
             id 7745
          state S
             dn /O=GermanGrid/OU=AEI/CN=Robert Engel
          shost buran.aei.mpg.de
          suser robert
          stime Thu Mar 20 14:19:33 2008
          ghost supergrid.aei.mpg.de
          guser engro
          ehost supergrid.aei.mpg.de
          etime Thu Mar 20 14:17:09 2008
          ztime Thu Mar 20 14:37:15 2008
           proc 1
       walltime 237971
        gstatus Unsubmitted
          gexit 0
        cputime 237971
        archive gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks/7745.tar
        snumber 9
        cnumber 8
         stdout gsiftp://buran.aei.mpg.de/store/GEO600/tasks/7745.out
         stderr gsiftp://buran.aei.mpg.de/store/GEO600/tasks/7745.err
            log gsiftp://buran.aei.mpg.de/store/GEO600/tasks/7745.out
      						
    • Note how the completion number and submission number both have been increased by one. Also the walltime increased by twenty minutes and the task is ready for resubmission at this time.
    • Frequently Asked Questions
    • Robert Engel, Max-Planck Institut for Gravitational Physics