GEO600
The Search for Gravitational Waves using the GRID
  • Deployment
  • Configuration
  • Testing
  • Workflow
  • Content

    • Storage
    • Remote Storage
    • Local Storage
    • Output
    • GRID RUN
    • Grid Workstation
    • Grid Cluster
    • Troubleshooting
    • FAQ
    • Storage
    • At this point we assume, that you have succesfully completed the deployment of GEO600 on a grid resource. All following configuration steps must be completed on a grid resource, which will be used for GEO600 job submissions.
    • Storage is an essential component of GEO600 running in grid environments. Each GEO600 task analyzes a certain amount of detector data while running on a grid resource. The application stores this data in its runtime directory together with frequent checkpoints saving the state of the application.
    • After each run of the application on a grid resource the runtime directory will be archived and transferred to a grid storage location where it will occupy approx. 100MB. Upon resubmission of the application the archive will be staged in to the grid resource from the storage resource and the data analysis will continue from the last checkpoint.
    • Each GEO600 task running on a grid resource will also produce output in the form of stdout, stderr and a log file which will be transferred to a central storage location for convinient inspections.
    • This section aims to describe the necessary configuration of storage resources used by GEO600.
    • Remote Storage Locations
    • In a general grid scenario output produced by an application should be transferred to a grid storage location after the application exited. In fact many grid resources are computational resources which will not provide long time storage.
    • It will therefor be necessary to configure a remote storage location for GEO600 archives which Globus may access using gridftp. The storage configuration file is located in GEO600-devel/main/etc/storage.conf.
    • A storage location is specified by the keyword storage followed by the fully qualified domain name of the storage resource. It is necessary to define the full URI pointing to the storage LOCATION which must support gsiftp:
    • storage astrodata09.gac-grid.org {
      	LOCATION     = gsiftp://astrodata09.gac-grid.org/store/01/aei/GEO600/tasks
      }
      						
    • In order to quickly find out about the state of the resource and if you have suffient rights to access the storage location, simply execute in GEO600-devel/main/scripts:
    • robert@buran:~/GEO600-devel/main/scripts> perl contact.pl -storage -host=astrodata09.gac-grid.org
      
       HOST                                        load[5min]    permissions    usage[G]  access
      ------------------------------------------------------------------------------------------------------
       astrodata09.gac-grid.org                          2.35             OK         5.7  public
      						
    • Notice the row named access saying that the storage location is available to the public. It does not mean that all grid users may access the storage location, but that all grid resources may use it to stage out data. Simply use the keyword ACCESS to tie a specific grid resource running GEO600 jobs to a specific storage location:
    • storage gridgk01.racf.bnl.gov {
      	LOCATION     = gsiftp://gridgk01.racf.bnl.gov/~/data/tasks
      	ACCESS       = gridgk01.racf.bnl.gov
      }
      						
    • To determine the state of the storage resource and to resolve ~ a GSISSH call to the resource was issued. In some cases GSISSH won't be available on the storage resource, in which case you can also use GRAMWS:
    • storage gridgk01.racf.bnl.gov {
      	LOCATION    = gsiftp://gridgk01.racf.bnl.gov/~/data/tasks
      	ACCESS      = gridgk01.racf.bnl.gov
      	GT4         = gridgk01.racf.bnl.gov
      	GRAMWS_PORT = 9443
      }
      						
    • Local Storage Locations
    • A local storage location is by our definition just an ordinary grid resources that provides storage space on side.
    • Local storage is not only convinient but needed by large grid resources. The network traffic created by a larger number of GEO600 jobs staging in 100MB each from the same remote storage location will easily surpass the capabilities of any storage resource.
    • Local storage locations are configured in much the same way as remote storage locations in GEO600-devel/main/etc/storage.conf and setting the ACCESS option to the fully qualified domain name of the grid resource where the GEO600 jobs are running:
    • storage gridgk01.racf.bnl.gov {
      	LOCATION    = gsiftp://gridgk01.racf.bnl.gov/~/data/tasks
      	ACCESS      = gridgk01.racf.bnl.gov
      	GT4         = gridgk01.racf.bnl.gov
      	GRAMWS_PORT = 9443
      }
      						
    • At this point you may want to change GEO600-devel/main/etc/storage.conf to reflect your grid configuration. It is not needed to create the path on each storage location by calling mkdir. It might be sufficient to execute:
    • robert@buran:~/GEO600-devel/main/scripts> perl contact.pl -storage
      
       HOST                                        load[5min]    permissions    usage[G]  access
      ------------------------------------------------------------------------------------------------------
       a01.hlrb2.lrz-muenchen.de                       269.54             OK         0.0  private
       arminius-grid.uni-paderborn.de                    0.00             OK         5.5  private
       astrodata09.gac-grid.org                          3.05             OK         5.7  public
       damiana.aei.mpg.de                                0.98             OK         0.1  private
       dgrid-globus.rz.rwth-aachen.de                    0.12             OK        17.3  private
       gate01.aglt2.org                                 17.51             OK         0.0  private
       gcwn60.d-grid.uni-hannover.de                     0.17             OK        80.7  private
       gramd1.d-grid.uni-hannover.de                     0.30             OK        80.7  private
       grid3.aset.psu.edu                                0.00             NO         0.0  private
       gridgk01.racf.bnl.gov                            41.17             OK        52.0  private
       gt4-fzk.gridka.de                                 0.14             OK        95.4  private
       hector.zih.tu-dresden.de                         18.04             OK         2.1  private
       hydra.ari.uni-heidelberg.de                       0.70             OK         0.0  private
       iwrgt4.fzk.de                                     0.21             OK         9.3  private
       juggle-glob.fz-juelich.de                         0.01             OK        46.1  private
       juggle-inter.fz-juelich.de                        0.00             OK        46.1  private
       lx32ia1.lrz-muenchen.de                           0.00             NO         0.0  private
       lx64ia2.lrz-muenchen.de                           0.00             NO         0.0  private
       mardschana.zib.de                                 0.09             OK        14.7  private
       medigrid-srv.gwdg.de                              0.12             OK        25.4  private
       nest.phys.uwm.edu                                 0.00             NO         0.0  private
       osg.rcac.purdue.edu                               2.70             OK         0.0  private
       othello.zih.tu-dresden.de                        18.19             OK         3.4  private
       srvgrid01.offis.uni-oldenburg.de                  0.00             OK         0.0  private
       udo-gt01.grid.uni-dortmund.de                     0.14             OK        29.2  private
      						
    • Note that the previous command will likely take some time to complete in dependency of the number of storage locations you have setup. You can access the log file of contact.pl at GEO600-devel/log/contact.log
    • Output Storage Location
    • An output storage location is just an ordinary remote storage location that all GEO600 tasks will use to stage out stdout, stderr and the log file.
    • This will be especially convinient while debugging GEO600 tasks since all files will be available on one resource - which may well be the submission host.
    • At this point you may want to change GEO600-devel/main/etc/output.conf to define one remote storage location for collecting GEO600 job output:
    • storage buran.aei.mpg.de {
      	LOCATION  = gsiftp://buran.aei.mpg.de/store/GEO600/tasks
      }
      						
    • In order to quickly find out about the state of the resource and if you have suffient rights to access the output location, simply execute in GEO600-devel/main/scripts:
    • robert@buran:~/GEO600-devel/main/scripts> perl contact.pl -output 
      
       HOST                                        load[5min]    permissions    usage[G]  access
      ------------------------------------------------------------------------------------------------------
       buran.aei.mpg.de                                  0.44             OK         0.2  public
      						
    • Grid Run Configuration
    • The GEO600 grid application is functionally split into two independent parts. The first part will always be executed on compute elements of grid resources. This part is enhancing BOINC to allow a care free execution on clusters. This applicaton is located in GEO600-devel/main/scripts and called eah.pl. This application:
    • adds the possibility to set a user defined runtime limit
    • adds support for migrating the application between different hosts
    • adds support for managing a large number of GEO600 tasks in grid environments
    • adds support for accounting and statistics in grid environments
    • The second part allows the execution of BOINC in grid environments by providing interfaces to the Globus Toolkit. This part will only be executed on a special job submission host. The application is located in GEO600-devel/main/scripts and called contact.pl. This application:
    • adds support to submit GEO600 tasks to a grid resource
    • adds support to monitor GEO600 tasks on grid resources
    • manages storage locations used by GEO600 tasks
    • provides live and statistical information about GEO600 tasks
    • The grid run configuration file specifies the capabilities and interfaces of grid resources. The configuration file is located in GEO600-devel/main/etc/grid-run.conf. A grid host entry consists of the keyword run followed by the fully qualified domain name of the grid resource:
    • run gridgk01.racf.bnl.gov {
      	GEO600_HOME = GEO600-devel
      }
      						
    • The keyword GEO600_HOME points to the installation path of GEO600 relative to $HOME on the grid resource. At this point contact.pl will assume many default values suitable only for standard Globus installations on grid workstations. It will therefor be necessary to discuss some more options used in the grid run configuration file.
    • GEO600_HOME      = GEO600                               : default GEO600 installation path
      TIMEOUT          = 0.00:10:00                           : default runtime of the application (d.hh:mm:ss)
      FT               = Fork                                 : default factory type of GRAMWS  
      JOBS_RUNNING_MAX = 1                                    : default maximum number of tasks running simultaneously
      PRESTAGE         = GLOBUS                               : by default Globus stages files, no MANUAL staging
      ALLDAYS          = [0-24]                               : by default run on all days and all times of the day
      WEEKDAYS         = [0-24]                               : by default run on all weekdays and all times of the day
      WEEKENDS         = [0-24]                               : by default run on all weekends and all times of the day
      CHECK_ARCHIVE    = NO                                   : check job archive prior to job submission
      SUSPEND_ALL      = NO                                   : suspend all GEO600 tasks on node upon exit of application
      USE_BETA         = NO                                   : by default don't use BETA application of EAH
      USE_TMP          = NO                                   : by default don't use $TMP nor /tmp to run
      GRAMWS_PORT      = 8443                                 : default port of GRAMWS
      PREFIX           = build/boinc_5.8.16_i686-pc-linux-gnu : default BOINC installation to use
      						
    • At first it will be necessary to specify some details concerning the Globus installation on the grid resource. The factory type FT specifies how jobs are put into execution by GRAMWS on the grid resource.
    • If the grid resource in question is a simple workstation, the default factory type Fork does not need to be set. In this case GRAMWS will simply fork your job on the resource.
    • In case the grid resource is the frontend to a cluster using a local queuing system, the factory type must be set to the factory type supported by the resource. This might be PBS, SGE, CCS, LSF or Condor. If unsure contact the local grid resource administrator for details.
    • Some grid resources do not only provide access to the local queuing system using one of the factory types mentioned above, but provide access to Fork as well. This does not mean that you may run tasks on the frontend, but that you may use Fork for administrative purposes. This is very convinient and can be specified by using the keyword FT_FORK. The default value is NO.
    • The default port GRAMWS uses for connections is 8443. If the grid resource in questions uses a different port, it is possible to specify this port using the keyword GRAMWS_PORT:
    • run gridgk01.racf.bnl.gov {
      	GEO600_HOME = GEO600-devel
      	FT          = Condor
      	FT_FORK     = YES
      	GRAMWS_PORT = 9443
      }
      						
    • The default port GRAMWS uses for connections is 8443. If the grid resource in questions uses a different port, it is possible to specify this port using the keyword GRAMWS_PORT:
    • Grid Workstation Configuration
    • A number of options are mostly of interest for grid worktstations. You can adjust the number of jobs running at most simultaneously by setting JOBS_RUNNING_MAX. The default is 1 and should not be greater than the number of cpu cores installed. If you set JOBS_RUNNING_MAX = 0 , no jobs will be allowed to run at the resource.
    • Some grid workstations might be in daily use and only available for heavy computation at night times or weekends. You can limit the time GEO600 tasks are allowed to run at the resource using options ALLDAYS, WEEKDAYS and WEEKENDS. For example consider following configuration:
    • run gavo2.aip.de {
      	GEO600_HOME      = GEO600-devel
      	TIMEOUT          = 1.00:00:00
      	JOBS_RUNNING_MAX = 2
      	WEEKDAYS         = [22-7]
      }
      						
    • In the example shown above 2 GEO600 tasks will be submitted only past 10pm weekdays and the runtime of 1 day will be limited to 9 hours in order to free the workstation at 7am the next day. At weekends no limitations exist.
    • It is convinient to use one remote storage location for all grid workstations. In this case it is recommended to check the availability of the job archive located at the storage resource before job submission. This can be done using the keyword CHECK_ARCHIVE
    • run gavo2.aip.de {
      	GEO600_HOME      = GEO600-devel
      	TIMEOUT          = 1.00:00:00
      	JOBS_RUNNING_MAX = 2
      	WEEKDAYS         = [22-7]
      	CHECK_ARCHIVE    = YES
      }
      						
    • This will slow down job submission by a bit, but will assure that the job archive exists on the remote storage resource and that you have sufficient rights to access the archive. It does not assure, that the gridftp service provided by the remote storage resource is working properly and neither that the grid workstation will succeed in staging the job archive.
    • Grid Cluster Configuration
    • Grid Clusters have capabilities beyond simple grid workstations that need to be configured. Most prominent is the existence of a local queuing system distributing tasks to a number of compute nodes. Fortunately Globus handles the details about howto access the queuing system as long as you specify the type , also known as factory type, correctly. Currently supported by GEO600 are PBS, SGE, CCS, LSF and Condor.
    • run gridgk01.racf.bnl.gov {
      	GEO600_HOME = GEO600-devel
      	FT          = Condor
      	FT_FORK     = YES
      	GRAMWS_PORT = 9443
      }
      						
    • Some Grid Clusters may provide several queues for you to choose from. You can tell GEO600 to use a specific queue instead of the default queue by providing its name using the keyword QUEUE:
    • run othello.zih.tu-dresden.de {
      	GEO600_HOME = GEO600-devel
      	FT          = PBS
      	FT_FORK     = NO
      	QUEUE       = gridbatch
      }
      						
    • At this point it will be possible to submit one job to the grid cluster using the default values:
    • JOBS_RUNNING_MAX = 1  : maximum number of jobs allowed to run simultaneously
      JOBS_QUEUE_MAX   = 1  : maximum number of jobs allowed to be pending in queue simultaneously
      JOBS_QUEUE_MIN   = 0  : minimum number of jobs pending before new jobs will be submitted
      						
    • Here JOBS_RUNNING_MAX acts as an upper limit regarding the number of jobs running simultaneously at the resource. This is a hard limit that should never be exceeded. Therefor the sum of the number of jobs currently running and the number of jobs pending will never be greater than this limit.
    • JOBS_QUEUE_MAX acts as an upper limit regarding the number of jobs pending simultaneously in the queue. It takes care that the grid cluster is filled only slowly with GEO600 tasks and assures a fair use of the resource at the same time.
    • JOBS_QUEUE_MIN acts as a lower limit regarding the number of GEO600 tasks pending in the queue. New GEO600 tasks will only be submitted if the number of jobs pending falls below this limit. As an illustrating example take a look at the grid resource configuration below:
    • run gridgk01.racf.bnl.gov {
      	GEO600_HOME        = GEO600-devel
      	FT                 = Condor
      	FT_FORK            = YES
      	GRAMWS_PORT        = 9443
      	TIMEOUT            = 0.10:00:00
      	JOBS_RUNNING_MAX   = 700
      	JOBS_QUEUE_MAX     = 64
      	JOBS_QUEUE_MIN     = 32
      }		
      						
    • The number of JOBS_RUNNING_MAX is based on the observation that this resource is capable of supporting one user running on 700 cores simultaneously.
    • Job submission to the resource begins whenever the number of GEO600 tasks pending in queue falls below the limit JOBS_QUEUE_MIN. In this case tasks will be submitted till the number of tasks pending either reaches JOBS_QUEUE_MAX or the sum of tasks pending and running reaches JOBS_RUNNING_MAX.
    • It is very easy to submit hundreds of GEO600 tasks to a grid resource. Be aware that it is not as easy to stop these tasks once submitted. Even more difficult might be to explain the situation to the resource administrator and other users confronted with way too many GEO600 tasks in queue. Make sure that you start with reasonable small limits for these settings and that you increase them slowly, watching carefully how the resource is able to keep up! If unsure contact your resource administrator asking for reasonable limits on the resource utilization!
    • During the execution of GEO600 each task approximately requires 100MB of disk space for its runtime directory. By default it will be located below your GEO600 installation path in GEO600-devel/tasks. With a large number of GEO600 tasks running simultaneously on the grid cluster, the disk space used in GEO600-devel/tasks may grow beyond 50GB easily. If sufficient disk space cannot be provided at this location, it is possible to use disk space on the compute nodes for the same purpose by using the keyword USE_TMP
    • run gridgk01.racf.bnl.gov {
      	GEO600_HOME        = GEO600-devel
      	FT                 = Condor
      	FT_FORK            = YES
      	GRAMWS_PORT        = 9443
      	TIMEOUT            = 0.10:00:00
      	JOBS_RUNNING_MAX   = 700
      	JOBS_QUEUE_MAX     = 64
      	JOBS_QUEUE_MIN     = 32
      	USE_TMP            = YES
      }
      						
    • The runtime directory of the task will then be located below $TMP/$LOGIN or in /tmp/$LOGIN if $TMP was not set. If neither /tmp nor $TMP exist the default location GEO600-devel/tasks will be used.
    • Upon finishing the execution of a GEO600 task, the runtime directory will be archived and written to GEO600-devel/tasks. There is no workaround for using this location, since Globus will not be able to access the local filesystem on compute nodes to stage out the job archive.
    • This workaround comes with a serious drawback. In general you will not be able to access the compute nodes and neither their local filesystem. In case of problems you will not be able to take a look at the runtime directory and also not be able to cleanup. If unsure ask the grid resource administrator for permission to use /tmp or $TMP!
    • Troubleshooting Configuration
    • The options described for grid workstations and grid clusters are usually sufficient to get GEO600 running on any resource. Never the less there are resources that will not cooperate and where special options might be needed.
    • The default boinc application might not always be the best choice for every grid resource. In this case you can use the keyword PREFIX to point to the installation path of an alternative boinc version to use which is relative to your GEO600 installation path:
    • PREFIX       = build/boinc_5.4.11_i686-pc-linux-gnu
      						
    • In many cases the home partition will be located on a network file system. Some versions of boinc have problems to checkpoint and recover from such a file system. In this case you may want to test if this is really the case by temporary using a different runtime directory below the /tmp directory or any other arbitrary directory:
    • USE_TMP     = /tmp
      						
    • In this case the GEO600 working directory will be located on the compute node in /tmp/$LOGIN/$ID. It is also possible to use environment variables at this point as long as they have been defined:
    • USE_TMP     = $OSG_WN_TMP
      						
    • Sometimes it will be helpful if the environment on the target machine can be defined or changed. This functionality can be obtained when ENV will be defined as a comma seperated list of key = value pairs:
    • ENV         = ( TMP = /tmp, PROJECT = "AstroGrid")					
      						
    • Some queuing systems are simply misconfigured. They might support the sharing of one compute node by several jobs, but will cleanup all user processes as soon as one of these jobs exits. Setting the keyword SUSPEND_ALL to YES will assure that all GEO600 tasks sharing a compute node will exit cleanly together whenever one task is about to exit.
    • SUSPEND_ALL = YES
      						
    • Some Globus installations have continues problems to stage in files from gridftp servers. Changing the keyword PRESTAGE from its default value GLOBUS to MANUAL will deligate the staging of files to the submission host.
    • PRESTAGE    = MANUAL
      						
    • Some Globus installations seem to be unable to detect that GEO600 only requires a single cpu core to run and provide an entire compute node instead. You can explicitly tell Globus to only use one cpu core by setting the keyword JOB_TYPE to single.
    • JOB_TYPE    = single
      						
    • Some Globus installations don't use reasonable default values for the memory request of your job. This might prevent your job from being put into execution. In this case you can provide these values to Globus using following statements (in MB).
    • MAXMEMORY   = 900
      MINMEMORY   = 256
      						
    • If the grid resource relies on the project name for accounting purposes or in order to route your job to a special set of compute nodes, you can set the PROJECT name.
    • PROJECT     = astrogrid
      						
    • If all options listed above won't support you in running GEO600 on a difficult machine, the Einstein@Home developers might be able to provide a beta application to you. The beta application must be installed manually into the archive located in GEO600-devel/src/initial-beta.tar.gz. Setting the keyword USE_BETA to YES will make sure that the beta application instead of the stable application will be used.
    • USE_BETA    = YES
      						
    • Frequently Asked Questions
    • Robert Engel, Max-Planck Institut for Gravitational Physics