Data Summary Set
S. Marka, B. Mours, V. Sannibale
Draft
October 4, 2000
1. Motivations
The purpose of the Data Summary Set
(DSS) is to provide a small data set (around 10 Kbytes/seconds per site,
or 1 GBytes/day) containing the main GW channel(s) plus the description
of the running conditions. The target size is such that the data should
be easy to transfer via the Internet to be used for real time burst search
and network analysis. Such a size made possible also to keep month of data
on spinning media to be used for detector investigation or astrophysical
search. This data set is close to the Level 3 data type defined in the
LSC White Paper on Data Analysis, but in addition to the GW channel, it
contain monitoring information which should account for the same volume
of data (around 5 Kbytes/sec). This additional information will be critical
in the first stage of the data analysis when we will be searching for the
(detector) origin of the observed events.
The monitoring information will
consist of three sets:
-
The Channel Summary Information. This
is the largest data volume. A few statistical parameters (like mean values,
r.m.s., power per band, spectrum stability,...) have to be selected to
describe all monitoring channels.
-
The Quality Information. It is a qualitative description
(flags) of the running condition. This information will be provided by
detector part (like the PSL, the seismic isolation), for the full detector.
-
Some Snapshots of some useful
monitoring channels when they behave in a non standard way. This is optional
but may be useful at the beginning when the characterization of the detector
will not be complete enough. The target for the data 'snapshot' data
is to be below the volume of the GW channel.
The design of the Data Summary Set
should support multiple interferometers. This is done by naming convention
(prefix).
Please notice that this document is only an
attempt to define such a data set. The initial goal is to prototype it
in order to check the validity of this approach.
2. File Organization
To be compatible with the various software environments,
data will be stored in frames. Since storage is a key issue, the files
will contain many frames. This allow us to use static data in an efficient
way, to have small frames plus global information. To be efficient the
file should be at least several minutes long.
If we use 1000 frames per file (about 15 minutes), we
will have 10 Mbytes files, 85 file per day. For 10000 frames per files
(about 3 hours), the file is still manageable (100 Mbytes) and we have
only 8 files per day.
If we are succesfull by keeping
the data rate bellow 8 kBytes/day, one day of data will fit on one CDROM
or one week of data will fit on a DVD.
The frame file will be labeled X-GPSTIME.DSS
were X is the site (H or L) and GPSTIME is the GPS time for the first frame
in the file. In the case of frame containing data from several site, the
prefix fill be N.
3. Channel Summary Information
There are two types of channels. The fast ones with sampling
rate of 256 Hz or above and the slow ones (16 Hz or bellow). We define
also a third type of channels: the key channel for which we want a better
description than only 5 numbers
Fast Channels. For each fast channel, the following
values will be computed and stored:
-
Mean value. For each frame a
mean value ('value')
is computed. But instead of storing directly the floating point results,
we stored an integer; the variation ('variation')
around a more stable mean expressed in percent of an rms value:
variation = 100*
(value - <value>)/rms(value)
This second mean ('<value>')
and rms (rms(value)) are computed for
a longer time period and stored as static data. The
period used is long enough to give a good estimate of the mean and rms,
and to reduce the impact of the static data on the total data volume. However
the period must be short enough to allow one to track larger drifts
after lock acquisition for instance. A typical length of one minute seems
appropriate.
There are two reasons to store the data this way. First
it takes less space. Second, it gives right away a felling of how standard
is the current value, without having to remember the usual values.
-
RMS. This is the rms on one
frame. Like for the mean value, we do not store in each frame the raw rms
value but the variation
in percent. The mean rms and the rms of the frame rms are computed on a
longer period and stored in static data.
-
Power in Three bands. This is the power on a given
frequency band for one frame. The typical band boundary could be 32Hz-128Hz,
128Hz-1kHz, 1kHz-8kHz. Other values could be defined. Of course, not all
fast channel will have the power computed for all bands. Like
for the mean value, we do not store in each frame the raw power value
but the variation
in percent. The mean power and the rms of the power are computed on a longer
period and stored in static data.
-
Spectrum Chisquare. This quantity is designed to detect
changes in the spectrums. It is computed by making the bin by bin difference
between the current spectrum and an average one. Then the differences are
normalized by the rms of the bin fluctuation and quadratically summed to
produce a chisquare. Finally this chisquare is normalize per degree of
freedom and multiply by 100. to be stored.
Slow Channels. For each slow channel we compute the
mean value (average of the 16 values). Like for the fast channels, we store
the variation of the mean value express in percent of the rms value.
Key Channels: For a limited number of fast key
channels (<40 channels, list TBD), we compute more parameters. The preliminary
list is
-
MajorFreq. The frequency bin of the largest spectral
component.
-
MajorFreqA. The amplitude of the major frequency component
-
MajorFreqP. The phase of the major frequency component
-
Poly0. o-order polynomal fit paramters (after
removing the majorFreq)
-
Poly1. 1-order polynomal fit paramters (after removing
the majorFreq)
-
Poly2. 2-order polynomal fit paramters (after removing
the majorFreq)
-
RmsRes. rms of the residual distribution
-
Chi2Res. chi2 of gaussian fit to
the residual distribution
-
CrossN. Number of samples outside some narrow threshold
-
CrossL. Number of samples outside some large
threshold
-
Ilarge. Index of the 'largest' residual
-
Min . This is the minimum value over one frame
.
-
Max. This is the maximum value over one frame.
-
Delta . This is the maximum value of the difference
between two consecutif camplings. delta = max(sample(i+1)-sample(i))
-
WorstSpecrum. If several spectrum are computed,
during one frame, it is the index of the spectrum with the largest chisquare.
4. Quality flags
Quality flags could be defined and computed in many different
ways. The main goal here is to define flags which tells us if the data
could be used in a safe way for data analysis or if there are some doubts.
It this design we foreseen three steps/levels: Channel level, Group level,
Frame Level. Each of this flags are computed on a one second frame basis.
4.2 Channel level
For each selected channel one or several tests are performed.
It could be
-
Check that the mean value is between some boundary
-
Check that the band limited power is within some boundary
-
Check that the spectrum chisquare is less than some value
-
Check that the means spectrum is in a given band.
The output of the test is one of the following flags:
-
Gold : perfect channel.
-
Faire : some minor non standard behavior probably due to
some statistical fluctuation.
-
Suspicious : non standard behavior not compatible with some
statistical fluctuation, like a 5 or more sigma.
-
Fatal : data are unreliable for any analysis.
The output of the channel test is bit encoded by group of
channel. There are three 32 bits word for each group, one for the channels
tagged Faire, one for the channels tagged Suspicious, one for the channels
tagged Fatal.
4.2 Group Level
Channel information is collected by group to build
quality flags per logical detector part (the PSL, a mirror and its seismic
suspension,...) . There are about 20 such groups per interferometer. The
output of this test is similar to the channel test. It is one of the following
flags :
-
Gold : no more than one channel tagged Faire.
-
Faire : two or more channels tagged Faire.
-
Suspicious : At least one channel suspicious or more than
5 channels Faire
-
Fatal : At least on fatal channel or more than 3 channel
Suspicious.
The result of the Group test is bit encoded by interferometer.
There are three 32 bits word for each group, one for the group tagged Faire,
one for the groups tagged Suspicious, one for the groups tagged Fatal.
The proposed groups are (listed only for Hanford):
-
Laser per IFO: H1:PSL, H2:PSL
-
Mode Cleaner per IFO: H2:MC
-
Suspended Optical Element (suspension, optical level OSEM,
corresponding accelerometer): H2:RM, H2:BS, H2:FM1, H2:FM2...
-
LSC per IFO: H1:LSC,
H2:LSC
-
WSC per IFO: H1:WSC, H2:WFS
-
Acoustic sensors per building: H0:MIC-LV,
H0:MIC-MX, H0:MIC-MY, H0:MIC-EX, H0:MIC-EY
-
Seismometer and tiltmeter per building:
H0:GROUND-LV, H0:GROUND-MX, H0:GROUND-MY, H0:GROUND-EX, H0:GROUND-EY
-
Magnetometer per building: H0:MAG-LV,
H0:MAG-EX, H0:MAG-EY
-
Vacuum (?): H0:VAC
Click here to see a try of channel assignment
4.3 The IFO level
The group information is collected to form the interferometer
quality information. The IFO Quality Flag will be:
-
Gold : if no more than one group is tagged Faire.
-
Faire : if their is no Suspicious group or no more than 5
group tagged Faire.
-
Suspicious : if there is at least one group suspicious or
more than 5 group Faire.
-
Fatal : if there is one fatal channel or more than 3 group
suspicious.
The result of the IFO quality flag will be stored in the
frame header with 2 bits per IFO (instead of 1 as described in the frame
spec.) (GOLD = 3, Faire = 2, Suspicious = 1, Fatal = 0).
4.4 Quality Flag usage
To be useful, these quality flags should not only tag obvious
problem. So the thresholds have to be set in such way that we get a chance
to see potential problems. We probably can afford up to a few percent of
suspicious frames without loosing too many good events. In that case, the
typical use would be that for short burst and for the end of the binary
coalescence we ask for Gold or Faire frames. Since some low mass inspiral
may last many frames, we may tolerate a suspicious
frame at the beginning of the inspiral if it does not carry a large
fraction of the signal/noise ratio in order to limit the inefficiency due
to quality checks.
The problem for CW search is different since the signal
is very weak and statistical test could be performed on the main output
itself. Such analysis will probable care only at fatal flags.
5. Frame Content:
The frames will contains several part:
-
The summary information itself. It
is a set of vectors which describe all the channels. Some associate information
like the channel names and sampling rates are stored in static data
-
The quality information stored on a
frame by frame basis
-
The calibration information stored
in the static data
-
The GW channel
-
As an option, it is possible to store
snapshots of channels with strange behavior. But the total amount of data
should be less than the main channel. It is equivalent to store one or
two full frame per hour.
Channel Name
(X should be H or L)
|
Frame Structure Type
|
Data Type
|
Sampling rate/size
|
Total size (MB) for a 1000 seconds file*
|
Comments
|
|
X:Raw-h
|
FrProcData
|
INT_4S
|
2048 Hz
|
4
|
Main channel filter down to 2 kHz.
|
|
X:QcValues
|
FrSummary
|
INT_4U
|
~3*70 values
|
0.3
|
Quality flags values
|
|
X:CsFMean
|
FrSummary
|
INT_2S
|
~500 channels
|
0.6
|
mean value for fast channels**
|
|
X:CsFRms
|
FrSummary
|
INT_2S
|
~500 channels
|
0.6
|
rms value for one channel (on a frame basis)**
|
|
X:CsFPwr32-128
|
FrSummary
|
INT_2S
|
~500 channels
|
0.6
|
power in the 32Hz-128Hz band**
|
|
X:CsFPwr128-1K
|
FrSummary
|
INT_2S
|
~400 real channels
|
0.5
|
power in the 128Hz-1kHz band**
|
|
X:CsFPwr1K-4K
|
FrSummary
|
INT_2S
|
~70 real channels
|
0.2
|
power in the 1kHz-8kHz band**
|
|
X:CsFChi2
|
FrSummary
|
INT_2S
|
~500 channels
|
0.6
|
chisquare for fast channels
|
|
X:CsSMean
|
FrSummary
|
INT_2S
|
~700 channels
|
0.6
|
Slow channel mean values**
|
|
X:CsKey
|
FrSummary
|
FLOAT
|
15 values for 40 channels
|
1.4
|
Table of parameters for the key channels
|
|
Original channel name
|
FrAdcData
|
-
|
on average, no more than one channel every two frame
|
2
|
Channel with strange behavior
|
Table 1: Information changing every
frame
* Including the structure overhead and using a typical compression
factor of 2. This is a value which have been measured during the first
tests.
** We store the relative variation of this parameters
|
Channel Name
|
Data Type
|
Sampling rate/size
|
Update frequency
|
Total size (in MBytes) in a 1000 frames file
|
Comments
|
|
X:QcNames
|
STRING
|
~60 values
|
once per file
|
.001
|
Quality flags names (see table 2)
|
|
X:TF
|
REAL
|
??
|
once per file
|
??
|
Overall Calibration/Transfer Function (TBD)
|
|
X:CsFName
|
STRING
|
~500 channels
|
once per file
|
.01
|
Fast channel (>16Hz) names
|
|
X:CsFRates
|
INT_2U
|
~500 channels
|
once per file
|
.001
|
Fast channel rates
|
|
X:CsSNames
|
STRING
|
~700 channels
|
once per file
|
.015
|
Slow channel (<= 16Hz) names
|
|
X:CsSMean-<>
|
FLOAT
|
~700 channels
|
every 50 frames
|
.06
|
Mean of Mean Value for Slow channels
|
|
X:CsSMean-rms
|
FLOAT
|
~700 channels
|
every 50 frames
|
.06
|
rms of the Mean Value for Slow channels
|
|
X:CsFMean-<>
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
Mean of Mean Value for Fast channels
|
|
X:CsFMean-rms
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
rms of the Mean Value for Fast channels
|
|
X:CsFRms-<>
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
Mean of rms Value for Fast channels
|
|
X:CsFrms-rms
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
rms of the rms Value for Fast channels
|
|
X:CsFPwr32-128<>
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
Mean of the power in the 32-128 band
|
|
X:CsFPwr32-128-rms
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
rms of the power in the 32-128 band
|
|
X:CsFPwr128-1k-<>
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
Mean of the power in the 128-1k band
|
|
X:CsFPwr128-1k-rms
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
rms of the power in the 128-1k band
|
|
X:CsFPwr1k-8k-<>
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
Mean of the power in the 1k-8k band
|
|
X:CsFPwr1k-8k-rms
|
FLOAT
|
~500 channels
|
every 50 frames
|
.04
|
rms of the power in the 1k-8k band
|
Table 2: Static Data (Frame type:
FrStatData)
Remark: Other information available in the FrAdcData could
be copy in the static data if needed.
|
Type of Data
|
Size for a 1000 frames file (MBytes)
|
|
GW data
|
4
|
|
Channel Summary (Slow channels)
|
0.6
|
|
Channel Summary (Fast channels)
|
3.1
|
|
Channel Summary (Key channels)
|
1.4
|
|
Detailed Quality Information
|
0.3
|
|
Static Data
|
0.56
|
|
Snapshots
|
2
|
|
Frame Header, History, TOC
|
0.4
|
|
Total
|
12.4
|
Table 3: File Size
6. Prototype results
A prototype version of a DSS builder has been set up and
run online in one of the DMT computer in Hanford. This allow us to
have access to an almost infinite amount of data for test. It is a preliminary
step before deciding which part of this work needs to be integrated within
LDAS.
Here are some plots of the channel
taking on october 4. Each figure contain two plots. The top one shows the
parameter value for all channel (horizontal axis) for the last frame. The
second plot has the time as horizontal scale (about half an hour of data)
and the channel number os vertical scale. There is an entry in this plot
is the value is above some threshold.
A vertical line on the scatter plot
indicate strange running conditions which could be tag as bad data. On
horizontal line correspond to a channel which as a non stationary behavior.
All these plots are preliminary and are
shown to give an idea of what we can do.
