===================
GAVO DaCHS Tutorial
===================
.. contents::
:depth: 2
:backlinks: entry
:class: toc
Ingesting Data
==============
Starting the RD
---------------
To ingest data, you will have to write a resource descriptor (RD). We
recommend to keep everything handled by a specific RD together in on
directory that is a direct child of your inputs directory (see
installation and configuration), though you could group resources in
deeper subdirectories. So, go to your inputs directory and say::
mkdir lmcextinct
The directory name will (normally) appear in URLs, so it's a good idea
to choose something descriptive and short. This directory is called the
resource directory.
We recommend to put the RD in the root of this directory. A good
default name for the RD is "q.rd"; the "q" will appear in the default
URLs as well and usually looks good in there::
cd lmcextinct
vi q.rd
(where you can substitute vi with your favourite editor, of course).
Writing resource descriptors is what most of the operation of a data
center is about. Let's start slowly by giving some metadata::
Extinction within the LMC
2009-06-02T08:42:00Z
Extinction values in the area of the LMC...
Free to use.
S. Author
Large Magellanic Cloud
Interstellar medium, nebulae
Extinction
You need to adapt the encoding attribute in the prefix to match what you
are actually using if you plan on using non-ASCII-characters.
You may want to use utf-8 instead of the iso-8859-1 given below
depending on your computer's setup.
The schema attribute on resource gives the (database) schema tables for
this resource will turn up in. You should, in general, use the
subdirectory name. If you don't, you have to give the subdirectory name
in a resdir attribute. This attribute must be the name of the resource
directory relative to the inputs directory specified in the
configuration.
In general, you should have exactly one RD per database
schema. This is not enforced, but sharing schemata between RDs will
cause many undesirable behaviours. An example is permissions: When
importing a table, the schema access rights are adapted. If you have
one RD A defining an ADQL-queriable table in schema X and another RD B
that has no ADQL-queriable table, importing A will make schema X
readable to untrusted queries, whereas importing B will make it
unreadable again; this would lead to query failures (which could, in
this case, fixed by adding untrusted to B's readRoles manually, but you
get the idea).
Otherwise, there is only meta information so far. This metadata is
cruicial for later registration of the service. In HTML forms, it is
displayed in a sidebar.
See also `More on Metadata`_.
Once you are here, you should "validate" your RD. This is, in general,
a good idea before doing anything with the RD, since it will allow you
to more easily catch errors than the in all likelihood even more
byzantine error messages that may arise when something goes wrong later.
So, say::
gavo val q.rd
and read the output. If you don't understand what ``gavo val`` tells
you, complain to gavo@ari.uni-heidelberg.de -- the command is really
intended to help you catch errors, and if it doesn't do so, it's
either a bug in ``gavo val`` or the documentation, and in either case
we'd like to fix it.
You can also pass an RD id to ``gavo val``, and you can specify more
than one RD.
Defining Target Tables
----------------------
Within the DC, data is represented in database tables, while metadata is
mostly kept within the resource descriptors. A major part of this
metadata is the table structure. It is defined in table elements, which
usually are direct children of the resoource element. A resource
element may contain multiple table definitions.
Such a table definition might look like this::
Extinction values within certain areas on the sky.
In a table definition, you must give id, which will double as the table
name within the database. The onDisk attribute specifies that the table
is to reside on the disk as opposed to in memory (in-memory tables have
applications in advanced operations). The adql attribute specifies that
no access restrictions are to be placed on the table; if you run an ADQL
or TAP service, users can access this table.
Table elements may contain metadata. You do not need to repeat metadata
given for the resource, because (in most cases) the DC performs metadata
inheritance. This means that if a table is asked for a piece of
metadata it does not have, it forwards that request to the embedding
resource.
Defining Columns
''''''''''''''''
The main content of table is a sequence of column elements. These
contain the definition of a single table column. The name attribute is
central in that it will be the column name in the database, the key for
the column's value in record dictionaries that the software uses
internally, and it is usually used to reference the column from the
outside. Column names must be legal identifiers for both python and SQL in
DaCHS. SQL quoted identifiers thus are not allowed.
The type attribute defaults to real, and can otherwise take values in
valid SQL datatypes. The DC software knows how to handle, in addition
to real,
* text -- a string. You can also use types like char(7) and the like,
but since that does not help postgres (or much anything else within
the DC), this is not recommended.
* double precision (or double) -- a floating point number. You should use
in doubles if you need to keep more than about 7 digits of mantissa.
* integer (or int) -- typically a 32-bit integer
* bigint -- typically a 64-bit integer
* smallint -- typically a 16-bit integer
* timestamp -- a combination of date and time. While postgres can
process a very large range of dates, the DC stores timestamps in
datetime.datetime objects, which means that for "astronomical" times
(like 10000 B.C. or 10000 A.D. you may need to use custom
representations. Also, the DC assumes all times to be without time
zones. Further time metadata (like distinguishing TT from UT) is
given through STC specifications.
* date -- a date. See timestamp.
* time -- a time. See timestamp
* box -- a rectangle.
* spoint, scircle, sbox, spoly -- objects of spherical geometry, taken
from pgSphere. Ask for documentation...
Some more types (like raw and file) are available to tables in service
definitions, but they should, in general, not appear in database tables.
Futher metadata on columns includes:
* unit -- the unit the column values are in. The syntax is that defined
by Vizier, but that may change pending further standardization in the
VO. Unit is left out for unitless values.
* tablehead -- a very short string designating the content. This string
is typically used for display purposes, e.g., as table headings or
labels on input fields.
* description -- a longer string characterizing the content. This may
be in bubble help or VOTable descriptions. Since these could by
longer, you may want to put them in a child element rather than an
attribute; in both cases, whitespace is normalized, so you can enter
line breaks and similar for readability in the source; they will
always be rendered as a single blank. For even longer, note-like
material, see Notes_. An example for a long descripton::
The aperture is the full-width-half-mean of the
response function of our sage 3000 hyper-detector.
* ucd -- a Unified Content Descriptor as defined by IVOA. To figure out
"good" UCDs, the UCD resolver at
http://dc.zah.uni-heidelberg.de/ucds/ui/ui/form can help.
* requried -- True if value must be set in order for the record to be
valid. By default, NULL (which in python is None) is a valid value
for any column. For required columns, that is no longer the case.
This is particularly important in connection with foreign keys.
Column elements may have a child element
`values <./ref.html#element-values>`_. This lets you specify
metadata like maximum or minimum, or enumerate possible values. The
most common use is the definition of null literals though. This is not
necessary for floats, and usually not even strings, because these have
useful (and actually non-overridable) null values in the VOTable
representation (where this sort of thing counts most). It is, however,
highly recommended to give null literals when defining integral types
(including chars) that may have null values. DaCHS will try to pick
useful null values for those automatically when possible, but when
streaming tables, this is impossible, and errors will be raised during
VOTable rendering when NULLs are encountered in such a situation.
So, just define null values whenever you define a non-required integral
column, like this::
After you have imported a table, it is a good idea to run ``gavo info``
with the id of the freshly imported table, e.g.,::
gavo info myres/q#thistable
This will output several properties (min, max, avg) of numeric columns
that may help spot import errors; it will also say which columns contain
NULLs. Use this to mark every column containing integers either
``required="True"`` (to tell other people that no NULLs are possible
here) or add an explicit null literal. Everyone will be grateful.
Parsing Input Data
------------------
After you have defined the table, you will want to fill it. You will
usually have one or more input files with "raw" data.
We recommend putting such input data files into a subdirectory of their
own named "data". Let's assume we have one input file for the table
above, called lmc_extinction_values.txt. Suppose it looks like this,
where tabs in the input are shown as "\\t"::
RA_min\\tRA_max\\tDEC_min\\tDEC_max\\tE(V-I)\\tA_V\\tA_I
78.910625\\t78.982146\\t-69.557417\\t-69.480639\\t0.04\\t0.092571\\t0.123429
78.910625\\t78.982146\\t-69.480639\\t-69.403861\\t0.05\\t0.115714\\t0.154286
78.910625\\t78.982146\\t-69.403861\\t-69.327083\\t0.05\\t0.115714\\t0.154286
The first step for ingestion is lexical analysis. In the DC software,
this is performed by grammars. There are many grammars defined, e.g.,
for getting values from FITS files, VOTables, or using column-based
formats; you can also write specialized grammars in python.
All grammars read "something" and emit a mapping from names to (mostly)
string values.
reGrammars
''''''''''
In this case the easiest grammar to use probably is the `reGrammar
<./ref.html#element-regrammar>`_. The idea here is that you give two
regular expressions to separate the file into records and the records
into fields, and that you simply enumerate the names used in the
mapping.
For the file given above, the RE grammar definition could look like
this::
raMin, raMax, decMin, decMax, EVI, AV, AI
The names given are values of the name attribute in the table
definition.
If you checked the documentation on reGrammar, you will have noticed
that "names" is an "atomic child" of reGrammar. Atomic children are
usually written as attributes, since their values can always be
represented as strings. However, if strings become larger, it's more
convenient to write them in elements. The DC software allows you to
do just that in general: All attributes can be written as elements with
tags named like the attribute. So,
::
would have worked just fine, as would::
1raMin, raMax, decMin, decMax, EVI, AV, AI
Structured children, in contrast, cannot be written as plain strings and
thus can only be written in element notation.
Though grammars can be direct children of resource, they are usually
written as children of data elements (see below).
columnGrammars
''''''''''''''
Another grammar frequently useful when reading from text tables is the
`columnGrammar <./ref.html#element-columngrammar>`_. It allows a rather
dircect translation of VizieR-like "byte-by-byte"-descriptions.
Column grammars define ``col`` elements. Each of these has a ``key``
attribute that gives a name. This could be the ``name`` of a target
column in the simplest case, or it can be an auxillary identifier that
you process in a rowmaker::
1-9
10-18
...
The first column has the index 1, and -- contrary to python slices --
the last index is included in the selection. No expansion of tabs or
similar is performed.
As potential column names, the keys must be valid python identifiers.
Mapping data
------------
A grammar produces a sequence of mappings from names to strings, the
rawdicts. The database, on the other hand, wants typed values, i.e.,
integers, time stamps, etc, internally represented as dictionaries
mapping column names to values called rowdicts. Also, data in input
tables is frequently given in inconvenient formats (e.g., sexagesimal
angles), units not suited to further processing, or distributed over
multiple columns (e.g., date and time of an observation when we want a
single timestamp). It is the job of row makers to transform the rough
data coming from a grammar to whatever the table defines.
Basically a row maker consists of
* `var <./ref.html#element-var>`_ s -- assignments of expression values
names in the rawdict.
* procedure applications (see `apply <./ref.html#element-apply>`_) --
procedural manipulations of both rawdicts and rowdicts.
* maps -- rowdict definition.
When building a rowdict for ingestion into the database, a rowmaker first
binds var names, then applies procedures and finally performs the mappings.
For simple cases, maps will suffice; you may actually even be able to do
without them. Maps must specify a dest attribute giving the rowdict key
that is defined. To specify the value, the can
* either give a src attribute specifying a rawdict key that will then be
converted to a typed value using "sane" defaults (e.g., integers will
be converted by python's int constructor, where empty strings are
mapped to None)
* or give a python expression in the character content, the value of
which is then directly used as value for dest. No implicit
conversions are performed.
In python expressions, you can access the data handed over by the
grammar as ``vars["key"]``; equivalently, you can use the abbreviation
``@key``. This notion is supported throughout rowmakers where
applicable; e.g., you can use it in late bindings of procedure
applications.
In the case above, you could start by saying::
to copy over the rawdict (grammar) keys that directly map to table
column names. Since this is a bit unwieldy, the DC provides a
shortcut::
EVI:EVI,AV:AV,AI:AI
which expands to exactly what is written above. The keys in each pair do not
need to be identical; the first item of each pair is the table column
name, the second the rawdict key.
The case where the names of rawdict and rowdict keys are identical is so
common (since the RD author controls both) that there is yet another
shortcut for this::
EVI,AV,AI
Idmaps sets up one map element each with both dest and src set to the
value for every name in the comma separated list idmaps.
You can abbreviate this further to::
idmaps values can contain shell patterns. They will be matched to the
column names in the target table. For every column for which there is
no explicit mapping, an identity mapping (with type conversion) will be
set up.
This leaves the bbox, centerAlpha, and centerDelta keys to be defined.
No literals for those appear in the rawdicts since they are not part of
the input data. We need to compute them.
To facilitate computations, we first turn the bounds to floats; this can
be done using vars::
float(@raMin)float(@raMax)float(@decMin)float(@decMax)
No shortcut is available here, since this is a relatively rare thing.
You could use procDef/apply to save on keystrokes if you find yourself
having to do such simple conversions more frequently. Note the
@-notation. As mentioned above, you could equivalently have written
``vars["raMin"]``. Both spellings evaluate to the value of the given
name in the rawdict coming from the grammar.
The remaining computations can be performed in mappings::
``coords.Box`` is the internal type for SQL Box
values; you will not usually see those. Still, you can access basically
the whole DC code in this mapping. At some point we will define an API
of "safe operations" that you can use without having to fear changes in
the DC code.
`Some functions useful for such mappings <./ref.html#functions-available-for-row-makers>`_
are listed in the reference manual.
Of course, you can have values that do not even depend on grammar
output::
Null values are always troublesome. Within DaCHS, the null value
(almost) always is python's None. There is the rowmaker function
``parseWithNull`` to help you come up with those; say, some joker used
99.99 as a null value for a magnitude, you could say::
If you need to scale this (or if null values are chosen that they are
invalid literals to begin with), a feature that lets you null out a
value when an specific type of exception is raised comes in handy.
This is map's ``nulExcs`` attribute, which is just a comma separated list of
exceptions that should be caught and interpreted as "this is null". If,
in the example above, the source would give the magnitude in millimags
to save a comma, you could use::
If parseWithNull here returns None, a TypeError will be raised and caught,
and Vmag will be None.
You can turn more than one exception into None. For example example, if
both magicOffset has been parsed before and could be None, while
magicLit is to be parsed and has the empty string as a Null literal, you
could write::
If magicOffset is None, magic will be None via the TypeError, whereas
empty magicLits will result in Nones via a ValueError.
Data elements
-------------
We now have a table definition, a grammar, and a rowmaker. For purposes
of importing, these three come together in a data element. These
elements define what could be seen as the equivalent of a VOTable
resource together with a recipe of how to build it. For onDisk tables,
a side effect of building the data is that the tables are created in the
database; in that sense, data elements also define operations, a notion
that will become more pronounced as we discuss incremental processing.
Let us assemble the pieces we have so far::
Extinction within the LMC
2009-06-02T08:42:00Z
Extinction values in the area of the LMC...
Free to use.
S. Author
Large Magellanic Cloud
Interstellar medium, nebulae
Extinction
Extinction values within certain areas on the sky.
raMin, raMax, decMin, decMax, ev_i, a_v, a_ifloat(raMin)float(raMax)float(decMin)float(decMax)
There are two new elements in data. For one, there's sources. Sources
specify where the data will find its input files in its pattern
attribute. This contains shell patterns that are interpreted relative
to the resource directory. You can give multiple patterns if necessary
like this::
inp2/*.txtinp1/*.txt
There also is a recurse boolean attribute you can use when your sources
are distributed over subdirectories of the path part of the pattern.
The second new element is ``make``. It ties together a destination
table an the rowmaker using id references. You may want to define the
rowmaker as a direct child of make, which saves you some referencing.
Though make looks quite inoccuous here, it is the element that drives
the action. You can have multiple make elements in a single data
element to build multiple tables (using different row makers) from the
same grammar output.
Makes can also carry scripts in SQL or python. For details, see
`Scripting <./ref.html#scripting>`_.
As you can see, we have put the grammar and the rowmaker into a data
element. They could also be direct children of resource, which might be
a good idea if they are used in more than one data; you would then give
the rowmaker an id (make_table, say) and say something like ```_ element. It is a child of table. In
general, index specifications can be rather involved, but simple cases
remain simple. If you just wanted to define an index on EVI, you could
say::
...
(the columns attribute would be "A_V,EVI" if you wanted an index on both
columns).
However, indices are not always that simple. For example, for a spatial
index on centerAlpha, centerDelta, with the q3c scheme used by the DC
software you would have to write something like::
q3c_ang2ipix(centerAlpha,centerDelta)
The DC software has a mechanism that helps in this case: `Mixins
<./ref.html#mixins>`_. A mixin conceptually is a guarantee of certain
table properties, typically of the presence of certain columns; here, it
is just the presence of an index.
So, all you need to do to have a spatial index on the table is::
...
This is UCD magic at work -- q3cindex selects the columns with
pos.eq.*;meta.main as index columns. If you are curious how it does
this, check scs.rd in the system RD directory.
Mixins actually do much more than just help with indexing. Their main
purpose is the definition of interfaces that can be relied upon. For
example, an image table must have a certain structure determined by
the SIA protocol. The mixins ``//siap#pgs`` and ``//siap#bbox`` make
sure that tables have this structure, and they make sure that the table
containing information on all the files in the data center is updated
when the table is filled.
Starting the Ingestion
----------------------
At this point, you can run the ingestion::
gavo imp q
By default, ``gavo imp`` creates all data defined in a resource. If this is
not what you want, you can explicitely specify a data id to process::
gavo imp q content
For larger data sets, it may be wise to first try a couple of rows::
gavo imp --stop-after=300 q
Try ``gavo imp --help`` to see more options (most of which are probably
irrelevant to you now.
By the way, the ``gavo`` command has lots of subcommands. The
subcommand here has the full name ``import``; you could have said ``gavo
import`` or even ``gavo im``, since any unique prefix into the command
list is ok. Try ``gavo --help`` to see the commands available.
Note that gavo imp interprets the RD argument as a file first and then as
an RD id. An RD id is the inputs-relative path of the RD with the
extension stripped. Our example RD thus has the RD id lmcextinct/q, and
you could have said::
gavo imp lmcextinct/q
from anywhere in the file system.
Images
------
Images have relatively rich metadata. Partly these are covered in a
mixin called "products", but astronomical images have even more
metadata, like position on the sky or bandpass. To cope with them, use
the `pgsSIAP `_ (and ignore the old
bbox-based SIAP mixin).
To define a table that carries images, simply mix in the appropriate
mixin::
(of course, you can add more columns if you need them).
Filling this table requires the use of a rowfilter and two procedure
applications. Let's look at a data element for this table::
"cars.images"vars["imageTitle"]"%s, %s"%(vars["OBSERVAT"],
vars["TELESCOP"])vars["dateObs"]+vars["startTime"]+(
vars["endTime"]-vars["startTime"])/2vars["FILTER"]
So, step-by-step:
* When ingesting images, you will almost always read from FITS images,
i.e., FITS primary headers. A ``fitsProdGrammar`` delivers the
key-value-pairs from a header as a rawdict.
* The ``qnd`` attribute of the grammar is recommended. It makes
some (weak) assumptions to yield significant speedups with large
images.
* The ``fitsProdGrammar`` will map keys with hyphens to names with
underscores, which is required to make them accessible in rowmakers.
The ``map``` example above therefore is superfluous since it orders
default behaviour. You may need other (non-automatic) name mappings,
though, which would work analoguously.
* The grammar further needs a rowfilter. These are procedure
applications working on rawdicts. The //products#define rowfilter
lets you add keys on owners and embargo in case you want password
protection for images, but most importantly it defines what table the
data is destined for. This is crucial information, and if you ever
get it wrong, you need to manually connect to the database and issue
a command like ``DELETE FROM products WHERE sourcetable=''``. So, always bind table. Make sure to include the quotes,
this is supposed to be a valid python expression yielding a string.
* You then need to define a rowmaker that must apply two procs. For
one, you need `computePGSSIAP `_
(if you mixed in ``pgsSIAP``). No bindings are required here.
* The second proc application required is
`setSIAPMeta `_ . Try to give all its keys
somewhat sensible values, you will make your users' lives much easier.
Warning: Do *not* use idmaps="*" with SIAP, since the auto-generated
mappings will clobber the work of the xSIAP procs.
Debugging
---------
If nothing else helps you can watch what the software actually sends to
the database. To do that, set the GAVO_SQL_DEBUG environment variable
to any value. This could look like this::
env GAVO_SQL_DEBUG=1 gavo imp q create
The first couple of requests are for internal use (like checking that some
meta tables are present).
Publishing Data
===============
Once a table is in the database, it needs to get out again. Within
DaCHS, there are three parties involved in delivering data to the user:
* The core; it actually does the computation
* The renderer; it formats the result in some way requested by the user
and delivers it. There are renderers for web forms, VO protocols,
imges, etc.
* The service; it holds together the core and the renderer, can
reformat core results, controls the metadata, etc.
You will usually use pre-specified renderers, so these are not defined
in resource descriptors. What you have to define are cores and
services.
For core, you will usually use the `dbCore <./ref.html#element-dbcore>`_
in custom services, though `many other cores
<./ref.html#cores-available>`_ are predefined -- e.g., to run ADQL
queries, to upload files, or to do feedback queries --, and you can `define your
own <./ref.html#writing-custom-cores>`_ when you need special
functionality.
The dbCore generates a (single-table) query from condition descriptors
and returns a table that you describe through an output table. Cores
are defined as direct children of the resource. For the lmcextinction
table above, it could look like this::
Cores always need an id. dbCores need a queriedTable attribute, the
value of which must be a table reference. This is the table the query
will run against.
CondDescs define input fields (for the form renderer, these are actually
form items people can fill in). Most commonly, you will either define
them using the ``original`` attribute or using ``buildFrom``. The
first case is typically used in connection with protocols and on tables
having mixins; such condDescs result in zero or more input fields, and
they typically inspect the queried table. For example, the humanScs
core in the example locates the "main" positions as identified by UCDs
and generates queries against them using two input fields, one it tries
to guess a position from, and another for the search radius.
When you define your condDesc using buildFrom, the result is almost
always a single input field that allows posing restrictions against the
column referred to in the buildFrom attribute, which in turn usually is
the name of a column in the table queried (though you could use any
field using id-based referencing). The software tries to make some
useful input definition from that column, which in particular means that
the types are "up-valued". String columns can be queried against using
Vizier-like string expressions, real and double precision columns using
Vizier-like float expressions, and so on. You can suppress that
behaviour using more verbose forms explained elsewhere.
Renderers other than form will expose the input fields in some other way
than form items. In all cases, however, the condDescs of
the dbCore define what fields can be queried.
The service now ties the core together with a renderer. It might look
like this::
lmcext_web
While a service can run without a ``shortName``, it can lead to trouble later,
so you
should make a habit of assigning short names. See `the data checklist
<./data_checklist.html>`_ for more information on short names.
A service must have an id as well, and its core attribute must contain
the id of a core.
With this minimal specification, the service exposes a web form-based
interface. To try this, run a server::
gavo serve debug
and point a browser to http://localhost:8080/lmcextinct/q/cone/form (the
host part, of course, depends on your configuration. If you did not
change anything there, you should find the data at the given URL).
More on Tables
==============
Notes
-----
Frequently, you need to say more about a column than is appropriate in
the few-phrase description. Historically, such situations have been
handled using notes. Since notes can be reused for multiple columns, we
chose to follow that precedent rather than attach longish information
onto the columns themselves.
The notes themselves are kept in meta elements belonging to tables.
Since the notes tend to be markup-heavy, their default format is
restructured text. When entering notes in RDs, there is an attribute
``tag`` on these meta items::
...
The meaning of the flag is as follows:
===== ==========
value meaning
===== ==========
1 value is 2
2 value is 1
===== ==========
...
To assoicate a column with a note, use the column's note attribute::
As tag, you may use basically any string, but it's a good idea to keep
it to numbers or at least characters not requiring URL encoding.
The notes will exposed in HTML table heads, table and service
description, etc. If you need to link to one, there is the built-in
tablenote renderer that takes the table and the note from its query path.
The most convenient way to is it is through the
built-in vanity name tablenote, where you would access the note above
using a URL like ``http://your.server/tablenote/demoschema.demo/1``.
STC
---
As soon as you have coordinates, you will want to define coordinate
systems on them. In the introductory example, that was not necessary
because SCS mandates that the coordinates you export are in ICRS, so
either your coordinates are in ICRS or you are violating the SCS
protocol -- in either case, nothing to declare.
In the more general case, you will want to say what is what in your
tables. DaCHS uses a language called STC-S to declare systems,
reference points, etc. The STC-S description [TODO: Link to IVOA note]
is a bit terse, but the good news is that you will get by with a few
features most of the time.
STC is defined in children of table elements, with references to table
columns in quoted strings::
Position ICRS "ra" "dec" Error "e_ra" "e_dec"
Position FK4 J1950.0 "ra_orig" "dec_orig"
You do not need to change anything in the column definitions themselves,
since the machinery will resolve your column references. If you refer
to non-existing columns, RD parse errors will be thrown.
More on Grammars
================
Row Generators
--------------
TBD
Source Fields
-------------
Grammars can have a sourceFields element. It contains a standard
procedure definition (i.e., you could predefine those and bind
parameters), but usually you will just fill in the code.
This code is called once for each source processed, and receives the
sourceToken as argument. It must return a dictionary, the key/value
pairs of which will be added to all rows returned by the row iterator.
The purpose of sourceFields is to precompute values that depend on the
source ("file") and are constant for all rows within it. An example for
where you need this is when you want to create backlinks to the file a
piece of data came from::
srcKey = utils.getRelativePath(sourceToken,
base.getConfig("inputsDir"))
return locals()
You can then retrieve the path to the source file via srcKey key in
rawdicts (and then, using render functions and static renderers, turn
this into links).
In addition to the sourceToken, you also have access to the data that
will be fed from the grammar. This can be used to, e.g., retrieve the
resource directory (``data.dd.rd.resdir``) or data descriptor properties
(``data.dd.getProperty("whatever")``).
Sometimes you want to do database queries from within sourceFields.
This is tricky when you access the table being written or otherwise
being accessed. This is because sourceTokens run in the midst of a
transaction updating the table. So, something like::
base.SimpleQuerier().query(...)
will wait for the transaction to finish. But the transaction is waiting
for data that will only come when the query finishes -- this is a
deadlock, and gavo imp will just sit there and wait (see also `Deadlocks
`_).
To get around this, you need to query using the data's connection. So,
instead write::
base.SimpleQuerier(connection=data.connection).query(...)
More on Services
================
Custom Templates
----------------
Within the data center, most pages are generated from templates [XXX
TODO: write something about them generically]. This is true for the
pages the form renderer on services displays as well. To effect special
effects, you may want to override them (though in general, it is a much
better idea to work within the standard template since that will give
your service all kind of automatic updates and would make, e.g., changes
much easier if your institution undergoes the yearly reorganization).
The default response template can be found in
resources/templates/defaultsresponse.html in the installed tree. To
obtain the plainest output conceivable, try::
No title
Save this to a file within the resource directory, let's say
"res/plain.html". Then, say::
res/plain.html
in your service; this should do give you a minimally decorated page.
Of course, this will display a severely degraded page. To get at least
the standard style sheet and the standard javascript, say::
instead of the plain head.
More on Cores
=============
CondDescs
---------
dbCores and cores derived from them take most of their power from
condition descriptors or CondDescs. These combine inputKeys, which are
basically column objects with some additional presentation-related
information, with code generating SQL conditions.
A condDesc can contain zero or more input keys (though having zero input
keys makes no sense for user-defined condDescs since they would never
"fire"). Having more than one input key is useful when input quantities
can only be interpreted when present as a group. An example is the
standard cone search, where you need both a position and a search
radius.
Automatic and manual control
''''''''''''''''''''''''''''
However, most condDescs correspond to one input key, and the input key
is mostly derived from a table column. This is the standard idiom,
::
where somecol is a column in the table queried by the core. This
construct will cause the an input key to be built from somecol. While
doing this, the type will be mapped automatically. The primary rules
are:
* Numeric types will get mapped to numeric vizier-like expressions
* Datetimes will get mapped to date vizier-like expressions
* text and chars will get mapped to string vizier-like expressions
* enumerated values (i.e., columns with value elements giving options)
will not become vizier-like expressions but input keys that yield
selection widgets.
To have more control (e.g., if you do not want to allow vizier-like
expressions, give the input key yourself)::
(which would make a column required in the table optional in the query),
or::
(which creates an input key matching everything literally), or even::
-- if the input key is required, queries not giving it will be rejected.
The title attribute on option gives the label of an option in the HTML
input widget; if it's missing, a string representation of the value will
be used.
In all those cases, the SQL generated from the condDesc is a conjunction
of the input key's individual SQL expressions. Those, in turn, are
simply comparisons for equality for plain types and more or less
arbitrary expressions for vizier expression types.
Incidentally, two properties on inputKeys are defined to only show
inputs for certain renderers, viz., ``onlyForRenderer`` and
``notForRenderer``. Both have single strings as values. This is
intended mainly for cases like SIAP and SCS where there are
"human-oriented" versions of the input fields available. The built-in
SCS and SIAP conditions already to that, so you can give both scs and
humanSCS conditions in a core. Here is how you would define an input
key that is only used for the form renderer::
form
Phrase makers
'''''''''''''
For complete control over what SQL is generated, condDescs may contain
code called a phrase maker. This, again, is a procedure application,
quite like with rowmaker procs, except that the signature of condDesc
code is different.
Phrase maker code has the following names available:
* inputKeys -- the list of input keys for the parent CondDesc
* inPars -- a dictionary mapping inputKey names to the values
provided by the user
* outPars -- a dictionary that is later used as the parameter
dictionary to the query.
The code should amend the outPars dictionary with the keys mentioned in
the the conditions. The conditions themselves are yielded. So, a very
simple condDesc with generated SQL could look like this::
outPars["xxyy"] = "x"*inPars.get("val", 20)
yield "someColumn=%(xxyy)s"
However, using fixed names in outPars is not recommended, if only
because condDescs could be used multiple times. The recommended way
uses the vizierexprs.getSQLKey function. It takes a name, a value, and
the outPars dictionary. It will return a key unique to the query in
question and enter the value into the outPars dictionary under that key.
While that sounds complicated, it is actually rather harmless, as shown in
the following real-world example that lets users input date, time and an
interval in split-up form (e.g., when you cannot hope anyone will try to
write the equivalent vizier-like expressions)::
baseTS = datetime.datetime.combine(inPars["date"], inPars["time"])
dt = datetime.timedelta(minutes=inPars["within"])
yield "date BETWEEN %%(%s)s AND %%(%s)s"%(
vizierexprs.getSQLKey("date", baseTS-dt, outPars),
vizierexprs.getSQLKey("date", baseTS+dt, outPars))
More on Metadata
================
In general, most metadata for services and resources rather closely
follows what's defined in `Resource Metadata for the Virtual
Observatory`_; see also the `Reference Manual on RMI-style metadata`_.
Coverage
--------
One tricky spot is coverage, i.e., the parts of the STC space covered
by what's in the resource. In general, you will define coverage more
or less like this::
AllSky ICRS
Optical
The easy part is the waveband. Values here are from a fixed set of
strings, viz., Radio, Millimeter, Infrared, Optical, UV, EUV, X-ray,
Gamma-ray; capitalization is important, and you may give multiple
elements (the software doesn't enforce this selection, but your registry
documents will become invalid if you use anything else).
The coverage.profile meta item has STC-S strings as values. See the
`STC-S Note`_ as well as the `STC library documentation`_ for more
information on the STC-S understood by DaCHS. In principle, you can
get fancy here; for example, you could write::
TimeInterval TT BARYCENTER 1999-10-01T20:30:00 1999-10-02T20:30:10
unit s Error 10 Resolution 1 2
Circle FK5 J1980.0 GEOCENTER 0.13 0.45 0.03 unit rad
PixSize 0.0001 0.0001
SpectralInterval HELIOCENTER 2000 6000 unit Angstrom Error 1
RedshiftInterval TOPOCENTER VELOCITY RELATIVISTIC -10 10 unit km/s
However, the registries probably evaluate not very much of this
information as yet, and you most certainly should try to give positions
in ICRS.
Copyright
---------
Within the astronomical community, licensing issues have traditionally
played a minor role – if you referenced properly, using data from other
people was not only ok, it was encouraged. We should keep it that way,
even in the days of easy reproducability. Still, formal statements
about how your data may be used may be useful. These statements are
called licenses.
RMI has the copyright meta for this purpose. Right now, DaCHS doesn't do much
with this information; it includes it in VOResource records, and the
default response template shows it below the query form. We recommend
either specifying something like "The data is in the public domain" or,
if you want to use something that's more in line with scientific habits,
the `Creative Commons Attribution`_ ("CC-BY"). To support this, DaCHS
includes a macro that can be used in meta elements that are direct
children of the resource element. Use it like this::
\RSTccby{Image metadata}
Usage conditions for individual images could differ. See the
COPYING FITS header.
The advantage of using the macro is that you get a nice image, and in
the future we may expand this to a formal, machine-readable declaration.
.. _Creative Commons Attribution: http://creativecommons.org/licenses/by/3.0/
.. _Reference Manual on RMI-style metadata: ./ref.html#rmi-style-metadata
.. _STC library documentation: ./stc.html
.. _STC-S Note: http://www.ivoa.net/Documents/Notes/STC-S/
Active Tags
===========
Active "tags" delemit elements within resource descriptor XML that
do not directly contribute to result tree. Their typical use is to
"record" event sequences and replay them later. Much of this is used
internally. However, some applications of active tags are interesting
for RD writers, too. Active tags always have names in all upper-case.
LOOP
----
Loop lets you create multiple elements by rules. The simplest way to
use it is by giving a space-separated list of "items"::
The ``events`` child of the ``LOOP`` element creates a list of events
(think "begin column element", "value for name attribute", "end column
element"). These events are then replayed to the parser for each item
in the LOOP's ``listItems`` attribute. Each occurrence of the
``\\item`` macro is replaced with the current item. So, in the
resulting RD tree, the fragment above will have the same result as::
Sometimes the list items are used in multiple places in the same
document. To avoid having to maintain multiple lists, you can define
macros using RD's ``macDef`` element; this could look like this::
U B V R
....
Note that macro names must be at least two characters long.
Frequently, the loop variable should not just take on a single string.
For such cases you can feed in tuples. The most convenient way to do
this is ``csvItems``. The content of this element is a string literal
containing comma separated values *with labels*, i.e., parsable with
python's csv.DictReader. In your events, you can then refer to the
labeled items using macros. For example::
band,source
U 10-12
V 13-16
\source
TODO: EDIT actives?
Publishing DAL Services
=======================
DAL is VO-speak for "Data Access Layer", the standard protocols the VO
uses to allow remote querying of data. To support such a protocol, you
usually need to arrange things in three places:
* The table queried needs a certain set of columns
* The core must support certain input and output fields
* The renderer must exhibit specified behaviour as regards, e.g., the
formatting of error messages, and it may require protocol-specific
metadata
This section discusses the individual protocols in turn.
SCS
---
SCS, the simple cone search, is the simplest IVOA DAL protocol -- it
is just HTTP with RA, DEC, and SR parameters plus a special way to
encode errors (in a way somewhat different from what has been specified for
later DAL protocols).
Tables
''''''
In principle, SCS can expose any table that has a exactly one column each
with the UCDs pos.eq.ra;meta.main and pos.eq.dec;meta.main. The query
is then ran against the position specified in this way.
However, you almost always want to have a spatial index on these
columns. To do that, use the ``//scs#q3cindex`` mixin on the tables, like
this::
...
Cores
'''''
The SCS core simply is a dbCore. You must include the SCS condDesc,
like this::
There is an alternative condDesc more suitable for humans. They can be
used in parallel. The form renderer will then use the human-oriented
one, the DAL renderer the protocol one. Thus, you will ususally write::
The example also shows how to add custom query field. If you want to
add a larger number of them, you would use an active tag::
Service
'''''''
To expose that core through a service, just allow the scs.xml renderer
on it. As the core is built, you can have a web-based form interface
for free::
Nice Catalog Cone Search
NC Cone
10
10
0.01
The meta information given is used when generating registration
records. In particular, you should make sure that a query with the
given ra, dec, and sr actually returns some data.
SIAP
----
DaCHS' SIAP implemention right now assumes you are publishing FITS files
with WCS headers. Other arrangements are of course possible, but you'd
have to write your own computeXXX procDef.
Tables
''''''
SIAP-capable tables should mix in ``//siap#pgs`` (the older
``//siap#bbox`` is deprecated; you could still use it if for some reason
you have no pgSphere).
When building them, use the ``//siap#computePGS`` and ``//siap#setMeta``
applys. Since SIAP tables contain products, you also need the
``//products#define`` row filter in the input grammar (which, of
course, needs to be a fitsProdGrammar.
Cores
'''''
TBD.
For the SIAP cutout core, the SIAP human condDesc must have ``required``
True, since the core will retrieve the default cutout size from the field
size. The SIAP protocol condDesc is required anyway.
Service
'''''''
TBD.
SSAP
----
Tables
''''''
Currently, we only support "homogeneous" data collections, i.e., tables
for which every data set was generated by the same instrument, code, or
similar. Those mix in ``//ssap#hcd``. This mixin has lots of
parameters that define the instrument; see
`the SSAP HCD mixin in the ref doc <./ref.html#the-ssap-hcd-mixin>`_.
For example, you could say::
//ssap#hcd
To fill such a table, it is recommended to use the ``//products#define``
rowfilter and the ``//ssap#setMeta`` rowmaker apply. This could look
like this::
"\schema.data"@FILENAME"ivo://org.gavo.dc/ccd700/q#"+@FILENAME
Caution: In the ssa table, we force the spectral axis to be a wavelength
in meters. You must convert all values manually if necessary. For the
spectra themselves you could use different units, but in our experience
that's more confusing than helpful.
In contrast to images where delivering FITS is likely all you need,
there's a plethora of formats spectra are delivered in. To help a bit,
you should make sure one of the formats you offer are VOTables
conforming to the spectral data model (see `Making SDM Tables`_). If
you want to deliver the "native" format as well, you'll have to have two
rows for each spectrum. The standard way to achieve that is through a
rowmaker in the grammar importing the spectra, like this::
baseAccref = os.path.splitext(row["prodtblPath"])[0]
row["prodtblAccref"] = baseAccref+".txt"
row["prodtblMime"] = "text/plain"
# this is the file as delivered from upstream
yield row
row["prodtblAccref"] = baseAccref+".vot"
row["prodtblPath"] = "dcc://\rdIdDotted/mksdm?"+baseAccref+".txt"
row["prodtblMime"] = "application/x-votable+xml"
# this is our processed SDM VOTable
yield row
SSAP's FORMAT parameter lets clients select what they want. The way
the default FORMAT argument works, only application/x-votable+xml
records are considered compliant.
Cores
'''''
Use the ssapCore for SSAP services. You must manually feed in
the condition descriptors for the SSAP parameters. For homogeneous data
collections, this is::
The ``hcd_condDescs`` includes condition descriptors for all mandatory
and optional parameters meaningful in the case of homogeneous data
collections (i.e., excluding those that match against constant values).
Some of them may not be relevant to your service because your table
never has values for them. For example, theoretical spectra will
typically not give information on positions. The SSAP spec says that
such a service should ignore POS rather than returning the empty set.
If you think you must ignore certain conditions, you can use the PRUNE
active tag. This looks like this::
Do not do this just because you don't have position information -- this
would mean that you would dump your complete archive for (typical)
queries with a position, and that is neither required by the spec (even
if you might think so at first reading) nor desirable.
Here is a table of parameter names and ids; you can always check them
in ``$gavo_installed/resources/inputs/__system__/ssap.rd``.
============== ===========
Parameter name condDesc id
-------------- -----------
POS, SIZE coneCond
BAND bandCond
TIME timeCond
============== ===========
For APERTURE, SNR, REDSHIFT, TARGETNAME, TARGETCLASS, PUBDID,
CREATORDID, and MTIME, the condDesc id simply is ``_cond``,
e.g., ``APERTURE_cond``.
To have custom parameters, simply add condDesc elements as usual::
For SSAP cores, ``buildFrom`` will enable "PQL"-like query syntax such
that users can post arguments like ``20000/30000,35000`` to ``t_eff``.
Service
'''''''
To expose SSAP services, use the `ssap.xml renderer`_. The metadata
keys required for registration of these are documented in the reference
manual. A complete declaration of a published SSAP service would then
look like this::
mydata SSAP
theory
archival
MAXREC=1
This service will expose all standard SSAP query parameters, and
additionally condDescs built from the ``t_eff`` and ``log_g`` columns in
the source table (see above).
Incidentally, in web versions of such services, you may want to have
specview-based "quick-view" links based on the ``run`` system rd that
exposes the specview template. Here's an example of an ``outputTable``
(that would reside in the service element)::
Some less cody approach would be welcome, but we'd need to collect some
experience what people expect there. Also note that specview is (or
possibly was, when you're reading this) very picky in what it accepts as
VOTables; in the example, the ``dm=sed`` parameter is used to instruct
DaCHS' SDM-making machinery to come up with a table palatable by current
specviews.
.. _ssap.xml renderer: ./ref.html#the-ssap-xml-renderer
Making SDM Tables
'''''''''''''''''
Compared to images, the formats situation with spectra is a mess.
Therefore, in all likelihood, you will need some sort of conversion
service to VOTables compliant to the spectral data model. DaCHS has a
facility built in to support you with this.
First, you will have to define the "instance table", i.e., a table
definition that will contain a DC-internal representation of the
spectrum according to the data model. There's a mixin for that::
//ssap#sdm-instance
In addition to adding lots and lots of params, the mixin also defines
two columns, ``spectral`` and ``flux``; these have units and ucds as
taken from the SSA metadata. You can add additional columns (e.g., a
flux error depending the the spectral coordinate) as requried.
The actual spectral instances get built by sdmCores. These cores,
while potentially useful with common services, are intended to be used
by the product renderer for dcc product table paths. They contain a
data item that must yield a primary table that is basically sdm
compliant. Most of this is done by the //ssap#feedSSAToSDM apply
proc, but obviously you need to yield the spectral/flux pairs (plus
potentially more stuff like errors, etc, if your spectrum table has more
columns. This comes from the data item's grammar, which probably must
always be an embedded grammar, since its sourceToken is an SSA row in
a dictionary. Here's an example::
labels = ("spectral", "flux")
relPath = self.sourceToken["accref"].split("?")[-1]
with self.grammar.rd.openRes(relPath) as inF:
for ln in inF:
yield dict(zip(labels,ln.split()))
The sdmCores are always combined with the sdm renderer. It passes an
accref into the core that gets turned into an row from queried table;
this must be an "ssa" table (i.e., right now something that mixes in
``//ssap#hcd``). This row is the input to the embedded data descriptor.
Hence, this has no sources element, and you must have either a custom
or embedded grammar to deal with this input.
The actual data have to be located in the grammar; if they are in a text
file, you could have a grammar for parsing those somewhere in the RD
(TODO: example), or you could have the actual spectral data in the
database. Whatever – the grammar has to return spectral and flux
values. Also make sure that what you are return actually has the
units promised by the metadata.
To set the params from the ssa row, use the ``//ssap#feedSSAToSDM``
apply procDef in a ''parmaker''; this should mostly suffice in terms of
metadata definition. When you have no additional columns, the default
rowmaker (with ``idmaps="*"``) will do in the ``make`` of the spectrum
table.
ObsTAP
------
ObsTAP is basically a single table, ivoa.ObsCore. In DaCHS, this is a
view generated from input tables. To include the products within a
table, you must use one of the mixins from the //obscore RD and fill out
some of the mixin's parameters. There is some documentation on what to
put where in the mixin documentation, but frankly, as a publisher, you
should have at least passing knowledge of the obscore data model. See
the corresponding IVOA document [XXX TODO: add link when there's a WD
out].
In the simplest case, a SIAP table, you could get by simply adding::
mixin="//obscore#publishSIAP"
to the table definition's start tag. You do not have to re-import a table to
publish it to obscore after the fact – ``gavo imp -m && gavo imp
//obscore create`` will include an existing table to the obscore view.
Even for SIAP, you will usually want to add metadata not contained
in DaCHS' SIAP meta. To do this, add a mixin element to the table
definition's body::
//obscore#publishSIAP
On a table import, the obscore table will automatically be recreated to
include the data. If you retrofit ObsCore support to large tables, you
can avoid having to re-import everything by adding the mixin clause and
then updating the metadata. In that case, you must manually remake the
obscore table::
gavo imp -m path/to/my/rd
gavo imp //obscore create
Publishing DaCHS-managed tables via TAP
---------------------------------------
In the simplest form, all you need to do to publish a table through
the TAP endpoint is to add an ``adql="True"`` attribute to the table
definition and update the metadata (by saying ``gavo imp -m ``).
You should, however, take particular care that there's a useful
description of the table, usually as a direct meta on the table.
Keep in mind that people will stumble across the table in some sort of
registry and need to be able to figure out whether the table contains
useful data by that description and the column metadata alone.
The TAP endpoint only exposes rather limited metadata. At least when
there is no published service on the table, you may want to just publish
the data to the registry, too. This leads to a much richer set of
metadata, increasing people's chances to able to locate the data.
To publish a nonservice (usually a table definition, but you can
register data descriptors containing multiple tables, too), use
the `register Element <./ref.html#element-register>`_ . For a simple
table, just wringing ```` is enough, since the set name
defaults to ``ivo_managed`` and ADQL-accessible tables are automatically
related to the TAP services.
When ``register`` is the child of a data item, you need to manually
declare that child tables are TAP-accessible, like this::
Another thing you might want to do when publishing tables to TAP is add
sample queries for them. As an extension to the usual tap_schema, DaCHS
has an example table giving a name, a query and a description. TAP
clients may exploit these examples to help users figure out what to
usefully do with more arcane tables, and of course you can explain
more interesting features of your server or data here.
To add an example, create a file with a name ending in ``.sample`` in
``$GAVO_INPUTS/__system/adqlsamples/``. The grammar for theses files is
defined in ``//tap#import_examples``. You write the three keys, viz., name,
query and description, in this sequence, each followed by a double colon
and any material you want in the field. The keys must start at the
beginning of a line. You must add a double period at the and of the
file, and it's one file per example. Here's what this should look
like::
name::katkat bibliography
query::
select *
from katkat.katkat
where gavo_hasword('variable', source)
and minEpoch<1900
description::To search for title (or other) words in katkat's source field in
some sort of bibliographic query, use the gavo_hasword locally defined
function. This basically works a bit like you'd expect from search engines:
case-insensitive, and oblivious to any context.
..
After adding an example, run ``gavo imp //tap import_examples`` to
update the database table.
Publishing existing tables via TAP
----------------------------------
If you already have a database table and now want to use DaCHS to
publish it via TAP, just write an RD as described above, except that the
data element is trivial. Here's an example of how that could look
like::
My great table
... (more metadata)
id of object covered here
Then, say ``gavo imp -m ``; make sure you don't forget the
``-m``, because without it, ``gavo imp`` will drop the existing if it
can, i.e., if gavoadmin has write access to the schema in question, and
it should have that for reasons explained in the next paragraph.
This adds the metadata you've given to all kinds of administrative
tables DaCHS keeps but does not touch the data. It will also try to fix
the permissions of the table such that DaCHS's untrusted user can read
it. To let DaCHS manage the permissions, in psql say (assuming standard
profiles)::
GRANT ALL PRIVILEGES ON SCHEMA TO gavoadmin
WITH GRANT OPTION;
GRANT SELECT ON . TO gavoadmin
WITH GRANT OPTION;
If you have local users accessing the table, you should declare
them in either the allRoles or readRoles attributes to the table
definiton. Maybe even adapting the profiles in GAVOROOT/etc to match
your existing infrastructure could make sense.
The Registry Interface
======================
Introduction
------------
Conceptually, the VO's Registry is a set of resource records (i.e.,
descriptions of services, data, or other entities) to let users locate
resources relevant to them (e.g., look for a service giving surface
temperatures for OB stars). Whatever as a resource record is called *VO
resource* in the following to keep them apart from whatever DaCHS
resource descriptors describe; DaCHS RDs may descibe zero, one, or
multiple VO resources. We apologize for the confused nomenclature.
Physically, there are several services that keep and update this set and
let people query them (a "full registry"), e.g., the `VAO registry`_,
the `ESAVO registry`_, or the Astrogrid registry. All these should
harvest each other and thus have identical content (this is currently
not always true).
To be part of the VO, you have to register your services. DaCHS makes
this fairly easy since it contains a publishing registry. This is again
a service that exposes a standard interface defined by the Open Archives
Initiatives. There is a renderer for the OAI harvesting protocol
(`OAI-PMH`_) called ``pubreg.xml`` that goes together with
``registryCore``. The service ``//services#registry`` with this
renderer has a vanity name of ``/oai.xml``, which is you data center's
publishing registry "endpoint". Full registries obtain the resource
records present on your data center for there.
Each VO resource has a unique identifier of the form::
ivo:///
-- is defined by the DaCHS software (to be ``/ is a globally unique string.
It is recommended that you use your DNS name (or some appropriate part
of it), which will provide some uniqueness. The authority is declared
in your gavorc (see below). Details on VO identifiers can be found in
`IVOA Identifiers`_.
To claim an authority, you have to define who you -- as an organization
-- are. For this, DaCHS will create a resource record for your
organization, too, where "your organization" for DaCHS means whatever
you give as creator.name in defaultmeta (see below), which in general
should be something like "My Institute Data Center" rather than "My
Institute". You can register "My Institute" as well, if you want, but,
the way things are written now, not as the entity running managing the
authority.
To make the VO aware of the existence of your data center, you will need
to tell the `RofR`_ (Registry of Registries) about your data center.
Before you can do this, you need to fill in quite a bit of information
in gavorc. The next section explains how this is done.
.. _ESAVO registry: http://esavo.esa.int/registry/index.jsp
.. _VAO registry: http://nvo.stsci.edu/vor10/index.aspx
.. _IVOA identifiers: http://www.ivoa.net/Documents/REC/Identifiers/Identifiers-20070302.html
.. _RofR: http://rofr.ivoa.net/
.. _OAI-PMH: http://www.openarchives.org/OAI/openarchivesprotocol.html
DaCHS' Registry Interface
-------------------------
As explained in the introduction, you must provide
enough data to allow the VO to tell who you are
before you can include your data center into the VO.
The first step is to define your authority (i.e., something like
org.g-vo.dc) in your config (``/etc/gavo.rc``), in the
``[ivoa]authoriy`` item.
Then, add metadata about yourself in ``GAVO_ROOT/etc/defaultmeta.txt``.
It has a `simple format`_, basically, ``: ``. In it, you
must give basic information on your authority and some fallback for
services:
* authority.creationDate -- A UTC datetime (with trailing Z);
technically, it should be the date the resource record is created, but
realistically, just use "now". Example: ``2007-12-19T12:00:00Z``.
* authority.title -- A human-readable descriptor of what the authority
corresponds to. Example: ``The GAVO data center``
* authority.description -- A sentence or two on what you're up to.
Example: ``The GAVO data center provides VO publication services to
all interested parties on behalf of the German Astrophysical Virtual
Observatory.`` (use backslashes an the end of the lines to break long
lines).
* authority.referenceURL -- A URL at which people can learn more about
your organisation. Example: ``http://www.g-vo.org``.
* publisher -- A short, human-readable name for you
* publisher.ivoId -- An IVOA id for yourself; set this to
ivo:///org unless you know what you are doing
* contact.name -- A human-readable name for some entity people should
write to. This is not necessarily different from publisher, but
ideally people can write "Dear " in their mails.
* contact.address -- A contact address for surface mail
* contact.email -- An email address. It should be spam-proof.
* contact.telephone -- A telephone number people can call if things
really look bad.
* creator.name -- A name to use when you give no creator in your
resource descriptors. Could be some error sentinel ("we foget to
give credit, please complain") or just contact.name if you produce
resources yourself.
* creator.logo -- A URL for a logo to use when none is given in the
resource metadata. Use a small PNG here.
.. _simple format: ref.html#meta-stream-format
Registering DaCHS-external services
-----------------------------------
The registry interface of DaCHS can be used to register entities
external to DaCHS; actually, you're already doing this when you're
claiming an authority.
To register a non-service "resource", you can fill out a resRec RD
element. You could reserve an RD (say, ``GAVOROOT/inputs/ext.rd`` to
collect such external registrations, or you could put them alongside
internal services into their respective RDs. You will then usually just
use the resRec's id attribute to determine the IVORN of resource record.
It will then be ``ivo:////``.
In all likelihood, however, you will want to register services. To
do that, use a normal service definition with with a nullCore. You
probably need to manually give an accessURL. The most common case is
that of a service with a ``WebBrowser`` capability. These result from
``external`` or ``static`` renderers. Thus, the pattern here usually
is::
shortName: My external service
description: This service does wonderful things, even though\
it's not based on GAVO's DaCHS software.
http://wherever.else/svc
Of course, you will normally need to add further metadata as discussed
above.
Running the DC Server
=====================
You will probably want to run the DC server software from a start
script. The ``gavo`` interface already works a bit like a SYSV
initscript -- you can say ``gavo start``, ``gavo stop``, ``gavo
restart`` and ``gavo reload``; you may want to define an init script
anyway (instead of just linking) in order to define metadata.
TBD.
Updates to metadata are not usually picked up by the running software.
If you change an RD, you may have to say ``gavo imp -m `` if
the column metadata or permissions changed, or ``gavo publish ``
if you changed the publication status.
Even then, the home page will not show new local publications since
those lists are currently cached. To update them, you could run ``gavo
reload``, but it's better to just reload the services data. To do this,
log in as gavoadmin and go to the service overview at
``/__system__/services/overview/form`` (the standard root has that in a
small [s] link at the very bottom of the page). Choose "Admin me" from
the sidebar and then the reload RD button.
Similarly, if you edit anything in an RD, you will have to reload it,
preferably through "Admin me", before the changes will be reflected.
Note that right now, if your RD is invalid, any services on it will stop
working on a reload.
The Vanity Map
--------------
DaCHS' URL scheme leads to somewhat clunky URLs that, in particular,
reflect the file system underneath. While this doesn't matter to the VO
registry, it is possibly unwelcome when publishing URLs outside of the
VO. To overcome it, you can define "vanity names", single path elements
that are mapped to paths.
These mappings are read from the file ``GAVO_ROOT/etc/vanitymap.txt``.
The file contains lines of the format::
[