===================
GAVO DaCHS Tutorial
===================
.. contents::
:depth: 2
:backlinks: entry
:class: toc
Ingesting Data
==============
Starting the RD
---------------
To ingest data, you will have to write a resource descriptor (RD). We
recommend to keep everything handled by a specific RD together in on
directory that is a direct child of your inputs directory (see
installation and configuration), though you could group resources in
deeper subdirectories. So, go to your inputs directory and say::
mkdir lmcextinct
The directory name will (normally) appear in URLs, so it's a good idea
to choose something descriptive and short. This directory is called the
resource directory.
We recommend to put the RD in the root of this directory. A good
default name for the RD is "q.rd"; the "q" will appear in the default
URLs as well and usually looks good in there::
cd lmcextinct
vi q.rd
(where you can substitute vi with your favourite editor, of course).
Writing resource descriptors is what most of the operation of a data
center is about. Let's start slowly by giving some metadata::
Extinction within the LMC
2009-06-02T08:42:00Z
Extinction values in the area of the LMC...
Free to use.
S. Author
Large Magellanic Cloud
Interstellar medium, nebulae
Extinction
You need to adapt the encoding attribute in the prefix to match what you
are actually using if you plan on using non-ASCII-characters.
You may want to use utf-8 instead of the iso-8859-1 given below
depending on your computer's setup.
The schema attribute on resource gives the schema tables for this
resource will turn up in. You should, in general, use the subdirectory
name. If you don't, you have to give the subdirectory name in a
resdir attribute. This attribute must be the name of the resource
directory relative to the inputs directory specified in the
configuration.
Otherwise, there is only meta information so far. This metadata is
cruicial for later registration of the service. In HTML forms, it is
displayed in a sidebar. See `RMI-style metadata
<./ref.html#rmi-style-metadata>`_ in the reference documentation.
Defining Target Tables
----------------------
Within the DC, data is represented in database tables, while metadata is
mostly kept within the resource descriptors. A major part of this
metadata is the table structure. It is defined in table elements, which
usually are direct children of the resoource element. A resource
element may contain multiple table definitions.
Such a table definition might look like this::
Extinction values within certain areas on the sky.
In a table definition, you must give id, which will double as the table
name within the database. The onDisk attribute specifies that the table
is to reside on the disk as opposed to in memory (in-memory tables have
applications in advanced operations). The adql attribute specifies that
no access restrictions are to be placed on the table; if you run an ADQL
or TAP service, users can access this table.
Table elements may contain metadata. You do not need to repeat metadata
given for the resource, because (in most cases) the DC performs metadata
inheritance. This means that if a table is asked for a piece of
metadata it does not have, it forwards that request to the embedding
resource.
Defining Columns
''''''''''''''''
The main content of table is a sequence of column elements. These
contain a description of a single table column. The name attribute is
central in that it will be the column name in the database, the key for
the column's value in record dictionaries that the software uses
internally, and it is usually used to reference the column from the
outside. Column names must be legal identifiers for both python and SQL in
DaCHS. SQL quoted identifiers thus are not allowed.
The type attribute, if not given, defaults to real, and can otherwise
take values in valid SQL datatypes. The DC software knows how to
handle, in addition to real,
* text -- a string. You can also use types like char(7) and the like,
but since that does not help postgres (or much anything else within
the DC), this is not recommended.
* double precision (or double) -- a floating point number. You should use
in doubles if you need to keep more than about 7 digits of mantissa.
* integer (or int) -- typically a 32-bit integer
* bigint -- typically a 64-bit integer
* smallint -- typically a 16-bit integer
* timestamp -- a combination of date and time. While postgres can
process a very large range of dates, the DC stores timestamps in
datetime.datetime objects, which means that for "astronomical" times
(like 10000 B.C. or 10000 A.D. you may need to use custom
representations. Also, the DC assumes all times to be without time
zones. Further time metadata (like distinguishing TT from UT) is
given through STC specifications.
* date -- a date. See timestamp.
* time -- a time. See timestamp
* box -- a rectangle.
Some more types (like raw and file) are available to tables in service
definitions, but they should, in general, not appear in database tables.
Futher metadata on columns includes:
* unit -- the unit the column values are in. The syntax is that defined
by Vizier, but that may change pending further standardization in the
VO. Unit is left out for unitless values.
* tablehead -- a very short string designating the content. This string
is typically used for display purposes, e.g., as table headings or
labels on input fields.
* description -- a longer string characterizing the content. This may
be in bubble help or VOTable descriptions.
* ucd -- a Unified Content Descriptor as defined by IVOA. To figure out
"good" UCDs, the UCD resolver at
http://dc.zah.uni-heidelberg.de/ucds/ui/ui/form can help.
* requried -- True if value must be set in order for the record to be
valid. By default, NULL (which in python is None) is a valid value
for any column. For required columns, that is no longer the case.
This is particularly important in connection with foreign keys.
Parsing Input Data
------------------
After you have defined the table, you will want to fill it. You will
usually have one or more input files with "raw" data.
We recommend putting such input data files into a subdirectory of their
own named "data". Let's assume we have one input file for the table
above, called lmc_extinction_values.txt. Suppose it looks like this,
where tabs in the input are shown as "\\t"::
RA_min\\tRA_max\\tDEC_min\\tDEC_max\\tE(V-I)\\tA_V\\tA_I
78.910625\\t78.982146\\t-69.557417\\t-69.480639\\t0.04\\t0.092571\\t0.123429
78.910625\\t78.982146\\t-69.480639\\t-69.403861\\t0.05\\t0.115714\\t0.154286
78.910625\\t78.982146\\t-69.403861\\t-69.327083\\t0.05\\t0.115714\\t0.154286
The first step for ingestion is lexical analysis. In the DC software,
this is performed by grammars. There are many grammars defined, e.g.,
for getting values from FITS files, VOTables, or using column-based
formats; you can also write specialized grammars in python.
All grammars read "something" and emit a mapping from names to (mostly)
string values.
reGrammars
''''''''''
In this case the easiest grammar to use probably is the `reGrammar
<./ref.html#element-regrammar>`_. The idea here is that you give two
regular expressions to separate the file into records and the records
into fields, and that you simply enumerate the names used in the
mapping.
For the file given above, the RE grammar definition could look like
this::
raMin, raMax, decMin, decMax, EVI, AV, AI
The names given are values of the name attribute in the table
definition.
If you checked the documentation on reGrammar, you will have noticed
that "names" is an "atomic child" of reGrammar. Atomic children are
usually written as attributes, since their values can always be
represented as strings. However, if strings become larger, it's more
convenient to write them in elements. The DC software allows you to
do just that in general: All attributes can be written as elements with
tags named like the attribute. So,
::
would have worked just fine, as would::
1raMin, raMax, decMin, decMax, EVI, AV, AI
Structured children, in contrast, cannot be written as plain strings and
thus can only be written in element notation.
Though grammars can be direct children of resource, they are usually
written as children of data elements (see below).
columnGrammars
''''''''''''''
Another grammar frequently useful when reading from text tables is the
`columnGrammar <./ref.html#element-columngrammar>`_. It allows a rather
dircect translation of VizieR-like "byte-by-byte"-descriptions.
Column grammars define ``col`` elements, the ``key`` attributes of which
give the ``name``s of the target columns (or auxillary identifiers that
you work with in your rowmaker), like this::
1-9
10-18
...
The first column has the index 1, and -- contrary to python slices --
the last index is included in the selection. No expansion of tabs or
similar is performed.
As potential column names, the keys must be valid python identifiers.
Mapping data
------------
A grammar produces a sequence of mappings from names to strings, the
rawdicts. The database, on the other hand, wants typed values, i.e.,
integers, time stamps, etc, internally represented as dictionaries
mapping column names to values called rowdicts. Also, data in input
tables is frequently given in inconvenient formats (e.g., sexagesimal
angles), units not suited to further processing, or distributed over
multiple columns (e.g., date and time of an observation when we want a
single timestamp). It is the job of row makers to transform the rough
data coming from a grammar to whatever the table defines.
Basically a row maker consists of
* `var <./ref.html#element-var>`_ s -- assignments of expression values
names in the rawdict.
* procedure applications (see `apply <./ref.html#element-apply>`_) --
procedural manipulations of both rawdicts and rowdicts.
* maps -- rowdict definition.
When building a rowdict for ingestion into the database, a rowmaker first
binds var names, then applies procedures and finally runs the mappings.
For simple cases, maps will suffice; you may actually even be able to do
without them. Maps must specify a dest attribute giving the rowdict key
that is defined. To specify the value, the can
* either give a src attribute specifying a rawdict key that will then be
converted to a typed value using "sane" defaults (e.g., integers will
be converted by python's int constructor, where empty strings are
mapped to None)
* or give a python expression in the character content, the value of
which is then directly used as value for dest. No implicit
conversions are performed.
In the case above, you could start by saying::
to copy over the rawdict (grammar) keys that directly map to table
column names. Since this is a bit unwieldy, the DC provides a
shortcut::
EVI:EVI,AV:AV,AI:AI
which expands to exactly what is written above. The keys in each pair do not
need to be identical; the first item of each pair is the table column
name, the second the rawdict key.
The case where the names of rawdict and rowdict keys are identical is so
common (since the RD author controls both) that there is yet another
shortcut for this::
EVI,AV,AI
Idmaps sets up one map element each with both dest and src set to the
value for every name in the comma separated list idmaps.
You can abbreviate this further to::
idmaps values can contain shell patterns. They will be matched to the
column names in the target table. For every column for which there is
no explicit mapping, an identity mapping (with type conversion) will be
set up.
This leaves the bbox, centerAlpha, and centerDelta keys to be defined.
No literals for those appear in the rawdicts since they are not part of
the input data. We need to compute them.
To facilitate computations, we first turn the bounds to floats; this can
be done using vars::
float(raMin)float(raMax)float(decMin)float(decMax)
No shortcut is available here, since this is a relatively rare thing.
You could use procDef/apply to save on keystrokes if you find yourself
having to do such simple conversions more frequently.
As you can see, var elements have a name attribute that gives the name
in the rawdict the value is to be bound to. Their character content is
a python expression in which you can access the rawdict values by their
names.
The remaining computations can be performed in mappings::
As in vars, the rawdict values can be accessed by their keys in the
mapping expressions. coords.Box is the internal type for SQL Box
values; you will not usually see those. Still, you can access basically
the whole DC code in this mapping. At some point we will define an API
of "safe operations" that you can use without having to fear changes in
the DC code.
Data elements
-------------
We now have a table definition, a grammar, and a rowmaker. For purposes
of importing, these three come together in a data element. These
elements define what could be seen as the equivalent of a VOTable
resource together with a recipe of how to build it. For onDisk tables,
a side effect of building the data is that the tables are created in the
database; in that sense, data elements also define operations, a notion
that will become more pronounced as we discuss incremental processing.
Let us assemble the pieces we have so far::
Extinction within the LMC
2009-06-02T08:42:00Z
Extinction values in the area of the LMC...
Free to use.
S. Author
Large Magellanic Cloud
Interstellar medium, nebulae
Extinction
Extinction values within certain areas on the sky.
raMin, raMax, decMin, decMax, ev_i, a_v, a_ifloat(raMin)float(raMax)float(decMin)float(decMax)
As you can see, we have put the grammar and the rowmaker into a data
element. While this is not exactly necessary (they could be direct
children of resource as well, which might be a good idea if they are
used in more than one data), this is good practice since they, in some
sense, belong to that data element.
There are two new elements in data. For one, there's sources. Sources
specify where the data will find its input files in its pattern
attribute. This contains shell patterns that are interpreted relative
to the resource directory. You can give multiple patterns if necessary
like this::
inp2/*.txtinp1/*.txt
There also is a recurse boolean attribute you can use when your sources
are distributed over subdirectories of the path part of the pattern.
Indices and Mixins
------------------
Now, let's assume the input table is large. You will want to define
indices on the table. To do this, use the `index
<./ref.html#element-index>`_ element. It is a child of table. In
general, index specifications can be rather involved, but simple cases
remain simple. If you just wanted to define an index on EVI, you could
say::
...
(the columns attribute would be "A_V,EVI" if you wanted an index on both
columns).
However, indices are not always that simple. For example, for a spatial
index on centerAlpha, centerDelta, with the q3c scheme used by the DC
software you would have to write something like::
q3c_ang2ipix(centerAlpha,centerDelta)
The DC software has a mechanism that helps in this case: `Mixins
<./ref.html#mixins>`_. A mixin conceptually is a guarantee of certain
table properties, typically of the presence of certain columns; here, it
is just the presence of an index.
So, all you need to do to have a spatial index on the table is::
...
This is UCD magic at work -- q3cindex selects the columns with
pos.eq.*;meta.main as index columns. If you are curious how it does
this, check scs.rd in the system RD directory.
Starting the Ingestion
----------------------
At this point, you can run the ingestion::
gavoimp q
By default, gavoimp creates all data defined in a resource. If this is
not what you want, you can explicitely specify a data id to process::
gavoimp q import
For larger data sets, it may be wise to first try a couple of rows::
gavoimp --stop-after=300 q
Try ``gavoimp --help`` to see more options (most of which are probably
irrelevant to you now.
Note that gavoimp interprets the RD argument as a file first and then as
an RD id. An RD id is the inputs-relative path of the RD with the
extension stripped. Our example RD thus has the RD id lmcextinct/q, and
you could have said::
gavoimp lmcextinct/q
from anywhere in the file system.
Debugging
---------
If nothing else helps you can watch what the software actually sends to
the database. To do that, set the GAVO_SQL_DEBUG environment variable
to any value. This could look like this::
env GAVO_SQL_DEBUG=1 gavoimp q create
The first couple of requests are for internal use (like checking that some
meta tables are present).
Publishing Data
===============
Once a table is in the database, it needs to get out again. Within
DaCHS, there are three parties involved in delivering data to the user:
* The core; it actually does the computation
* The renderer; it formats the result in some way requested by the user
and delivers it. There are renderers for web forms, VO protocols,
imges, etc.
* The service; it holds together the core and the renderer, can
reformat core results, controls the metadata, etc.
You will usually use pre-specified renderers, so these are not defined
in resource descriptors. What you have to define are cores and
services.
For core, you will usually use the `dbCore <./ref.html#element-dbcore>`_
in custom services, though `many other cores
<./ref.html#cores-available>`_ are predefined and you can `define your
own <./ref.html#writing-custom-cores>`_.
The dbCore generates a (single-table) query from condition descriptors
and returns a table that you describe through an output table. Cores
are defined as direct children of the resource. For the lmcextinction
table above, it could look like this::
Cores always need an id. dbCores need a queriedTable attribute, the
value of which must be a table reference. This is the table the query
will run against.
CondDescs can be defined in all kinds of ways. The most common modes,
however, are using predefined condDescs (which mostly come from
protocols; in this case, humanScs comes from SCS and lets you do cone
searches), and just deriving condDescs from table columns. You can
refer to columns from your table definition by name in the buildFrom
attribute, and the software tries to make some useful input definition
from that column.
In web forms, these input definitions become form items; other renderers
will expose them differently. In all cases, however, the condDescs of
the dbCore define what fields can be queried.
The service now ties the core together with a renderer. It might look
like this::
lmcext_web
While services can run without shortnames, it can lead to trouble later, so you
should make a habit of assigning short names. See `the data checklist
<./data_checklist.html>`_ for more information on short names.
A service must have an id as well, and its core attribute must contain
the id of a core.
With this minimal specification, the service exposes a web form-based
interface. To try this, run a server::
gavoserve debug
and point a browser to http://localhost:8080/lmcextinct/q/cone/form (the
host part, of course, depends on your configuration. If you did not
change anything there, you should find the data at the given URL).
More on Tables
==============
Notes
-----
Frequently, you need to say more about a column than is appropriate in
the few-phrase description. Historically, such situations have been
handled using notes. Since notes can be reused for multiple columns, we
chose to follow that precedent rather than attach longish information
onto the columns themselves.
The notes themselves are kept in meta elements belonging to tables.
Since the notes tend to be markup-heavy, their default format is
restructured text. When entering notes in RDs, there is an attribute
``tag`` on these meta items::
...
The meaning of the flag is as follows:
===== ==========
value meaning
===== ==========
1 value is 2
2 value is 1
===== ==========
...
To assoicate a column with a note, use the column's note attribute::
As tag, you may use basically any string, but it's a good idea to keep
it to numbers or at least characters not requiring URL encoding.
The notes will exposed in HTML table heads, table and service
description, etc. If you need to link to one, there is the built-in
tablenote renderer that takes the table and the note from its query path.
The most convenient way to is it is through the
built-in vanity name tablenote, where you would access the note above
using a URL like ``http://your.server/tablenote/demoschema.demo/1``.
STC
---
As soon as you have coordinates, you will want to define coordinate
systems on them. In the introductory example, that was not necessary
because SCS mandates that the coordinates you export are in ICRS, so
either your coordinates are in ICRS or you are violating the SCS
protocol -- in either case, nothing to declare.
In the more general case, you will want to say what is what in your
tables. DaCHS uses a language called STC-S to declare systems,
reference points, etc. The STC-S description [TODO: Link to IVOA note]
is a bit terse, but the good news is that you will get by with a few
features most of the time.
STC is defined in children of table elements, with references to table
columns in quoted strings::
Position ICRS "ra" "dec" Error "e_ra" "e_dec"
Position FK4 J1950.0 "ra_orig" "dec_orig"
You do not need to change anything in the column definitions themselves,
since the machinery will resolve your column references. If you refer
to non-existing columns, RD parse errors will be thrown.
More on Grammars
================
Row Generators
--------------
TBD
Source Fields
-------------
Grammars can have a sourceFields element. It contains a standard
procedure definition (i.e., you could predefine those and bind
parameters), but usually you will just fill in the code.
This code is called once for each source processed, and receives the
sourceToken as argument. It must return a dictionary, the key/value
pairs of which will be added to all rows returned by the row iterator.
The purpose of sourceFields is to precompute values that depend on the
source ("file") and are constant for all rows within it. An example for
where you need this is when you want to create backlinks to the file a
piece of data came from::
srcKey = utils.getRelativePath(sourceToken,
base.getConfig("inputsDir"))
return locals()
You can then retrieve the path to the source file via srcKey key in
rawdicts (and then, using render functions and static renderers, turn
this into links).
In addition to the sourceToken, you also have access to the data that
will be fed from the grammar. This can be used to, e.g., retrieve the
resource directory (``data.dd.rd.resdir``) or data descriptor properties
(``data.dd.getProperty("whatever")``).
Sometimes you want to do database queries from within sourceFields.
This is tricky when you access the table being written or otherwise
being accessed. This is because sourceTokens run in the midst of a
transaction updating the table. So, something like::
base.SimpleQuerier().query(...)
will wait for the transaction to finish. But the transaction is waiting
for data that will only come when the query finishes -- this is a
deadlock, and gavoimp will just sit there and wait (see also `Deadlocks
`_).
To get around this, you need to query using the data's connection. So,
instead write::
base.SimpleQuerier(connection=data.connection).query(...)
More on Services
================
Custom Templates
----------------
Within the data center, most pages are generated from templates [XXX
TODO: write something about them generically]. This is true for the
pages the form renderer on services displays as well. To effect special
effects, you may want to override them (though in general, it is a much
better idea to work within the standard template since that will give
your service all kind of automatic updates and would make, e.g., changes
much easier if your institution undergoes the yearly reorganization).
The default response template can be found in
resources/templates/defaultsresponse.html in the installed tree. To
obtain the plainest output conceivable, try::
No title
Save this to a file within the resource directory, let's say
"res/plain.html". Then, say::
res/plain.html
in your service; this should do give you a minimally decorated page.
Of course, this will display a severely degraded page. To get at least
the standard style sheet and the standard javascript, say::
instead of the plain head.
More on Cores
=============
CondDescs
---------
dbCores and cores derived from them take most of their power from
condition descriptors or CondDescs. These combine inputKeys, which are
basically column objects with some additional presentation-related
information, with code generating SQL conditions.
A condDesc can contain zero or more input keys (though having zero input
keys makes no sense for user-defined condDescs since they would never
"fire"). Having more than one input key is useful when input quantities
can only be interpreted when present as a group. An example is the
standard cone search, where you need both a position and a search
radius.
Automatic and manual control
''''''''''''''''''''''''''''
However, most condDescs correspond to one input key, and the input key
is mostly derived from a table column. This is the standard idiom,
::
where somecol is a column in the table queried by the core. This
construct will cause the an input key to be built from somecol. While
doing this, the type will be mapped automatically. The primary rules
are:
* Numeric types will get mapped to numeric vizier-like expressions
* Datetimes will get mapped to date vizier-like expressions
* text and chars will get mapped to string vizier-like expressions
* enumerated values (i.e., columns with value elements giving options)
will not become vizier-like expressions but input keys that yield
selection widgets.
To have more control (e.g., if you do not want to allow vizier-like
expressions, give the input key yourself)::
(which would make a column required in the table optional in the query),
or::
(which creates an input key matching everything literally), or even::
-- if the input key is required, queries not giving it will be rejected.
The title attribute on option gives the label of an option in the HTML
input widget; if it's missing, a string representation of the value will
be used.
In all those cases, the SQL generated from the condDesc is a conjunction
of the input key's individual SQL expressions. Those, in turn, are
simply comparisons for equality for plain types and more or less
arbitrary expressions for vizier expression types.
Incidentally, two properties on inputKeys are defined to only show
inputs for certain renderers, viz., ``onlyForRenderer`` and
``notForRenderer``. Both have single strings as values. This is
intended mainly for cases like SIAP and SCS where there are
"human-oriented" versions of the input fields available. The built-in
SCS and SIAP conditions already to that, so you can give both scs and
humanSCS conditions in a core. Here is how you would define an input
key that is only used for the form renderer::
form
Phrase makers
'''''''''''''
For complete control over what SQL is generated, condDescs may contain
code called a phrase maker. This, again, is a procedure application,
quite like with rowmaker procs, except that the signature of condDesc
code is different.
Phrase maker code has the following names available:
* inputKeys -- the list of input keys for the parent CondDesc
* inPars -- a dictionary mapping inputKey names to the values
provided by the user
* outPars -- a dictionary that is later used as the parameter
dictionary to the query.
The code should amend the outPars dictionary with the keys mentioned in
the the conditions. The conditions themselves are yielded. So, a very
simple condDesc with generated SQL could look like this::
outPars["xxyy"] = "x"*inPars.get("val", 20)
yield "someColumn=%(xxyy)s"
However, using fixed names in outPars is not recommended, if only
because condDescs could be used multiple times. The recommended way
uses the vizierexprs.getSQLKey function. It takes a name, a value, and
the outPars dictionary. It will return a key unique to the query in
question and enter the value into the outPars dictionary under that key.
While that sounds complicated, it is actually rather harmless, as shown in
the following real-world example that lets users input date, time and an
interval in split-up form (e.g., when you cannot hope anyone will try to
write the equivalent vizier-like expressions)::
baseTS = datetime.datetime.combine(inPars["date"], inPars["time"])
dt = datetime.timedelta(minutes=inPars["within"])
yield "date BETWEEN %%(%s)s AND %%(%s)s"%(
vizierexprs.getSQLKey("date", baseTS-dt, outPars),
vizierexprs.getSQLKey("date", baseTS+dt, outPars))
.. _`Resource Metadata for the Virtual Observatory`: http://www.ivoa.net/Documents/latest/RM.html