=================== GAVO DaCHS Tutorial =================== :Author: Markus Demleitner :Email: gavo@ari.uni-heidelberg.de .. contents:: :depth: 2 :backlinks: entry :class: toc Ingesting Data ============== Starting the RD --------------- To ingest data, you will have to write a resource descriptor (RD). We recommend to keep everything handled by a specific RD together in on directory that is a direct child of your inputs directory (see installation and configuration), though you could group resources in deeper subdirectories. So, go to your inputs directory and say:: mkdir lmcextinct The directory name will (normally) appear in URLs, so it's a good idea to choose something descriptive and short. This directory is called the resource directory. We recommend to put the RD in the root of this directory. A good default name for the RD is "q.rd"; the "q" will appear in the default URLs as well and usually looks good in there:: cd lmcextinct vi q.rd (where you can substitute vi with your favourite editor, of course). Writing resource descriptors is what most of the operation of a data center is about. Let's start slowly by giving some metadata:: Extinction within the LMC 2009-06-02T08:42:00Z Extinction values in the area of the LMC... Free to use. Author, S.; Other, A. Large Magellanic Cloud Interstellar medium, nebulae Extinction You need to adapt the encoding attribute in the prefix to match what you are actually using if you plan on using non-ASCII-characters. You may want to use utf-8 instead of the iso-8859-1 given below depending on your computer's setup. The schema attribute on resource gives the (database) schema tables for this resource will turn up in. You should, in general, use the subdirectory name. If you don't, you have to give the subdirectory name in a resdir attribute. This attribute must be the name of the resource directory relative to the inputs directory specified in the configuration. In general, you should have exactly one RD per database schema. This is not enforced, but sharing schemata between RDs will cause many undesirable behaviours. An example is permissions: When importing a table, the schema access rights are adapted. If you have one RD A defining an ADQL-queriable table in schema X and another RD B that has no ADQL-queriable table, importing A will make schema X readable to untrusted queries, whereas importing B will make it unreadable again; this would lead to query failures (which could, in this case, fixed by adding untrusted to B's readRoles manually, but you get the idea). Otherwise, there is only meta information so far. This metadata is cruicial for later registration of the service. In HTML forms, it is displayed in a sidebar. See also `More on Metadata`_. Another hint: There's a fairly large body of RDs at http://svn.ari.uni-heidelberg.de/svn/gavo/hdinputs, and most of them are free for inspection and blatant stealing (if you need a license on any of this, let us know). Most of those can be seen live on http://dc.g-vo.org. Once you are here, you should "validate" your RD. This is, in general, a good idea before doing anything with the RD, since it will allow you to more easily catch errors than the in all likelihood even more byzantine error messages that may arise when something goes wrong later. So, say:: gavo val q.rd and read the output. If you don't understand what ``gavo val`` tells you, complain to gavo@ari.uni-heidelberg.de -- the command is really intended to help you catch errors, and if it doesn't do so, it's either a bug in ``gavo val`` or the documentation, and in either case we'd like to fix it. You can also pass an RD id to ``gavo val``, and you can specify more than one RD. Defining Target Tables ---------------------- Within the DC, data is represented in database tables, while metadata is mostly kept within the resource descriptors. A major part of this metadata is the table structure. It is defined in table elements, which usually are direct children of the resoource element. A resource element may contain multiple table definitions. Such a table definition might look like this:: Extinction values within certain areas on the sky.
In a table definition, you must give id, which will double as the table name within the database. The onDisk attribute specifies that the table is to reside on the disk as opposed to in memory (in-memory tables have applications in advanced operations). The adql attribute specifies that no access restrictions are to be placed on the table; if you run an ADQL or TAP service, users can access this table. Table elements may contain metadata. You do not need to repeat metadata given for the resource, because (in most cases) the DC performs metadata inheritance. This means that if a table is asked for a piece of metadata it does not have, it forwards that request to the embedding resource. Defining Columns '''''''''''''''' The main content of table is a sequence of column elements. These contain the definition of a single table column. The name attribute is central in that it will be the column name in the database, the key for the column's value in record dictionaries that the software uses internally, and it is usually used to reference the column from the outside. Column names must be legal identifiers for both python and SQL in DaCHS. SQL delmited identifiers thus are not allowed (this is not the whole truth, but it's true enough). The type attribute defaults to real, and can otherwise take values in valid SQL datatypes. The DC software knows how to handle, in addition to real, * text -- a string. You can also use types like char(7) and the like, but since that does not help postgres (or much anything else within the DC), this is not recommended. * double precision (or double) -- a floating point number. You should use in doubles if you need to keep more than about 7 digits of mantissa. * integer (or int) -- typically a 32-bit integer * bigint -- typically a 64-bit integer * smallint -- typically a 16-bit integer * timestamp -- a combination of date and time. While postgres can process a very large range of dates, the DC stores timestamps in datetime.datetime objects, which means that for "astronomical" times (like 10000 B.C. or 10000 A.D. you may need to use custom representations. Also, the DC assumes all times to be without time zones. Further time metadata (like distinguishing TT from UT) is given through STC specifications. * date -- a date. See timestamp. * time -- a time. See timestamp * box -- a rectangle. * spoint, scircle, sbox, spoly -- objects of spherical geometry, taken from pgSphere. Ask for documentation... Some more types (like raw and file) are available to tables in service definitions, but they should, in general, not appear in database tables. Futher metadata on columns includes: * unit -- the unit the column values are in. The syntax is that defined by Vizier, but that may change pending further standardization in the VO. Unit is left out for unitless values. * tablehead -- a very short string designating the content. This string is typically used for display purposes, e.g., as table headings or labels on input fields. * description -- a longer string characterizing the content. This may be in bubble help or VOTable descriptions. Since these could by longer, you may want to put them in a child element rather than an attribute; in both cases, whitespace is normalized, so you can enter line breaks and similar for readability in the source; they will always be rendered as a single blank. For even longer, note-like material, see Notes_. An example for a long descripton:: The aperture is the full-width-half-mean of the response function of our sage 3000 hyper-detector. * ucd -- a Unified Content Descriptor as defined by IVOA. To figure out "good" UCDs, the UCD resolver at http://dc.zah.uni-heidelberg.de/ucds/ui/ui/form can help. * required -- True if value must be set in order for the record to be valid. By default, NULL (which in python is None) is a valid value for any column. For required columns, that is no longer the case. This is particularly important in connection with foreign keys. * verbLevel – A measure for the "importance" of the column. Various protocols have the notion of "verbosity", where higher verbosity means you get to see more columns with more esoteric content. Within DaCHS, verbLevel is a number between (usefully) 1 and 30, with columns with verbLevel 1 always given and those with verbLevel 30 only given if someone really wants to see all columns. Technically, in SCS, a column is part of the output table if its verbLevel is smaller or equal to ten times the query's VERB parameter. Column elements may have a child element `values <./ref.html#element-values>`_. This lets you specify metadata like maximum or minimum, or enumerate possible values. The most common use is the definition of null literals though. This is not necessary for floats, and usually not even strings, because these have useful (and actually non-overridable) null values in the VOTable representation (where this sort of thing counts most). It is, however, highly recommended to give null literals when defining integral types (including chars) that may have null values. DaCHS will try to pick useful null values for those automatically when possible, but when streaming tables, this is impossible, and errors will be raised during VOTable rendering when NULLs are encountered in such a situation. So, just define null values whenever you define a non-required integral column, like this:: After you have imported a table, it is a good idea to run ``gavo info`` with the id of the freshly imported table, e.g.,:: gavo info myres/q#thistable This will output several properties (min, max, avg) of numeric columns that may help spot import errors; it will also say which columns contain NULLs. Use this to mark every column containing integers either ``required="True"`` (to tell other people that no NULLs are possible here) or add an explicit null literal. Everyone will be grateful. Parsing Input Data ------------------ After you have defined the table, you will want to fill it. You will usually have one or more input files with "raw" data. We recommend putting such input data files into a subdirectory of their own named "data". Let's assume we have one input file for the table above, called lmc_extinction_values.txt. Suppose it looks like this, where tabs in the input are shown as "\\t":: RA_min\\tRA_max\\tDEC_min\\tDEC_max\\tE(V-I)\\tA_V\\tA_I 78.910625\\t78.982146\\t-69.557417\\t-69.480639\\t0.04\\t0.092571\\t0.123429 78.910625\\t78.982146\\t-69.480639\\t-69.403861\\t0.05\\t0.115714\\t0.154286 78.910625\\t78.982146\\t-69.403861\\t-69.327083\\t0.05\\t0.115714\\t0.154286 The first step for ingestion is lexical analysis. In the DC software, this is performed by grammars. There are many grammars defined, e.g., for getting values from FITS files, VOTables, or using column-based formats; you can also write specialized grammars in python. All grammars read "something" and emit a mapping from names to (mostly) string values. reGrammars '''''''''' In this case the easiest grammar to use probably is the `reGrammar <./ref.html#element-regrammar>`_. The idea here is that you give two regular expressions to separate the file into records and the records into fields, and that you simply enumerate the names used in the mapping. For the file given above, the RE grammar definition could look like this:: raMin, raMax, decMin, decMax, EVI, AV, AI The names given are values of the name attribute in the table definition. If you checked the documentation on reGrammar, you will have noticed that "names" is an "atomic child" of reGrammar. Atomic children are usually written as attributes, since their values can always be represented as strings. However, if strings become larger, it's more convenient to write them in elements. The DC software allows you to do just that in general: All attributes can be written as elements with tags named like the attribute. So, :: would have worked just fine, as would:: 1 raMin, raMax, decMin, decMax, EVI, AV, AI Structured children, in contrast, cannot be written as plain strings and thus can only be written in element notation. Though grammars can be direct children of resource, they are usually written as children of data elements (see below). columnGrammars '''''''''''''' Another grammar frequently useful when reading from text tables is the `columnGrammar <./ref.html#element-columngrammar>`_. It allows a rather dircect translation of VizieR-like "byte-by-byte"-descriptions. Column grammars define ``col`` elements. Each of these has a ``key`` attribute that gives a name. This could be the ``name`` of a target column in the simplest case, or it can be an auxillary identifier that you process in a rowmaker:: 1-9 10-18 ... The first column has the index 1, and -- contrary to python slices -- the last index is included in the selection. No expansion of tabs or similar is performed. As potential column names, the keys must be valid python identifiers. Mapping data ------------ A grammar produces a sequence of mappings from names to strings, the rawdicts. The database, on the other hand, wants typed values, i.e., integers, time stamps, etc, internally represented as dictionaries mapping column names to values called rowdicts. Also, data in input tables is frequently given in inconvenient formats (e.g., sexagesimal angles), units not suited to further processing, or distributed over multiple columns (e.g., date and time of an observation when we want a single timestamp). It is the job of row makers to transform the rough data coming from a grammar to whatever the table defines. Basically a row maker consists of * `var <./ref.html#element-var>`_ s -- assignments of expression values names in the rawdict. * procedure applications (see `apply <./ref.html#element-apply>`_) -- procedural manipulations of both rawdicts and rowdicts. * maps -- rowdict definition. When building a rowdict for ingestion into the database, a rowmaker first binds var names, then applies procedures and finally performs the mappings. For simple cases, maps will suffice; you may actually even be able to do without them. Maps must specify a dest attribute giving the rowdict key that is defined. To specify the value, the can * either give a src attribute specifying a rawdict key that will then be converted to a typed value using "sane" defaults (e.g., integers will be converted by python's int constructor, where empty strings are mapped to None) * or give a python expression in the character content, the value of which is then directly used as value for dest. No implicit conversions are performed. In python expressions, you can access the data handed over by the grammar as ``vars["key"]``; equivalently, you can use the abbreviation ``@key``. This notion is supported throughout rowmakers where applicable; e.g., you can use it in late bindings of procedure applications. In the case above, you could start by saying:: to copy over the rawdict (grammar) keys that directly map to table column names. Since this is a bit unwieldy, the DC provides a shortcut:: EVI:EVI,AV:AV,AI:AI which expands to exactly what is written above. The keys in each pair do not need to be identical; the first item of each pair is the table column name, the second the rawdict key. The case where the names of rawdict and rowdict keys are identical is so common (since the RD author controls both) that there is yet another shortcut for this:: EVI,AV,AI Idmaps sets up one map element each with both dest and src set to the value for every name in the comma separated list idmaps. You can abbreviate this further to:: idmaps values can contain shell patterns. They will be matched to the column names in the target table. For every column for which there is no explicit mapping, an identity mapping (with type conversion) will be set up. This leaves the bbox, centerAlpha, and centerDelta keys to be defined. No literals for those appear in the rawdicts since they are not part of the input data. We need to compute them. To facilitate computations, we first turn the bounds to floats; this can be done using vars:: float(@raMin) float(@raMax) float(@decMin) float(@decMax) No shortcut is available here, since this is a relatively rare thing. You could use procDef/apply to save on keystrokes if you find yourself having to do such simple conversions more frequently. Note the @-notation. As mentioned above, you could equivalently have written ``vars["raMin"]``. Both spellings evaluate to the value of the given name in the rawdict coming from the grammar. The remaining computations can be performed in mappings:: (@raMin+@raMax)/2. (@decMin+@decMax)/2. coords.Box((@raMin, @decMin), (@raMax, @decMax)) ``coords.Box`` is the internal type for SQL Box values; you will not usually see those. Still, you can access basically the whole DC code in this mapping. At some point we will define an API of "safe operations" that you can use without having to fear changes in the DC code. `Some functions useful for such mappings <./ref.html#functions-available-for-row-makers>`_ are listed in the reference manual. Of course, you can have values that do not even depend on grammar output:: datetime.datetime.now() Null values are always troublesome. Within DaCHS, the null value (almost) always is python's None. There is the rowmaker function ``parseWithNull`` to help you come up with those; say, some joker used 99.99 as a null value for a magnitude, you could say:: parseWithNull(@VmagSrc, float, "99.99") If you need to scale this (or if null values are chosen that they are invalid literals to begin with), a feature that lets you null out a value when an specific type of exception is raised comes in handy. This is map's ``nulExcs`` attribute, which is just a comma separated list of exceptions that should be caught and interpreted as "this is null". If, in the example above, the source would give the magnitude in millimags to save a comma, you could use:: parseWithNull(@VmagSrc, float, "99999")/1000. If parseWithNull here returns None, a TypeError will be raised and caught, and Vmag will be None. You can turn more than one exception into None. For example example, if both magicOffset has been parsed before and could be None, while magicLit is to be parsed and has the empty string as a Null literal, you could write:: @magicOffset+float(@magicLit) If magicOffset is None, magic will be None via the TypeError, whereas empty magicLits will result in Nones via a ValueError. Some Words on Times ------------------- Among the messier data types in astronomical databases are dates and times – they come in lots of crazy input formats, they can be represented in lots of different ways in the database, they are expected in lots of crazy output formats, plus there's a host of exciting metadata on them, including time scales and reference positions. With DaCHS, we recommend one of the following ways of storing dates and times (written as attributes of column): * ``type="double precision" xtype="mjd" unit="d"`` – a modified julian date * ``type="double precision" unit="d"`` – a julian date * ``type="double precision" unit="s"`` – a unix timestamp * ``type="double precision" unit="yr"`` – a Julian year with fractions * ``type="timestamp"`` – a postgresql timestamp All other things being equal, we recommend using mjds; most VO data models and protocols employ them, and they are fairly easy to query against. In HTML forms, they are easily displayed as human-readable datetimes by using an ``displayHint="type=humanDate"`` (which you can do for the others, too, of course). The Julian years are a good choice, too, and they are immediately human-readable to some extent. They are certainly the representation of choice for epochs and equinoxes. Note that the storage of Bessel years is strongly discouraged. Use the ``bYearToDateTime`` function to transform them to datetime instances which you can then map to any recommended representation. While timestamps might sound like a good idea in that they are the proper native type to manipulate dates and times with, they usually are a bad choice. The main reason is that in ADQL there is basically no support of timestamps at all, which makes any manipulation of them in ADQL queries virtually impossible. If you're sure your table will never turn up on a TAP service, that doesn't hurt much, but can you be sure? All this didn't mention any UCDs or utypes that may apply. UCDs should not, in general, depend on the time format chosen; all of the above could be used for quantities like ``time.creation``, ``time.end``, ``time.epoch``, ``time.equinox``, ``time.processing``, ``time.release``, ``time.start``, and more. The SIAP version 1 protocol made a funky exception there, defining an ``VOX:Image_MJDateObs`` UCD; please forget that ever happened. Finally, there is advanced metadata, in particular time zones, time scales (i.e., how does the the time pass) and reference positions (i.e., where is the clock positioned). Time zones are not supported at all in the VO. All times are for the Greenwhich meridian (i.e., they should be close to UTC). The time scales are important on the level of seconds; they include TAI (the time scale defined by a bunch of atomic clocks, UTC (TAI with leap seconds, basically our everyday time), UT, UT0, UT1, UT2 (several sorts of true times in Greenwich), and TT (Terrestial Time, a time scale linked to the TAI and used quite a bit in Astronomy). More of that on the fairly readable http://stjarnhimlen.se/comp/time.html. The reference positions are currently relevant on a level of milliseconds or below; they need to be declared for high precision work since a clock in the barycenter of the solar system will (evaporate but before that) run slower than one on Pluto due to relativistic effects of various sorts. Common reference positions would be TOPOCENTER (the observatory), GEOCENTER (the center of the Earth), BARYCENTER (the barycenter of the solar system) and UNKNOWN (the default, which you should keep unless you are sure; it doesn't matter anyhow for most applications). To declare those, you must include a time phrase in your STC_ declaration in your table. Typically, this could look like this:: TimeInterval TT "timeStart" "timeEnd" Time "dateObs" ... (descriptions and everything else left out for clarity; in particular, for times using double precision almost always is a good idea). Data Elements ------------- We now have a table definition, a grammar, and a rowmaker. For purposes of importing, these three come together in a data element. These elements define what could be seen as the equivalent of a VOTable resource together with a recipe of how to build it. For onDisk tables, a side effect of building the data is that the tables are created in the database; in that sense, data elements also define operations, a notion that will become more pronounced as we discuss incremental processing. Let us assemble the pieces we have so far:: Extinction within the LMC 2009-06-02T08:42:00Z Extinction values in the area of the LMC... Free to use. Author, S. Large Magellanic Cloud Interstellar medium, nebulae Extinction
Extinction values within certain areas on the sky.
raMin, raMax, decMin, decMax, ev_i, a_v, a_i float(@raMin) float(@raMax) float(@decMin) float(@decMax) (@raMin+@raMax)/2. (@decMin+@decMax)/2. coords.Box((@raMin, @decMin), (@raMax, @decMax)) There are two new elements in data. For one, there's sources. Sources specify where the data will find its input files in its pattern attribute. This contains shell patterns that are interpreted relative to the resource directory. You can give multiple patterns if necessary like this:: inp2/*.txt inp1/*.txt There also is a recurse boolean attribute you can use when your sources are distributed over subdirectories of the path part of the pattern. The second new element is ``make``. It ties together a destination table an the rowmaker using id references. You may want to define the rowmaker as a direct child of make, which saves you some referencing. Though make looks quite inoccuous here, it is the element that drives the action. You can have multiple make elements in a single data element to build multiple tables (using different row makers) from the same grammar output. Makes can also carry scripts in SQL or python. For details, see `Scripting <./ref.html#scripting>`_. As you can see, we have put the grammar and the rowmaker into a data element. They could also be direct children of resource, which might be a good idea if they are used in more than one data; you would then give the rowmaker an id (make_table, say) and say something like ```_ element. It is a child of table. In general, index specifications can be rather involved, but simple cases remain simple. If you just wanted to define an index on EVI, you could say:: ... (the columns attribute would be "A_V,EVI" if you wanted an index on both columns). However, indices are not always that simple. For example, for a spatial index on centerAlpha, centerDelta, with the q3c scheme used by the DC software you would have to write something like:: q3c_ang2ipix(centerAlpha,centerDelta) The DC software has a mechanism that helps in this case: `Mixins <./ref.html#mixins>`_. A mixin conceptually is a guarantee of certain table properties, typically of the presence of certain columns; here, it is just the presence of an index. So, all you need to do to have a spatial index on the table is::
... This is UCD magic at work -- q3cindex selects the columns with pos.eq.*;meta.main as index columns. If you are curious how it does this, check scs.rd in the system RD directory. Mixins actually do much more than just help with indexing. Their main purpose is the definition of interfaces that can be relied upon. For example, an image table must have a certain structure determined by the SIA protocol. The mixins ``//siap#pgs`` and ``//siap#bbox`` make sure that tables have this structure, and they make sure that the table containing information on all the files in the data center is updated when the table is filled. Starting the Ingestion ---------------------- At this point, you can run the ingestion:: gavo imp q By default, ``gavo imp`` creates all data defined in a resource. If this is not what you want, you can explicitely specify a data id to process:: gavo imp q content For larger data sets, it may be wise to first try a couple of rows:: gavo imp --stop-after=300 q Try ``gavo imp --help`` to see more options (most of which are probably irrelevant to you now. By the way, the ``gavo`` command has lots of subcommands. The subcommand here has the full name ``import``; you could have said ``gavo import`` or even ``gavo im``, since any unique prefix into the command list is ok. Try ``gavo --help`` to see the commands available. Note that gavo imp interprets the RD argument as a file first and then as an RD id. An RD id is the inputs-relative path of the RD with the extension stripped. Our example RD thus has the RD id lmcextinct/q, and you could have said:: gavo imp lmcextinct/q from anywhere in the file system. Debugging --------- If nothing else helps you can watch what the software actually sends to the database. To do that, set the GAVO_SQL_DEBUG environment variable to any value. This could look like this:: env GAVO_SQL_DEBUG=1 gavo imp q create The first couple of requests are for internal use (like checking that some meta tables are present). Publishing Data =============== Once a table is in the database, it needs to get out again. Within DaCHS, there are three parties involved in delivering data to the user: * The core; it actually does the computation * The renderer; it formats the result in some way requested by the user and delivers it. There are renderers for web forms, VO protocols, imges, etc. * The service; it holds together the core and the renderer, can reformat core results, controls the metadata, etc. You will usually use pre-specified renderers, so these are not defined in resource descriptors. What you have to define are cores and services. For core, you will usually use the `dbCore <./ref.html#element-dbcore>`_ in custom services, though `many other cores <./ref.html#cores-available>`_ are predefined -- e.g., to run ADQL queries, to upload files, or to do feedback queries --, and you can `define your own <./ref.html#writing-custom-cores>`_ when you need special functionality. The dbCore generates a (single-table) query from condition descriptors and returns a table that you describe through an output table. Cores are defined as direct children of the resource. For the lmcextinction table above, it could look like this:: Cores always need an id. dbCores need a queriedTable attribute, the value of which must be a table reference. This is the table the query will run against. CondDescs define input fields (for the form renderer, these are actually form items people can fill in). Most commonly, you will either define them using the ``original`` attribute or using ``buildFrom``. The first case is typically used in connection with protocols and on tables having mixins; such condDescs result in zero or more input fields, and they typically inspect the queried table. For example, the humanScs core in the example locates the "main" positions as identified by UCDs and generates queries against them using two input fields, one it tries to guess a position from, and another for the search radius. When you define your condDesc using buildFrom, the result is almost always a single input field that allows posing restrictions against the column referred to in the buildFrom attribute, which in turn usually is the name of a column in the table queried (though you could use any field using id-based referencing). The software tries to make some useful input definition from that column, which in particular means that the types are "up-valued". String columns can be queried against using Vizier-like string expressions, real and double precision columns using Vizier-like float expressions, and so on. You can suppress that behaviour using more verbose forms explained elsewhere. Renderers other than form will expose the input fields in some other way than form items. In all cases, however, the condDescs of the dbCore define what fields can be queried. The service now ties the core together with a renderer. It might look like this:: lmcext_web While a service can run without a ``shortName``, it can lead to trouble later, so you should make a habit of assigning short names. See `the data checklist <./data_checklist.html>`_ for more information on short names. A service must have an id as well, and its core attribute must contain the id of a core. With this minimal specification, the service exposes a web form-based interface. To try this, run a server:: gavo serve debug and point a browser to http://localhost:8080/lmcextinct/q/cone/form (the host part, of course, depends on your configuration. If you did not change anything there, you should find the data at the given URL). More on Tables ============== Notes ----- Frequently, you need to say more about a column than is appropriate in the few-phrase description. Historically, such situations have been handled using notes. Since notes can be reused for multiple columns, we chose to follow that precedent rather than attach longish information onto the columns themselves. The notes themselves are kept in meta elements belonging to tables. Since the notes tend to be markup-heavy, their default format is restructured text. When entering notes in RDs, there is an attribute ``tag`` on these meta items::
... The meaning of the flag is as follows: ===== ========== value meaning ===== ========== 1 value is 2 2 value is 1 ===== ========== ...
To assoicate a column with a note, use the column's note attribute:: As tag, you may use basically any string, but it's a good idea to keep it to numbers or at least characters not requiring URL encoding. The notes will exposed in HTML table heads, table and service description, etc. If you need to link to one, there is the built-in tablenote renderer that takes the table and the note from its query path. The most convenient way to is it is through the built-in vanity name tablenote, where you would access the note above using a URL like ``http://your.server/tablenote/demoschema.demo/1``. STC --- As soon as you have coordinates, you will want to define coordinate systems on them. In the introductory example, that was not necessary because SCS mandates that the coordinates you export are in ICRS, so either your coordinates are in ICRS or you are violating the SCS protocol -- in either case, nothing to declare. In the more general case, you will want to say what is what in your tables. DaCHS uses a language called STC-S to declare systems, reference points, etc. The STC-S description [TODO: Link to IVOA note] is a bit terse, but the good news is that you will get by with a few features most of the time. STC is defined in children of table elements, with references to table columns in quoted strings:: Position ICRS "ra" "dec" Error "e_ra" "e_dec" Position FK4 J1950.0 "ra_orig" "dec_orig" You do not need to change anything in the column definitions themselves, since the machinery will resolve your column references. If you refer to non-existing columns, RD parse errors will be thrown. More on Grammars ================ Row Generators -------------- TBD Source Fields ------------- Grammars can have a sourceFields element. It contains a standard procedure definition (i.e., you could predefine those and bind parameters), but usually you will just fill in the code. This code is called once for each source processed, and receives the sourceToken as argument. It must return a dictionary, the key/value pairs of which will be added to all rows returned by the row iterator. The purpose of sourceFields is to precompute values that depend on the source ("file") and are constant for all rows within it. An example for where you need this is when you want to create backlinks to the file a piece of data came from:: srcKey = utils.getRelativePath(sourceToken, base.getConfig("inputsDir")) return locals() You can then retrieve the path to the source file via srcKey key in rawdicts (and then, using render functions and static renderers, turn this into links). In addition to the sourceToken, you also have access to the data that will be fed from the grammar. This can be used to, e.g., retrieve the resource directory (``data.dd.rd.resdir``) or data descriptor properties (``data.dd.getProperty("whatever")``). Sometimes you want to do database queries from within sourceFields. This is tricky when you access the table being written or otherwise being accessed. This is because sourceTokens run in the midst of a transaction updating the table. So, something like:: base.SimpleQuerier().query(...) will wait for the transaction to finish. But the transaction is waiting for data that will only come when the query finishes -- this is a deadlock, and gavo imp will just sit there and wait (see also `Deadlocks `_). To get around this, you need to query using the data's connection. So, instead write:: base.SimpleQuerier(connection=data.connection).query(...) Embargoing Products ------------------- One of the sadder facts of life is that many researchers still think they profit from keeping their data proprietary (and that they are entitled to do so even though everything is paid for by the public). To cope with this, DaCHS' products subsystem has the notion of owners and embargo periods. To make a "product" (e.g., spectrum or image) proprietary, in the `products#define`_ application building the rowdict, set the ``owner`` and ``embargo`` keys. Owner is just any string convenient, embargo must eventually become a timestamp, so you'll in general come up with an ISO datetime string or a python ``datetime.datetime`` instance. Here's an example that says images become public a year after the observation:: parseTimestamp(row["DATE_OBS"])+datetime.timedelta( days=365) "danish" "danish.data" This is, in our view, an acceptable policy, but many observers want weird policies (try to talk them out of it, since such behaviour is not nice, and it leads to a bad user experience in the VO as a whole). You can get as fancy (or antisocial) as you like using custom rowfilters, as in the following example that sets a default embargo for the end of 2008, except for calibration frames and the observations of two objects made in 2003:: "maidanak" getEmbargo(row) "maidanak.rawframes" An embargoed product can only be retrieved by the "owner" until the embargo period is over. "Owner" is a concept implemented through the DaCHS user administration, which is fashioned a bit after the Unix user/group model. What you give as ``owner`` in the products mixin is a group name; if someone can authenticate as the member of a group, she can access the data. To create a user (and thus a group), use ``gavo admin adduser``, like this:: gavo admin adduser kroisos notsecret "Remove when xy is public" This command adds the user kroisos with the password notsecret and an optional comment reminding future operators what to do with the identity. Note that the password is stored in clear text in the database – which allows you to handle "I forgot my password" requests gracefully; as long as we only do HTTP Basic authentication, this doesn't matter much since with it, the passwords traverse the net in basically cleartext anyway. Again, this implements a mild deterrence rather than hard security. To add existing users to groups, use ``gavo admin addtogroup``, like this:: gavo admin addtogroup kroisos happy – this adds kroisos to the happy group, and whoever can authenticate as kroisos will be allowed access to any products with the ``owner`` happy. To discover further commands manipulating the user table, try:: gavo admin --help *Important*: When you use authentication, please set the ``[web]realm`` configuration item to some string reasonably characteristic for your site. Many systems will store credentials by realm, and if different sites use the same realm, their credentials will clobber each other. For details see the `customization info in the operators' guide`_ More on Services ================ Custom Templates ---------------- Within the data center, most pages are generated from templates [XXX TODO: write something about them generically]. This is true for the pages the form renderer on services displays as well. To effect special effects, you may want to override them (though in general, it is a much better idea to work within the standard template since that will give your service all kind of automatic updates and would make, e.g., changes much easier if your institution undergoes the yearly reorganization). The default response template can be found in resources/templates/defaultsresponse.html in the installed tree. To obtain the plainest output conceivable, try:: No title
Save this to a file within the resource directory, let's say "res/plain.html". Then, say:: in your service; this should do give you a minimally decorated page. Of course, this will display a severely degraded page. To get at least the standard style sheet and the standard javascript, say:: instead of the plain head. More on Cores ============= CondDescs --------- dbCores and cores derived from them take most of their power from condition descriptors or CondDescs. These combine inputKeys, which are basically column objects with some additional presentation-related information, with code generating SQL conditions. A condDesc can contain zero or more input keys (though having zero input keys makes no sense for user-defined condDescs since they would never "fire"). Having more than one input key is useful when input quantities can only be interpreted when present as a group. An example is the standard cone search, where you need both a position and a search radius. Automatic and manual control '''''''''''''''''''''''''''' However, most condDescs correspond to one input key, and the input key is mostly derived from a table column. This is the standard idiom, :: where somecol is a column in the table queried by the core. This construct will cause the an input key to be built from somecol. While doing this, the type will be mapped automatically. The primary rules are: * Numeric types will get mapped to numeric vizier-like expressions * Datetimes will get mapped to date vizier-like expressions * text and chars will get mapped to string vizier-like expressions * enumerated values (i.e., columns with value elements giving options) will not become vizier-like expressions but input keys that yield selection widgets. To have more control (e.g., if you do not want to allow vizier-like expressions, give the input key yourself):: (which would make a column required in the table optional in the query), or:: (which creates an input key matching everything literally), or even:: -- if the input key is required, queries not giving it will be rejected. The title attribute on option gives the label of an option in the HTML input widget; if it's missing, a string representation of the value will be used. In all those cases, the SQL generated from the condDesc is a conjunction of the input key's individual SQL expressions. Those, in turn, are simply comparisons for equality for plain types and more or less arbitrary expressions for vizier expression types. Incidentally, two properties on inputKeys are defined to only show inputs for certain renderers, viz., ``onlyForRenderer`` and ``notForRenderer``. Both have single strings as values. This is intended mainly for cases like SIAP and SCS where there are "human-oriented" versions of the input fields available. The built-in SCS and SIAP conditions already to that, so you can give both scs and humanSCS conditions in a core. Here is how you would define an input key that is only used for the form renderer:: form Phrase makers ''''''''''''' For complete control over what SQL is generated, condDescs may contain code called a phrase maker. This, again, is a procedure application, quite like with rowmaker procs, except that the signature of condDesc code is different. Phrase maker code has the following names available: * inputKeys -- the list of input keys for the parent CondDesc * inPars -- a dictionary mapping inputKey names to the values provided by the user * outPars -- a dictionary that is later used as the parameter dictionary to the query. The code should amend the outPars dictionary with the keys mentioned in the the conditions. The conditions themselves are yielded. So, a very simple condDesc with generated SQL could look like this:: outPars["xxyy"] = "x"*inPars.get("val", 20) yield "someColumn=%(xxyy)s" However, using fixed names in outPars is not recommended, if only because condDescs could be used multiple times. The recommended way uses the vizierexprs.getSQLKey function. It takes a name, a value, and the outPars dictionary. It will return a key unique to the query in question and enter the value into the outPars dictionary under that key. While that sounds complicated, it is actually rather harmless, as shown in the following real-world example that lets users input date, time and an interval in split-up form (e.g., when you cannot hope anyone will try to write the equivalent vizier-like expressions):: baseTS = datetime.datetime.combine(inPars["date"], inPars["time"]) dt = datetime.timedelta(minutes=inPars["within"]) yield "date BETWEEN %%(%s)s AND %%(%s)s"%( vizierexprs.getSQLKey("date", baseTS-dt, outPars), vizierexprs.getSQLKey("date", baseTS+dt, outPars)) More on Metadata ================ In general, most metadata for services and resources rather closely follows what's defined in `Resource Metadata for the Virtual Observatory`_; see also the `Reference Manual on RMI-style metadata`_. Authors ------- DaCHS tries to interpret the creator.name metadata as the authors, and it will split strings passed in at semicolons. So, the recommended way to write author lists in DaCHS is "Author1, S.; Author-Two, J.C.; et al" – the et al is ignored when trying to come up with individual names from such a string. For alternatives, see `RMI-Style Metadata in the reference`_. Coverage -------- One tricky spot is coverage, i.e., the parts of the STC space covered by what's in the resource. In general, you will define coverage more or less like this:: AllSky ICRS Optical The easy part is the waveband. Values here are from a fixed set of strings, viz., Radio, Millimeter, Infrared, Optical, UV, EUV, X-ray, Gamma-ray; capitalization is important, and you may give multiple elements (the software doesn't enforce this selection, but your registry documents will become invalid if you use anything else). The coverage.profile meta item has STC-S strings as values. See the `STC-S Note`_ as well as the `STC library documentation`_ for more information on the STC-S understood by DaCHS. In principle, you can get fancy here; for example, you could write:: TimeInterval TT BARYCENTER 1999-10-01T20:30:00 1999-10-02T20:30:10 unit s Error 10 Resolution 1 2 Circle FK5 J1980.0 GEOCENTER 0.13 0.45 0.03 unit rad PixSize 0.0001 0.0001 SpectralInterval HELIOCENTER 2000 6000 unit Angstrom Error 1 RedshiftInterval TOPOCENTER VELOCITY RELATIVISTIC -10 10 unit km/s However, the registries probably evaluate not very much of this information as yet, and you most certainly should try to give positions in ICRS. Copyright --------- Within the astronomical community, licensing issues have traditionally played a minor role – if you referenced properly, using data from other people was not only ok, it was encouraged. We should keep it that way, even in the days of easy reproducability. Still, formal statements about how your data may be used may be useful. These statements are called licenses. RMI has the copyright meta for this purpose. Right now, DaCHS doesn't do much with this information; it includes it in VOResource records, and the default response template shows it below the query form. We recommend either specifying something like "The data is in the public domain" or, if you want to use something that's more in line with scientific habits, the `Creative Commons Attribution`_ ("CC-BY"). To support this, DaCHS includes a macro that can be used in meta elements that are direct children of the resource element. Use it like this:: \RSTccby{Image metadata} Usage conditions for individual images could differ. See the COPYING FITS header. The advantage of using the macro is that you get a nice image, and in the future we may expand this to a formal, machine-readable declaration. .. _Creative Commons Attribution: http://creativecommons.org/licenses/by/3.0/ .. _Reference Manual on RMI-style metadata: ./ref.html#rmi-style-metadata .. _STC library documentation: ./stc.html .. _STC-S Note: http://www.ivoa.net/Documents/Notes/STC-S/ Active Tags =========== Active "tags" delemit elements within resource descriptor XML that do not directly contribute to result tree. Their typical use is to "record" event sequences and replay them later. Much of this is used internally. However, some applications of active tags are interesting for RD writers, too. Active tags always have names in all upper-case. LOOP ---- Loop lets you create multiple elements by rules. The simplest way to use it is by giving a space-separated list of "items":: The ``events`` child of the ``LOOP`` element creates a list of events (think "begin column element", "value for name attribute", "end column element"). These events are then replayed to the parser for each item in the LOOP's ``listItems`` attribute. Each occurrence of the ``\\item`` macro is replaced with the current item. So, in the resulting RD tree, the fragment above will have the same result as:: Sometimes the list items are used in multiple places in the same document. To avoid having to maintain multiple lists, you can define macros using RD's ``macDef`` element; this could look like this:: U B V R
parseFromString(MAG_\item) .... Note that macro names must be at least two characters long. Frequently, the loop variable should not just take on a single string. For such cases you can feed in tuples. The most convenient way to do this is ``csvItems``. The content of this element is a string literal containing comma separated values *with labels*, i.e., parsable with python's csv.DictReader. In your events, you can then refer to the labeled items using macros. For example:: band,source U 10-12 V 13-16
\source
TODO: EDIT actives? Publishing DAL Services ======================= DAL is VO-speak for "Data Access Layer", the standard protocols the VO uses to allow remote querying of data. To support such a protocol, you usually need to arrange things in three places: * The table queried needs a certain set of columns * The core must support certain input and output fields * The renderer must exhibit specified behaviour as regards, e.g., the formatting of error messages, and it may require protocol-specific metadata This section discusses the individual protocols in turn. SCS --- SCS, the simple cone search, is the simplest IVOA DAL protocol -- it is just HTTP with RA, DEC, and SR parameters plus a special way to encode errors (in a way somewhat different from what has been specified for later DAL protocols). Tables '''''' In principle, SCS can expose any table that has a exactly one column each with the UCDs ``pos.eq.ra;meta.main``, ``pos.eq.dec;meta.main``, and ``meta.id;meta.main``. The query is then ran against the position specified in this way. However, you almost always want to have a spatial index on these columns. To do that, use the ``//scs#q3cindex`` mixin on the tables, like this:: ... Finally, note that to have a valid SCS service, you must make sure the output table always contains the three required columns (as defined by the UCDs given above. To ensure that, these columns' ``verbLevel`` attribute must be 10 or less (we advise to have it at 1). Cores ''''' The SCS core simply is a dbCore. You must include the SCS condDesc, like this:: There is an alternative condDesc more suitable for humans. They can be used in parallel. The form renderer will then use the human-oriented one, the DAL renderer the protocol one. You'll get this by writing:: Although not required by SCS, we recommend to also include a MAXREC argument that lets people change the match limit in the SCS service (for the web service, the database widget already provides this functionality). A usable definition for it is given in the SCS RD in a STREAM with the id coreDescs, together with the two condDescs above. So, here's the recommended way to build a bare-bone SCS service:: SCS allows more query parameters; you can usually use condDesc's buildFrom attribute to directly make one from an input column. If you want to add a larger number of them, you would use an active tag:: Service ''''''' To expose that core through a service, just allow the scs.xml renderer on it. As the core is built, you can have a web-based form interface for free:: Nice Catalog Cone Search NC Cone 10 10 0.01 The meta information given is used when generating registration records. In particular, you should make sure that a query with the given ra, dec, and sr actually returns some data. SIAP ---- DaCHS' SIAP implemention right now assumes you are publishing FITS files with WCS headers. Other arrangements are of course possible, but you'd have to write your own computeXXX procDef. Tables '''''' SIAP-capable tables should mix in ``//siap#pgs`` (the older ``//siap#bbox`` is deprecated; you could still use it if for some reason you have no pgSphere). So, in the simplest case, a table that's going to be published through SIAP would look like this::
(of course, you can add more columns if you need them). Filling this table requires the use of a rowfilter and two procedure applications. Let's look at a data element for this table:: PI_COI "cars.images" vars["imageTitle"] "%s, %s"%(vars["OBSERVAT"], vars["TELESCOP"]) vars["dateObs"]+vars["startTime"]+( vars["endTime"]-vars["startTime"])/2 vars["FILTER"] This does, step by step: * When ingesting images, you will almost always read from FITS images, i.e., FITS primary headers. A ``fitsProdGrammar`` delivers the key-value-pairs from a header as a rawdict. * The ``qnd`` attribute of the grammar is recommended. It makes some (weak) assumptions to yield significant speedups with large images. * The ``fitsProdGrammar`` will map keys with hyphens to names with underscores, which is required to make them accessible in rowmakers. The ``map``` example above therefore is superfluous since it orders default behaviour. You may need other (non-automatic) name mappings, though, which would work analoguously. * The grammar further needs a rowfilter. These are procedure applications working on rawdicts. The `products#define`_ rowfilter lets you add keys on owners and embargo in case you want password protection for images, but most importantly it defines what table the data is destined for. This is crucial information, and if you ever get it wrong, you need to manually connect to the database and issue a command like ``DELETE FROM products WHERE sourcetable=''``. So, always bind table. Make sure to include the quotes, this is supposed to be a valid python expression yielding a string. * You then need to define a rowmaker that must apply two procs. For one, you need `computePGSSIAP `_ (if you mixed in ``pgsSIAP``). No bindings are required here. * The second proc application required is `setSIAPMeta `_ . Try to give all its keys somewhat sensible values, you will make your users' lives much easier. Warning: Do *not* use idmaps="*" with SIAP, since the auto-generated mappings will clobber the work of the xSIAP procs. Cores ''''' TBD. For the SIAP cutout core, the SIAP human condDesc must have ``required`` True, since the core will retrieve the default cutout size from the field size. The SIAP protocol condDesc is required anyway. Service ''''''' TBD. SSAP ---- Tables '''''' Currently, we only support "homogeneous" data collections, i.e., tables for which every data set was generated by the same instrument, code, or similar. Those mix in ``//ssap#hcd``. This mixin has lots of parameters that define the instrument; see `the SSAP HCD mixin in the ref doc <./ref.html#the-ssap-hcd-mixin>`_. For example, you could say:: //ssap#hcd
To fill such a table, it is recommended to use the `products#define`_ rowfilter and the `ssap#setMeta`_ rowmaker apply. This could look like this:: "\schema.data" @FILENAME "ivo://org.gavo.dc/ccd700/q#"+@FILENAME Caution: In the ssa table, we force the spectral axis to be a wavelength in meters. You must convert all values manually if necessary. For the spectra themselves you could use different units, but in our experience that's more confusing than helpful. In contrast to images where delivering FITS is likely all you need, there's a plethora of formats spectra are delivered in. To help a bit, you should make sure one of the formats you offer are VOTables conforming to the spectral data model (see `Making SDM Tables`_). If you want to deliver the "native" format as well, you'll have to have two rows for each spectrum. The standard way to achieve that is through a rowmaker in the grammar importing the spectra, like this:: baseAccref = os.path.splitext(row["prodtblPath"])[0] row["prodtblAccref"] = baseAccref+".txt" row["prodtblMime"] = "text/plain" # this is the file as delivered from upstream yield row row["prodtblAccref"] = baseAccref+".vot" row["prodtblPath"] = "dcc://\rdIdDotted/mksdm?"+baseAccref+".txt" row["prodtblMime"] = "application/x-votable+xml" # this is our processed SDM VOTable yield row SSAP's FORMAT parameter lets clients select what they want. The way the default FORMAT argument works, only application/x-votable+xml records are considered compliant. FITS files with spectra are a nasty chapter. Most of the FITS spectra out there currently are basically 1D images. Use an image/fits MIME type for those; application/fits is reserved for FITS binary tables conforming to the spectral data model; chances are you'll have to build those yourself. Cores ''''' Use the ssapCore for SSAP services. You must manually feed in the condition descriptors for the SSAP parameters. For homogeneous data collections, this is:: The ``hcd_condDescs`` includes condition descriptors for all mandatory and optional parameters meaningful in the case of homogeneous data collections (i.e., excluding those that match against constant values). Some of them may not be relevant to your service because your table never has values for them. For example, theoretical spectra will typically not give information on positions. The SSAP spec says that such a service should ignore POS rather than returning the empty set. If you think you must ignore certain conditions, you can use the PRUNE active tag. This looks like this:: Do not do this just because you don't have position information -- this would mean that you would dump your complete archive for (typical) queries with a position, and that is neither required by the spec (even if you might think so at first reading) nor desirable. Here is a table of parameter names and ids; you can always check them in ``$gavo_installed/resources/inputs/__system__/ssap.rd``. ============== =========== Parameter name condDesc id -------------- ----------- POS, SIZE coneCond BAND bandCond TIME timeCond ============== =========== For APERTURE, SNR, REDSHIFT, TARGETNAME, TARGETCLASS, PUBDID, CREATORDID, and MTIME, the condDesc id simply is ``_cond``, e.g., ``APERTURE_cond``. To have custom parameters, simply add condDesc elements as usual:: For SSAP cores, ``buildFrom`` will enable "PQL"-like query syntax such that users can post arguments like ``20000/30000,35000`` to ``t_eff``. Service ''''''' To expose SSAP services, use the `ssap.xml renderer`_. The metadata keys required for registration of these are documented in the reference manual. A complete declaration of a published SSAP service would then look like this:: mydata SSAP theory archival MAXREC=1 This service will expose all standard SSAP query parameters, and additionally condDescs built from the ``t_eff`` and ``log_g`` columns in the source table (see above). Incidentally, in web versions of such services, you may want to have specview-based "quick-view" links based on the ``run`` system rd that exposes the specview template. Here's an example of an ``outputTable`` (that would reside in the service element):: Some less cody approach would be welcome, but we'd need to collect some experience what people expect there. Also note that specview is (or possibly was, when you're reading this) very picky in what it accepts as VOTables; in the example, the ``dm=sed`` parameter is used to instruct DaCHS' SDM-making machinery to come up with a table palatable by current specviews. .. _ssap.xml renderer: ./ref.html#the-ssap-xml-renderer Making SDM Tables ''''''''''''''''' Compared to images, the formats situation with spectra is a mess. Therefore, in all likelihood, you will need some sort of conversion service to VOTables compliant to the spectral data model. DaCHS has a facility built in to support you with this. First, you will have to define the "instance table", i.e., a table definition that will contain a DC-internal representation of the spectrum according to the data model. There's a mixin for that:: //ssap#sdm-instance
In addition to adding lots and lots of params, the mixin also defines two columns, ``spectral`` and ``flux``; these have units and ucds as taken from the SSA metadata. You can add additional columns (e.g., a flux error depending the the spectral coordinate) as requried. The actual spectral instances get built by sdmCores. These cores, while potentially useful with common services, are intended to be used by the product renderer for dcc product table paths. They contain a data item that must yield a primary table that is basically sdm compliant. Most of this is done by the //ssap#feedSSAToSDM apply proc, but obviously you need to yield the spectral/flux pairs (plus potentially more stuff like errors, etc, if your spectrum table has more columns. This comes from the data item's grammar, which probably must always be an embedded grammar, since its sourceToken is an SSA row in a dictionary. Here's an example:: labels = ("spectral", "flux") relPath = self.sourceToken["accref"].split("?")[-1] with self.grammar.rd.openRes(relPath) as inF: for ln in inF: yield dict(zip(labels,ln.split())) The sdmCores are always combined with the sdm renderer. It passes an accref into the core that gets turned into an row from queried table; this must be an "ssa" table (i.e., right now something that mixes in ``//ssap#hcd``). This row is the input to the embedded data descriptor. Hence, this has no sources element, and you must have either a custom or embedded grammar to deal with this input. The actual data have to be located in the grammar; if they are in a text file, you could have a grammar for parsing those somewhere in the RD (TODO: example), or you could have the actual spectral data in the database. Whatever – the grammar has to return spectral and flux values. Also make sure that what you are return actually has the units promised by the metadata. To set the params from the ssa row, use the ``//ssap#feedSSAToSDM`` apply procDef in a ''parmaker''; this should mostly suffice in terms of metadata definition. When you have no additional columns, the default rowmaker (with ``idmaps="*"``) will do in the ``make`` of the spectrum table. Supporting getData '''''''''''''''''' DaCHS supports the preliminary getData specification by `Demleitner and Skoda (2012)`_. This means that you can emit spectra in lots of different formats, do cutouts and simple normalization. Of course, to do all this, DaCHS must again be taught to understand the spectra. This works as explained in `Making SDM Tables`_. In the example there, the embedded data element already has the id ``getdata``, ready to be referenced. To enable getData on an SSA service, just add a property called ``tablesource`` to it, pointing to this data element, like so:: ... getdata ... Note, however, that to make that work, the spectral coordinate must be a wavelength (but this is already true for the rest of the current spectra handling system), and the wavelength must be in what was given as spectralSI in the SSA mixin. .. _Demleitner and Skoda (2012): http://docs.g-vo.org/ssaevolution.html ObsTAP ------ ObsTAP is basically a single table, ivoa.ObsCore. In DaCHS, this is a view generated from input tables. To include the products within a table, you must use one of the mixins from the //obscore RD and fill out some of the mixin's parameters. There is some documentation on what to put where in the mixin documentation, but frankly, as a publisher, you should have at least passing knowledge of the obscore data model as laid down in `Tody et al (2011)`_. In the simplest case, a SIAP table, you could get by simply adding:: mixin="//obscore#publishSIAP" to the table definition's start tag. You do not have to re-import a table to publish it to obscore after the fact – ``gavo imp -m && gavo imp //obscore create`` will include an existing table to the obscore view. Even for SIAP, you will usually want to add metadata not contained in DaCHS' SIAP meta. To do this, add a mixin element to the table definition's body:: //obscore#publishSIAP On a table import, the obscore table will automatically be recreated to include the data. If you retrofit ObsCore support to large tables, you can avoid having to re-import everything by adding the mixin clause and then updating the metadata. In that case, you must manually remake the obscore table:: gavo imp -m path/to/my/rd gavo imp //obscore create For SSAP tables, there is an ``//obscore#publishSSAP`` mixin that works like its SIAP cousin (see the reference documentation of details). If you have "custom" tables, have a look at what GAVO does for its califa/q resource. .. _Tody et al (2011): http://www.ivoa.net/Documents/ObsCore/index.html Publishing DaCHS-managed tables via TAP --------------------------------------- In the simplest form, all you need to do to publish a table through the TAP endpoint is to add an ``adql="True"`` attribute to the table definition and update the metadata (by saying ``gavo imp -m ``). You should, however, take particular care that there's a useful description of the table, usually as a direct meta on the table. Keep in mind that people will stumble across the table in some sort of registry and need to be able to figure out whether the table contains useful data by that description and the column metadata alone. The TAP endpoint only exposes rather limited metadata. At least when there is no published service on the table, you may want to just publish the data to the registry, too. This leads to a much richer set of metadata, increasing people's chances to able to locate the data. To publish a nonservice (usually a table definition, but you can register data descriptors containing multiple tables, too), use the `register Element <./ref.html#element-register>`_ . For a simple table, just wringing ```` is enough, since the set name defaults to ``ivo_managed`` and ADQL-accessible tables are automatically related to the TAP services. When ``register`` is the child of a data item, you need to manually declare that child tables are TAP-accessible, like this:: Another thing you might want to do when publishing tables to TAP is add sample queries for them. As an extension to the usual tap_schema, DaCHS has an example table giving a name, a query and a description. TAP clients may exploit these examples to help users figure out what to usefully do with more arcane tables, and of course you can explain more interesting features of your server or data here. To add an example, create a file with a name ending in ``.sample`` in ``$GAVO_INPUTS/__system/adqlexamples/``. The grammar for theses files is defined in ``//tap#import_examples``. You write the three keys, viz., name, query and description, in this sequence, each followed by a double colon and any material you want in the field. The keys must start at the beginning of a line. You must add a double period at the end of the file, and it's one file per example. Here's what this should look like:: name::katkat bibliography query:: select * from katkat.katkat where gavo_hasword('variable', source) and minEpoch<1900 description::To search for title (or other) words in the source field of :taptable:`katkat.katkat` (or in some other similar query), use the gavo_hasword locally defined function. This basically works a bit like you'd expect from search engines: case-insensitive, and oblivious to any context. .. The description can be reStructured text. We have added an interpreted text role taptable (see the description above on how to use it). Its effect is to declare the table name that follows as "pertinent" to the query. Smart clients would then show that query together with the table metadata in their metadata browsers. After adding an example, run ``gavo imp //tap import_examples`` to update the database table. The result of this query is cached in the server, so to see the result on TAP's example endpoint, you need to reload the tap service (e.g., by going to /tap, logging in as gavoadmin, clicking on "Admin me" and then on "Reload RD"). Publishing existing tables via TAP ---------------------------------- If you already have a database table and now want to use DaCHS to publish it via TAP, just write an RD as described above, except that the data element is trivial. Here's an example of how that could look like:: My great table ... (more metadata) id of object covered here
Within the data element you need one make each for each table you want in ADQL; it would cause the tables to be created on a plain ``gavo imp``, in the present context, it just says something like "put the table metadata into DaCHS' internal catalogs". After that, say ``gavo imp -m ``; make sure you don't forget the ``-m``, because without it, ``gavo imp`` will drop the existing tables if it can, i.e., if gavoadmin has write access to the schema in question, and it should have that for reasons explained in the next paragraph. This adds the metadata you've given to all kinds of administrative tables DaCHS keeps but does not touch the data. It will also try to fix the permissions of the table such that DaCHS's untrusted user can read it. To let DaCHS manage the permissions, in psql say (assuming standard profiles):: GRANT ALL PRIVILEGES ON SCHEMA TO gavoadmin WITH GRANT OPTION; GRANT SELECT ON . TO gavoadmin WITH GRANT OPTION; If you have local users accessing the table, you should declare them in either the allRoles or readRoles attributes to the table definiton. Maybe even adapting the profiles in GAVOROOT/etc to match your existing infrastructure could make sense. Also do not forget that people should have some way to locate your data collection (i.e., the table(s) that you are exposing). If you have sufficient metadata defined – basically as for services –, you can register your data collection. To do this, just add an empty ```` element to either a table definition or, more convenient in multi-table setups, a data element for your data collection. The defaults for register are publication to the VO and, for ADQL-exposed tables, serviced by the TAP service, which is about what you want in this situation. Here's an example for the case of a multi-table publication:: Don't forget that you need to execute:: gavo pub the/rdid to make DaCHS actually publish the table. The Registry Interface ====================== Conceptually, the VO's Registry is a set of resource records (i.e., descriptions of services, data, or other entities) to let users locate resources relevant to them (e.g., look for a service giving surface temperatures for OB stars). Whatever as a resource record is called *VO resource* in the following to keep them apart from whatever DaCHS resource descriptors describe; DaCHS RDs may descibe zero, one, or multiple VO resources. We apologize for the confused nomenclature. Physically, there are several services that keep and update this set and let people query them (a "full registry"), e.g., the `VAO registry`_, the `ESAVO registry`_, or the Astrogrid registry. All these should harvest each other and thus have identical content (this is currently not always true). To be part of the VO, you have to register your services. DaCHS makes this fairly easy since it contains a publishing registry. This is again a service that exposes a standard interface defined by the Open Archives Initiatives. There is a renderer for the OAI harvesting protocol (`OAI-PMH`_) called ``pubreg.xml`` that goes together with ``registryCore``. The service ``//services#registry`` with this renderer has a vanity name of ``/oai.xml``, which is you data center's publishing registry "endpoint". Full registries obtain the resource records present on your data center for there. Each VO resource has a unique identifier of the form:: ivo:/// -- is defined by the DaCHS software (to be ``/ is a globally unique string. It is recommended that you use your DNS name (or some appropriate part of it), which will provide some uniqueness. The authority is declared in your gavorc (see below). Details on VO identifiers can be found in `IVOA Identifiers`_. To claim an authority, you have to define who you -- as an organization -- are. For this, DaCHS will create a resource record for your organization, too, where "your organization" for DaCHS means whatever you give as creator.name in defaultmeta (see below), which in general should be something like "My Institute Data Center" rather than "My Institute". You can register "My Institute" as well, if you want, but, the way things are written now, not as the entity running managing the authority. To make the VO aware of the existence of your data center, you will need to tell the `RofR`_ (Registry of Registries) about your data center. Before you can do this, you need to fill in quite a bit of information in your gavorc and ``etc/defaultmeta.txt``. The `registry section in the operator's guide`_ has information on what to do. .. _ESAVO registry: http://esavo.esa.int/registry/index.jsp .. _VAO registry: http://nvo.stsci.edu/vor10/index.aspx .. _IVOA identifiers: http://www.ivoa.net/Documents/REC/Identifiers/Identifiers-20070302.html .. _RofR: http://rofr.ivoa.net/ .. _OAI-PMH: http://www.openarchives.org/OAI/openarchivesprotocol.html .. _registry section in the operator's guide: opguide.html#registry-matters .. _Resource Metadata for the Virtual Observatory: http://www.ivoa.net/Documents/latest/RM.html .. _products#define: ref.html#products-define .. _ssap#setMeta: ref.html#ssap-setmeta .. _customization info in the operators' guide: opguide.html#adapting-dachs-for-your-site .. _RMI-Style Metadata in the reference: ref.html#rmi-style-metadata