=================== GAVO DaCHS Tutorial =================== .. contents:: :depth: 2 :backlinks: entry :class: toc Ingesting Data ============== Starting the RD --------------- To ingest data, you will have to write a resource descriptor (RD). We recommend to keep everything handled by a specific RD together in on directory that is a direct child of your inputs directory (see installation and configuration), though you could group resources in deeper subdirectories. So, go to your inputs directory and say:: mkdir lmcextinct The directory name will (normally) appear in URLs, so it's a good idea to choose something descriptive and short. This directory is called the resource directory. We recommend to put the RD in the root of this directory. A good default name for the RD is "q.rd"; the "q" will appear in the default URLs as well and usually looks good in there:: cd lmcextinct vi q.rd (where you can substitute vi with your favourite editor, of course). Writing resource descriptors is what most of the operation of a data center is about. Let's start slowly by giving some metadata:: Extinction within the LMC 2009-06-02T08:42:00Z Extinction values in the area of the LMC... Free to use. S. Author Large Magellanic Cloud Interstellar medium, nebulae Extinction You need to adapt the encoding attribute in the prefix to match what you are actually using if you plan on using non-ASCII-characters. You may want to use utf-8 instead of the iso-8859-1 given below depending on your computer's setup. The schema attribute on resource gives the schema tables for this resource will turn up in. You should, in general, use the subdirectory name. If you don't, you have to give the subdirectory name in a resdir attribute. This attribute must be the name of the resource directory relative to the inputs directory specified in the configuration. Otherwise, there is only meta information so far. This metadata is cruicial for later registration of the service. In HTML forms, it is displayed in a sidebar. See `RMI-style metadata <./ref.html#rmi-style-metadata>`_ in the reference documentation. Defining Target Tables ---------------------- Within the DC, data is represented in database tables, while metadata is mostly kept within the resource descriptors. A major part of this metadata is the table structure. It is defined in table elements, which usually are direct children of the resoource element. A resource element may contain multiple table definitions. Such a table definition might look like this:: Extinction values within certain areas on the sky.
In a table definition, you must give id, which will double as the table name within the database. The onDisk attribute specifies that the table is to reside on the disk as opposed to in memory (in-memory tables have applications in advanced operations). The adql attribute specifies that no access restrictions are to be placed on the table; if you run an ADQL or TAP service, users can access this table. Table elements may contain metadata. You do not need to repeat metadata given for the resource, because (in most cases) the DC performs metadata inheritance. This means that if a table is asked for a piece of metadata it does not have, it forwards that request to the embedding resource. Defining Columns '''''''''''''''' The main content of table is a sequence of column elements. These contain a description of a single table column. The name attribute is central in that it will be the column name in the database, the key for the column's value in record dictionaries that the software uses internally, and it is usually used to reference the column from the outside. Column names must be legal identifiers for both python and SQL in DaCHS. SQL quoted identifiers thus are not allowed. The type attribute, if not given, defaults to real, and can otherwise take values in valid SQL datatypes. The DC software knows how to handle, in addition to real, * text -- a string. You can also use types like char(7) and the like, but since that does not help postgres (or much anything else within the DC), this is not recommended. * double precision (or double) -- a floating point number. You should use in doubles if you need to keep more than about 7 digits of mantissa. * integer (or int) -- typically a 32-bit integer * bigint -- typically a 64-bit integer * smallint -- typically a 16-bit integer * timestamp -- a combination of date and time. While postgres can process a very large range of dates, the DC stores timestamps in datetime.datetime objects, which means that for "astronomical" times (like 10000 B.C. or 10000 A.D. you may need to use custom representations. Also, the DC assumes all times to be without time zones. Further time metadata (like distinguishing TT from UT) is given through STC specifications. * date -- a date. See timestamp. * time -- a time. See timestamp * box -- a rectangle. Some more types (like raw and file) are available to tables in service definitions, but they should, in general, not appear in database tables. Futher metadata on columns includes: * unit -- the unit the column values are in. The syntax is that defined by Vizier, but that may change pending further standardization in the VO. Unit is left out for unitless values. * tablehead -- a very short string designating the content. This string is typically used for display purposes, e.g., as table headings or labels on input fields. * description -- a longer string characterizing the content. This may be in bubble help or VOTable descriptions. * ucd -- a Unified Content Descriptor as defined by IVOA. To figure out "good" UCDs, the UCD resolver at http://dc.zah.uni-heidelberg.de/ucds/ui/ui/form can help. * requried -- True if value must be set in order for the record to be valid. By default, NULL (which in python is None) is a valid value for any column. For required columns, that is no longer the case. This is particularly important in connection with foreign keys. Parsing Input Data ------------------ After you have defined the table, you will want to fill it. You will usually have one or more input files with "raw" data. We recommend putting such input data files into a subdirectory of their own named "data". Let's assume we have one input file for the table above, called lmc_extinction_values.txt. Suppose it looks like this, where tabs in the input are shown as "\\t":: RA_min\\tRA_max\\tDEC_min\\tDEC_max\\tE(V-I)\\tA_V\\tA_I 78.910625\\t78.982146\\t-69.557417\\t-69.480639\\t0.04\\t0.092571\\t0.123429 78.910625\\t78.982146\\t-69.480639\\t-69.403861\\t0.05\\t0.115714\\t0.154286 78.910625\\t78.982146\\t-69.403861\\t-69.327083\\t0.05\\t0.115714\\t0.154286 The first step for ingestion is lexical analysis. In the DC software, this is performed by grammars. There are many grammars defined, e.g., for getting values from FITS files, VOTables, or using column-based formats; you can also write specialized grammars in python. All grammars read "something" and emit a mapping from names to (mostly) string values. reGrammars '''''''''' In this case the easiest grammar to use probably is the `reGrammar <./ref.html#element-regrammar>`_. The idea here is that you give two regular expressions to separate the file into records and the records into fields, and that you simply enumerate the names used in the mapping. For the file given above, the RE grammar definition could look like this:: raMin, raMax, decMin, decMax, EVI, AV, AI The names given are values of the name attribute in the table definition. If you checked the documentation on reGrammar, you will have noticed that "names" is an "atomic child" of reGrammar. Atomic children are usually written as attributes, since their values can always be represented as strings. However, if strings become larger, it's more convenient to write them in elements. The DC software allows you to do just that in general: All attributes can be written as elements with tags named like the attribute. So, :: would have worked just fine, as would:: 1 raMin, raMax, decMin, decMax, EVI, AV, AI Structured children, in contrast, cannot be written as plain strings and thus can only be written in element notation. Though grammars can be direct children of resource, they are usually written as children of data elements (see below). columnGrammars '''''''''''''' Another grammar frequently useful when reading from text tables is the `columnGrammar <./ref.html#element-columngrammar>`_. It allows a rather dircect translation of VizieR-like "byte-by-byte"-descriptions. Column grammars define ``col`` elements, the ``key`` attributes of which give the ``name``s of the target columns (or auxillary identifiers that you work with in your rowmaker), like this:: 1-9 10-18 ... The first column has the index 1, and -- contrary to python slices -- the last index is included in the selection. No expansion of tabs or similar is performed. As potential column names, the keys must be valid python identifiers. Mapping data ------------ A grammar produces a sequence of mappings from names to strings, the rawdicts. The database, on the other hand, wants typed values, i.e., integers, time stamps, etc, internally represented as dictionaries mapping column names to values called rowdicts. Also, data in input tables is frequently given in inconvenient formats (e.g., sexagesimal angles), units not suited to further processing, or distributed over multiple columns (e.g., date and time of an observation when we want a single timestamp). It is the job of row makers to transform the rough data coming from a grammar to whatever the table defines. Basically a row maker consists of * `var <./ref.html#element-var>`_ s -- assignments of expression values names in the rawdict. * procedure applications (see `apply <./ref.html#element-apply>`_) -- procedural manipulations of both rawdicts and rowdicts. * maps -- rowdict definition. When building a rowdict for ingestion into the database, a rowmaker first binds var names, then applies procedures and finally runs the mappings. For simple cases, maps will suffice; you may actually even be able to do without them. Maps must specify a dest attribute giving the rowdict key that is defined. To specify the value, the can * either give a src attribute specifying a rawdict key that will then be converted to a typed value using "sane" defaults (e.g., integers will be converted by python's int constructor, where empty strings are mapped to None) * or give a python expression in the character content, the value of which is then directly used as value for dest. No implicit conversions are performed. In the case above, you could start by saying:: to copy over the rawdict (grammar) keys that directly map to table column names. Since this is a bit unwieldy, the DC provides a shortcut:: EVI:EVI,AV:AV,AI:AI which expands to exactly what is written above. The keys in each pair do not need to be identical; the first item of each pair is the table column name, the second the rawdict key. The case where the names of rawdict and rowdict keys are identical is so common (since the RD author controls both) that there is yet another shortcut for this:: EVI,AV,AI Idmaps sets up one map element each with both dest and src set to the value for every name in the comma separated list idmaps. You can abbreviate this further to:: idmaps values can contain shell patterns. They will be matched to the column names in the target table. For every column for which there is no explicit mapping, an identity mapping (with type conversion) will be set up. This leaves the bbox, centerAlpha, and centerDelta keys to be defined. No literals for those appear in the rawdicts since they are not part of the input data. We need to compute them. To facilitate computations, we first turn the bounds to floats; this can be done using vars:: float(raMin) float(raMax) float(decMin) float(decMax) No shortcut is available here, since this is a relatively rare thing. You could use procDef/apply to save on keystrokes if you find yourself having to do such simple conversions more frequently. As you can see, var elements have a name attribute that gives the name in the rawdict the value is to be bound to. Their character content is a python expression in which you can access the rawdict values by their names. The remaining computations can be performed in mappings:: (raMin+raMax)/2. (decMin+decMax)/2. coords.Box((raMin, decMin), (raMax, decMax)) As in vars, the rawdict values can be accessed by their keys in the mapping expressions. coords.Box is the internal type for SQL Box values; you will not usually see those. Still, you can access basically the whole DC code in this mapping. At some point we will define an API of "safe operations" that you can use without having to fear changes in the DC code. Data elements ------------- We now have a table definition, a grammar, and a rowmaker. For purposes of importing, these three come together in a data element. These elements define what could be seen as the equivalent of a VOTable resource together with a recipe of how to build it. For onDisk tables, a side effect of building the data is that the tables are created in the database; in that sense, data elements also define operations, a notion that will become more pronounced as we discuss incremental processing. Let us assemble the pieces we have so far:: Extinction within the LMC 2009-06-02T08:42:00Z Extinction values in the area of the LMC... Free to use. S. Author Large Magellanic Cloud Interstellar medium, nebulae Extinction Extinction values within certain areas on the sky.
raMin, raMax, decMin, decMax, ev_i, a_v, a_i float(raMin) float(raMax) float(decMin) float(decMax) (raMin+raMax)/2. (decMin+decMax)/2. coords.Box((raMin, decMin), (raMax, decMax))
As you can see, we have put the grammar and the rowmaker into a data element. While this is not exactly necessary (they could be direct children of resource as well, which might be a good idea if they are used in more than one data), this is good practice since they, in some sense, belong to that data element. There are two new elements in data. For one, there's sources. Sources specify where the data will find its input files in its pattern attribute. This contains shell patterns that are interpreted relative to the resource directory. You can give multiple patterns if necessary like this:: inp2/*.txt inp1/*.txt There also is a recurse boolean attribute you can use when your sources are distributed over subdirectories of the path part of the pattern. Indices and Mixins ------------------ Now, let's assume the input table is large. You will want to define indices on the table. To do this, use the `index <./ref.html#element-index>`_ element. It is a child of table. In general, index specifications can be rather involved, but simple cases remain simple. If you just wanted to define an index on EVI, you could say:: ... (the columns attribute would be "A_V,EVI" if you wanted an index on both columns). However, indices are not always that simple. For example, for a spatial index on centerAlpha, centerDelta, with the q3c scheme used by the DC software you would have to write something like:: q3c_ang2ipix(centerAlpha,centerDelta) The DC software has a mechanism that helps in this case: `Mixins <./ref.html#mixins>`_. A mixin conceptually is a guarantee of certain table properties, typically of the presence of certain columns; here, it is just the presence of an index. So, all you need to do to have a spatial index on the table is::
... This is UCD magic at work -- q3cindex selects the columns with pos.eq.*;meta.main as index columns. If you are curious how it does this, check scs.rd in the system RD directory. Starting the Ingestion ---------------------- At this point, you can run the ingestion:: gavoimp q By default, gavoimp creates all data defined in a resource. If this is not what you want, you can explicitely specify a data id to process:: gavoimp q import For larger data sets, it may be wise to first try a couple of rows:: gavoimp --stop-after=300 q Try ``gavoimp --help`` to see more options (most of which are probably irrelevant to you now. Note that gavoimp interprets the RD argument as a file first and then as an RD id. An RD id is the inputs-relative path of the RD with the extension stripped. Our example RD thus has the RD id lmcextinct/q, and you could have said:: gavoimp lmcextinct/q from anywhere in the file system. Debugging --------- If nothing else helps you can watch what the software actually sends to the database. To do that, set the GAVO_SQL_DEBUG environment variable to any value. This could look like this:: env GAVO_SQL_DEBUG=1 gavoimp q create The first couple of requests are for internal use (like checking that some meta tables are present). Publishing Data =============== Once a table is in the database, it needs to get out again. Within DaCHS, there are three parties involved in delivering data to the user: * The core; it actually does the computation * The renderer; it formats the result in some way requested by the user and delivers it. There are renderers for web forms, VO protocols, imges, etc. * The service; it holds together the core and the renderer, can reformat core results, controls the metadata, etc. You will usually use pre-specified renderers, so these are not defined in resource descriptors. What you have to define are cores and services. For core, you will usually use the `dbCore <./ref.html#element-dbcore>`_ in custom services, though `many other cores <./ref.html#cores-available>`_ are predefined and you can `define your own <./ref.html#writing-custom-cores>`_. The dbCore generates a (single-table) query from condition descriptors and returns a table that you describe through an output table. Cores are defined as direct children of the resource. For the lmcextinction table above, it could look like this:: Cores always need an id. dbCores need a queriedTable attribute, the value of which must be a table reference. This is the table the query will run against. CondDescs can be defined in all kinds of ways. The most common modes, however, are using predefined condDescs (which mostly come from protocols; in this case, humanScs comes from SCS and lets you do cone searches), and just deriving condDescs from table columns. You can refer to columns from your table definition by name in the buildFrom attribute, and the software tries to make some useful input definition from that column. In web forms, these input definitions become form items; other renderers will expose them differently. In all cases, however, the condDescs of the dbCore define what fields can be queried. The service now ties the core together with a renderer. It might look like this:: lmcext_web While services can run without shortnames, it can lead to trouble later, so you should make a habit of assigning short names. See `the data checklist <./data_checklist.html>`_ for more information on short names. A service must have an id as well, and its core attribute must contain the id of a core. With this minimal specification, the service exposes a web form-based interface. To try this, run a server:: gavoserve debug and point a browser to http://localhost:8080/lmcextinct/q/cone/form (the host part, of course, depends on your configuration. If you did not change anything there, you should find the data at the given URL). More on Tables ============== Notes ----- Frequently, you need to say more about a column than is appropriate in the few-phrase description. Historically, such situations have been handled using notes. Since notes can be reused for multiple columns, we chose to follow that precedent rather than attach longish information onto the columns themselves. The notes themselves are kept in meta elements belonging to tables. Since the notes tend to be markup-heavy, their default format is restructured text. When entering notes in RDs, there is an attribute ``tag`` on these meta items::
... The meaning of the flag is as follows: ===== ========== value meaning ===== ========== 1 value is 2 2 value is 1 ===== ========== ...
To assoicate a column with a note, use the column's note attribute:: As tag, you may use basically any string, but it's a good idea to keep it to numbers or at least characters not requiring URL encoding. The notes will exposed in HTML table heads, table and service description, etc. If you need to link to one, there is the built-in tablenote renderer that takes the table and the note from its query path. The most convenient way to is it is through the built-in vanity name tablenote, where you would access the note above using a URL like ``http://your.server/tablenote/demoschema.demo/1``. STC --- As soon as you have coordinates, you will want to define coordinate systems on them. In the introductory example, that was not necessary because SCS mandates that the coordinates you export are in ICRS, so either your coordinates are in ICRS or you are violating the SCS protocol -- in either case, nothing to declare. In the more general case, you will want to say what is what in your tables. DaCHS uses a language called STC-S to declare systems, reference points, etc. The STC-S description [TODO: Link to IVOA note] is a bit terse, but the good news is that you will get by with a few features most of the time. STC is defined in children of table elements, with references to table columns in quoted strings:: Position ICRS "ra" "dec" Error "e_ra" "e_dec" Position FK4 J1950.0 "ra_orig" "dec_orig" You do not need to change anything in the column definitions themselves, since the machinery will resolve your column references. If you refer to non-existing columns, RD parse errors will be thrown. More on Grammars ================ Row Generators -------------- TBD Source Fields ------------- Grammars can have a sourceFields element. It contains a standard procedure definition (i.e., you could predefine those and bind parameters), but usually you will just fill in the code. This code is called once for each source processed, and receives the sourceToken as argument. It must return a dictionary, the key/value pairs of which will be added to all rows returned by the row iterator. The purpose of sourceFields is to precompute values that depend on the source ("file") and are constant for all rows within it. An example for where you need this is when you want to create backlinks to the file a piece of data came from:: srcKey = utils.getRelativePath(sourceToken, base.getConfig("inputsDir")) return locals() You can then retrieve the path to the source file via srcKey key in rawdicts (and then, using render functions and static renderers, turn this into links). In addition to the sourceToken, you also have access to the data that will be fed from the grammar. This can be used to, e.g., retrieve the resource directory (``data.dd.rd.resdir``) or data descriptor properties (``data.dd.getProperty("whatever")``). Sometimes you want to do database queries from within sourceFields. This is tricky when you access the table being written or otherwise being accessed. This is because sourceTokens run in the midst of a transaction updating the table. So, something like:: base.SimpleQuerier().query(...) will wait for the transaction to finish. But the transaction is waiting for data that will only come when the query finishes -- this is a deadlock, and gavoimp will just sit there and wait (see also `Deadlocks `_). To get around this, you need to query using the data's connection. So, instead write:: base.SimpleQuerier(connection=data.connection).query(...) More on Services ================ Custom Templates ---------------- Within the data center, most pages are generated from templates [XXX TODO: write something about them generically]. This is true for the pages the form renderer on services displays as well. To effect special effects, you may want to override them (though in general, it is a much better idea to work within the standard template since that will give your service all kind of automatic updates and would make, e.g., changes much easier if your institution undergoes the yearly reorganization). The default response template can be found in resources/templates/defaultsresponse.html in the installed tree. To obtain the plainest output conceivable, try:: No title
Save this to a file within the resource directory, let's say "res/plain.html". Then, say:: in your service; this should do give you a minimally decorated page. Of course, this will display a severely degraded page. To get at least the standard style sheet and the standard javascript, say:: instead of the plain head. More on Cores ============= CondDescs --------- dbCores and cores derived from them take most of their power from condition descriptors or CondDescs. These combine inputKeys, which are basically column objects with some additional presentation-related information, with code generating SQL conditions. A condDesc can contain zero or more input keys (though having zero input keys makes no sense for user-defined condDescs since they would never "fire"). Having more than one input key is useful when input quantities can only be interpreted when present as a group. An example is the standard cone search, where you need both a position and a search radius. Automatic and manual control '''''''''''''''''''''''''''' However, most condDescs correspond to one input key, and the input key is mostly derived from a table column. This is the standard idiom, :: where somecol is a column in the table queried by the core. This construct will cause the an input key to be built from somecol. While doing this, the type will be mapped automatically. The primary rules are: * Numeric types will get mapped to numeric vizier-like expressions * Datetimes will get mapped to date vizier-like expressions * text and chars will get mapped to string vizier-like expressions * enumerated values (i.e., columns with value elements giving options) will not become vizier-like expressions but input keys that yield selection widgets. To have more control (e.g., if you do not want to allow vizier-like expressions, give the input key yourself):: (which would make a column required in the table optional in the query), or:: (which creates an input key matching everything literally), or even:: -- if the input key is required, queries not giving it will be rejected. The title attribute on option gives the label of an option in the HTML input widget; if it's missing, a string representation of the value will be used. In all those cases, the SQL generated from the condDesc is a conjunction of the input key's individual SQL expressions. Those, in turn, are simply comparisons for equality for plain types and more or less arbitrary expressions for vizier expression types. Incidentally, two properties on inputKeys are defined to only show inputs for certain renderers, viz., ``onlyForRenderer`` and ``notForRenderer``. Both have single strings as values. This is intended mainly for cases like SIAP and SCS where there are "human-oriented" versions of the input fields available. The built-in SCS and SIAP conditions already to that, so you can give both scs and humanSCS conditions in a core. Here is how you would define an input key that is only used for the form renderer:: form Phrase makers ''''''''''''' For complete control over what SQL is generated, condDescs may contain code called a phrase maker. This, again, is a procedure application, quite like with rowmaker procs, except that the signature of condDesc code is different. Phrase maker code has the following names available: * inputKeys -- the list of input keys for the parent CondDesc * inPars -- a dictionary mapping inputKey names to the values provided by the user * outPars -- a dictionary that is later used as the parameter dictionary to the query. The code should amend the outPars dictionary with the keys mentioned in the the conditions. The conditions themselves are yielded. So, a very simple condDesc with generated SQL could look like this:: outPars["xxyy"] = "x"*inPars.get("val", 20) yield "someColumn=%(xxyy)s" However, using fixed names in outPars is not recommended, if only because condDescs could be used multiple times. The recommended way uses the vizierexprs.getSQLKey function. It takes a name, a value, and the outPars dictionary. It will return a key unique to the query in question and enter the value into the outPars dictionary under that key. While that sounds complicated, it is actually rather harmless, as shown in the following real-world example that lets users input date, time and an interval in split-up form (e.g., when you cannot hope anyone will try to write the equivalent vizier-like expressions):: baseTS = datetime.datetime.combine(inPars["date"], inPars["time"]) dt = datetime.timedelta(minutes=inPars["within"]) yield "date BETWEEN %%(%s)s AND %%(%s)s"%( vizierexprs.getSQLKey("date", baseTS-dt, outPars), vizierexprs.getSQLKey("date", baseTS+dt, outPars)) .. _`Resource Metadata for the Virtual Observatory`: http://www.ivoa.net/Documents/latest/RM.html