================================
Development notes for GAVO DaCHS
================================
This is basically a heap of stuff I intend to amend with informal docs
on what I'm doing to the software. While I hope that at some point
it'll grow into a useful introduction to further developing the stuff,
right now it's a random collection that contains quite a bit of
outdated information. Caveat emptor.
Package Layout
==============
To alleviate cross-import pains, to facilitate later package splitting
and also as a guideline of what goes where, apply the following rules:
* Each functionality block is in a subpackage, the __init__ for which
contains the main functions, classes, etc., of the the sub-package
interface most clients will be concerned with. Clients needing
special tricks may still import individual modules.
* Within each subpackage, *no* module imports the sub-package, i.e.,
a module in base never says "from gavo import base"
* A subpackage may have a module common, containing objects that
multiple modules within that subpackage requires. common may *not*
import any module from the subpackage, but may be imported from all
of them.
* There is a hierarchy of subpackages, where subpackages lower in the
hierarchy may not import anything from the higher or equal levels, but
only from lower levels. This hierarchy currently looks like this:
imp < utils < (adql, stc) < base < rscdef < grammars < rsc < svcs <
protocols < web < rscdef < (helpers, user)
Error handling, logging
=======================
It is the goal that all errors that can be triggered from the web or
from within resource descriptors yield sensible error messages with,
as possible, information on the location of the error. Also, major
operations changing the content of the database should be loggable with
time and, probably, user information.
The core of error processing is utils.excs. All "sensible" exceptions
(i.e., MemoryErrors and software bugs excepted) should be instances of
gavo.excs.Error (or derived classes).
The base class takes a hint argument at construction that gives
additional information on how to fix a certain problem. Apart from
message (in the first argument), the exceptions must always be
constructed using keyword arguments.
When there is structured information (e.g., line numbers, keys, and the
like), always keep the information separate and use the __str__ method
of the exception to construct something humans want to see. All
built-in exceptions should accept a hint keyword.
Testing
=======
Many of the tests will require database connectivity. They should, in
general, not need resources apart from the availability of a test db
profile. I have this set up on my private machine like this:
::
[profiles]
test:test
in ~/.gavorc and
::
database=gavo
user=msdemlei
in ~/.gavo/etc -- of course, for this to work, you need a postgresql
engine running on your machine with a database gavo in which your role
is superuser (well, has sufficient rights...).
There are doctest, pyunit-based tests, and trial-based tests yaddayadda,
utils/roughtest.py
Structures
==========
Resource description within the DC works via instances of
base.Structure. These parse themselves from XML strings, do validation,
etc.
A complete structure instance has the following callbacks:
* completeElement -- called when the element closing tag is encountered,
used to fill in computed defaults
* validate -- called after completeElement, used to raise errors if
some gross ("syntactic") mistakes are in the element
* onElementComplete -- called after validate, i.e., elementCompleted
can rely on seeing a "valid" structure
In addition, structures can register onParentCompleted callbacks. These
are called after the elementCompleted of the parent element.
This processing is done automatically when parsing elements from XML.
When building elements manually, you should call the finishElement
method when done to arrange for these methods being called.
If you override these methods, make sure you call the methods of the
superclass. Since we might, at some point, want mixins to be able
to define validators etc, use super()-based superclass calling, through
_completeElementNext, _validateNext, and _onElementCompleteNext
Metadata
========
Within the framework, there are two main sources of metadata. For one,
the fields (via datadef.DataField) of a table or a document record carry
metadata on their types, ucds, etc.
Metadata pertaining to other entities than fields is kept with these
entities, viz., ResourceDescriptor, DataDescriptor, DataSet, Table, and
RecordDef, instances. All these mix in the parsing.meta.MetaMixin
providing getMeta and addMeta methods.
It is the metadata containers' responsibility to choose their parents
and children. For this purpose, they have have to call a child's
setMetaParent method when they notice a Meta-carrying child is being
added.
Our Metadata implementation has to deal with
* Sequences of metadata -- there may be more than one item for a keyword,
e.g. for "subject".
* Compound metadata -- items may consist of various sub-items (e.g.,
coverage, creator)
* There may be sequences of compound objects
* The metadata should be sanely serializable into at least plain text,
html, and VOResource
* At least XML and key/value pairs should be supported as input
* Handover: Metadata containers are hierarchical -- a service might be
derived from a data set, which in turn sits within a resource descriptor.
If the service doesn't have a piece of metadata, it has to hand over
the question to its parent.
* To keep the complexity of the meta trees down, we want certain types
common types of metadata; for example, we generally don't want
the title to be metadata for a link but rather keep it in the link
meta value itself.
This results in a rather messy implementation and an interface that's
not really optimal.
Describing Metadata
+++++++++++++++++++
Metadata is organized by mapping keys to values. Keys are
dot-seperated "atoms" (i.e., sequences of letters); most of them are
defined in RMI. In addition, the system uses quite a number of
"internal" keys, designated by leading underscores. They include:
* _type -- on DataSets, this becomes the type attribute of the VOTable.
* _query_status -- on DataSets, this can be used to communicate the
value of an INFO element in the VOTable (see SIAP spec). These must
be meta.InfoItem instances.
* _legal -- human-readable unstructured information on the legal
status of the data.
* _infolink -- a URL pointing to further unstructured human-readable
information to the data content
Getting Metadata
++++++++++++++++
Metadata are accessed by name (or "key", if you will).
The getMeta method usually follows the enclosure hierarchy up, meaning
that if a meta item is not found in the current instance, it will ask
its parent for that item, and so on. If no parent is known, the meta
information contained in the configuration will be consulted. If all
fails, a default is returned (which is set via a keyword argument that
again defaults to None) or, if the raiseOnFail keyword argument
evaluates to true, a gavo.NoMetaKey exception is raised.
As an example for propagation, querying metadata on a Table
will ask DataSet (XXX shouldn't it ask the RecordDef? Right now, that
won't work because Tables don't get RecordDefs but fieldDefs XXX),
DataDescriptor, RecordDef and finally config.
If you require metadata exactly for the item you are querying, call
getMeta(key, propagate=False).
For metadata that has structure, getMeta will raise a gavo.MetaCardError
when there is more than one matching meta item. For these, you will
usually use a builder, which will usually be a subclass of
meta.metaBuilder. web.common.HtmlMetaBuilder is an example of how such
a thing may look like, for simple cases you may get by using
ModelBasedBulder (see the registry code for examples).
The builders are passed to a MetaMixin's buildRepr(metakey, builder)
method that returns whatever the builder's getResult method returns.
Setting Metadata
++++++++++++++++
You can programmatically set metadata on any metadata container by
calling container.addMeta(key, value), where both key and value are
(unicode-compatible) strings. You can build any hierarchy in this way,
provided you stick with typeless meta values or can do with the default
types. Those are set by key in meta._typesForKeys.
To build sequences, call addMeta repeatedly. To have a sequence of
containers, call addMeta with None or an empty string as value, like
this:
m.addMeta("p.q", "x")
m.addMeta("p.r", "y")
m.addMeta("p", None)
m.addMeta("p.q", "u")
m.addMeta("p.r", "v")
More complex structures require direct construction of MetaValues. Use
the makeMetaValue factory for this. This function takes a value (default
empty), and possibly a key and/or type arguments. All additional
arguments depend on the meta type desired.
The type argument selects and an entry in the meta._typesForKeys table
that specifies that, e.g., _related meta items always are links. You
can also give the type directly (which overrides any specification
through a key).
This can look like this:
m.addMeta("info", meta.makeMetaValue("content", type="info",
infoName="someInfo", infoValue="GIVEN"))
VOTables
========
table.Table instances are primarily meant to be serialized into VOTable
tables. Since most of the metadata of tables will be contained in the
parent DataSet's docRec, PARAM elements of tables will be taken from
there and will be located in the RESOURCE element that contains the
table(s).
There's currently no provision for having PARAMS to VOTABLE elements
(these would probably reside in the resource descriptor) or TABLE
elements (this would require a special attribute of table.Table, I
guess).
XXX TODO: there's also INFO and LINK. Have some convention as to
what goes where.
However, Tables are meta containers and can contain meta information.
With VOTables, correct formatting of values becomes a particular
problem. While presentation is largely a non-issue, it is paramount
that the literals actually match what the ucds, units and types give.
Therefore, displayHints (apart from "suppress", which by default is
honored) are ignored. Instead, votable.py defines MapperFactories.
These are just callables taking ColProperties (in a pinch, dicts having
"sufficient" keys will do too, where sufficient at least includes
``ucd``, ``unit``, ``datatype``, ``arraysize``, and ``dbtype``, possibly
more) and returning either None (meaning they won't handle values for
this column) or a callable returning a string.
These MapperFactories are organized in a Registry that can be queried
for a mapper. If you need to do some special mapping, get a copy of
the default mapper registry by calling ``votable.getMapperRegistry``,
write mappers (a couple of the keys available are listed above, but
votable calls the mappers with properly filled out votable.ColProperties
instances, so you can, e.g., look at min, max, and hasNulls), and
register them using the ``registerFactory`` method of the registry. The
mappers will be called in reverse order of registration, so you can
override default behaviour, and you should register the most special
mappers last.
Mapper factories may decide to alter the type they're returning (in
fact, for things like date they'll in all likelihood need to). To do
that, change the datatype and arraysize attributes. After the mappers
have run, nothing will look at the dbtype any more.
To add new default mappers, add one in votable.py and call
_registerDefaultMF on them.
The New Web Structure
=====================
The new web structure is based on resource descriptors. These define
data, adapters and services. A service is a combination of a renderer,
a core, and optionally input and output adapters.
A core is a machinery that takes a DataSet as input (mostly, only the
docRec will be used, which typically contains the query parameters). If
at all sensible, cores should return DataSets, since that is what most
renderers expect.
The renderer provides the interface to the client. This can be Form
for something that outputs an HTML form and returns and HTML table.
Other renderers include ones of DAL protocols (like SIAP), JPEG images
or plain text.
Adapters are basically datadef.DataTransformers, i.e., they generate
tables using grammars. For input adapters, these grammars will
typically be context grammars, for output adapters, table grammars.
The way a query ideally is processed is roughly like this:
::
query interface (http, dal, ...) <-> service
^ ^ ^
| | \
+ -----/ V `--------+
V core V
(input adapter) (output adapter)
Services
++++++++
Services primarily know the fields the user is expected to provide and
that the renderer should return. In addition, they carry most of the
metadata necessary for publication (although much of it may really be
in the resource descriptor by meta delegation).
You probably will not subclass Service.
A service provides the following methods:
* getInputFields() -> seq of InputKeys -- what the user must/may give
as the service input.
* getInputData(rawInput) -> DataSet -- receives the input to the
inputAdapter's Grammar (usually a dict) and returns the input data set
* run(inputData, queryMeta) -> DataSet -- returns a
web.common.SvcResult instance containing (hopefully) everything to
build an answer to the query.
* getCurOutputFields(queryMeta) -> seq of DataFields -- returns the colums
of the primary table of the result. This is given so (a) cores
capable of returning specific fields only (e.g., DbBasedCores) can
ask a service what data to query and (b) so the registry interface
or renderers wanting to know the structure of the data in advance
(SOAP!) can obtain this information. This depends on queryMeta since
the verbLevel can change what fields are returned.
The return values of getInputData and run may be twisted deferreds (for
getInputData, this is necessary because getting the input data may
involve database operations or web queries).
Cores
+++++
A core is something that receives a DataSet and produces one. In
between, they may issue database queries, run a program, whatever.
Cores must have the following method:
* run(inputData, queryMeta) -> thing
The thing returned is either
* a twisted.deferred firing a DataSet or
* a pair of (headers, file-like object) if whatever is to be delivered is
not really a table or
* a string.
InputData is a DataSet (unfortunately, there
currently is no way for a core to describe the structure it expects in
that DataSet), queryMeta is a common.QueryMeta instance.
Cores must have avInputKeys and avOutputKeys sets. These contain
strings naming input and output parameters the core supports.
If at all possible, the should also give getInputFields and
getOutputFields methods returning record.DataFieldLists. Several
renderers may depend on these.
In the typical case of a database based core, you'll usually want to use
the condition descriptor infrastructure (currently in standardcores,
CondDesc and friends, but that's probably going to move). A condition
descriptor is an object that controls zero or more data fields, can
describe them and can build SQL queries given values (usually coming
from the docRec of a service's inputTable).
CondDescs must support the following methods:
* getInputFields() -> seq of DataField instances -- the embedding core
will return the union of all theses sequences from its getInputFields
* getQueryFrags(inPars, sqlPars, queryMeta) -> frag -- receives the
input parameters and a QueryMeta instance and returns a fragment
suitable for where clauses in SQL statements. It will add the
parameters within this statement to sqlPars (which is passed so the
implementations can check for free keys).
Default Output Filter
+++++++++++++++++++++
With DbBasedCores, the typical case is that the DataSet returned gives
exactly the fields the service ordered. In that case, when constructing
the SvcResult, the core's DataSet is taken as the SvcResult's original.
If, however, the field keys are different, a transformation will
be set up based on the service's outputFields. If any outputField
declared as non-optional is not found in a coreResult's row, a
ValiationError will be raised.
Validation
++++++++++
XXX Standard validation: make a values child for field, yadda
If some code implementing a service can't go on, it should raise some
subclass of gavo.ValidationError with an explanation and the responsible
field name if at all possible. web.resourcebased.Form catches such
errors during the parsing of the input (i.e., interpretation of the form
data) and (hopefully) during the actual run of the service. Any other
errors right now lead to 500s.
In the later stages of processing, fields will usually be present that
are not present in the input form. If errors corresponding to these
fields are raised, formal won't be able to display the message at the
appropriate field. As a quick fix, Service provides a
translateFieldName method that takes a field name and tries to map it to
an originating field. Right now, it uses a table coded in the resource
descriptor (nameMap element) for that, but that's only a temporary hack.
We should come up with something residing in the adapters.
Macros and processors an, while processing, set an attribute errorField.
This should contain the name of the argument they are processing. If
some error occurs, it will be caught in RowFunction.__call__, which in
turn will change the error to a ValidatonError with the appropriate
source field.
Templates
+++++++++
We support providing custom XHTML templates for both the query form and
the response table. To do that say, something like ```` in your service definition. For
type, there currently are two legal values, ``form`` for the template
used to format the form, and ``response`` for the resdir-relative file
name of the template used to format the query response.
In the code, this mechanism is implemented by making docFactory a
property that checks if there is a customTemplate attribute on the
instance. If it is, its content is used as an absolute path to a nevow
XML template (i.e., the instances need to do resolution of the relative
paths in the resource descriptors themselves), if not, the
defaultDocFactory attribute of the instance is used. There is a mixin
implementing this functionality in gavo.web.common.CustomTemplateMixin.
Dispatching
+++++++++++
Requests come in through dispatcher.ArchiveService. This parses the
query URI in its locateChild method. All resource based services have
URIs like ``//``. ArchiveService decides
on action what child to return, using the rederClasses dictionary. This
should map to rend.Page subclasses, the service pages.
For now, ArchiveService will create a new class using whatever is in
renderClasses. I guess we might save time if we'd reuse those classes
as appropriate, but let's not go that way as long as the thing is fast
enough.
On instanciatioon, the service page receives the current context and the
segments pertaining to its rd an service in one go. Use the
web.common.parseServicePath function to get the rd and the service name from
this.
Access control
++++++++++++++
Services can be "protected" by setting their requiredGroup field.
Service-based renderers should be constructed through
resourcebased.getServiceRend and then get (currently, sigh,
http-basic-based) protection automatically. Everything else will have
to use something in web.creds instead.
Publication
+++++++++++
Publications are handled via the publications list field of
service.Service plus a lot of meta data. Its values are dictionaries
with at least the keys ``render`` (the renderer used, i.e., the last
item in the service path) and ``type``. The value of type has to be an
element of a locally controlled vocabulary defined in gavo.web.registry.
Values include "web" (something that provides a web form), "siap" (a
siap compliant service).
These dictionaries end up in gavo.web.servicelist.makeRecord when
gavopublish runs and are turned into records for the tables described by
``__system__/services/services.vord``.
Interfaces
==========
One nice architectural feature would be to have interfaces actually
implement functionality -- a positional interface could tell how to run
cone searches on it, etc. I haven't yet done this, because in effect
we'd like to have another indirection here: various interfaces could
implement SIAP-like queries, etc. Also, we'd have to keep interfaces
around at runtime. While it's quite clear that interfaces would always
be tied to table definitions (I should rename RecordDef at some
point...), the actual location of the interfaces requires some
thought.
Meanwhile, all interfaces have the getNodes method, which is nifty for
testing -- it (almost) returns a list of datadef.DataField instances for
the fields defined in the interface. See tests/testsiap.py for an
example of testing against an interface if you're lazy and don't want an
actual table implementing the interface.
User management
===============
The rights model used here is simple: There are users and groups, where
for each user there's a group and vice versa. Access restrictions are
stated in terms of groups, and users can belong to groups. Access to a
protected resource is given to any user that belongs to that group.
Users and groups are stored in two database tables generated from
user.vord. Admittedly, it's overkill to use the gavo framework for
these tables, but there you are.
Memoization
===========
The resourcecache module should be the central point for all kinds of
memoization/caching issues. To keep dependencies and risks of
recursive imports low, it is the providing modules' responsibility to
register caching functions. The idea is that, e.g., importparser wants
a cache of resource descriptors. It should then call
resourcecache.makeCache("getRd", getRd)
Clients would then call
resourcecache.getRd(id).
This mechanism for now is restricted to items that come with a unique
id (the argument). It would be easy to extend this to multiple-argument
functions, but I don't think that's a good idea -- the "identities" of
the cached objects should be kept simple.
No provision is made to prevent accidental overwriting of function names.
This scheme has the central flaw that you need to make sure that you or
somebody else actually imports the module defining the resource you want
to access.
All caches can be cleared by calling clearCaches. This only affects
caches make via makeCache. There probably should be similar mechanisms
for other shared resources, since we want to be able to respond to
"reload"-like requests.
Coordinate Systems
==================
This is a very weak point right know. All those transforms aren't
rocket science, but still need to be worked out.
Cartesian Coordinates
+++++++++++++++++++++
For many geometrical operations, we use cartesian coordinates (a.k.a.
c_x, c_y, c_z). These correspond to the point at which the radius
vector to RA, Dec crosses the unit sphere. These coordinates have many
desirable properties (e.g., there are no "stich" lines for them).
The cartesian coordinates are oriented such that the x axis is aligned
with alpha=0, the y axis with alpha=90 degrees or 6 hours, and the z
axis with delta=90 degrees. Most of this is implemented in coords.py.
If you're doing more with this that just compute it and hand it over to
the DB engine, you may want to use the Vector3 class in coords. I wrote
it when I tried to do siap-style matches using cartesians. Since that
doesn't work, the class is, as it were, orphaned right now.
SIAP
====
Our plan for coping with the nasty stitching and the degeneracy at the
poles with polar coordinates and SIAP is as follows:
Every table supporting siap queries has two fields, primaryBox and
secondaryBox; in SQL both are of the type BOX, and I have a local class
Box in coords that these may be converted from/to.
For "harmless" areas, only primaryBox is non-NULL and contains a
bounding box for the area in RA and DEC. A field that covers the stitch
line at 360/0 has a secondaryBox; the primary box is left of the stich
line (i.e., box.x1<360, box.x0=360), the secondary box is right of the
stitch line (i.e., box.x1=0, box.x0>0).
The ROI works analogously.
With this plan, the SIAP conditions become (in Postgres notation):
* COVERS: primaryBox ~ roiPrimary AND (secondaryBox IS NULL OR
secondaryBox ~ roiSecondary)
* ENCLOSED: roiPrimary ~ primaryBox AND (roiSecondary IS NULL OR
roiSecondary ~ secondaryBox)
* CENTER: roiCenter ~ primaryBox OR roiCenter ~ secondaryBox
* OVERLAPS: primaryBox && roiPrimary OR (
secondaryBbox AND secondaryBox && roiSecondary) OR (
secondaryBbox AND secondaryBox && roiPrimary) OR (
roiSecondary AND roiSecondary && primaryBox)
There's code in siap.py that computes these boxes.
Databases
=========
The code started out using pypgsql, but after handling multimillon-row
datasets, I largely gave up on it, instead going for psycopg2. Most
things still should work with both interfaces. However, psycopg2
by default uses python's native DateTime rather that egenix' mxDateTime.
psycopg2 needs to be compiled with mxDateTime support, or bad things
will happen.
Currently, the main thing that doesn't work for pypgsql is SIAP. I use
boxes in there and only adapt them for psycopg2. It probabaly wouldn't
be hard to retrofit it to pypgsql, but I can't see much sense in doing
this.
Test Plan
=========
(mostly ARI-specific)
Before any commit, run trunk/tests/runtests.sh (this will start a local
postgres used just for testing). It runs unit tests independent of any
DB content.
trunk/utils/roughtest.py requires the configured dc service to run and
depends on ARI-local data. Run it if you're developing at ARI, build
your own if you're anywhere else.
When you changed anything on the registry, run
trunk/utils/registryrest.py. It also contains ARI-specific stuff.
Then mirror the stuff using ~/gavo/mirror.sh. Restart the DC on
alnilam, run roughtest.py on tucana. If all is well, do the commit.
Scrap
=====
To get a value from a rowdict and a recordDef, use the getValueIn method of
datadef.DataField. This way, you'll have consistent semantics of source and
value. You may pass in an @-expanding function, but if you're sure there are
not @s or you don't want them expanded, it doesn't hurt if you don't.
Warts
=====
The whole system has lots and lots of warts, and the main thing I can
say in the way of justification is that most of the time, I didn't
really know where the whole thing was to be going. At the few times I
really knew what I was going for, I, admittedly, went for q'n'd
"solutions" quite frequently.
Grep for XXX in the sources to find (sometimes minor) warts. Here, I'll
collect some higher level warts:
* Table names have a horribly defined semantics. They are used as table
names in SQL, as ids and, in the context of web services, as function
designators ("output"). At least the last function is incompatible
with the others (because more than one "output" will be present within
a schema). We should untangle this, probably introducing a "function"
attribute for RecordDef and allowing tables without names.
* ...and RecordDef is a lousy name for what really is a table
definition.
* The whole business of passing InternalDataSets around when processing
web queries and usually just using the first table in them is a pain.
I think DataSets should have a designated "interesting" table when
something insists on having a single table, and instances where things
insist on that should be dramatically reduced in the first place.
Lessons learnt
==============
There's an old saying "Get your data structures right, and the rest of
the program will write itself". This couldn't be more true here, and
unfortunately, I didn't get them right, mostly, because I started out
doing something utterly useless.
Anyway, I think one should set out doing a "pure python" model of a
VOTable, i.e., a set of classes modelling *easily accessible* and
*easily constructable* (possibly along the lines of stan?) data tables
containing all the information necessary to build a VOTable. The
VOTable data model isn't all that bad. Also, when it comes to
generating VOTables according to DAL protocols, having abstracted away
LINK, INFO, and PARAM is a major pain.
These VOTables should be able to exist standalone, but having
descriptors containing standard metadata certainly is a good idea. I'm
still not sure how I'd do this. Resource descriptors like I have them
here are too fat, anyway.
And, think of the metadata from the very start. Having to glue them on
later is a pain. I guess that's less of a problem once there are
self-contained Tables that don't (necessarily) refer to external
objects.
Also, in your data trees, make sure you always have a parent reference.
The managed-attribute things (record.Record) probably were not a good
idea. I liked the idea because it *may* help when parsing from XML
(that is what they still are used for almost exclusively), but I never
actually used the stuff. Having accessor methods generated
automatically is nice, but if you absolutely want this, do it properly
and use metaclasses. Also, think about how you'll copy these records
from the beginning. And think about inheritence, the additionalFields
hack sucks bad.
It was a bad idea to stuff information on how a field should be parsed
(source, literalForm, nullLiteral) into the same field definition
describing the field itself. At the very least, the fields used for
parsing should be joined in a subelement of field.
The separation between core and renderer is ok, though it's a bit
misfigured. The interface between the two could use some work. But:
*Both* cores and renderers should be able to define condDescs -- this
would do away with that silent nonsense. This might also be a way to
get rid of those stupid output filters. They were not a good idea.
Maybe some kind of general transformer for whatever a core hands to a
renderer might work at some point, but then one again needs to be
careful not to make the prediction what output fields are generated
impossible.
.. [#booster] DataSet._parseSources can take an
optional parseSwitcher that you can pass in when you construct a
DataSet directly. The parseSwitcher, if defined, can override the parse
method of the grammar, which is used for parser boosting (e.g., when
filling the table through copy statements).