================================ Development notes for GAVO DaCHS ================================ :Author: Markus Demleitner :Email: gavo@ari.uni-heidelberg.de Some of this is severely out of date. Package Layout ============== The following rules should be followed as regards subpackages of gavo in order to keep the modules dependency graph manageable (and facilitate factoring out libraries). * Each functionality block is in a subpackage, the __init__ for which contains the main functions, classes, etc., of the the sub-package interface most clients will be concerned with. Clients needing special tricks may still import individual modules. What's in __init__ should be considered "public interface". * Within each subpackage, *no* module imports the sub-package, i.e., a module in base never says "from gavo import base" * A subpackage may have a module common, containing objects that multiple modules within that subpackage requires. common may *not* import any module from the subpackage, but may be imported from all of them. No rules wrt importing modules from the same subpackage exist of other modules. Just apply common sense here to avoid circular imports. * There is a hierarchy of subpackages, where subpackages lower in the hierarchy may not import anything from the higher or equal levels, but only from lower levels. This hierarchy currently looks like this: imp [<] utils < stc < (votable, adql) < base < rscdef < grammars < formats < rsc < svcs < registry < protocols < web < rscdesc < (helpers, user) utils should never assume anything from imp is present, i.e., it may *attempt* to import from there, but it should not fail hard if the import doesn't work. Of course, concrete functions (e.g., from utils.fitstools) won't work if the base libraries are not present. Error handling, logging ======================= Exception classes +++++++++++++++++ It is the goal that all errors that can be triggered from the web or from within resource descriptors yield sensible error messages with, if possible, information on the location of the error. Also, major operations changing the content of the database should be loggable with time and, probably, user information. The core of error processing is utils.excs. All "sensible" exceptions (i.e., MemoryErrors and software bugs excepted) should be instances of gavo.excs.Error. However, upwards from base you should always raise exceptions from base; all ("public") exception types from utils.excs are available there (i.e., raise base.NotFoundError(...) rather than utils.excs.NotFoundError(...)). The base class takes a hint argument at construction that should give additional information on how to fix the problem that gave rise to the exception. All exception constructor arguments except the first one must always be keyword arguments as a simple hack to allow pickling the excepitons. When defining new exceptions, if there is structured information (e.g., line numbers, keys, and the like), always keep the information separate and use the __str__ method of the exception to construct something humans want to see. All built-in exceptions should accept a hint keyword. The events subsystem ++++++++++++++++++++ All proper DC code (i.e. above base) should do user interaction through base.ui.notify. In base and below, you can use utils.sendUIEvent, but this should be reserved for weird circumstances; code so far down should't normally need to do user interaction or similar. The can be various things. base.events defines a class EventDispatcher (an instance of which then becomes base.ui) that defines the notify methods. The docstrings there explain what you're supposed to pass, and they explain what observers get. base.events itself does very little with the events, and in particular it does not do any user interaction -- the idea is that I may yet want to have Tkinter interfaces or whatever, and they should have a fair chance to control the user interaction of a program. The actual action on events is done by observers; these are ususally defined in ``user``, and some can be selected from the ``gavo`` command line. For convenience, you should derive your Observer classes from base.ObserverBase. This lets you stuff like:: from gavo.base import ObserverBase, listensTo class PlainUI(ObserverBase): @listensTo("NewSource") def announceNewSource(self, srcString): print "Starting %s"%srcString However, you can also just handle single events by saying things like:: from gavo import base def handleNewSource(srcToken): pass base.ui.subscribeNewSource(handleNewSource) Most logging is done in user.logui; if you want logging, say:: from gavo.user import logui logui.LoggingUI(base.ui) Catching exceptions +++++++++++++++++++ In the DC software, is is frequently desirable to ignore the first rule of exception handling, viz., leave them alone as much as possible. Instead, we often map exceptions to DC-internal exceptions (this is very relevant for everything leading up to ValidationErrors, since they are used in user interaction on the web interface). However, to make the original exception information available for debugging or problem fixing, whenever you "translate" an exception, have ``base.ui.notifyExceptionMutation(newException)`` called. This should arrange logging the exception to the error log (although of course that's up to the observer selected). The convenient way to do this is to call ``ui.logOldExc(exc)``:: raise base.ui.logOldExc(GavoError(...)) LoggingUI only logs the information on old exceptions when base.DEBUG is true. You can set this from your code, or by passing the ``--debug`` option to gavo. Testing ======= All unit tests must import gavo.helpers.testhelpers before importing anything else from the gavo namespace. This is because testhelpers sets up a test environment in /var/tmp/gavo_test (set in tests/data/test-gavorc). To make this work reliably, it must manipulate the normal way configuration files are read. helpers.testhelpers needs a gavotest database for which the current user is a superuser; so, before running tests, do:: createdb --encoding=UTF-8 dachstest You'll also need lines like:: local dachstest ident local dachstest all md5 (the second line is for the DaCHS admin, trustedquery, and untrusted profiles) in your pg_hba.conf. There are doctests in modules (though fewer than I'd like), and pyunit- and trial-based tests in ``/tests``. ``tests/runAllTests.py`` takes care of locating and executing them all. In addition to setting up the test environment, testhelpers provides (check out the source) some useful helper functions (like ``getTestRD``), the ``VerboseTest`` class adding test resources and some assertions to the normal ``unittest.TestCase``. Do *not* import it in production code. Test-like functionality interesting to production code should go to ``helpers.testtricks``. ``testhelpers.main`` is useful after an ``if __name__=='__main__'`` in test modules. Pass a default test class, and you can call the module without arguments (in which case it will run all tests), with a single argument (that will be interpreted as a method prefix to locate tests on the default TestCase) or with two arguments (a TestCase name and a method prefix to find the methods to be run). All pyunit-based tests use this main. ``testhelpers.main`` evaluates the ``TEST_VERBOSITY`` environment variable. With ``TEST_VERBOSITY=2``, you'll see the test names as they are executed. Regression testing of data ++++++++++++++++++++++++++ For certain kinds of data, unit testing is useful, too. Since it's always possible that server code changes may break such tests, it makes sense to run those unit tests at each commit. Therefore, ``tests/runAllTests.py`` has a facility to pick up such tests from directories named in $GAVO_INPUTS (the "real" one, not the fake test one) in the ``__tests/__unitpaths``. It will pick up tests from there just as it picks them up from tests. Such data-based tests (typically) must run "out of tree", i.e., in the actual server environment where the resources expected by the service tested are. To keep testhelper from fudging the environment, set the environment variable ``GAVO_OOTTEST`` to anythign before importing testhelpers. This is conveniently done in python, like this:: import os os.environ["GAVO_OOTTEST"] = "dontcare" from gavo.helpers import testhelpers Structures ========== Resource description within the DC works via instances of base.Structure. These parse themselves from XML strings, do validation, etc. All compound RD elements correspond to a structure class (well, almost; meta is an exception). A structure instance has the following callbacks: * ``completeElement(ctx)`` -- called when the element's closing tag is encountered, used to fill in computed defaults. ``ctx`` is a parse context that you can use to, e.g. resolve XML ids. * ``validate()`` -- called after completeElement, used to raise errors if some gross ("syntactic") mistakes are in the element * ``onElementComplete()`` -- called after validate, i.e., elementCompleted can rely on seeing a "valid" structure In addition, attributes can define ``onParentCompleted`` methods. These are called after onElementCompleted of the parent element is run when the attribute value is different from its default. They receive the new attribute value as the single argument. Maybe this is bad design. This processing is done automatically when parsing elements from XML. When building elements manually, you should call the structure's finishElement method when done to arrange for these methods being called. If you override these methods, make sure you call the methods of the superclass. Since we might, at some point, want mixins to be able to define validators etc, use super()-based superclass calling, through ``_completeElementNext(cls, ctx)``, ``_validateNext(cls)``, and ``_onElementCompleteNext(cls)``. The ``user.docgen`` module makes documentation out of these structures. There are several catches. One of the more striking is that element names in the *entire* DaCHS code must be unique, since docgen generates section heading from those names and actually checks that these headings are unique; hence, only one (essentially randmonly selected) of two identically-named elements would be documented, and parent links would both point there. Since there are cases when that limitation is a real pain (e.g., the publish element of services and data), there's a workaround: you can set a ``docName_`` class attribute on a structure that contains the name used for the documentation. See ``rscdef.common.Registration`` for an example. Metadata ======== "Open" metadata (as opposed to the attributes of columns and the like) is kept in a ``meta_`` structure added by ``base.meta.MetaMixin``. You should probably not access that attribute directly if at all possible since the current implementation is incredibly messy and liable to change. For this kind of metadata, a simple inheritance exists. MetaMixins have a ``setMetaParent`` method that declares another structure as the current's meta parent. Any request for metadata that cannot be satisfied from self will then be propagated up to this parent (unless propagation is suppressed). Usually, parents will call their children's setMetaParent methods. The metdata is organized in a tree with ``MetaItem``s as nodes. Each MetaItem contains one or more children that are instances ``MetaValue`` (or more specialized classes). A MetaValue in turn can have more MetaItem children. Getting Metadata ++++++++++++++++ Metadata are accessed by name (or "key", if you will). The ``getMeta(key, ...)->MetaItem`` method usually follows the inheritance hierarchy up, meaning that if a meta item is not found in the current instance, it will ask its parent for that item, and so on. If no parent is known, the meta information contained in the configuration will be consulted. If all fails, a default is returned (which is set via a keyword argument that again defaults to None) or, if the raiseOnFail keyword argument evaluates to true, a gavo.NoMetaKey exception is raised. If you require metadata exactly for the item you are querying, call getMeta(key, propagate=False). getMeta will raise a gavo.MetaCardError when there is more than one matching meta item. For these, you will usually use a builder, which will usually be a subclass of meta.metaBuilder. web.common.HtmlMetaBuilder is an example of how such a thing may look like, for simple cases you may get by using ModelBasedBulder (see the registry code for examples). This really is too messy and needs to be replaced by something smarter. The builders are passed to a MetaMixin's buildRepr(metakey, builder) method that returns whatever the builder's getResult method returns. Setting Metadata ++++++++++++++++ You can programmatically set metadata on any metadata container by calling its method ``addMeta(key, value)``, where both key and value are (unicode-compatible) strings. You can build any hierarchy in this way, provided you stick with typeless meta values or can do with the default types. Those are set by key in meta._typesForKeys. To build sequences, call addMeta repeatedly. To have a sequence of containers, call addMeta with None or an empty string as value, like this: m.addMeta("p.q", "x") m.addMeta("p.r", "y") m.addMeta("p", None) m.addMeta("p.q", "u") m.addMeta("p.r", "v") More complex structures require direct construction of MetaValues. Use the makeMetaValue factory for this. This function takes a value (default empty), and possibly a key and/or type arguments. All additional arguments depend on the meta type desired. These are documented in the `reference manual <./ref.html>`_. The type argument selects an entry in the meta._typesForKeys table that specifies that, e.g., _related meta items always are links. You can also give the type directly (which overrides any specification through a key). This can look like this: m.addMeta("info", meta.makeMetaValue("content", type="info", infoName="someInfo", infoValue="GIVEN")) Memoization =========== The base.caches module should be the central point for all kinds of memoization/caching tasks; in particular, if you use base.caches, your caches will automatically be cleared on ``gavo serve reload``. To keep dependencies and risks of recursive imports low, it is the providing modules' responsibility to register caching functions. The idea is that, e.g., rscdesc wants a cache of resource descriptors. Therefore, it says:: base.caches.makeCache("getRD", getRD) Clients the say:: base.caches.getRD(id). This mechanism for now is restricted to items that come with a unique id (the argument). It would be easy to extend this to multiple-argument functions, but I don't think that's a good idea -- the "identities" of the cached objects should be kept simple. No provision is made to prevent accidental overwriting of function names. Profiling ========= If you want to profile server actions, try a script like this:: """ Make a profile of server responses. Call as trial --profile createProfile.py """ import sys from gavo import api from gavo.web import dispatcher sys.path.append("/home/msdemlei/gavo/trunk/tests") import trialhelpers class ProfileThis(trialhelpers.RenderTest): renderer = dispatcher.ArchiveService() def testOneService(self): self.assertGETHasStrings("/ppmx/res/ppmx/scs/form", {"hscs_pos": "12 2", "hscs_sr": "20.0"}, ["PPMX"]) After running, you can use pstats on the file profile.data. To profile actually running DaCHS operations, use the --profile-to option of the gavo program. For the server, you must make sure in cleanly exists in order to have meaningful stats. Do this by accessing /test/exit on a debug server. Delimited SQL identifiers ========================= Although it may look like it, we do not really support delimited identifiers (DIs) as column names (and not at all as table names). I happen to regard them as an SQL misfeature and really only want to keep them out of my software. However, TAP forces me to deal with them at least superficially. That means that using them elsewhere will lead to lots of mysterious error messages from inside of DaCHS's bowels. There still should not be any remote exploits possible when using them. Here's the deal on them: They are represented as ``utils.misctricks.QuotedName`` objects. These QuotedNames have some methods to control the impact the partial support for delimited identifiers has on the rest of the software. In particular, when you stringify them, they result in string ready for inclusion into SQL (i.e., hopefully properly escaped). The hash to the name, i.e., there are no implied quotes, and, unfortunately, hash(di)!=hash(str(di)). The DC software right now assumes DIs are ASCII only (what do the standards people say?). The one real painful thing is the representation of result rows with DIs -- I did not want to have lots of these ugly QuotedNames in the result rows, so they end up as SQL-escaped strings when used as keys. This is extra sad since in this way for a DI column foo, rec[QName("foo")] raises a KeyError. To work around this, fields have a key attribute, and rec[f.key] should never bomb. Grammars ======== Grammars are DaCHS' means of turning some external data to rowdicts, i.e., dictionaries that map grammar keys to values that are usually strings. They are fed to rowmakers to come up with rows suitable for ingestion (or formatting). A grammar consists of a Grammar object, which is a structure inheriting from grammars.Grammar. It contains all the "configuration" (e.g., rules). Grammars have a parse method receiving some kind of source token (typically, a file name). You will normally not need to override it. The real action happens in the row iterator, which is declared in the rowIterator class attribute of the grammar. Row iterators should inherit from grammars.RowIterator. TODO: yieldsTyped, rowfilters, sourceFields, targetData Do not import modules from the grammars subpackage directly. Instead, use rscdef.getGrammar with the name of the grammar you want. If you define a new grammar, add a line in rscdef.builtingrammars.grammarRegistry. To inspect what grammars are available, consult the keys from rscdef.grammarRegistry. Procedures ========== To embed actual (python) code into RDs, you should use the infrastructure given in rscdef.procdef. It basically leads up to ``ProcApp``, which is what's usually embedded in RDs. ``ProcApp`` inherits from ``ProcDef``, a procedure definition. Such a definition gives some (python) code that is executed when the procedure is applied. To set up the execution environment of this code, there's the definition's setup child. The setup contains code and parameters. The code is executed to set up the namespace that the procedure will run in; it is thus executed once -- at construction -- per procedure. The parameters allow configuration of the procedure. This is the place to do relatively expensive operations like I/O or imports. For example, ``//procs#resolveObject`` creates the resolver in its setup code; this happens only once per creation of the embedding RD:: True from gavo.protocols import simbadinterface resolver = simbadinterface.Sesame(saveNew=True) ... ra, dec = None, None try: ra, dec = resolver.getPositionFor(identifier) except KeyError: if not ignoreUnknowns: raise base.Error("resolveObject could not resolve object" " %s."%identifier) vars["simbadAlpha"] = ra vars["simbadDelta"] = dec The setup definition introduced two parameters. One is ignoreUnknowns, which is "immediate" and just lets the code see a name ignoreUnknowns. As with all ``par`` elements, the content of the element is a python expression providing a default. The other parameter, identifier, is a "late" identifier. This means that it is evaluated on each application of the procedure, much like a function argument. These are just translated into assignments at the top of the function body, which means that everything available in the procedure code is available; e.g., for rowmaker procedures (i.e., type="apply"), you can access ``vars`` here. Taken together, late and immediate ``par`` allow for all kinds of configuration of procedures. This is particularly convenient together with macros. To actually execute the code, you need some kind of procedure application. These always inherit from procdef.ProcApp and add bindings. The ``bind`` element lets you give python expressions for all names defined using ``par`` in the ``setup`` child of the ``ProcDef`` given in the ``procDef`` attribute. You can also define just a procedure application without a procDef by giving ``setup`` and ``code``. Procedure application have "types" -- these give where they can be used. In particular, the type determines the signature of the python callable that the procedure application is compiled into. ``procdef.ProcApp`` has no type, and thus is "abstract"; it should never be a child factory of any ``StructAttribute``. Instead, inherit from it and give * ``name_`` -- the element name, as always in structures. This is "apply" for rowmaker applys, "rowfilter" for grammar rowfilters, etc * ``formalArgs`` -- a python argument list that gives the arguments of the callable a ProcApp of this type is compiled into. Thus, this defines the signature. * ``requiredType`` -- a type name that specifies what kind of ProcDef the application will accept. This will in general be the same as ``name_``. None would mean accept all, which probably is useless. So, all you need to do to define a new sort of ProcApp is write something like:: class EmbeddedIterator(rscdef.ProcApp): name_ = "iterator" formalArgs = "self" (of course, here, documentation as to what the code is supposed to do is particularly important, so don't leave out the docstring when actually doing anything. Then, you could have:: _iterator = base.StructAttribute("iterator", default=base.Undefined, childFactory=EmbeddedIterator, description="Code yielding row dictionaries", copyable=True) in some structure. To produce something you can execute, then say:: theIterator = self.iterator.compile() for row in theIterator(self): print row or somesuch. Javascript ========== While it's our goal to let people operate the web-based part of DaCHS without javascript enabled, it's ok if fancier functionality depends on javascript. After some hesitation, we decided to use the jquery javascript library (we used to have MochiKit but left that when we wanted nice in-browser plotting; so, if you still see MochiKit somewhere, please disregard). We also include some of jquery-ui. The result is shipped in minimized form within DaCHS, which means that we have a little GPL issue here. I guess we should keep a copy of the non-minimized source for the current DaCHS js somewhere. On the other hand, jquery and jquery-ui currently has fine DVCSes, so I think we're fine with the spirit of the GPL... Building jquery-gavo.js –---------------------- * Go to a temporary directory * Go to http://jqueryui.com/download and make yourself a jquery-ui archive. Currently, we want all the core, plus draggable and resizable. Do not select any theme. * Unzip the file, cd js, cat jquery-*.min.js jquery-ui-*.js > ../jquery-gavo.js * copy jquery-gavo.js to gavo/resources/web/js Whatever CSS is necessary for jquery should go into gavo_dc.css. Implementing Protocols ====================== TBD You should add the protocol mixin(s) in user.docgen.PUBLIC_MIXINS so they get included in the documentation; likewise if you need apply and/or rowfilters, amend PUBLIC_APPLYS or PUBLIC_ROWFILTERS. Random Stuff ============ Sometimes it's nice to see what gets imported when. Futzing with PEP 302-style import hooks is a pain, and indeed a simple shell line produces more useful output than naive hooks:: strace gavo imp -h 2>&1 | grep 'open' | grep -v ENOENT | grep -v "pyc" | sed -e 's/.*"\(.*\)".*/\1/'