=======================================
GAVO DaCHS: DirectGrammars and Boosters
=======================================

:Author: Markus Demleitner
:Email: gavo@ari.uni-heidelberg.de


A DirectGrammar is some program that gets called by the ingestor and
outputs binary COPY material to be dumped into the table to stdout.
Note that, by these restrictions, direct grammars can only operate on
single tables; data descriptors containing more than one ``make`` cannot
have direct grammars.  The record definition of the table to be filled
in this way must of course have ``onDisk="true"``.

This program can in principle be anything you like, but usually they are
written in C using the boosterskel infrastructure
(``resources/src/boosterskel.c``).  This infrastructure yields programs that
expect to be called with a single argument, the source file to operate
on.

To use a booster, say

::

  <directGrammar cBooster="res/boosterfunc.c"/>

in a ``data`` element.  The path in the ``cbooster`` attribute is interpreted 
relative to the RD's resdir.


Input Formats
-------------

By default a booster expects to read from a text file.  Currently, the
maximum line length is set to 4000 (``INPUT_LINE_MAX`` in
``boosterskel.c``).  It is up to the parsing function to split and
digest this text line.

When you get binary data of fixed record length, set the ``recordSize``
attribute on the ``DirectGrammar`` element::

  <DirectGrammar cBooster="path" recordSize="300"/>

Note that  a ``recordSize`` larger than ``INPUT_LINE_MAX`` will cause a buffer
overflow.  You will, in this case, want to generate the booster using
the -b flag.

Another type of booster reads from text files, but the fields are not
defined by extent but by separator characters.  For the booster
environment, this doesn't make any difference, so you don't need to
declare it.  The booster itself should be generated using the -s flag to
``gavo mkboost``.


Writing a booster
-----------------

To write a booster, first define the table.  You can already specify the
``DirectGrammar`` as noted above.  

Then call ``gavo mkboost`` with the id or the
path to the resource descriptor and the id of the table, e.g.::

  cd res
  gavo mkboost ../q.rd main > boosterfunc.c

This will write template code for a column-based source to stdout.
See above on what to do when your source has a different format.

Place the template code anywhere (use the convention of putting it in
res/boosterfunc.c unless you have a good reason to do otherwise).

The template code starts somewhat like this::

  #include <math.h>
  #include <string.h>
  #include "boosterskel.h"

  #define QUERY_N_PARS 33

  enum outputFields {
    fi_localid,              /* Identifier, text */
    fi_pmra,                 /* PM (alpha), real */
    fi_pmde,                 /* PM (delta), real */

You can add includes as you need them.  The definition of QUERY_N_PARS
(which is the number of columns in the table) is essential and must not
be edited or removed, since the function building the booster greps it out of
the source code to communicate this value to the booster boilerplate.

The code continues with an enumeration of the field indices; the names are
simple fi\_ and the field destination lowercased.  If you only use these
names to access fields, the cut'n'paste effort if the table should later
change should be manageable and you can add your code directly in the
mkboost-generated source.

While you shouldn't need to change any of this, you have to change the
function that follows::

  Field *getTuple(char *inputLine)
  {
    static Field vals[QUERY_N_PARS];

    parseWhatever(inputLine, F(fi_localid), start, len);
    parseFloat(inputLine, F(fi_pmra), start, len);
    parseFloat(inputLine, F(fi_pmde), start, len);
    parseFloat(inputLine, F(fi_raerr), start, len);

Here, it's your job to assign the proper values to the ``Field``
elements of ``vals``.  The template adds parseXXX function calls that
will do in the simplest cases, except you need to fill in ``start`` (the
character index of the start of the value in C-convention, i.e.,
the first character has index 0) and ``len``, the length of the field
in characters.

``Field`` is defined as follows::


  typedef struct Field_s {
    valType type;
    int length; /* ignored for anything but VAL_TEXT */
    union {
      char *c_ptr;
      double c_double;
      float c_float;
      int32_t c_int32;
      int8_t c_int8;
    } val;
  } Field;

where ``type`` is one of::

  typedef enum valType_e {
    VAL_NULL,
    VAL_BOOL,
    VAL_CHAR,
    VAL_INT,
    VAL_FLOAT,
    VAL_DOUBLE,
    VAL_TEXT,
    VAL_JDATE,
  } valType;

JDATE is a julian day number to be dumped as a date (rather than a
datetime).  For other ways to represent dates and datetimes, see below.

You can, and frequently will, fill the stuff by hand.  There are,
however, a couple of functions that care about some standard situations,
in particular when parsing column-structured text files:

* ``void linearTransform(Field *field, double offset, double factor)`` --
  changes field in place to ``offset+factor*oldValue``.  Handles NULL
  correctly, silently does nothing for anything non-numeric
* ``void parseFloatWithMagicNULL(char *src, Field *field, int start, int
  len, char *magicVal)`` -- parses a float from src[start:start+len] into
  field, writing NULL when magicVal is found in the field.
* ``void parseDouble(char *src, Field *field, int start, int len)`` --
  parses a double from src[start:start+len] into field, writing NULL if it's empty
* void ``parseInt(char *src, Field *field, int start, int len)`` -- parses a
  32-bit int into field.
* void ``parseShort(char *src, Field *field, int start, int len)`` -- parses a
  16-bit int into field.
* ``void parseBlankBoolean(char *src, Field *field, int srcInd)`` -- parses
  a boolean such that field becomes true when src[srcInd] is nonempty.
* ``void parseString(char *src, Field *field, int start, int len, char
  *space)`` -- copies len bytes starting at start from src into space (you
  are responsible for allocating that; usually, a static buffer should
  do, since the postgres input is generated before the next input line
  is parsed) and stuffs the whole thing into field.
* ``void parseChar(char *src, Field *field, int srcInd)`` -- guess.

Of course, you can also manually copy or delimit data and use fieldscanf
as documented in `Boosters reading from separated data`_.


Boosters reading from binary data
---------------------------------

Your booster can also read from binary data.  You are mainly on your own
in terms of segmentation, but for entering values, you can use the
following macros:

* MAKE_NULL(fi) -- makes fi NULL
* MAKE_DOUBLE(fi, value) -- make fi a double with value
* MAKE_FLOAT(fi, value) -- 
* MAKE_SHORT(fi, value) -- 
* MAKE_CHAR(fi, value) --
* MAKE_JDATE(fi, value) --
* MAKE_TEXT(fi, value) -- note that you must manage the memory of value
  yourself.  In particular, it must not be automatic memory of getTuple,
  since that will not be valid when the tuple actually is built.  Most
  commonly, you'll be using a static buffer here.
* MAKE_CHAR_NULL(fi, value, nullvalue) -- makes fi a char with value
  unless value==nullvalue; in that case, fi becomes a NULL


For these in particular, use the the portable type specifiers for
integral types, viz., ``int8_t``, ``int16_t``, ``int32_t``, and
``int64_t`` and these names with a ``u`` in front.


Boosters reading from separated data
------------------------------------

When the input data comes as xSV (e.g., values separated by vertical
bars or tabs), use ``gavo mkboost -s <splitter>``, e.g.::

  gavo mkboost -s '\t' q.rd sources

for Tab-separated values.  This creates a source like::

 	char *curCont = strtok(inputLine, "\t");
	fieldscanf(curCont, fi_objid, VAL_INT_64);
	curCont = strtok(NULL, "\t");
	fieldscanf(curCont, fi_run, VAL_SHORT);

etc.  Thus, the input line is parsed using strtok, and each value is
parsed using the fieldscanf function.  This function takes the string
containing the literal in the first argument, the field index in the
second, and finally the type specifier.  If the data comes in the
sequence of the table columns, the generated source *might* just work.


Filling in data manually
------------------------

The ``F(index)`` macro lets you access the field info directly.  So, you
could enter a fixed-length piece of memory into ``fi_magic`` like this::

  static char bufForMagic[8];

  memcpy(bufForMagic, inputLine+20, 8);
  F(fi_magic)->type = VAL_TEXT;
  F(fi_magic)->val.c_ptr =  bufForMagic;
  F(fi_magic)->length = 8;

Having static buffers in ``getTuple`` is usually ok since the COPY input
is generated before ``getTuple`` is called again.

It is quite common to have to handle null values.  In the example above,
this could look like this if a NULL for magic were signified by a F in
``inputLine[19]``::

  static char bufForMagic[8];

  if (inputLine[19]=='F') {
    F(fi_magic)->type = VAL_NULL;
  } else {
    memcpy(bufForMagic, inputLine+20, 8);
    ...

Casts
'''''

Make sure you always properly cast what you read, e.g.,

::

  MAKE_DOUBLE(fi_dej2000, -90+*(int32_t*)(line+4)); /* SPD */

You can use the ``int<len>_t`` and ``uint<len>_t`` types defined by your
compiler's (or library's) headers.


Utility Functions
'''''''''''''''''

While you can, of cource, manipulate ``F(fi_X).value`` as you see fit,
it may be convenient to use boosterskel utility functions:

* ``linearTransform(fi, offset, factor)`` -- computes
  factor*value+offset for floats, doubles, and ints.
* mjdToJYear(mjd) -- returns a julian year for mjd (for when you don't
  want actual timestamps in your tables, which your shouldn't).
* ``AS2DEG(field)`` -- turns a field value in arcsecs to degrees
* ``MAS2DEG(field)`` -- turns a field value in milli-arcsecs to degrees


Dates and times
'''''''''''''''

The boosters treat "normal" dates and datetimes as ``struct tm``s.  If you
need a larger range, use ``VAL_JDATE``, which lets you store julian
dates in floats.  Julian dates are serialized to dates rather than
datetimes.

To parse ``VAL_DATE`` or ``VAL_DATETIME``, you will write something
like::

	fieldscanf(curCont, fi_date, VAL_DATE, "%Y-%m-%d");

if parsing from date strings.  If your input is something weird, figure
out a way to generate a ``struct tm`` as defined in ``time.h``.  Then
write::

  struct tm timeParts;
  timeParts.tm_sec = 12;
  ...
  timeParts.tm_year = 1920;
  F(fi_dt)->val.time = timeParts;
  F(fi_dt).type = VAL_DATETIME;

(or ``VAL_DATE``, as the case may be).