GUMS is a simulation of some part of the GAIA result set.

A bit more info on what's in the files is on
http://www.rssd.esa.int/wikiSI/index.php?title=GAP:GUMS10&instance=Gaia

The data set was obtained via rsync 2012-02 directly from
gaia.esac.esa.int and provided by William O'Mullane.

The files came in the GAIA-internal "gbin" format, which is a
concatenation of Java serialized strings containing zip archives of Java
serialized objects. Hell knows why they did it like this.  There's
python code that can read the original gbin format (see `Parsing
gbin directly`_), but that's too slow for the milky way data set.

After some experimentation with a C-based JSO parser I decided writing
one powerful enough to robustly parse the input was too tedious -- in
particular, there's simply no way around keeping all class information
and most objects in memory all the time.  That's what bin/makebooster.py
and everything in src are about, and all that is not used.

The data in the database is based on the data.txt.gz files present in
all data subdirectories.  These were generated locally; see `Converting
gbin files`_.

People one could ask about this mess include:

 * Wil O'Mullane
 * Xavi Luri (xluri@am.ub.es; he basically gave permission to publish)


Parsing gbin directly
=====================

For experiments and/or convenience, you can import directly from the
gbin files.  The corresponding data items in q.rd ("gbinimp_x") are
commented out right now.

Here's how to come up with the custom grammars (res/*grammar.py) for
these things: bin/getschema.makePythonExpression (edit the main function
to use it) figures out the field sequence by inspecting what's in a
given file (for most subdirectories of data/GUMS-10, the schema is
different).  It then spits out python source for mapping deserialized
objects to rowdicts, which is all that's variable in the custom
grammars.

Incidentally, getschema.py has also been used to infer the table
definitions in q.rd, but there's probably better ways to do this using
GAIA's Java mess.


Converting gbin files
=====================

The code in gbindec uses gaiatools and friends to read the files.
Unfortunately, the GAIA support code is a mess that's hard to cut
through.  Therefore, I've jarred together all class files in the
vicinity of gaiatools to gbindec/gaiaenv.jar (roughly 100 Megs, not in
version control, let's hope it stays where it is until all this can
mercifully be forgotten).

The code actually doing the work is below gbindec/gaia.  Most
interesting is the stuff in gbindec/gaia/cu3/gbin2ascii/converter --
this is the code that's DM-specific.  It was generated using
src/getschema.py, too; see the makeAllConverters.sh shell script within
that file.  To build the whole thing, say ``make`` in gbindec.

If you want to see the java classes that are serialized into the gbindec
files, unjar gaiaenv.jar and check gaia/cu1/mdb/cu2/um; interesting classes
reside in dm/ (Root, PhotoRoot, UMAstroRoot) and umtypes/dm (rest).

After that, call ``gbindec/bin2ascii`` with a directory name as its
argument to dump the contents of all the gbin files within that
directory to a text file.  ``bin/convertAllToASCII.sh`` uses that to
traverse the whole data tree and create the data.txt.gz files.


The actual boosters were then built using::

	gavo mkboost -s '|' q sn > res/snbooster.c
	gavo mkboost -s '|' q quasars > res/quasarsbooster.c
	gavo mkboost -s '|' q galaxies > res/galaxiesbooster.c
	gavo mkboost -s '|' q mw > res/hostedbooster.c

-- plus a few manual fixes (variabilitytype can be NULL, hasphotocentermotion
has values true/false that need to be converted to 1/0, at the end, one
needs a *strchr(curCont, ' ') = 0 to terminate sourceextendedid).

Since the column sequence in the RD matches the sequence within the
``data.txt.gz`` files, the boosters should work as generated.