Developer Blog: 2012

Monday 10 December 2012

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"

If you are a regular user of the GBIF data portal at http://data.gbif.org, or keep an eye on the numbers given at http://www.gbif.org, you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be?

The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:

- Several publishers migrated their datasets from other publishing tools to the Integrated Publishing Toolkit (IPT) and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that are no longer contained in the source file: a data transfer is reliably all-or-nothing, so that any record that is not touched during indexing can automatically be deleted. This is less easy in the dialog-driven data transfer protocols (DiGIR, BioCASe and TAPIR), where data transfer might fail at any point in between for a number of reasons, requiring human supervision of deletions.

- The now dataset-aware registry and changed metadata updating workflow make it possible to much easier spot data resources that are no longer published at source, and therefore need to be removed from the data portal as well. Previously, such checks were manual and required regular screening. More often than not, datasets are not really withdrawn, but instead published under a new identifier, combined with other data, or moved to a new location, all with the old version still hanging in until spotted or pointed out. The new registry workflows will significantly speed up the process of detecting and handling such cases.

In summary, the current drop in numbers is the result of data cleaning and removal of duplicates, and reflects continuing efforts by publishers, nodes and the Secretariat to improve the quality of data accessible through the GBIF network. While they happen regularly, the effects of such cleaning activities often get masked by increased record numbers of existing resources and new datasets in the global index. This time, the reduction happens to be more prominent than the additions.

Monday 29 October 2012

The GBIF Registry is now dataset-aware!

This post continues the series of posts that highlight the latest updates on the GBIF Registry.

To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry, a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system, and how it has a new web application and API.

Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase.

Now even if you have been keeping an eye on the GBIF Registry, you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!

To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools used by organizations in the GBIF Network to publish their datasets. The datasets are exposed via an endpoint URL, and there can potentially be thousands of datasets behind a single endpoint.

Traditionally, the GBIF Registry knew about the endpoint but not about its datasets. It was then the job of GBIF's Harvesting and Indexing Toolkit (HIT) to discover what datasets existed behind the endpoint, harvest all their records, and index those records into the GBIF Data Portal.

Therefore if you ever visited the GBIF Data Portal and viewed the Portal page for the Academy of Natural Sciences, you would find that it has 3 datasets.

Clicking on each one, reveals that they are all exposed via the same DiGIR endpoint (see "Access point URL") - see below:

But, if you visited the GBIF Registry and did the same search for the Academy of Natural Sciences, prior to the Registry being dataset-aware, you would have seen it has a DiGIR endpoint, but not found it has any datasets!

Now that the GBIF Registry is dataset-aware, however, the Registry page for the Academy of Natural Sciences shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation.

So that's fantastic, now the GBIF Registry knows about 1000s of datasets that only the GBIF Data Portal knew about before. But how was dataset-awareness achieved?

First, the Registry now does the job of dataset discovery that the HIT used to do. A project called the registry-metadata-sync was created to do this.

Second, a special set of scripts was written to migrate all the datasets from the GBIF Data Portal index database, into the Registry database. For the first time, all datasets that existed in the GBIF Data Portal now exist in the GBIF Registry, and can be uniquely identified by their GBIF Registry UUID!

Third, the HIT was branched, creating a revised version of the tool that was able to understand the new dataset-aware Registry. The HIT also had to be modified to allow its operators to still trigger dataset discovery by technical installation. Life just got easier for the HIT though, since it could use each dataset's GBIF Registry UUID to uniquely identify each dataset during indexation.

Indeed, the dataset-aware Registry allocates a UUID to each dataset. This is fundamentally the biggest advantage that the dataset-aware Registry brings. Now that GBIF has succeeded in uniquely identifying each Dataset in its Registry, it is now working to assign each Dataset a Globally Unique Identifier (GUID) in the form of a Digital Object Identifier (DOI). The DOI for a dataset will be resolvable back to the GBIF Registry, and could be referenced when citing a Dataset, thereby enabling better tracking of Dataset usage in scientific publications.

GBIF is really excited about being able to provide publishers a DOI for each of their dataset. Keep an eye on our Registry in the coming months for their grand appearance.

Wednesday 17 October 2012

IPT v2.0.4 released

Today the GBIF Secretariat has announced the release of version 2.0.4 of the Integrated Publishing Toolkit (IPT). For those who can't wait to get their hands on the release, it's available for download on the project website here.

Collaboration on this version was more global than ever before, with volunteers in Latin America, Asia, and Europe contributing translations, and volunteers in Canada and the United States contributing some patches.

Add to that all the issue activity, things have been busy. In total 108 issues were addressed in this version; 38 Defects, 35 Enhancements, 7 Other, 5 Patches, 18 Won't fix, 4 Duplicates, and 1 that was considered as Invalid. These are detailed in the issue tracking system.

So what exactly has changed and why? Here's a quick rundown.

One thing that kept coming up again and again in version 2.0.3, was that users were unwittingly installing the IPT in test mode, thinking that they were running in production. After registering a resource, these users expected to see it show up in the GBIF Registry and ultimately be indexed by GBIF. Frustrated emails were then sent to the GBIF Helpdesk when nothing happened. Sadly the reply from the GBIF Helpdesk was always filled with the same disappointing news:

"Your resource is actually in the Test Registry therefore it will never be indexed by GBIF. Oh, and you will have to reinstall your IPT using production mode next time and do your resource configuration over again!"

So to tackle this problem, the setup pages have been improved to make it crystal clear what it means to choose one mode or the other.

The UI has also been branded when running in test mode to make it even more obvious what mode the IPT is running in.

Now whether or not test mode was chosen accidentally, it can be used to help train administrators how to configure an instance, and to help train users how to publish resources. What was always missing, was a way to transfer configured resources from an IPT in test mode, to one in production.

I'm happy to say that in 2.0.4, a resource can now be easily transferred between 2 IPTs including all its source files and mappings. Users will be happy to know that they never have to waste time reconfiguring the same resource from scratch. How is this done? Well in short, resource transfer is achieved by uploading an archived IPT resource folder during resource creation - see user manual for full instructions.

Moving on..

With so many publishers opting for the convenience of publishing via the IPT, the GBIF helpdesk has been receiving dozens of requests to replace an existing DiGIR, BioCASE, or TAPIR resource in the GBIF Registry with one coming from their IPT. To facilitate resource migration, another new feature was added in 2.0.4 that allows the IPT to update an existing resource in the GBIF Registry during registration. The change is welcomed most of all by the GBIF helpdesk who bore the brunt of carrying out resource migrations in the GBIF Registry. See User Manual for instructions.

Thanks to the Taiwan Biodiversity Information Facility (TaiBIF) the IPT interface is now available in Traditional Chinese. That makes the IPT available in a total of 4 languages now including French, Spanish and of course English.

What else?

Thanks to a patch from Peter Desmet, download metrics for the Archive, EML, and RTF files can now be tracked via Google Analytics. For IPT admins who aren't already tracking analytics, there are simple instructions in the User Manual. Here's a screenshot showing some metrics from http://ipt-rc.gbif.org For your reference, the "Event Label" is the resource short name in the IPT.

Last but not least, it should be highlighted that the IPT's RSS feed is now updated every time a resource is published. The version number is displayed right beside the resource name, so subscribers can stay on top of the latest changes. Here's a screenshot from my RSS reader pulling from http://ipt.gbif.org/rss.do

And that about wraps up the most important changes in this version.

As always, we’d like to give special thanks to the volunteer translators for their time and efforts:

Nicolas Noé (Belgian Biodiversity Platform, Belgium) - French
TaiBIF, Taiwan - Traditional Chinese
Laura Roldan Gomez, Dairo Escobar, and Daniel Amariles, (Colombian Biodiversity Information System (SiB)) - Spanish

Plus another couple of special mentions are owed to Peter Desmet and Laura Russell who provided an exceptional amount of feedback and suggestions.

On behalf of the GBIF development team, I hope you enjoy using latest version.

Friday 13 July 2012

Getting started with DataCube on HBase

This tutorial blog provides a quick introduction to using DataCube, a Java based OLAP cube library with a pluggable storage engine open sourced by Urban Airship. In this tutorial, we make use of the inbuilt HBase storage engine.

In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc). As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics. The features of DataCube that make this attractive to us are:

A managable process to modify the cube structure
A higher level API to develop against
Ability to rebuild the cube with a single pass over the source data

For this tutorial we will consider the source data as classical DarwinCore occurrence records, where each record represents the metadata associated with a species observation event, e.g.:

ID, Kingdom, ScientificName, Country, IsoCountryCode, BasisOfRecord, CellId, Year
1, Animalia, Puma concolor, Peru, PE, Observation, 13245, 1967
2, Plantae, Abies alba, Spain, ES, Observation, 3637, 2010
3, Plantae, Abies alba, Spain, ES, Observation, 3638, 2010

Suppose the following metrics are required, each of which is termed a rollup in OLAP:

Number of records per country
Number of records per kingdom
Number of records georeferenced / not georeferenced
Number of records per kingdom per country
Number of records georeferenced / not georeferenced per country
Number of records georeferenced / not georeferenced per kingdom
Number of records georeferenced / not georeferenced per kingdom per country

Given the requirements above, this can be translated into a cube definition with 3 dimensions, and 7 rollups as follows:

/**
 * The cube definition (package access only).
 * Dimensions are Country, Kingdom and Georeferenced with counts available for:
 * 
 * Country (e.g. number of record in DK)
 * Kingdom (e.g. number of animal records)
 * Georeferenced (e.g. number of records with coordinates)
 * Country and kingdom (e.g. number of plant records in the US)
 * Country and georeferenced (e.g. number of records with coordinates in the UK
 * Country and kingdom and georeferenced (e.g. number of bacteria records with coordinates in Spain)
 * 
 * TODO: write public utility exposing a simple API enabling validated read/write access to cube.
 */
class Cube {

  // no id substitution
  static final Dimension COUNTRY = new Dimension("dwc:country", new StringToBytesBucketer(), false, 2);
  // id substitution applies
  static final Dimension KINGDOM = new Dimension("dwc:kingdom", new StringToBytesBucketer(), true, 7);
  // no id substitution
  static final Dimension GEOREFERENCED = new Dimension("gbif:georeferenced", new BooleanBucketer(), false, 1);

  // Singleton instance if accessed through the instance() method
  static final DataCube INSTANCE = newInstance();

  // Not for instantiation
  private Cube() {
  }

  /**
   * Creates the cube.
   */
  private static DataCube newInstance() {
    // The dimensions of the cube
    List> dimensions = ImmutableList.>of(COUNTRY, KINGDOM, GEOREFERENCED);

    // The way the dimensions are "rolled up" for summary counting
    List rollups =
      ImmutableList.of(new Rollup(COUNTRY),
        new Rollup(KINGDOM),
        new Rollup(GEOREFERENCED),
        new Rollup(COUNTRY, KINGDOM),
        new Rollup(COUNTRY, GEOREFERENCED),
        new Rollup(KINGDOM, GEOREFERENCED),
        // more than 2 requires special syntax
        new Rollup(ImmutableSet.of(new DimensionAndBucketType(COUNTRY), new DimensionAndBucketType(KINGDOM),
          new DimensionAndBucketType(GEOREFERENCED))));

    return new DataCube(dimensions, rollups);
  }
}

In this code, we are making use of ID substitution for the kingdom. ID substitution is an inbuilt feature of DataCube whereby an auto-generated ID is used to substitute verbose coordinates (a value for a dimension). This is an important feature to help improve performance as coordinates are used to construct the cube lookup keys, which translate into the key used for the HBase table. The substitution is achieved by using a simple table holding a running counter and a mapping table holding the field-to-id mapping. When inserting data into the cube, the counter is incremented (with custom locking to support concurrency within the cluster), the mapping is stored, and the counter value used as the coordinate. When reading, the mapping table is used to construct the lookup key. With the cube defined, we are ready to populate it. One could simply iterate over the source data and populate the cube with the likes of the following:

DataCubeIo dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity
dataCubeIo.writeSync(new LongOp(1), 
  new WriteBuilder(Cube.INSTANCE)
    .at(Cube.COUNTRY, "Spain") // for example
    .at(Cube.KINGDOM, "Animalia")
    .at(Cube.GEOREFERENCED, true)
);

However, one should consider what to do when you have the following inevitable scenarios:

A new dimension or rollup is to be added to the running cube
Changes to the source data have occurred without the cube being notified (e.g. through a batch load, or missing notifications due to messaging failures)
Some disaster recovery requiring a cube rebuild

To handle this when using HBase as the cube storage engine, we make use of the inbuilt backfill functionality. Backfilling is a multistage process:

A snapshot of the live cube is taken and stored in a snapshot table
An offline cube is calculated from the source data and stored in a backfill table
The snapshot and live cube are compared to determine changes that were accepted in the live cube during the rebuilding process (step 2). These changes are then applied to the backfill table
The backfill is hot swapped to become the live cube

This is all handled within DataCube with the exception of stage 2, where we are required to provide a BackfillCallback, the logic responsible for populating the new cube from the source data. The following example illustrates a BackfillCallback using a simple MapReduce job to scan an HBase table for the source data.

/**
 * The callback used from the backfill process to spawn the job to write the new data in the cube.
 */
public class BackfillCallback implements HBaseBackfillCallback {

  // Property keys passed in on the job conf to the Mapper
  static final String TARGET_TABLE_KEY = "gbif:cubewriter:targetTable";
  static final String TARGET_CF_KEY = "gbif:cubewriter:targetCF";
  // Controls the scanner caching size for the source data scan (100-5000 is reasonable)
  private static final int SCAN_CACHE = 200;
  // The source data table
  private static final String SOURCE_TABLE = "dc_occurrence";

  @Override
  public void backfillInto(Configuration conf, byte[] table, byte[] cf, long snapshotFinishMs) throws IOException {
    conf = HBaseConfiguration.create();
    conf.set(TARGET_TABLE_KEY, Bytes.toString(table));
    conf.set(TARGET_CF_KEY, Bytes.toString(cf));
    Job job = new Job(conf, "CubeWriterMapper");

    job.setJarByClass(CubeWriterMapper.class);
    Scan scan = new Scan();
    scan.setCaching(SCAN_CACHE);
    scan.setCacheBlocks(false);

    // we do not want to get bad counts in the cube!
    job.getConfiguration().set("mapred.map.tasks.speculative.execution", "false");
    job.getConfiguration().set("mapred.reduce.tasks.speculative.execution", "false");
    job.setNumReduceTasks(0);
    TableMapReduceUtil.initTableMapperJob(SOURCE_TABLE, scan, CubeWriterMapper.class, null, null, job);
    job.setOutputFormatClass(NullOutputFormat.class);
    try {
      boolean b = job.waitForCompletion(true);
      if (!b) {
        throw new IOException("Unknown error with job.  Check the logs.");
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
  }
}

/**
 * The Mapper used to read the source data and write into the target cube.
 * Counters are written to simplify the spotting of issues, so look to the Job counters on completion.
 */
public class CubeWriterMapper extends TableMapper {

  // TODO: These should come from a common schema utility in the future
  // The source HBase table fields
  private static final byte[] CF = Bytes.toBytes("o");
  private static final byte[] COUNTRY = Bytes.toBytes("icc");
  private static final byte[] KINGDOM = Bytes.toBytes("ik");
  private static final byte[] CELL = Bytes.toBytes("icell");

  // Names for counters used in the Hadoop Job
  private static final String STATS = "Stats";
  private static final String STAT_COUNTRY = "Country present";
  private static final String STAT_KINGDOM = "Kingdom present";
  private static final String STAT_GEOREFENCED = "Georeferenced";
  private static final String STAT_SKIPPED = "Skipped record";
  private static final String KINGDOMS = "Kingdoms";

  // The batch size to use when writing the cube
  private static final int CUBE_WRITE_BATCH_SIZE = 1000;

  static final byte[] EMPTY_BYTE_ARRAY = new byte[0];

  private DataCubeIo dataCubeIo;

  @Override
  protected void cleanup(Context context) throws IOException, InterruptedException {
    super.cleanup(context);
    // ensure we're all flushed since batch mode
    dataCubeIo.flush();
    dataCubeIo = null;
  }

  /**
   * Utility to read a named field from the row.
   */
  private Integer getValueAsInt(Result row, byte[] cf, byte[] col) {
    byte[] v = row.getValue(cf, col);
    if (v != null && v.length > 0) {
      return Bytes.toInt(v);
    }
    return null;
  }

  /**
   * Utility to read a named field from the row.
   */
  private String getValueAsString(Result row, byte[] cf, byte[] col) {
    byte[] v = row.getValue(cf, col);
    if (v != null && v.length > 0) {
      return Bytes.toString(v);
    }
    return null;
  }


  @Override
  protected void map(ImmutableBytesWritable key, Result row, Context context) throws IOException, InterruptedException {
    String country = getValueAsString(row, CF, COUNTRY);
    String kingdom = getValueAsString(row, CF, KINGDOM);
    Integer cell = getValueAsInt(row, CF, CELL);

    WriteBuilder b = new WriteBuilder(Cube.INSTANCE);
    if (country != null) {
      b.at(Cube.COUNTRY, country);
      context.getCounter(STATS, STAT_COUNTRY).increment(1);
    }
    if (kingdom != null) {
      b.at(Cube.KINGDOM, kingdom);
      context.getCounter(STATS, STAT_KINGDOM).increment(1);
      context.getCounter(KINGDOMS, kingdom).increment(1);
    }
    if (cell != null) {
      b.at(Cube.GEOREFERENCED, true);
      context.getCounter(STATS, STAT_GEOREFENCED).increment(1);
    }
    if (b.getBuckets() != null && !b.getBuckets().isEmpty()) {
      dataCubeIo.writeSync(new LongOp(1), b);
    } else {
      context.getCounter(STATS, STAT_SKIPPED).increment(1);
    }
  }

  // Sets up the DataCubeIO with IdService etc.
  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    Configuration conf = context.getConfiguration();
    HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE);

    IdService idService = new HBaseIdService(conf, Backfill.LOOKUP_TABLE, Backfill.COUNTER_TABLE, Backfill.CF, EMPTY_BYTE_ARRAY);

    byte[] table = Bytes.toBytes(conf.get(BackfillCallback.TARGET_TABLE_KEY));
    byte[] cf = Bytes.toBytes(conf.get(BackfillCallback.TARGET_CF_KEY));


    DbHarness hbaseDbHarness =
      new HBaseDbHarness(pool, EMPTY_BYTE_ARRAY, table, cf, LongOp.DESERIALIZER, idService, CommitType.INCREMENT);

    dataCubeIo = new DataCubeIo(Cube.INSTANCE, hbaseDbHarness, CUBE_WRITE_BATCH_SIZE, Long.MAX_VALUE, SyncLevel.BATCH_SYNC);

  }
}

With the callback written, all that is left to populate the cube is to run the backfill. Note that this process can also be used to bootstrap the live cube for the first time:

// The live cube table 
final byte[] CUBE_TABLE = "dc_cube".getBytes();
// Snapshot of the live table used during backfill
final byte[] SNAPSHOT_TABLE = "dc_snapshot".getBytes();
// Backfill table built from the source
final byte[] BACKFILL_TABLE = "dc_backfill".getBytes();
// Utility table to provide a running count for the identifier service
final byte[] COUNTER_TABLE = "dc_counter".getBytes();
// Utility table to provide a mapping from source values to assigned identifiers
final byte[] LOOKUP_TABLE = "dc_lookup".getBytes();
// All DataCube tables use a single column family
final byte[] CF = "c".getBytes();

HBaseBackfill backfill =
  new HBaseBackfill(
    conf, 
    new BackfillCallback(), // our implementation
    CUBE_TABLE, 
    SNAPSHOT_TABLE, 
    BACKFILL_TABLE, 
    CF, 
    LongOp.LongOpDeserializer.class);
backfill.runWithCheckedExceptions();

While HBase provides the storage for the cube, a backfill could be implemented against any source data, such as from a database over JDBC or from text files stored on a Hadoop filesystem. Finally we want to be able to read our cube:

DataCubeIo dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity
Optional result = 
  cubeIo.get(
    new ReadBuilder(cube)
     .at(Cube.COUNTRY, "DK")
     .at(Cube.KINGDOM, "Animalia"));
// need to check if this coordinate combination hit anything in the cube
if (result.isPresent()) {
  LOG.info("Animal records in Denmark: " + result.get().getLong());
)

All the source code for the above is available in the GBIF labs svn.

Many thanks to Dave Revell at UrbanAirship for his guidance.

Wednesday 11 July 2012

Optimizing Writes in HBase

I've written a few times about our work to improve the scanning performance of our cluster (parts 1, 2, and 3) since our highest priority for HBase is being able to serve requests for downloads of occurrence records (which require a full table scan). But now that the scanning is working nicely we need to start writing new records into our occurrence table as well as cleaning raw data and interpreting it into something more useful for the users of our data portal. That processing is built as Hive queries that read from and write back to the same HBase table. And while it was working fine on small test datasets, it all blew up once I moved the process to the full dataset. Here's what happened and how we fixed it. Note that we're using CDH3u3, with the addition of Hive 0.9.0, which we patched to support HBase 0.90.4.

The problem

Our processing is Hive queries which run as Hadoop MapReduce jobs. When the mappers were running they would eventually fail (repeatedly, ultimately killing the job) with an error that looks like this:

java.io.IOException: org.apache.hadoop.hbase.client.ScannerTimeoutException: 63882ms passed since the last invocation, timeout is currently set to 60000

We did some digging and found that this exception happens when the scanner's next() method hasn't been called within the timeout limit. We simplified our test case by reproducing this same error when doing a simple CopyTable operation (the one that ships with HBase), and again using Hive to do select overwrite table A select * from table B. In both cases mappers are assigned a split to scan based on TableInputFormat (just like our Hive jobs), and as they scan they simply put the record out to the new table. Something is holding up the loop as it tries to put, preventing it from calling next(), and thereby triggering the timeout exception.

The logs

First stop - the logs. The regionservers are littered with lines like the following:

WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e. has too many store files; delaying flush up to 90000ms

INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 7 on 60020' on region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e.: memstore size 128.2m is >= than blocking 128.0m size

Well, blocking updates sure sounds like the kind of thing that would prevent a loop from writing more puts and dutifully calling next(). After a little more digging (and testing with a variety of hbase.client.scanner.caching values, including 1) we concluded that yes, this was the problem, but why was it happening?

Why it blocks

It blocks because the memstore has hit what I'll call the "memstore blocking limit" which is controlled by the setting hbase.hregion.memstore.flush.size * hbase.hregion.memstore.block.multiplier, which by default are 64MB and 2 respectively. Normally the memstore should flush when it reaches the flush.size, but in this case it reaches 128MB because it's not allowed to flush due to too many store files (the first log line). The definition of "too many storefiles" is in turn a setting, namely hbase.hstore.blockingStoreFiles (default 7). A new store file is created every time the memstore flushes, and their number is reduced by compacting them into fewer, bigger storefiles during minor and major compactions. By default, compactions will only start if there are at least hbase.hstore.compactionThreshold (default 3) store files, and won't compact more than hbase.hstore.compaction.max (default 7) in a single compaction. And regardless of what you set flush.size to, the memstore will always flush if all memstores in the regionserver combined are using too much heap. The acceptable levels of heap usage are defined by hbase.regionserver.global.memstore.lowerLimit (default 0.35) and hbase.regionserver.global.memstore.upperLimit (default 0.4). There is a thread dedicated to flushing that wakes up regularly and checks these limits:

if the flush thread wakes up and memstores are greater than the lower limit it will start flushing (starting with current biggest memstore) until it gets below the limit.
if flush thread wakes up and memstores are greater than the upper limit it will block updates and start flushing until it gets under lower limit, when it unblocks updates.

Fortunately we didn't see blocking because of the upper heap limit - only the "memstore blocking limit" described earlier. But at this point we only knew the different dials we could turn.

Stopping the blocking

Our goal is to stop the blocking so that our mapper doesn't timeout, while at the same time not running out of memory on the regionserver. The most obvious problem is that we have too many storefiles, which appears to be a combination of producing too many of them and not compacting them fast enough. Note that we have a 6GB heap dedicated to HBase, but can't afford to take any more away from the co-resident Hadoop mappers and reducers.

We started by upping the memstore flush size - this will produce fewer but bigger storefiles on each flush:

first we tried 128MB with block multiplier still at 2. This still produced too many storefiles and caused the same blocking (maybe a little better than at 64MB)
then tried 256MB with multiplier of 4. The logs and ganglia showed that the flushes were happening well before 256MB (still around 130MB) "due to global heap pressure" - a sign that total memstores were consuming too much heap. This meant we were still generating too many files and got the same blocking problem, but with the "memstore blocking limit" set to 1GB the memstore blocking happened much less often, and later in the process (still killed the mappers though)

We were now producing fewer storefiles, but they were still accumulating too quickly. From ganglia we also saw that the compaction queue and storefile counts were growing unbounded, which meant we'd hit the blocking limit again eventually. Next came trying to compact more files per compaction, hence raised compaction.max to 20, and this made little difference.

So how to reduce the number of storefiles? If we had fewer stores, we'd be creating fewer files and using up less heap for memstore, so next we increased the region size. This meant increasing the setting hbase.hregion.max.filesize from its default of 256MB to 1.5G, and then rebuilding our table with fewer pre-split regions. That resulted in about 75% fewer regions.

It was starting to look good - the number of "Blocking updates" log messages dropped to a handful per run, but it was still enough to affect one or two jobs to the point of them getting killed. We tried upping the memstore.lowerLimit and upperLimit to 0.45 and 0.5 respectively, but again no joy.

Now what?

Things looked kind of grim. After endless poring over ganglia charts, we kept coming back to one unexplained blip that seemed to coincide with the start of the storefile explosion that eventually killed the jobs.

Figure 1: Average memstore flush size over time

At about the halfway point of the jobs the size of memstore flushes would spike and then gradually increase until the job died. Keep in mind that the chart shows averages, and it only took a few of those flushes to wait for storefiles long enough to fill to 1GB and then start the blocking that was our undoing. Back to the logs.

From the start of Figure 1 we can see that things appear to be going smoothly - the memstores are flushing at or just above 256MB, which means they have enough heap and are doing their jobs. From the logs we see the flushes happening fine, but there are regular lines like the following:

INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Under global heap pressure: Region uat_occurrence,\x06\x0E\xAC\x0F,1341574060728.ab7fed6ea92842941f97cb9384ec3d4b. has too many store files, but is 625.1m vs best flushable region's 278.2m. Choosing the bigger.

This isn't quite as bad as the "delaying flush" line, but it shows that we're on the limit of what our heap can handle. Then starting from around 12:20 we see more and more like the following:

WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region uat_occurrence,"\x98=\x1C,1341567129447.a3a6557c609ad7fc38815fdcedca6c26. has too many store files; delaying flush up to 90000ms

and then to top it off:

INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): ab7fed6ea92842941f97cb9384ec3d4b

INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up with memory above low water.

So what we have is memstores initially being forced to flush because of minor heap pressure (adds storefiles faster than we can compact). Then we have memstores delaying flushes because of too many storefiles (memstores start getting bigger - our graph spike). Then the write ahead log (WAL) complains about too many of its logs, which forces a memstore flush (so that the WAL HLog can be safely discarded - this again adds storefiles). And for good measure the flushing thread now wakes up, finds its out of heap, and starts attempting flushes, which just aggravates the problem (adding more storefiles to the pile). The failure doesn't happen immediately but we're past the point of no return - by about 13:00 the memstores are getting to the "memstore blocking limit" and our mappers die.

What's left?

Knowing what's going on in the memstore flush size analysis is comforting, but just reinforces what we already knew - the problem is too many storefiles. So what's left? Raising the max number of storefiles, that's what! Every storefile in a store consumes resources (file handles, xceivers, heap for holding metadata), which is why it's limited to a maximum of a very modest 7 files by default. But for us this level of write load is rare - in normal operations we won't hit anything like this, and having the capacity to write a whole bunch of storefiles over short-ish, infrequent bursts is relatively safe, since we know our nightly major compaction will clean them up again (and thereby free up all those extra resources). Crank that maximum up to 200 and, finally, the process works!

Conclusion

Our problem was that our compactions couldn't keep up with all the storefiles we were creating. We tried turning all the different HBase dials to get the storefile/compaction process to work "like they're supposed to", but in the end the key for us was the hbase.hstore.blockingStoreFiles parameter which we set to 200, which is probably double what we actually needed but gives us buffer for our infrequent, larger write jobs. We additionally settled on larger (and therefore fewer) regions, and a somewhat bigger than default memstore. Here are the relevant pieces of our hbase-site.xml after all our testing:

  <!-- default is 256MB 268435456, this is 1.5GB -->
  <property>
    <name>hbase.hregion.max.filesize</name>
    <value>1610612736</value>
  </property>
  
  <!-- default is 2 -->
  <property>
    <name>hbase.hregion.memstore.block.multiplier</name>
    <value>4</value>
  </property>
  
  <!-- default is 64MB 67108864 -->
  <property>
    <name>hbase.hregion.memstore.flush.size</name>
    <value>134217728</value>
  </property>
  
  <!-- default is 7, should be at least 2x compactionThreshold -->
  <property>
    <name>hbase.hstore.blockingStoreFiles</name>
    <value>200</value>
  </property>

And finally, if our compactions were faster and/or more frequent, we might be able to keep up with our storefile creation. That doesn't look possible without multi-threaded compactions, but naturally those exist in newer versions of HBase (starting with 0.92 - HBASE-1476) so if you're having these problems, an upgrade might be in order. Indeed this is prompting us to consider an upgrade to CDH4.

Many thanks to Lars George, who helped us get through the "Now What?" phase by digging deep into the logs and the source to help us work out what was going on.

Monday 18 June 2012

Launch of the Canadensys explorer

At Canadensys we already adopted and customized the IPT as our data repository. With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see this presentation. The post below originally appeared on the Canadensys blog.

We are very pleased to announce the beta version of the Canadensys explorer. The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network.

The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the Canadensys repository and via the Global Biodiversity Information Facility (GBIF) as well. The main functionalities of the explorer are listed below, but we encourage you to discover them for yourself instead. We hope it is intuitive. For the best user experience, please use an up-to-date version of your browser.

Happy exploring: http://data.canadensys.net/explorer

Functionalities

The explorer is a one page tool, limiting unnecessary navigation.
The default view shows all the data, allowing users to get an overview and explore immediately.
Data can be queried by using and combining filters.
Filters use smart suggestions, informing the user how often their search term occurs even before they search.
The exact number of results is displayed, including the number of georeferenced records.
The map view displays all georeferenced records for the current query, has different base layer options and can be zoomed in to any level.
Points on the map can be clicked for more information.
The table view displays a sortable preview of the records in the current query.
Records in the table can be clicked for more information in the same way as on the map.
The number of columns in the table responds to the screen width and can be controlled by the user in the display panel.
Data for the current query can be downloaded as a Darwin Core archive. There is no limit on the number of records that can be downloaded.
Users can download the data by providing their email address. Once the download package is generated, the user receives an email with a link to the data, information regarding the usage norms and a suggested citation.
The interface and emails are available in French and English.

As this is a beta version, you may encounter issues. Please report them by clicking the feedback button on the right, which will open a report form.

Technical details

The Canadensys explorer was developed using the following open source tools:

Spring, a Java framework used for the backend.
ThreeTen, a date/time Java library, used for cleaning the eventDate field.
GBIF parsers, a Java library used for cleaning the country field.
ECAT common, a Java library used for parsing the scientificName field.
Darwin Core archive reader, a Java library used to import the data in the backend and to create the user generated downloads.
PostgreSQL, the database.
PostGIS, a geospatial extension to the database.
Mapnik, a tool to generate the maps and provide the map interactivity.
Windshaft, a tile server.
Freemarker, a Java template engine, used to structure to frontend.
Backbone.js, a MVC structure for web applications, used for handling the filters.

Tuesday 12 June 2012

Taxonomic Trees in PostgreSQL

Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. In relational databases such a tree is commonly represented by 3 models known as the adjacency list, the materialized path and the nested set model. There are many comparisons out there listing pros and cons, for example ON stackoverflow, the slides by Lorenzo Alberton or Bill Karwin or a postgres specific performance comparison between the adjacency model and a nested set.

Checklist Bank

At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in Checklist Bank. At the core there is a single table name_usage which contains records each representing a single taxon in the tree [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. It primarily uses the adjacency model with a single foreign key parent_fk which is null for the root elements of the tree. The simplified diagram of the main tables looks like this (actually there are some 20 extra fix width columns left out from name_usage here for simplicity):

For certain searches though an additional index is required. In particular listing all descendants of a taxon, i.e. all members of a subtree, is a common operation that would otherwise involve a recursive function or Common Table Expression. So far we have been using nested sets, but experiencing some badly performing queries lately I decided to do a quick evaluation of different options for Postgres:

the ltree extension which implements a materialized path and provides many powerful operations.
the intarray extension to manually manage a materialized path as an array of non null integers - the primary keys of the parent taxa.
a simple varchar field holding the same materialized path
the current nested set index using a lft and rgt integer column which is unique within a single taxonomy and therefore has to be combined with a checklist_key.

Test Environment

For the tests I am using the current Checklist Bank which contains 14,907,828 records in total spread across 107 checklists of varying size between 7 and 4.25 million records. I am querying the GBIF Backbone Taxonomy which is the largest checklist containing 4,251,163 records with 9 root taxa and a maximum depth of 11 levels. For the queries I have picked specific 2 taxa with different position in the taxonomic tree:

44 Vertebrata the class covering all vertebrates with approximately 84.100 species in this nub.
2684876 Abies the fir genus covering 167 species in this nub and exactly 609 descendants.

For most of our real queries we provide paging for larger results. I will therefore use a limit of 100 and offset of 0 for all queries below.

All tests are executed ON a 2Ghz i7 MacBook Pro with 8GB of memory running Postgres 9.1.1. All queries have been executed 3 times before EXPLAIN ANALYZE was used to get a rough idea ON how differently the various indices behave.

Creating the indices

In order to use the extensions these need to be compiled at installation time and enabled in the database. In Postgres 9.1 you can do the later by executing in psql:

  CREATE EXTENSION ltree;
  CREATE EXTENSION intarray;

The individual data types require different indices. We set up the following indices:

 # general indices
 CREATE INDEX nu_chkl_idx ON name_usage (checklist_fk);
 CREATE INDEX nu_parent_idx ON name_usage (parent_fk);
 # tree specifics
 CREATE INDEX nu_path_idx ON name_usage USING GIST (path);
 CREATE INDEX nu_mpath_idx ON name_usage USING GIN (mpath gin__int_ops);
 CREATE INDEX nu_mspath_idx ON name_usage (mspath);
 CREATE INDEX nu_ckl_lft_rgt_idx ON name_usage (checklist_fk, lft, rgt);

The ltree GIST and intarray GIN indices are rather expensive to create/maintain, but they are selected for best read performance.

Populating the indices

The data being imported into Checklist Bank comes with a parent-child relation, so parent_fk is populated already. For all other indices we have to populate the indices first.

ltree, intarray, string path

For all materialized paths I simply run the following SQL until no new updates happened:

# update root paths once
UPDATE name_usage u SET 
 path = text2ltree(cast(u.id as text)), 
 mspath=u.id, 
 mpath = array[u.id] 
WHERE u.parent_fk is null;
# update until no more records are updated
UPDATE name_usage u SET 
 path = p.path || text2ltree(cast(u.id as text)), 
 mspath=p.mspath || '.' || u.id, 
 mpath = p.mpath || array[u.id] 
FROM name_usage p 
WHERE u.parent_fk=p.id AND p.mspath is not null AND u.mspath is null;

nested sets

I've created 2 functions and one sequence to populate the lft/rgt indices for every checklist:

-- NESTED SET UTILITY FUNCTIONS
CREATE SEQUENCE name_usage_nestidx START 1;

CREATE FUNCTION update_nested_set_usage(integer) RETURNS BOOLEAN as $$
  BEGIN
    UPDATE name_usage set lft = nextval('name_usage_nestidx')-1 WHERE id=$1;
    PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk=$1 ORDER BY rank_fk, name_fk;;
    UPDATE name_usage set rgt = nextval('name_usage_nestidx')-1 WHERE id=$1;
    RETURN true;
  END
$$ LANGUAGE plpgsql;

CREATE FUNCTION build_nested_set_indices(integer) RETURNS BOOLEAN AS $$
BEGIN
  PERFORM setval('name_usage_nestidx', 1);
  PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk is null and checklist_fk=$1 ORDER BY rank_fk, name_fk;;
  RETURN true;
  END
$$ LANGUAGE plpgsql;

Query for Descendants

The queries return the first page with 100 records descendants

adjacency

A recursive postgres CTE query does the trick:

 WITH RECURSIVE d AS (
   SELECT id
    FROM name_usage
    WHERE id = 44
  UNION ALL
   SELECT c.id
    FROM d JOIN name_usage c ON c.parent_fk = d.id
 )
 SELECT * FROM d
  ORDER BY id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=74360.25..74360.50 rows=100 width=4) (actual time=7.306..7.337 rows=100 loops=1)
   CTE d
     ->  Recursive Union  (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.041..6.017 rows=609 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=4) (actual time=0.038..0.040 rows=1 loops=1)
                 Index Cond: (id = 2684876)
           ->  Nested Loop  (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.240..1.393 rows=152 loops=4)
                 ->  WorkTable Scan on d  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.041 rows=152 loops=4)
                 ->  Index Scan using usage_parent_idx on name_usage c  (cost=0.00..715.75 rows=306 width=8) (actual time=0.006..0.008 rows=1 loops=609)
                       Index Cond: (parent_fk = d.id)
   ->  Sort  (cost=1779.24..1855.64 rows=30561 width=4) (actual time=7.304..7.316 rows=100 loops=1)
         Sort Key: d.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  CTE Scan on d  (cost=0.00..611.22 rows=30561 width=4) (actual time=0.046..6.678 rows=609 loops=1)
 Total runtime: 7.559 ms

Query Plan: Vertebrata

 Limit  (cost=74360.25..74360.50 rows=100 width=4) (actual time=2065.053..2065.080 rows=100 loops=1)
   CTE d
     ->  Recursive Union  (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.105..1797.898 rows=264325 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=4) (actual time=0.101..0.103 rows=1 loops=1)
                 Index Cond: (id = 44)
           ->  Nested Loop  (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.915..182.695 rows=29369 loops=9)
                 ->  WorkTable Scan on d  (cost=0.00..0.20 rows=10 width=4) (actual time=0.007..8.056 rows=29369 loops=9)
                 ->  Index Scan using usage_parent_idx on name_usage c  (cost=0.00..715.75 rows=306 width=8) (actual time=0.004..0.005 rows=1 loops=264325)
                       Index Cond: (parent_fk = d.id)
   ->  Sort  (cost=1779.24..1855.64 rows=30561 width=4) (actual time=2065.050..2065.062 rows=100 loops=1)
         Sort Key: d.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  CTE Scan on d  (cost=0.00..611.22 rows=30561 width=4) (actual time=0.109..1986.606 rows=264325 loops=1)
 Total runtime: 2080.248 ms

As expected the recursive query does a pretty good job if the subtree is small. But for the large vertebrate subtree its rather slow because we first get all decendants and then apply a limit.

ltree

With ltree you have various options to query for a subtree. You can use a path lquery with ~, a full text ltxtquery via @ or the ltree <@ isDescendant operator.

Unanchored lqueries for any path containing the a node turned out to be very, very slow. This is expected somehow because it can't use the index properly. Surprisingly even the anchored queries and the full text query were far too slow from being useful at all. The fastest option definitely was the native descendants operator:

 WITH p AS (
   SELECT path FROM name_usage WHERE id=44
 )
 SELECT u.id 
  FROM name_usage u, p 
  WHERE u.path <@ p.path
  ORDER BY u.id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=54526.09..54526.34 rows=100 width=4) (actual time=2.926..2.963 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.030..0.034 rows=1 loops=1)
           Index Cond: (id = 2684876)
   ->  Sort  (cost=54515.41..54551.25 rows=14336 width=4) (actual time=2.923..2.942 rows=100 loops=1)
         Sort Key: u.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=2094.99..53967.50 rows=14336 width=4) (actual time=1.094..2.230 rows=609 loops=1)
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.035..0.040 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2094.99..53788.28 rows=14336 width=124) (actual time=1.051..1.952 rows=609 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2091.40 rows=14336 width=0) (actual time=1.023..1.023 rows=609 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 3.068 ms

Query Plan: Vertebrata

 Limit  (cost=54526.09..54526.34 rows=100 width=4) (actual time=512.420..512.445 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.091..0.094 rows=1 loops=1)
           Index Cond: (id = 44)
   ->  Sort  (cost=54515.41..54551.25 rows=14336 width=4) (actual time=512.417..512.432 rows=100 loops=1)
         Sort Key: u.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=2094.99..53967.50 rows=14336 width=4) (actual time=115.119..428.632 rows=264325 loops=1)
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.096..0.100 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2094.99..53788.28 rows=14336 width=124) (actual time=115.016..372.070 rows=264325 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2091.40 rows=14336 width=0) (actual time=109.791..109.791 rows=264325 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 512.723 ms

intarray

As a node in the tree appears only once, we can query for all usages that have the node id in their array but are not that very record.

 SELECT u.id 
  FROM name_usage u
  WHERE u.mpath @@ '44' and u.id != 44
  ORDER BY u.id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=52491.99..52492.24 rows=100 width=4) (actual time=1.925..1.966 rows=100 loops=1)
   ->  Sort  (cost=52491.99..52527.83 rows=14336 width=4) (actual time=1.923..1.942 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Bitmap Heap Scan on name_usage u  (cost=179.10..51944.08 rows=14336 width=4) (actual time=0.682..1.219 rows=608 loops=1)
               Recheck Cond: (mpath @@ '2684876'::query_int)
               Filter: (id <> 2684876)
               ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..175.52 rows=14336 width=0) (actual time=0.646..0.646 rows=609 loops=1)
                     Index Cond: (mpath @@ '2684876'::query_int)
 Total runtime: 2.052 ms

Query Plan: Vertebrata

 Limit  (cost=52491.99..52492.24 rows=100 width=4) (actual time=377.851..377.877 rows=100 loops=1)
   ->  Sort  (cost=52491.99..52527.83 rows=14336 width=4) (actual time=377.849..377.861 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Bitmap Heap Scan on name_usage u  (cost=179.10..51944.08 rows=14336 width=4) (actual time=115.634..298.193 rows=264324 loops=1)
               Recheck Cond: (mpath @@ '44'::query_int)
               Filter: (id <> 44)
               ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..175.52 rows=14336 width=0) (actual time=110.776..110.776 rows=264325 loops=1)
                     Index Cond: (mpath @@ '44'::query_int)
 Total runtime: 378.131 ms

string path

Just for completeness and to compare relative performances I am trying an anchored pattern match against a varchar based materialzed path. In order to let postgres use the mspath index we must also order by that value, no id:

 WITH p AS (
  SELECT mspath FROM name_usage where id=44
 )
 SELECT u.id 
  FROM name_usage u, p
  WHERE u.mspath LIKE p.mspath || '.%'
  ORDER BY u.mspath
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=10.68..80535.72 rows=100 width=36)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=32)
           Index Cond: (id = 2684876)
   ->  Nested Loop  (cost=0.00..60022560.36 rows=74539 width=36)
         Join Filter: (u.mspath ~~ (p.mspath || '.%'::text))
         ->  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..59649864.65 rows=14907828 width=36)
         ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32)

I don't know what is going on here, but I can't make postgres use the mspath btree index properly. The query therefore is very, very slow. If I use a hardcoded path instead of p.mspath the index is used. I'll show results here for the hardcoded path now - anyone having an idea on how to avoid the table scan in the above sql please let me know.

 SELECT u.id 
  FROM name_usage u
  WHERE u.mspath LIKE '6.101.194.640.3925.2684876.%'
  ORDER BY u.mspath
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=0.00..400.95 rows=100 width=36) (actual time=0.105..0.549 rows=100 loops=1)
   ->  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.102..0.516 rows=100 loops=1)
         Index Cond: ((mspath >= '6.101.194.640.3925.2684876.'::text) AND (mspath < '6.101.194.640.3925.2684876/'::text))
         Filter: (mspath ~~ '6.101.194.640.3925.2684876.%'::text)
 Total runtime: 0.637 ms

Query Plan: Vertebrata

 Limit  (cost=0.00..400.95 rows=100 width=36) (actual time=0.076..0.472 rows=100 loops=1)
   ->  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.074..0.435 rows=100 loops=1)
         Index Cond: ((mspath >= '1.44.'::text) AND (mspath < '1.44/'::text))
         Filter: (mspath ~~ '1.44.%'::text)
 Total runtime: 0.561 ms

The hardcoded results are extremely fast. Much faster than any of the specialised indices above. Retrying ltree for example with a hardcoded path doesnt speed up the ltree query. The vast speed gain here is because postgres can use the b-tree index for a sorted output and therefore really benefit from the limit. GiST and GIN indices on the other hand are not suitable for sorting.

nested set

A very different query model from the above, which all use some sort of a materialzed path. We need to also add the checklist_fk condition because the nested set indices are only unique within each taxonomy, not across all.

 SELECT u.id 
  FROM name_usage u JOIN name_usage p ON u.checklist_fk=p.checklist_fk  
  WHERE  p.id=44 and u.lft BETWEEN p.lft and p.rgt
  ORDER BY u.lft
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=81687.08..81687.33 rows=100 width=8) (actual time=2.030..2.076 rows=100 loops=1)
   ->  Sort  (cost=81687.08..82326.50 rows=255769 width=8) (actual time=2.029..2.052 rows=100 loops=1)
         Sort Key: u.lft
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.062..1.482 rows=609 loops=1)
               ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.028..0.029 rows=1 loops=1)
                     Index Cond: (id = 2684876)
               ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.029..1.182 rows=609 loops=1)
                     Index Cond: ((checklist_fk = p.checklist_fk) AND (lft >= p.lft) AND (lft <= p.rgt))
 Total runtime: 2.206 ms

Query Plan: Vertebrata

 Limit  (cost=81687.08..81687.33 rows=100 width=8) (actual time=475.811..475.829 rows=100 loops=1)
   ->  Sort  (cost=81687.08..82326.50 rows=255769 width=8) (actual time=475.809..475.817 rows=100 loops=1)
         Sort Key: u.lft
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.158..381.822 rows=264325 loops=1)
               ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.075..0.077 rows=1 loops=1)
                     Index Cond: (id = 44)
               ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.077..323.519 rows=264325 loops=1)
                     Index Cond: ((checklist_fk = p.checklist_fk) AND (lft >= p.lft) AND (lft <= p.rgt))
 Total runtime: 475.951 ms

Again the tricky part was making postgres use the lft/rgt index properly when using the order by. Removing the order by turns this query into a super fast one for even the vertebrate query (only o.4ms for any of the two). If we also use hardcoded values instead of a joined p table we can get even better performances than with the string path model. For example the vertebrate query would look like this:

SELECT u.id 
 FROM name_usage u 
 WHERE  u.checklist_fk=1 and u.lft BETWEEN 517646 and 1046295
 ORDER BY u.lft
 LIMIT 100 OFFSET 0;

Query Plan: Vertebrata

 Limit  (cost=0.00..337.88 rows=100 width=8) (actual time=0.080..0.395 rows=100 loops=1)
   ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..1566342.54 rows=463573 width=8) (actual time=0.078..0.356 rows=100 loops=1)
         Index Cond: ((checklist_fk = 1) AND (lft >= 517646) AND (lft <= 1046295))
 Total runtime: 0.480 ms

Query for Ancestors

The ancestor query does not use any limit as its only a few records and we always want all. All materialized paths already have the ancestors - they are the path.

adjacency

A recursive CTE query:

 WITH RECURSIVE a AS (
  SELECT id, parent_fk, rank_fk
   FROM name_usage
   WHERE id = 44
 UNION ALL
  SELECT p.id, p.parent_fk, p.rank_fk
   FROM a JOIN name_usage p ON a.parent_fk = p.id
 )
 SELECT * FROM a
 WHERE id!=44
 ORDER BY rank_fk;

Query Plan: Abies

 Sort  (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.155..0.156 rows=5 loops=1)
   Sort Key: a.rank_fk
   Sort Method: quicksort  Memory: 25kB
   CTE a
     ->  Recursive Union  (cost=0.00..1083.93 rows=101 width=12) (actual time=0.030..0.117 rows=6 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=12) (actual time=0.027..0.029 rows=1 loops=1)
                 Index Cond: (id = 2684876)
           ->  Nested Loop  (cost=0.00..107.12 rows=10 width=12) (actual time=0.011..0.012 rows=1 loops=6)
                 ->  WorkTable Scan on a  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=6)
                 ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.007..0.008 rows=1 loops=6)
                       Index Cond: (id = a.parent_fk)
   ->  CTE Scan on a  (cost=0.00..2.27 rows=100 width=12) (actual time=0.068..0.141 rows=5 loops=1)
         Filter: (id <> 2684876)
 Total runtime: 0.265 ms

Query Plan: Vertebrata

 Sort  (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.104..0.104 rows=1 loops=1)
   Sort Key: a.rank_fk
   Sort Method: quicksort  Memory: 25kB
   CTE a
     ->  Recursive Union  (cost=0.00..1083.93 rows=101 width=12) (actual time=0.040..0.077 rows=2 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=12) (actual time=0.037..0.040 rows=1 loops=1)
                 Index Cond: (id = 44)
           ->  Nested Loop  (cost=0.00..107.12 rows=10 width=12) (actual time=0.013..0.015 rows=0 loops=2)
                 ->  WorkTable Scan on a  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=2)
                 ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.008..0.009 rows=0 loops=2)
                       Index Cond: (id = a.parent_fk)
   ->  CTE Scan on a  (cost=0.00..2.27 rows=100 width=12) (actual time=0.079..0.092 rows=1 loops=1)
         Filter: (id <> 44)
 Total runtime: 0.232 ms

The trees are not very deep, so even the genus query only has to do 5 recursive calls to return 5 parents.

ltree

  select path FROM name_usage WHERE id=44;

The path contains all ancestor ids. Parsing it into separate ids within sql is not obvious at first glance though.

intarray

 SELECT mpath FROM name_usage WHERE id=44;

The path contains all ancestor ids. Iterating over each id entry is simple.

nested set

To find out all ancestors of a given node, we just select all nodes that contain its LFT boundary (which in a properly built hierarchy implies containing the RGT boundary too):

SELECT p.id, p.rank_fk, p.lft
 FROM  name_usage u JOIN name_usage p ON u.lft BETWEEN p.lft and p.rgt AND u.checklist_fk=p.checklist_fk 
 WHERE u.id=44 ORDER BY 3;

Query Plan: Abies

 Sort  (cost=100429.25..101068.67 rows=255769 width=12) (actual time=607.957..607.958 rows=6 loops=1)
   Sort Key: p.lft
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=0.00..73083.96 rows=255769 width=12) (actual time=415.862..607.938 rows=6 loops=1)
         ->  Index Scan using usage_pkey on name_usage u  (cost=0.00..10.68 rows=1 width=8) (actual time=0.046..0.047 rows=1 loops=1)
               Index Cond: (id = 2684876)
         ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage p  (cost=0.00..72706.36 rows=20967 width=20) (actual time=415.809..607.880 rows=6 loops=1)
               Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft >= lft) AND (u.lft <= rgt))
 Total runtime: 608.034 ms

Query Plan: Vertebrata

 Sort  (cost=100429.25..101068.67 rows=255769 width=12) (actual time=38.428..38.428 rows=2 loops=1)
   Sort Key: p.lft
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=0.00..73083.96 rows=255769 width=12) (actual time=0.111..38.410 rows=2 loops=1)
         ->  Index Scan using usage_pkey on name_usage u  (cost=0.00..10.68 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
               Index Cond: (id = 44)
         ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage p  (cost=0.00..72706.36 rows=20967 width=20) (actual time=0.084..38.378 rows=2 loops=1)
               Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft >= lft) AND (u.lft <= rgt))
 Total runtime: 38.504 ms

Not very efficient.

Query for Children

adjacency

A perfect fit:

 SELECT u.id 
  FROM name_usage  u 
  WHERE parent_fk=44 
  ORDER BY u.id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=640.97..641.22 rows=100 width=4) (actual time=0.566..0.623 rows=100 loops=1)
   ->  Sort  (cost=640.97..641.67 rows=279 width=4) (actual time=0.564..0.588 rows=100 loops=1)
         Sort Key: id
         Sort Method: quicksort  Memory: 32kB
         ->  Index Scan using usage_parent_idx on name_usage u  (cost=0.00..630.31 rows=279 width=4) (actual time=0.046..0.312 rows=167 loops=1)
               Index Cond: (parent_fk = 2684876)
 Total runtime: 0.696 ms

Query Plan: Vertebrata

 Limit  (cost=16863.12..16863.37 rows=100 width=4) (actual time=8.783..8.811 rows=100 loops=1)
   ->  Sort  (cost=16863.12..16881.75 rows=7454 width=4) (actual time=8.780..8.793 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Index Scan using usage_parent_idx on name_usage u  (cost=0.00..16578.23 rows=7454 width=4) (actual time=0.060..5.169 rows=6083 loops=1)
               Index Cond: (parent_fk = 44)
 Total runtime: 8.867 ms

ltree

Search for all descendants that have one more node:

WITH p AS (
 SELECT path FROM name_usage WHERE id=2684876
)
SELECT u.id 
 FROM  name_usage u, p
 WHERE u.path <@ p.path and nlevel(u.path)=nlevel(p.path)+1
 ORDER BY u.path
 LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=56078.37..56078.56 rows=75 width=124) (actual time=4.118..4.161 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.029..0.031 rows=1 loops=1)
           Index Cond: (id = 2684876)
   ->  Sort  (cost=56067.69..56067.88 rows=75 width=124) (actual time=4.116..4.134 rows=100 loops=1)
         Sort Key: u.path
         Sort Method: quicksort  Memory: 48kB
         ->  Nested Loop  (cost=2099.42..56065.36 rows=75 width=124) (actual time=1.230..3.241 rows=167 loops=1)
               Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1))
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.037..0.040 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2099.42..55729.91 rows=14908 width=124) (actual time=1.175..2.274 rows=609 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2095.69 rows=14908 width=0) (actual time=1.146..1.146 rows=609 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 4.321 ms

Query Plan: Vertebrata

 Limit  (cost=56078.37..56078.56 rows=75 width=124) (actual time=600.793..600.816 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.048..0.051 rows=1 loops=1)
           Index Cond: (id = 44)
   ->  Sort  (cost=56067.69..56067.88 rows=75 width=124) (actual time=600.792..600.805 rows=100 loops=1)
         Sort Key: u.path
         Sort Method: top-N heapsort  Memory: 32kB
         ->  Nested Loop  (cost=2099.42..56065.36 rows=75 width=124) (actual time=112.999..595.738 rows=6083 loops=1)
               Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1))
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.053..0.057 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2099.42..55729.91 rows=14908 width=124) (actual time=112.924..413.889 rows=264325 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2095.69 rows=14908 width=0) (actual time=107.524..107.524 rows=264325 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 600.980 ms

intarray

Search for an array that contains the parent id, but has an array size one greater than its parent:

WITH p AS (
 SELECT mpath FROM name_usage WHERE id=44
)
SELECT u.id 
 FROM  name_usage u, p
 WHERE  u.mpath @@ '44' and #u.mpath= #p.mpath+1
 ORDER BY u.mpath
 LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=53976.90..53977.09 rows=75 width=56) (actual time=2.026..2.056 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=52) (actual time=0.018..0.019 rows=1 loops=1)
           Index Cond: (id = 2684876)
   ->  Sort  (cost=53966.22..53966.41 rows=75 width=56) (actual time=2.024..2.042 rows=100 loops=1)
         Sort Key: u.mpath
         Sort Method: quicksort  Memory: 48kB
         ->  Hash Join  (cost=183.57..53963.89 rows=75 width=56) (actual time=0.341..1.352 rows=167 loops=1)
               Hash Cond: ((# u.mpath) = (# (p.mpath + 1)))
               ->  Bitmap Heap Scan on name_usage u  (cost=183.54..53851.29 rows=14908 width=56) (actual time=0.292..0.874 rows=609 loops=1)
                     Recheck Cond: (mpath @@ '2684876'::query_int)
                     ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..179.81 rows=14908 width=0) (actual time=0.279..0.279 rows=609 loops=1)
                           Index Cond: (mpath @@ '2684876'::query_int)
               ->  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.038..0.038 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.022..0.024 rows=1 loops=1)
 Total runtime: 2.144 ms

Query Plan: Vertebrata

 Limit  (cost=53976.90..53977.09 rows=75 width=56) (actual time=501.304..501.327 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=52) (actual time=0.075..0.078 rows=1 loops=1)
           Index Cond: (id = 44)
   ->  Sort  (cost=53966.22..53966.41 rows=75 width=56) (actual time=501.303..501.316 rows=100 loops=1)
         Sort Key: u.mpath
         Sort Method: top-N heapsort  Memory: 32kB
         ->  Hash Join  (cost=183.57..53963.89 rows=75 width=56) (actual time=116.159..495.063 rows=6083 loops=1)
               Hash Cond: ((# u.mpath) = (# (p.mpath + 1)))
               ->  Bitmap Heap Scan on name_usage u  (cost=183.54..53851.29 rows=14908 width=56) (actual time=116.030..391.099 rows=264325 loops=1)
                     Recheck Cond: (mpath @@ '44'::query_int)
                     ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..179.81 rows=14908 width=0) (actual time=111.387..111.387 rows=264325 loops=1)
                           Index Cond: (mpath @@ '44'::query_int)
               ->  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.101..0.101 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.082..0.086 rows=1 loops=1)
 Total runtime: 501.517 ms

nested set

A difficult task. Will leave this as a challenge for some later time.

Summary

The best performing queries do not come from one model. The adjacency model is unbeatable for listing children and with recursive queries in not too deep taxonomic trees it performs also very well to get all ancestors.

For listing descendants the winner are queries that can use an ordered btree index. I could only get this to work with hardcoded paths, so it's not useful like this for dynamic, single statement queries. If you can issue 2 separate queries in code though nested sets or the string materlialized path are ideal candidates for retrieving descendants. Because of a required ordering for doing paging it can use the btree efficiently, while all others produce the full list of decendants first and then order. Intarray is the quickest in that field, but nested sets and ltree perform rather similar. It remains to see if an additional btree index could improve the ordering of ltree and intarray drastically.

Monday 10 December 2012

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"

Monday 29 October 2012

The GBIF Registry is now dataset-aware!

Wednesday 17 October 2012

IPT v2.0.4 released

Friday 13 July 2012

Getting started with DataCube on HBase

Wednesday 11 July 2012

Optimizing Writes in HBase

The problem

The logs

Why it blocks

Stopping the blocking

Now what?

What's left?

Conclusion

Monday 18 June 2012

Launch of the Canadensys explorer

Functionalities

Technical details

Tuesday 12 June 2012

Taxonomic Trees in PostgreSQL

Checklist Bank

Test Environment

Creating the indices

Populating the indices

ltree, intarray, string path

nested sets

Query for Descendants

adjacency

Query Plan: Abies

Query Plan: Vertebrata

ltree

Query Plan: Abies

Query Plan: Vertebrata

intarray

Query Plan: Abies

Query Plan: Vertebrata

string path

Query Plan: Abies

Query Plan: Abies

Query Plan: Vertebrata

nested set

Query Plan: Abies

Query Plan: Vertebrata

Query Plan: Vertebrata

Query for Ancestors

adjacency

Query Plan: Abies

Query Plan: Vertebrata

ltree

intarray

nested set

Query Plan: Abies

Query Plan: Vertebrata

Query for Children

adjacency

Query Plan: Abies

Query Plan: Vertebrata

ltree

Query Plan: Abies

Query Plan: Vertebrata

intarray

Query Plan: Abies

Query Plan: Vertebrata

nested set

Summary

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers