Wednesday 4 May 2011

Line terminating characters breaking Darwin Core Archive

Hi, I am Jan K. Legind the new data administrator at the GBIF Secretariat and one of my responsibilities is to ensure that datasets from publishers get indexed so that the data can be made available through the GBIF Portal. I am a historian by training and I have worked with archival data collection and testing prior to joining GBIF.

Recently I have been bug hunting a large dataset (DwC - Archive) that from a casual glance would look OK at the publisher side, but upon hitting the parser several records would be rejected because of the occurrence of line terminating characters in the records themselves (Hex value 0A). On top of that the individual record would be replaced by one empty line due to the illegal line termination AND another empty line would be added to that due to the tail end of the record appearing to the parser as the start of a new record, which of course would not be well-formed (thus being replaced with blank line number two). The parser will see a line that has too few columns and drop it. Since the line was bisected the tail end will also be conceived of as an individual line with an insufficient number of columns.

Here is an example of a record that would be replaced by an empty line:

The line terminating characters seems to have been escaped but without achieving the desired result. The secondary effect of this error is that the record count is miscalculated since the parser merely counts the lines and therefore ends up with a larger number than the publisher expected (remember that the line terminating character breaks the data file by producing two lines with an incorrect number of columns). Incidentally this example can sometimes explain why we harvest MORE than 100% of the target records.

By using the Integrated Publishing Toolkit (IPT) illegal characters can be avoided and the publishers will benefit from a faster transition into data appearing live in the GBIF portal. http://www.gbif.org/informatics/infrastructure/publishing/

Fortunately I am working in a joint effort with the publisher’s team on ironing out the bumps on this resource so we can get the data published fast and prevent future errors of this sort.

2 comments:

  1. Thanks describing the problem with line breaking characters! It is GBIF's most common problem encountered when processing DwC-As.

    Just to add, that even using the IPT, the user has to ensure their source data is clean from line breaking characters before importing it. The worst case is that the IPT will not include those broken lines in the outputted archive so worst case you're losing records. The IPT can help the user re-check their data for line breaking characters though. For example there are logs written during import and publishing that can give clues. Also, the # of lines calculated by the IPT can be an immediate clue if this number is way-off of the expected.

    ReplyDelete