Background

Scientific literature is a central repository of scientific knowledge - every important scientific discovery has been published in it. As such, scientific literature has become a main target of data mining, and in particular, text mining. However, the unstructured, or covertly structured, nature of natural language texts poses a major barrier to accessing the contents of literature. The technology of literature annotation thus has played a central role for text mining. While literature annotation still requires enormous effort despite a number of years of concentrated experience, the productivity of literature annotation is recently significantly improved, and there are quite a few groups producing annotations in large scale. While many groups have released those data sets of literature annotation to the public, however, the way of sharing those widely valuable resources still remains at a primitive level, e.g., relying on individual exchange of archived files.

Meanwhile, the advancement of internet web technology has enabled much convenient ways of collaborating for producing and sharing data. For example, the technology of web 2.0 has enabled crowdsourcing content generation, and web 3.0 has enabled machine-understandable web of data. In terms of annotation in general sense, for example, the Google Map system allows even lay users to easily produce geographic annotations and immediately share them by simply sending the URL representing the annotations. Before the era of Google Map, producing and sharing geographic annotation had never been a simple work. What makes Google Map so convenient may be attributed to (1) sharing of the same coordinate system (latitude and longitude), (2) provision of dereferenceable URI to every position and every annotation data, and (3) provision of open APIs and tools.

Note that geographic and literature annotations share similar characteristics: They target unstructured data, e.g., map image vs. text. The entity annotations, e.g., restaurants and shops vs. drugs and diseases, identify where about the target data those entities are represented. The structural annotations, e.g., streets vs. dependency paths, reveals how the entities are connected to each other.

The organization of BLAH is motivated to initiate a Google Map-like system of sharing literature annotation. Through the event, we aim at (1) collecting various annotations to PubMed and PMC articles, (2) "linking" them through normalized texts of literature, and (3) provision of dereferenceable URIs to the resources. Expected immediate effects include (1) various annotations to become comparable to each other, (2) annotations to become searcheable across multiple data sets, and more importantly, (3) every piece of annotation become referenceable through the dereferenceable URIs. The dereferenceable URIs also will make the resources integrated to the whole world of Web 3.0 or linked data, which will be led to a natural integration to other data mining efforts.

We believe this exciting event has a great potential to open a new era of text mining, enabling rich analysis of heterogeneous annotations, e.g., syntactic and semantic annotations, genomic and clinical annotations, and so on, which is not possible using single annotation sets individually.

Page updated

Google Sites

Report abuse