The RALI is a lab based at the Université de Montréal, specializing in natural language processing. Can you tell us more about your research activities and expertise?
One of our main areas of research is information retrieval, which aims to address a quest for information (for example: “What are the side effects of a given medication”?) within a collection of documents. Since 2013, part of our research efforts is focused on OIE (Open Information Extraction), which consists in acquiring structured knowledge from unstructured, free text.
We are also interested in e-recruitment. As part of the Butterfly Predictive Project, we are developing, in collaboration with our industrial partner LittleBigJob Networks Inc., a platform designed to improve the recruitment process for administrators and professionals by using big data, extracted namely from social media.
We also explore the realms of automatic translation, automatic summarization, legal text processing and big data processing and visualization, in collaboration with government organizations and industrial partners.
As part of the CO.SHS project, you are developing a tool, named Allium, that will increase the Érudit platform’s information discovery potential thanks to OIE. What does that entail, more specifically?
The purpose of OIE is the extraction of structured information from unstructured, free text (such as newspaper articles or film reviews). The extracted and properly structured knowledge, connections, concepts and facts can be used in a myriad of ways. They can be cross-referenced, compared, aggregated, etc., in order to improve information discovery and help an information base reach its full documentary potential.
You can therefore imagine that an OIE program will “read” all of the documents disseminated on Érudit and will help the reader of an article identify its key concepts and the connections between them, or to relate these ideas to external documents. This will be, in a nutshell, our contribution to CO.SHS.
In your opinion, who would be the typical users for this tool? Could you give us examples of queries or information needs that Allium will be able to fulfill?
We believe that all types of users will profit from Érudit’s enrichment through our prototype. As Allium is part of CO.SHS’s “Discovery” mission, it aims to provide readers with additional sources of information that will enrich the text of the disseminated articles. The goal in orienting the reader in this way is twofold: to clarify the topic of the viewed article with relevant explanations, but also to showcase the collection by suggesting articles or documents that the user would otherwise be unaware of. Ultimately, as Érudit hosts a relatively vast collection, all of its users will benefit from the latter goal.
Are your developments specifically tailored for the scholarly and cultural journals disseminated on Érudit, or can they be adapted to other corpora?
For the moment, we are focusing our efforts on scholarly French-language texts, which we believe is the best way to contribute to the CO.SHS project. Adaptation to various fields is always a delicate endeavor in natural language processing, so we don’t claim that our prototype will be easily adaptable to other corpora. The rule is simple: the more a collection is similar to Érudit, the easier it will be to adapt to it, and vice-versa.
What difficulties were you faced with up until now in developing this project?
There are several challenges related to this project. The main one relates to its central scientific component, OIE.
Luckily (or not…), most of the researchers in this field are confronted with the same difficulties: how do you ensure an adequate level of precision while extracting data from an entirely unstructured text, how do you ensure adequate algorithmic efficiency for such a large collection (200 000 documents) and, even more importantly, how do you impartially evaluate your progress? Luckily, we are indeed seeing some progress, but not as quickly as we had hoped.
What concepts and methods is your work based on? What are, in your opinion, the most innovative elements of your approach?
The principles of open information extraction are mostly based on the important work of Michele Banko et al. (2007), who named the field and gave it its credentials. Our specific approach is based on the ReVerb extractor developed by Anthony Fader et al. (2011) and on the Ollie extractor by Mausam et al. (2012). Not only do we use tools inspired by theirs, or a modified version of these, but we also exploit data that they generated. We hope to use this data as a high-quality starting point to develop more sophisticated tools. This is called distant supervision, which means that we use data that hasn’t been manually verified in its entirety in order to train a system.
Where do you stand today and what are the next milestones for your developments?
At the moment, we are adapting certain OIE techniques to the French language and are using them in the main engine running the prototype that we have presented to CO.SHS members. Given its central aspect, we would like to improve it in several ways, namely by increasing its coverage and discriminating power, to be able to extract, for example, connections that aren’t explicitly stated with verbs in the texts processed by the machine. For example, when we say “Chilly Gonzales (born in 1972 as Jason Beck), we should be able to extract (Jason Beck, is known as, Chilly Gonzales), even though this relation isn’t explicit in the text. To put it otherwise, our approach involves paraphrasing, that is the rephrasing of expressions, in order to find equivalences in the text with explicit connections. The task is made even more difficult by the fact that corpora and evaluation metrics for such an endeavor are rare, if not non-existent. But we are persisting in that direction!