The Laboratory for Information Analysis and Knowledge Mobilization
Frontiers of Information Analysis
Opinion mining on healthcare topics from
Researchers: Stan Matwin, University of Ottawa
Opinions relevant to health can be found almost everywhere on the Web, e.g. news feeds (Google, Reuters, Yahoo), social networks (Twitter, Facebook, LinkedIn), newspaper sites, blogs, etc. We are able to collect and analyze opinions of others at a scale hitherto unachievable. However, as there are very large quantities of such reviews, it is costly or even impossible, for users to read and analyze them on their own [KBA11]. Therefore there is a need for automatic sentiment analysis (also known as opinion mining) that is a computational technique that seeks to understand and explain opinion and sentiment by analyzing large amounts of opinion data in such an efficient way as to assist in human decision making [KBA11][L10]. Note that not much of opinion mining has been done with health-related data, while there would be a social need to obtain such information about treatments, healthcare providers (eg. Rate My Physician), drugs, etc.
One of the crucial aspects in mining opinions and reviews is a possibility to synthesize the automated results concerning lexical information and the structural ones. On the one hand, evaluating opinions and their reliability depends on linguistic information, such as the style of language used, sophisticated words and linguistic correctness. On the other hand, the structural information, in which the opinion reliability is grounded, concerns e.g. references from other reviews, popularity/credibility of the opinion author and his/her authority, and influence on the community, the number of visits and readings of a particular opinion, etc. Next step in the process of mining health-related reviews may be the adaptation of the collected and processed information of these two kinds (lexical and structural) to the needs and preferences of a specific user.
The other requirement concerning analysis of reviews is to rank the opinions (regarding their scores and the level of confidence given to the opinion giver by a specific user). Furthermore, opinion ranks and importance may vary and depend on particular users and their tastes (personalization). Important tools using (to some extent) consolidated data are recommendation systems, which try to also utilize the knowledge about the specific user and her/his taste and measures of similarity between users.
In general, every opinion giving post (text) consists of objective facts about the target and holder of opinion (usually also time of expressing the opinion) as well as the subjective opinions/emotions expressed about the targeted objects, their components and attributes (their features) [L10].
Sentiment mining used for monitoring purposes usually utilizes superficial methods and the results are also superficial, mainly for statistics and shallow business analysis (e.g. product reviews). Nonetheless, there is a growing interest in exhaustive, deep mining of opinion texts, giving an insight into the nature of emotions and their context area or the extended knowledge about the holders of opinion and their reasons for giving particular opinions (actionable insights) [CC09][SC08]. Therefore the data expressed in different kinds of opinions may be divided into subjective and objective information carriers or the opinion giving phrases may have different levels of intensities of positive or negative tone [BES10][W10]. There are some works regarding automatic and semiautomatic generation and scoring of sentiment lexicons, e.g. [Liu09,BES10].
The deep knowledge can be captured with the use of semantic technologies, particularly with modeling of the domain with ontologies. Specifically for health-related data, some relevant ontologies (discussed elsewhere in this application) are available.
The following steps describe our approach and method:
I. OPINION MINING
II. STRUCTURAL TRUST MEASUREMENT module:
HARVESTING expertise and reliability MEASURES
using external resources, e.g. ranks of opinion givers’ from
III. OPINION INFORMATION SYNTHESIS &PERSONALIZATION module:
PROFILING USERS - Measuring and modeling users
and preferences, e.g. comparing with friends in discussion networks
Weighting opinions with trust, scoring opinioned
entities (works of art), with regard to users profiles
11) OPINION DYNAMICS AND INFLUENCERS
Discovering the leading influencers for
particular opinions and trends -
measuring opinion givers’ impact on particular opinion and score (simply who
generates a lot of opinions on particular subjects)
[BGC10] A. Brew, D. Greene, P. Cunningham. Using
Crowdsourcing and Active Learning to Track Sentiment in Online Media. In H.
Coelho, R. Studer, and M. Wooldridge, editors, ECAI 2010 - 19th European
Conference on Artificial Intelligence, pages 145-150, IOS Press, 2010
Computational Study of Natural Language
Researchers: Sheila Embleton. Distinguished
Research Professor of Linguistics,
In the Computational approach to the study of human languages, we use information technology on the one hand for its power in processing data with sophistication and speed, and on the other hand for its logical consistency and ease of repetition in modelling language. In different sub-projects, we have digitized large sets of dialect data and applied sophisticated statistical methods to uncover new understanding of how dialects vary; we have created mathematical models of language change, and of the semantics of language, and implemented them as computer programmes.
Online Dialect Atlas
RODA: The SSHRC-funded Romanian Online Dialect Atlas (RODA) project has digitized 3 of 5 volumes of a hardcopy atlas of dialects in North-West Romania, and developed a sophisticated tool for accessing the data, presenting the data as dynamically created maps of user-selected subsets of the data and user interpretations of the data.
Currently, we plan to expand the RODA interface to permit even more sophisticated searches, and to digitize additional data. In addition to Romanian, we have worked on extensive collections of Finnish and English dialect data, and we have shared our technology with other groups working on Romanian, Finnish and other languages.
This project is the subject of an SSHRC Insight Grant application for 2012-2015, and has identified milestones in that time frame for delivering new technology and data, and for communicating it broadly to both professional and general audiences.
Embleton, Sheila, Dorin
Uritescu and Eric Wheeler. 2008c.
Identifying Dialect Regions: Specific features vs. overall measures using the
Romanian Online Dialect Atlas and Multidimensional Scaling. Leeds, UK: Methods XIII Conference. August
2008. in Barry Heselwood and Clive Upton.eds. 2009. Proceedings of Methods
XIII. Papers from the Thirteenth International Conference on Methods in
Dialectology, 2008. Frankfurt am Main: Peter Lang. pp 79-90
FODA: Beyond the scope of the RODA project, we hope to extend the technology to an extensive set of Finnish dialect data (Finnish Online Dialect Atlas or FODA). While the data has been digitized, there remains some final editing, adaptation to the RODA technology, and publication of both the data and the associated findings. This project has had SSHRC funding but is currently not funded.
Embleton, Sheila and Eric S. Wheeler. 2000. Computerized Dialect Atlas of Finnish: Dealing with Ambiguity. J. of Quantitative Linguistics. 7.3. pp 227-231.
Wheeler has created a mathematical model of language change in a communications network, and produced some theoretical results. The model needs application to specific data situations. There are possibilities to test the model on existing data sets for Acadian French. The next step is to explore these possibilities. This project is currently not funded.
Wheeler, Eric S. 2007. Language Change in a Communication Network. Exact Methods in the Study of Language and Text. (Quantitative Linguistics, 62.) Dedicated to Gabriel Altmann on the Occasion of this 75th Birthday. Peter Grzybek, Reinhard Köhler (eds). Berlin and New York: Mouton de Gruyter. Pp 689-698
Theatre of the Mind is a system to display an animated interpretation of a natural language text on a virtual stage: Given “Jack and Jill went up the hill...” the system shows animated characters going up a hill. To be general, the system needs a robust natural language processor that integrates a wide range of linguistic theories, from basic syntax to semantics, pragmatics and discourse analysis.
A prototype system is operational. The next milestone is to create thousand-word lexicon and grammar with a suitable semantics. The focus here is on developing linguistic sophistication, rather than animation techniques. The project is currently active, but not externally funded.
Wheeler, Eric S. 2009. Theatre of the Mind: A
Project to Animate the Language of Thought and Communication. in e-Learning.
6.3 Special Edition. Sep 2009.
Understanding and Applying Open Data
Researchers: Sara Diamond, OCAD University and Barbara Crowe, York University
Partners: Mozilla Foundation, IBM, ReSRC, Google
In creating a major centre for data analysis it will be necessary to undertake research regarding the open data movement, appropriate policy regarding open data, the value and limits of open data and its applications. Our research will approach this complex phenomenon from the social science, humanities and computational perspectives. The movement towards “open data” has accelerated through the last two decades, exacerbated by the rise of social media, gaining momentum with governments, some members of the research community, some business advocates, large companies such as Mozilla and Google and citizens’ movements. Advocates of open data propose that restrictions on access to data hold back discovery, while opponents suggest that competition is dampened when all developers have access to the same databases. The meaning of “open data” is loose, usually implying access to all manner of raw data sets for the benefit of public information, democratic engagement or discovery. Data sets in question can range from all manner of municipal, provincial and federal data sets regarding the economy, urban conditions and districts, to non-textual data such as maps, genomes, formulae, medical or biological data. In the context of politics the Open Data movement focuses on publicly held databases. This detail of everyday life is argued to enable transparent government. At the same time it encourages individuals and companies to build applications, services and experiences using the data.
Governments around the world are increasing providing access to their data. The federal government’s Digital Economy Strategy, released May 10, 2010 included the statement, “Governments can help by making publicly-funded research data more readily accessible to Canadian researchers and businesses.” A Canadian open data 12 month pilot launched in March 17, 2011.  It made geospatial data from Natural Resources Canada available at no cost as well as data collections from Environment Canada. The Ministry of Research and Innovation of Ontario funded Regional Strategic Resource Centre Program (ReSRC), an open data portal at MaRS Innovation. The goal of ReSRC is to strengthen innovation in the region. As stated, “By sharing and integrating disparate sets of data – often collected in institutional silos – from government, academia as well as the private and non-profit sectors, we will better understand the unique strengths, opportunities and needs of our communities and can more effectively work together to build vibrant, productive regional innovation economies.” LIAKM will collaborate with ReSRC to provide overall capacity in data extraction and analysis. We will link the LIAKM provincial focus with comprehensive research on open data.
Perhaps the governments most engaged in open data are municipalities, with many cities engaged in Open Data Day projects and encouraging developers to use their data to create all manner of applications. Toronto, Edmonton, Ottawa, Vancouver have joined forces to develop an open data framework. Open data available through the City of Toronto includes ward profiles, GIS data, TTC data, employment districts, committee of adjustment decisions, water use and hydrant placement. Innovative partnerships have emerged to develop structures for the use of open data. For example FutureEverything, an art and technology hub, has been funded to, “lead the city of Manchester’s transition to an Open Data framework, a major policy initiative which in most cities is led by the mayor’s office”. Their initiative has included significant investment in a framework, application development and a series of design and art projects that make use of Manchester’s data in concert with the larger community – providing a model for research that LIAKM will undertake. Open data projects at times go hand in hand with crowd sourcing. Rachel Sterne the City of New York’s Chief Technology Officer has created a bureau that uses all manner of social media, whether Facebook, Twitter, QR codes and mobile applications to “crowd source” information on the city’s challenges and find solutions to these. European governments have asked designers, developers, journalists, researchers and the general public to develop ideas, applications and visualizations as well as contributing additional relevant data sets through the European Open Data Challenge.
LIAKM researchers have a track record in investigating the possibilities of open data and developing prescient applications for such data. The Mobile Digital Commons Network was co-led by Michael Longford (York University) and Sara Diamond (OCAD University) and included researcher Barbara Crowe (York University) and Martha Ladly (OCAD University). It developed strategies to open mobile networks and data for public access and use, developing a wide-ranging series of demonstration applications. Recent research entitled Taking Ontario Mobile led by Sara Diamond and Vera Roberts has included an investigation of the future of mobility and its reliance on access to open data. Sara Diamond has investigating the historical debates and emergences of the open data movement. LIAKM research will consider the following:
 How can We Build a City that Thinks Like a Web? Sara Diamond, Cory Doctorow, Mark Surmon. Dan Misener, Subtle Technologies, 2011. http://www.subtletechnologies.com/wp-content/uploads/2011/05/Full-Festival-Program-2011-part-2.pdf
 See Mobile Nation: Creating Methodologies for Mobile Platforms, Edited by Martha Ladly and Philip Beesley, Waterloo: Riverside Architectural Press, 2008 and The Wireless Spectrum: The Politics, Practices and Poetics of Mobile Communications, edited with Michael Longford and Kim Sawchuk, Toronto: University of Toronto Press, 2010.
 Taking Ontario Mobile, Sara Diamond and Vera Roberts, Toronto: OCAD University, (in press).
 Euphoria and Dystopia: The Banff New Media Institute Dialogues, 1995 – 2005. Edited by Sarah Cook and Sara Diamond. Banff Centre Press and Riverside Architectural Press, Banff/Toronto: 2011.
Advancements in Intelligent Proactive Systems
Researchers: Ken Ono, Nexj Systems, Nick Cercone, Zhenmei Gu, York University
Partners: Nexj Systems
Large repositories of unstructured textual data and structured relational exist in many businesses, such as shown in integrated CRM (Customer Relationship Management) systems. Examples of such textual data include emails, chat history, internal chats and messages about a customer, call-center records, pathology reports and doctor’s notes. External unstructured data is also available from business reports, medical reports, and social media feeds. Examples of structured data include activities (e.g. appointment types, task types, dates, times), financial holdings, medical test results (blood sugar levels, blood pressure, cholesterol, etc.), and fitness info (steps, distance, calories).
Free-form notes are frequently used because they can capture many types of information easily without prior knowledge of the subject. Although these textual data likely carry much useful information, their use is limited as long as these data remain unstructured. Therefore, such potentially informative textual data are usually difficult to utilize by traditional Business Intelligence (BI) techniques like data mining. Attention has been drawn to the need of the text mining for improving BI . Technology from Natural Language Processing (NLP), especially Information Extraction (IE), is helpful for this purpose, providing ways to automatically process texts, extract specific information from texts, and make them easily accessible (usually in a structured format), and ready to integrate into the existing BI system.
There are three primary goals of the proposed research:
In current systems, the users need to browse through many entries in order to search specific customer information buried in the texts and structured data. Even if all of a person’s health record was available to doctor, it would take too long to understand the key issues by reading each individual journal entry, test result and message. In the financial services realm, the abstract problem is the same, but the subject matter changes to emails, call notes and financial holdings.
With the amount of available data growing, the problem of information overload must be ameliorated.
A key focus of this proposal is to find better ways to extract and summarize information from the textual data. This capability will be leveraged to improve how such texts are conveyed to end users. In the simplest form, the extracted and summarized text could be displayed as text (i.e. displaying summaries instead of raw text), but we also seek graphical ways of conveying information.
It is also important to integrate textual and structured information. A goal of the system is to convey or act on the most important information regardless of its source. For example, in a financial context, if a customer’s holdings were down in value very sharply or an email arrived notifying the advisor of a divorce or death, then this information would need to be highlighted above other less relevant information. A simple rules based system may create too much clutter.
The integrated information from text and relational sources can also be used to create predictive models and detect relevant situations. These models and situation detectors can be used to compute next best actions. In a financial context, this could relate to product suggestions (e.g. we detect a young family situation and recommend a RESP over all other possibilities). In a medical scenario, it could suggest the next step in a prescribed care plan or warn of a diversion from known best practices.
Furthermore, quantified sentiment of the text could be conveyed to end users or used in predictive models.
In the shorter term, we propose a series of research investigations, the combination of which will support achievement of the primary goals. These shorter-term research investigations will guide future steps based on their estimated impact, market differentiators and estimated commercialization costs.
Primary care information systems with Chinese language support
Cercone, Zhenmei Gu, York University
Partners: Empress Systems, Inc.
The primary care information system will provide operational, clinical, and research capabilities for the physicians and staff who utilize the system. Operationally, it will provide registration, appointment booking, billing, consultant referrals, and various reporting functions. Clinically, it will provide a full cumulative patient profile, which will include ongoing conditions, treatment regimen, history, allergies, consultant lists, personal and family data, pediatric prevention, adult prevention, lab results, and various disease specific modules. In addition, it will provide prescription ordering, which can identify drug/drug interactions in real-time based on the drug in question and the treatment regimen of the patient. To improve recordkeeping, this clinical record will also allow the physician to provide progress notes either hand written or dictated through voice input directly into the record. On the research side, the system will allow for aggregate and specific analysis of all operational and clinical data to satisfy the needs of the physicians in question. Again, this will be done by means of Empress.
The research component is especially important, since these are academic health science centers. Examples would be the tracking of treatments for diabetic, as well as hypertensive patients.
Questi: a declarative, pattern-based query language
Researchers: Parke Godfrey, Jarek Gryz, Xiaohua Yu, York University
Partners: IBM CAS
We propose to develop a declarative, pattern-based query language, Questi (Italian for "these"), for exploration and extraction over XML collections. Our aim is that Questi queries and transformations
be easy to compose and refine, for lay users, but yet be formally understood, with a clear semantics. A Questi evaluation would comb large, heterogeneous XML collections to transform them into increasingly structured, schema-uniform tables. In the limit, the transformations lead to lists of "answers", query exemplars.
Support technologies for XML have well matured, including formal query languages for it, such as XQuery and XSLT, and in bridge languages for "relational" databases within SQL such as SQL/XML. For information retrieval, there has been wide research effort on keyword search over XML collections, to identify and rank best matching twigs. (See .) There has been little done in the way, however, of simpler to use, more flexible pattern-based query languages (along the lines of UnQL  and Xcerpt ) that could be used to explore iteratively and interactively over large collections. We feel the need for such tools is keen, and that XML technology is mature enough to pursue this with success.
We propose to develop a formal transformation language over SQL/XML, call it Balance, as the core of Questi. Balance is to operate over relational tables with XML columns. At one extreme is a simple table of two columns: an ID column, and an XML column which, in aggregate, stores a collection of XML documents. At the other extreme are tables with many columns, but for which the XML column's values consists only of leaf nodes with no deeper structure. Balance queries will offer transformations over such tables, both to "schematize" the XML data (by extracting parts into columns), and to "de-serialize", folding columns back into the XML structure. We will develop an algebra for Balance that preserves lossless transformations.
Approaches that bridge preliminary work done in pattern-based query languages, traditional path-based XML query languages, and the information retrieval techniques for keyword search over XML could accomplish this. Our efforts would be case driven, by working with other projects in the LIAKM, to understand their challenges with data curating, transformation, and exploration.
Stages of the work are to be as follows.
I. The Questi Language
A. Design the language.
II. The Questi Cache
A. Cache resulting Questi evaluations (as Balance transformations) into a local relational-XML
hybrid system, to facilitate efficient drill-down, refinement, and transformation.
B. Research how to index Balance tables for query optimization. (See  as an example.)
III. The Questi System
A. Design and implement a query-by-example
engine using Questi.
We draw on the strengths we have at York University in core database research. Later stages of this work would collaborate with industry such as with the IBM Toronto Laboratory where the DB2 database system and WebSphere are developed. The researchers are in active research collaboration with IBM through various projects supported by IBM's Centre for Advanced Studies.
S. Amer-Yahia, R. Baeza-Yates, M.P. Consens, M. Lalmas. XML Retrieval: DB/IR in
theory, web in practice. (Tutorial) Proceedings of the International Conference
on Very Large Data Bases,pp. 1437-1438, 2007.
|Copyright ©2012 LIAKM, Toronto, Canada|