The Laboratory for Information Analysis and Knowledge Mobilization
Nick Cercone, Jimmy Huang, York University
Free-form texts are the most common form of valuable data in healthcare. They range from doctor’s notes, descriptions of patient histories, to healthcare-related messages posted by patients on social media such as blogs, bulletin boards, and discussion forums. Such narrative text data contain the most valuable information for physicians to use in their practice and for public and government agencies to make their healthcare-related decisions. Recently, the New York Times reported on a study by MIT researchers, which showed that companies included in their study that adopted data-driven decision-making achieved 5-6% higher productivity than those that did not. However, since data are continuously generated everyday in large volumes, the sheer amount of data is too overwhelming for human to read and analyze manually. Automatic text analysis tools are in great need to discover the hidden information trapped inside the free-form texts. For example, a tool that identifies and analyzes the healthcare-related posts in social media can detect the public opinions, activities and preferences in healthcare-related issues. Natural language analysis, data mining and information retrieval are key techniques that can be used to build such text analysis tools.
With the general objective of discovering hidden and valuable information from healthcare-related text data, we propose to develop innovative text analysis techniques for analyzing healthcare-related data on social media. Our ultimate goal is to develop tools that provide policymakers and program managers with the information needed to plan and implement successful health care initiatives. We choose to focus on social media data because they are publically available, they are sensors of the real world and there are high demands from industry and government to discover and keep track of public activities and opinions expressed in social media. Our main objectives are as follows:
To achieve these objectives, we propose the following lines of research
1. Resource selection for seeking information about healthcare issues
There are numerous fora and networks on social media which people use to share their opinions and insights. Not all of them are relevant to either general or specific healthcare issues that are of concern. Appropriate and wise selections of fora and/or relevant posts in social networks are fundamental for the successful discovery of useful hidden information on the relevant healthcare issues.
In the field of information retrieval, many algorithms [e.g., Callan’00, Chakravarthy’95, Conrad’02, Ipeirotis’02] have been proposed for resource selection in federated text search. Many of them can be viewed conceptually as treating each information source as a "big document", and using variations of traditional document ranking algorithms to rank available information sources with respect to a user query. However, recent research [Si’03] has demonstrated that the "big document" approach does not work well when there are wide variations in the sizes of available information sources because the "big document" approach does not consider the individual documents in the information sources. The GlOSS algorithm [Gravano’99] turns away from the "big document" approach by calculating the goodness of information sources to a query as the number of relevant documents, but no satisfactory method has been provided for estimating the goodness. The hierarchical database sampling resource selection algorithm [Ipeirotis’02] and the classification-aware selection algorithm [Ipeirotis’06] use base resource selection algorithms such as "CORI" and the GlOSS approach, and still suffer from the weakness of the base algorithms. A set of more robust algorithms [e.g., Si’03, Liu’02] have been designed to explicitly estimate the goodness/usefulness of individual documents that each information source contains. However, such measures were not designed for evaluating documents (i.e. posts) on social media. We will work on designing goodness measures for evaluating a discussion forum and individual posts in a forum in social media. Resource retrieval systems will be developed to collect the relevant sources of information based on the designed goodness measures.
2. Adaptive information extraction and topic/event detection and tracking
Free-form discussions of medical or healthcare issues on social media do not follow the standards for classifying and coding medical information such as the Systematized Nomenclature of Medicine (SNOMED). A useful first step will be to design a system that is able to extract key elements from narrative text that are relevant to the task(s) at hand.
We will work on two directions for identifying the key information embedded in the text: (1) adaptive information extraction that automatically learns extraction knowledge from data to extract structured information such as entities, attributes and relationships between entities from unstructured text; (2) advanced topic and event detection that considers the content as well as temporal and social dimensions of the data.
For information extraction, we focus on improving the robustness of current IE systems (e.g. WHISK [Soderland’99], RAPIER [Califf’99], SRV [Freitag’99] and LP2 [Ciravegna’01]) which can automatically learn extraction knowledge from data, and easing the difficulty of adapting an IE system to different extraction tasks. We will fully examine the naive Bayes IE model as a purely adaptive IE model, in which the formulation problem existing in previous naive Bayes IE systems is corrected. We also investigate the effect of smoothing techniques in this context (essentially a general issue associated with any probabilistic model), and we design our own smoothing strategy  to obtain more stable probability estimation in statistical IE learning. Our initial experimental results show that a good smoothing method is critical to the robustness of naive Bayes IE systems. In most existing probabilistic systems, a natural evolution from the naïve Bayes' models is to more advanced Hidden Markov models (HMMs) [McCallum’00]. Our work on HMM IE will solve the extraction redundancy issue in current HMM IE modeling on an entire document. To this end, we propose a segment based HMM IE approach, in which a segment retrieval step is included to identify extraction related segments from the entire document. Note that our segment-based IE modeling is actually a general framework. In addition to HMM IE applicability, the same segment-based IE framework applies to other IE models in which the extraction is performed by sequential state labeling. To improve the system's adaptability to the situation when the labeled texts are limited, we will extend our segment-based HMM IE modeling to semi-supervised learning using a modified version of the multi-view Co-EM learning strategy.
Motivated by the need to choose term weighting related system design choices in segment retrieval in our segment-based HMM IE, we also investigate the use of information theoretic principles as tools to analyze the term vector models employed in IE and information retrieval (IR). Thus far, a series of theoretical analyses [Gu’06] show that the information theoretic principles provide a good framework to help make related design decisions in term vector models with sound theoretical justifications. We advocate an integrated IE system in which different learners are selected not only according to a particular IE domain, but also based on the characteristics of different extraction tasks (i.e., different slots in an extraction template). Inter-slot relationship utilization in the integrated IE framework will improve extraction performance. Our concept of redundant extraction will adapt an existing IE system to perform extraction with redundancy. By introducing some redundancy, the IE system can identify more extraction related information from documents thus providing a solution that bridges the gap between performance limits of current IE systems and performance needs stipulated by applications.
For topic and event detection, most of the existing methods use clustering techniques or the generative Latent Dirichlet Allocation (LDA) probabilistic model [Blei’03] to group the documents according to the content of the text. A problem with most existing topic detection techniques is that the topics are represented by a set of words, which together may not be meaningful. We will investigate how to integrate computational linguistics techniques with probabilistic topic modeling to extract meaningful topics. We will also extend these techniques by considering both temporal and social factors so that topics or events are related to the context in which they are discussed or occur.
3. Development and testing of indicators of public interest in healthcare related activities
With the goal of finding factors that influence and initiate consumer involvement in healthcare choices, we need to create indicators or indices to measure the activity of consumer participation. While formal surveys exist to measure patient participation, there are no universal indicators that track ongoing real-world consumer activities in or opinions about health care. To measure consumers’ interest and participation in healthcare decision-making from social media, we will create a set of broad indicators of healthcare related activities that operates independently of specific conditions and treatments or providers. For example, an indicator can be related to choice-making: are consumers actively trying to make choices, or are they passively waiting for advice?
We will create and test indicators along the following dimensions:
We will create and test these indicators using both supervised and unsupervised machine learning techniques on selected social media data.
4. Sentiment Detection
The rise of social media has provided various forms for the public to express their opinions on various issues. Businesses have tremendous interests in sentiment analysis over social media data in order to analyze markets and identify new opportunities. Most of the existing sentiment analysis techniques determine the overall sentiment orientation for a single object as either positive, negative or neutral [Liu’10] but they do not specify exactly what reviewers like or dislike. However, quite often, a person likes certain aspects of an object but dislikes others. Such mixed feelings cannot be extracted by the sentiment analysis techniques that only identify the polarity (i.e., positive/negative/neutral) of the sentiment.
We will work on a novel approach to sentiment analysis that makes use of various computational linguistics techniques. We will employ four indicators in 4 linguistic levels containing clause, phrase, word and feature levels to determine the polarity of the sentiments. We will also identify features of the object for which the sentiment is expressed, and propose a novel concept of applying weights on features in determining the polarity of the sentiment on the object. With features, we will be able to determine on what aspects of the object a consumer is positive or negative. This will provide a refined sentiment analysis for situations where mixed sentiment is expressed in one comment.
5. Discovering and tracking patterns of information seeking, comparisons and sentiment about health care
After identifying indicators of public activities and opinions in health care, we would like to find patterns in which consumers seek information, make comparisons and express sentiments, and also track changes in these patterns over time and in response to significant public events or stories. To this end, we will conduct the following lines of research:
 A. Chakravarthy and K. Hasse. Netserf: Using
semantic knowledge to find internet information archives. Proceedings of
|Copyright ©2012 LIAKM, Toronto, Canada|