The Laboratory for Information Analysis and Knowledge Mobilization
Information Management and Knowledge Mobilization
Trusted Data Repositories for Improved Health Care
Trusted Data Repositories for Improved Health Care Discovery
Researchers: Gerald Penn, Marc Chignall, University of
Information technologies already play a vital role in diagnostic imaging, surgical procedures, drug discovery and improving provider accessibility to patients. What is less well-appreciated is that many of our decision processes in health care date back to an earlier time when data were comparatively more scarce, and our ability to analyze it was scarcer still. Too often, the practice of medicine reduces to playing several games of chess in parallel, in each of which the attending physician visits a patient, makes a move, and then waits for results before deciding what to do next. In reality the patient's status may change continuously and unpredictably. Hospital-acquired infections, strokes and hypotensive events are examples of important events that physicians should be aware of as soon as they happen. In addition, often-under-diagnosed syndromes like delirium need to be monitored and prevented more effectively. Massive streams of raw data are a good start, but the additional benefit accrued by additional data is quickly tapering off. The best clinical evidence seems instead to be derivative and high-level: information visualizations, best practice guidelines, meaningful summaries of patient status, and indications of resource availability (e.g., which tests are available and how long it takes to obtain them).
Researchers and policy experts also suffer from a lack of high-level evidence, although for a different reason. Hospitals, clinics and pharmacies all have their own records about patients, but these are confidential. Government ministries have still other demographic data. Multinational corporations such as Google and IBM aspire to collect even larger amounts of health-related data, and yet these data cannot be completely shared, either because of privacy concerns, or because of fears of tipping their hands to their competitors. Everyone has a piece of the puzzle, but no single stakeholder can see enough of it to identify all of the apparent epidemiological trends, drug interactions, or high-risk groups – even anonymously – that would be in everyone's interests to know about.
In both cases, what is required is a trusted third-party source with (1) the capacity to warehouse very large and heterogeneous repositories of often unstructured data, (2) the intellectual capital to identify, collect and manage complementary data sources that will maximize the benefits to patients, and (3) the neutrality to identify and engage a wide range of corporations who see a commercial value in delivering those benefits --- all while respecting the security of the data and patient privacy.
We propose to conduct the basic research necessary to build a proof of this concept's vision and to begin engaging governmental and industrial partners who would subscribe to such a service. This project will include themes on (1) information visualization, (2) requirements analysis among healthcare professionals, (3) statistical pattern recognition, (4) cloud computing and (5) the economics of trusted data repositories. We envisage a cloud based computing architecture for receiving many streams of medical and demographic data simultaneously, archiving and indexing it, and carrying out automated analytics and visualizations of it. Applications could also be developed on top of this system for facilitating the work of different healthcare professionals. Nurses, for instance, could keep better track of patients who are experiencing difficult (potentially reducing the incidence of “failure to rescue”). Emergency physicians who might have up to 15 or more patients in their charge could monitor the status of their other patients while they work with one in particular. Hospital administrators could monitor the overall situation and use general patterns of patient status to provide better measures of future bed requirements and availability, as well as identifying resource bottlenecks and the like. The architecture needs to be secure, and must have a database management system that includes data models, queries, and reports for different usage applications. We anticipate this phase will take three years to complete.
Seminal References in Health Care Informatics:
M, Cameron P, Mackenzie C, Farrow N, Scicluna P, Gocentas R, Bystrzycki A, Lee
G, O'Reilly G, Andrianopoulos N, Dziukas L, Cooper DJ, Silvers A, Mori A,
Murray A, Smith S, Xiao Y, Stub D, McDermott FT, and Rosenfeld JV (2011) Trauma
resuscitation errors and computer-assisted decision support. Arch Surg. 146(2):218-25. PMID:21339436.
Information Processing, Analysis and Visualization of Emerging Health Risk
Researchers: Jianhong Wu,
This project aims to develop both scientific expertise and technical capacity for real-time synergetic information processing, qualitative analysis, computer simulation and visualization of emerging health risks such as emerging infectious diseases.
The spread of an infectious disease involves characteristics of the agent, the host and the environment in which transmissions take place. Science-informed public health policy requires evaluation of the agent-host environment interface and simulation of different intervention efforts to alter the interface for optimal control and management of the disease [1,2].
One of the challenging issues in emergency response, taking an emerging infectious disease outbreak as an example, is that the public policy decision and implementation must be made in real-time with great uncertainty and in the presence of a huge amount of information and constantly evolving knowledge from non-traditional academic venues without careful peer-review. These processes call for fast information processing in a truly interdisciplinary fashion using a wide range of technologies and involving multiple stakeholders. Mathematical modelling, statistical analysis and computer simulation facilitate the transfer of the results of information processing, often noised, into scenario and trend prediction for emergency intervention design. Visualization of data analysis is critical for influencing decision-making and knowledge mobilization.
This project will be led by the Mitacs/Mprime Centre for Disease Modeling (CDM) directed by Jianhong Wu, a Canada Research Chair in Industrial and Applied Mathematics. It will be part of the LIAKM’s strategic project “Health Informatics”, and will interact with multiple themes and topics with LIAKM, specially “Applications-Oriented Informatics Development”, “Data and Text Mining” and “Information Management and Mobilization”.
The ultimate goal is to realize a Canadian "Emerging Response Modeling and Visualization Theater" with the following modules:
This project will be based on York University’ experience with SARS and pandemic influenza modeling and disease projection in real-time, and is in full agreement with CDM’s mandate and mission to work with other Canadian institutions and international organizations to build theCanadian capacity for interdisciplinary research on disease modeling using cutting-edge mathematical and statistical techniques. The partnership with IBM will enhance the computing and data processing power of CDM, and the on-going collaboration with the partner governmental agencies and labs such as University Health Network, Public Health Ontario and Public Health Agency of Canada will link our design on data processing, modeling, simulation and visualization to the existing and emerging public health surveillance system, with impact on policy decision making processes. The project will be conducted in parallel, with rough milestones below:
a) Public health--specially infectious disease—oriented data collecting and processing platform and the platform specific high dimensional data clustering algorithms, based on the Projective Adaptive Resonance Theory (PART) neural network architecture invented by Jianhong Wu and his collaborators [1,2,3] which has been tested by international researchers in a number of applications such as text mining, gen filtering, prognostic predictor construction (year 1-2);
b) Integrated statistical and intelligence tools, as well as a wide range of template mathematical models and simulation packages, based on the CDM’s current work on pandemic influenza (emerging human diseases with novel strain), West Nile virus (mosquito-borne diseases), Lyme disease (tick-borne diseases) and avian influenza (global spread) [4,5] and on marginalized populations  (year 2-3). Model parameterization and formulation will also be guided by information gathered in part a), and thus the key is the robust and adaptivity of model templates, and their usefulness to generate visualization of plausible scenarios;
c) Integrated modeling, simulation and visualization technology, with a specific aim to demonstrate model-based prediction to end-users, on line and in real time (year 3).
Other projects, in collaboration with the LIAKM group, can be developed. For example, the high dimensional data clustering algorithm (PART) can be used to develop a pilot project with Ontario Public Health Lab for influenza strain analysis, using our partner’s gene bank.
Every stage of the development and implementation of this project will involve a large team of trainees with complementary backgrounds. There will be the need of a junior researcher or a postdoctoral fellow to help the faculty members in coordinating the team effort, there will be need of three graduate students working in the sub-projects: health data clustering, statistical analysis and mathematical analysis, computer simulation and visualization. Each sub-project can involve an undergraduate student for data input, model tuning and testing, as well as computer simulations.
. G. Gan, C. Ma and J. Wu, Data Clustering: Theory,
Algorithms and Applications, SIAM, 2007.
Privacy-preserving Distributed Data Mining in Healthcare
Researchers: Stan Matwin,
M. Sokolova (Medicine), University of Ottawa
Objectives: to allow multiple parties mine distributed datasets while preserving privacy of data belonging to each party. In healthcare, it is common to discover knowledge in a setting where data are distributed among a large number of parties. a high performance distributed and parallel computing environment is becoming crucial. This is an important problem, as it occurs in all medical multi-site randomized control group studies. Moreover, any scalable solution to this problem is applicable to the general setting of cloud computing. Existing privacy-preserving distributed DM methods include (1) altering the data before learning starts so that real values are obscured (e.g.[1, 2]), and (2) using specialized homomorphic cryptography approaches (e.g. ). The first method trades off accuracy of results for privacy, and the second method may release private information through the individual classifiers. Then second approach is still in its infancy and does not scale up to the size of datasets involved.
Research: We will work on a new distributed DM technique that learns a global rule-based classifier by requesting only data statistics from individual data sites. Our technique will be based on a new compressed data structure that stores necessary statistics of the data at each data site. The data structure stays at each individual site and can be called by the learning algorithm to provide data statistics during the “global” learning process. The data structure has a “guard” method to check whether the statistics required (so far) by another site would violate the privacy constraints. If yes, an appropriate answer with error bounds will be generated and passed to the requester; otherwise, accurate information is provided. The objective of the proposed technique is to produce as accurate results as possible without violating the privacy constraints.
The second challenge we will address in distributed DM is how to deal with dynamic data sources. In healthcare settings, new data are constantly added into individual data sites. Patterns learned from previous data may not reflect the nature of new data. How to detect and deal with “concept drifts” at individual sites is a challenging issue since we only learn global classifiers. Existing methods for concept drift detection rely on monitoring the prediction accuracy of the classifier. We will work on a method that detects concept drifts at an individual site by monitoring the changes of the data statistics stored in a compressed data structure. When a concept drift is detected, the individual site will let the others know and appropriate adjustments to the global classifier will be made.
In addition to learning classifiers from distributed data, we will work on discovering other types of interesting patterns from healthcare data. Recently we have developed techniques for discovering transitional and diverging patterns from large data sets. Transitional patterns are those patterns whose frequencies increase or decrease dramatically at some time points of a transaction database. Diverging patterns are a type of contrast patterns whose frequency changes in opposite directions in two data sets, i.e., it changes from a relatively low to a relatively high value in one dataset, but from high to low in the other. Both transitional and diverging patterns have been applied to “centralized” medical data sets to find significant differences and changes between two contrast groups of patients. For example, our techniques found that two groups of patients with different demographic backgrounds tolerate the side effect of a new drug differently. One of the groups tolerates the side effect better than the other one in a long run although at the beginning of taking the drug the first group showed more severe side effect than the second group. Such patterns are very useful in practice. In this research we will extend our algorithms for finding transitional and diverging patterns to a distributed environment. We will investigate how local transitional and diverging patterns can be integrated into a global one, and also how to directly discover the global patterns based on the data statistics from individual local data sites.
Milestones and deliverables: Both topics are highly related to information analysis and knowledge mobilization, as meaningful data mining on distributed data – ubiquitous in any modern “big data” setting – cannot be undertaken without taking into account privacy constraints of the parties involved. We will investigate the theoretical properties (e.g. bounds on data quality loss, with a particular focus on a probabilistic results) tradeoffs of the proposed approach in year one of the project. We will implement a prototype and perform extensive empirical evaluation. Year three will be devoted to a deployment of that solution on a large distributed environment (Ontario HPVC/SharcNet environment).
. K. Muralidhar, R. Sarathy, and R. A. Parsa, A
general additive perturbation method for database security, Management Science,
45(10), 1999. 1399-1415.
Researchers: Henry Kim,
Dirit Nevo, Marin Litiou, York University
Data Analytics (DA), also referred to as business intelligence, refers to a collection of decision support technologies for the enterprise aimed at enabling knowledge workers such as executives, managers, and analysts to make better and faster decisions. Obviously then, DA falls intimately within the purview of the Laboratory for Information Analysis and Knowledge Mobilization (LIAKM).
The past two decades have seen explosive growth, both in the number of products and services offered and in the adoption of these technologies by industry. This growth has been fueled by the declining cost of acquiring and storing very large amounts of data arising from sources such as customer transactions in banking, retail as well as in e-businesses, RFID tags for inventory tracking, email, query logs for Web sites, blogs, and product reviews. Enterprises today collect data at a finer granularity, which is therefore of much larger volume. Businesses are leveraging their data asset aggressively by deploying and experimenting with more sophisticated data analysis techniques to drive business decisions and deliver new functionality such as personalized offers and services to customers. Today, it is difficult to find a successful enterprise that has not leveraged BI technology for its business[i]
Also important is what is going on in the Information Technology industry. IBM, which is arguably one of the bellwethers for the industry, has ventured beyond its traditional domain of hardware and software for developing information systems (e.g. CASE tools, RDBMS, and compilers). They have made acquisitions in general data analytics tools like statistical and analytical processing software, and have even acquired analytical software for financial services and supply chain management. They are leading the industry further into data analytics. In fact, IBM estimates that it has 10,000 data analytics jobs that it would like to fill, but cannot, because of lack of applicants trained in data analytics.
A cogent research program requires focus. So within the broad area of data analytics, we are interested, in particular, in predictive modelling. Predictive modelling may take the form of sentiment analysis to predict election results[ii], corporate use of prediction markets to prioritize funding for projects[iii], analysis of social media sites for predicting reception to products and marketing campaigns[iv], or predicting anomalous behaviours in financial markets and instruments[v]. The sheer volume of proprietary data that a firm may collect and store, the capability to access large volumes of public domain data using public API’s, and development of novel data analysis techniques are all factors that highlight a paradigm shift in the evolution of predictive modelling.
The main deliverables for the DA project are delineated along the following applications for predictive modelling as follows:
By undertaking a program in data analytics within the Laboratory for Information Analysis and Knowledge Mobilization (LIAKM), we will help meet the marketplace’s demand for data analytics research and training for graduates.
 Chaudri, S., Dayal, U., Narasayya, V. (2011). An Overview
of Business Intelligence Technology. Communications
of the ACM, 54(8), 88-98.
Text Searching and Natural Language
Cercone, Sheila Embleton, York University
Recently, Empress has had a number of requests to provide a text searching facility. Traditional database text searching focused on the contents of text documents. These requests, however, are for searching text meta-data (author/performer, title, publisher, etc.) associated with large binary data, in the form of text-embedded documents (.doc, .pdf, .rtf, etc.), music, video, games, and so on. This searching should be familiar to anyone who visits sites like Amazon.com, BestBuy.com, and YouTube.com. The frustrations experienced by visitors to these sites should also be familiar: excessive false positives and seemingly random ordering of results. Dedicated search sites, such as Google.com or Yahoo.com often provide better experiences, but their technologies are not readily embeddable in other applications
In response to these requests, Empress has added a text searching facility on top of the Empress engine. The facility, however, is not a complete solution. It provides only for the storage of record identifications (ids) with sets of uninterpreted text tags and the retrieval of record ids from a query set of tags. The applications using this facility are expected to provide for the extraction of tags from meta-data for storage, the extraction of tags from the user query input for retrieval as well as the ordering of the results. Empress would like to provide these capabilities in the future, but recognizes that while these capabilities for a specific set of applications are tractable, these capabilities for the general case are technically challenging, as shown by experiences with existing web sites. In order to achieve better text searching solutions Empress faces several immediate challenges, including:
Challenge #1: Extraction of tags from meta-data. The extraction of tags is subject to a variety of problems. The character code set issue can presumably be alleviated by some form of Unicode: UTF-8, UTF-16. The linguistic issues are much more problematical, with case sensitivity, equivalent characters, the delimitation of words, plurals, noise words and synonyms being only some of the main culprits. Term ambiguity adds to the problem. Culturally, issues arise since many groups of people may speak one language, e.g., English, but differently.
Challenge #2: Design of the efficient storage for meta-data tags. Extracted tags from meta-data need to be stored in the efficient structure for retrieval, maintenance and the insertion of the new tags. The construction of the efficient storage may introduce additional challenges for bulk-loading of tags and storing them in the limited amount of space.
Challenge #3: Extraction of tags from the user query. This type of extraction poses similar challenge to the extraction of tags from meta-data. Some form of natural language interpretation should prove invaluable, particularly for the user query input.
Challenge #4: Ordering of results. What constitutes better ordering is often subjective, so some form of adaptive learning and user profiling may be necessary. To perform ordering, some of the information lost in tag extraction may need to be re-retrieved from original meta-data, which would be expensive. Some form of ordering information may have to be extracted from the metadata along with the tags and stored.
Challenge #5: How to determine the quality of the results? To determine to what extent the results addressed the requirements of the user query pose another challenge in researching a better solution for a text searching facility. In general, the goal is to increase the number of true positive results, minimize false positive and decrease the number of true negative results.
Meeting the Challenges
We consider the five challenges for a text searching facility based on our discussions with Empress and propose the following.
Challenge1: In addition to case sensitivity and plurals, a word could have many other inflected and derived forms. Using and differentiating all these forms in indexing and searching is neither necessary nor efficient. We can use a stemmer  to find word’s root form for indexing and searching, greatly improve indexing and searching of Empress’ searching facility with respect to its precision and time performances. Also, a stop list is generally used to remove common words in information retrieval. A stop list contains all common (noisy) words that are not very helpful in differentiating database records. By comparing tags from meta-data against the list, we can remove such noisy words from indexing and searching. This would also improve both indexing and searching performances.
For equivalent characters and synonyms, we generally use query expansion to address this issue, i.e., expanding a query with equivalent words and terms .
The tf-idf (term frequency-inverse document frequency) weight  is proposed to weight words. The measure is widely used in information retrieval to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. That is, a word is more important to a document if it appears in the document more times but is less important if it is more common across the collection.
Challenge 2: NAND flash has become the most popular storage for embedded systems, in terms of shock resistance, density and read performance, but it does exhibit specific hardware constraints: read/write operations are done at a page granularity, writes are more time and energy consuming than reads, a page cannot be written before erasing the complete block containing it, a block wears out after about 10**5 write/erase cycles. On the other hand, embedded systems generally provide very limited RAM (128MB typically). This characteristics require database storage, maintenance and retrieval techniques considering these specific constraints. Some database models have been proposed for flash storage [4, 5]. They decrease the write cost, which is considered the main problem using flash for databases, by logging index updates and performing grouped updating. They also perform out-of-place updates to reduce flash usage and address translation and garbage collection overheads. A sequential indexing scheme is proposed for flash-based embedded systems  where only sequential writes and page-level reads are used to update index since flash storage is best suited to support sequential writes and page-level reads. That is, database updating is only performed at the database end. The retrieval is sped up by summarization and partitioning. We will investigate which available schemes or which features to combine are best for the Empress search facility.
Challenge 3: For the query input, in addition to query expansion, query understanding should be helpful. Query understanding may include syntax and semantic analysis. The analysis results would be very helpful in weighting query terms properly.
Challenge 4: It would be too expensive to use non-index words for ranking. Index-words should be chosen before indexing: any words that would greatly influence ranking should be selected for indexing. All ranking should be done based on the established index. We propose to search and rank results based on BM25 , a widely used ranking function. The retrieved results would be ranked in the order of their probabilities of relevance to the query. We may need to experimentally adjust some parameters in BM25 for best performances in different specific problem domains. To further improve ranking quality, user profiling would be helpful.
Challenge 5: We might not be able to determine the quality of the results, but we can estimate the quality of the results based on user questionnaires. The nearer the sampled users to the potential target users and the larger is the sample used, the closer the estimation would be to the true quality.
F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3), 130-137.
(also ACL repository for program resources - http://www.aclweb.org/
|Copyright ©2012 LIAKM, Toronto, Canada|