In
my dissertation, I developed a novel approach to learning from data based on a
formulation of the learning task in terms of manipulation of answers to a
succession of statistical queries against the data source. This allows us to
reduce the problem of learning from distributed data into the problem of
answering statistical queries from distributed data. I identified the
statistical queries needed by a broad class of machine learning algorithms and
developed provably sound approaches to answering such queries from distributed
data. This approach offers a general strategy for transforming traditional
algorithms for learning from data into algorithms for learning from distributed
data.
I developed approaches to decomposing a statistical query posed by the learner
into sub-queries according to the distributed data sources, finding the best
execution plan, executing this plan and composing the individual answers
obtained from the distributed data sources into a final answer to the initial
query. My dissertation illustrates the application of this strategy to devise
several algorithms (Naïve Bayes, Decision Trees, Perceptron, Support Vector
Machines and k-NN) for induction of classifiers from horizontally and vertically
fragmented distributed data. The resulting algorithms are provably exact in that
the classifiers produced by them are identical to those obtained by the
corresponding algorithms in the centralized setting (i.e., when all of the data
is available in a central location). This ensures that the entire body of
theoretical (e.g., sample complexity, error bounds) and empirical results
obtained in the centralized setting carry over to the distributed setting. The
proposed algorithms compare favorably to their centralized counterparts in terms
of time and communication complexity in many scenarios that are common in
practice.
I extended this approach to learning from distributed data to learning from
semantically heterogeneous distributed data, where semantic differences between
data sources and the user are inevitable. To deal with semantic heterogeneity, I
introduced ontology extended data sources (with explicit representation of the
ontology associated with each data source). In my dissertation, I showed how
the user-specified interoperation constraints between user ontology and data
source specific ontologies can be used to define mappings and conversion
functions needed to answer statistical queries from semantically heterogeneous
data sources from a user perspective, in the important special case where
ontologies take the form of attribute value hierarchies. The resulting approach
to learning classifiers from semantically heterogeneous data allows flexible
data integration on demand during the process of answering statistical queries
needed by a learning task. This approach is especially well suited for
scientific applications that call for exploratory analysis of semantically
heterogeneous data sources from different perspectives.
During the next few years, I plan to explore:
1) Extensions of my approach to learning from semantically heterogeneous data in
settings where ontologies take more general forms than the attribute value
taxonomies treated in detail in my dissertation.
2) Extensions of my approach to learning classifiers from semantically
heterogeneous data to learning complex relationships from semantically
heterogeneous multi-relational data.
3) Provably approximate algorithms for learning classifiers from distributed
semantically heterogeneous data in settings where the computational requirements
of provably exact algorithms are prohibitive.
4) Sufficient statistics based approaches to learning from data streams where it
is infeasible to store the data.
5) Application of the algorithms for learning classifiers from semantically
heterogeneous, distributed data to knowledge acquisition tasks that arise in
computational biology (e.g., protein function/structure prediction,
protein-protein interaction sites prediction).
Research in the area of knowledge acquisition from semantically heterogeneous
data sources is still in its infancy. It presents research challenges that span
multiple areas of computer science including machine learning, knowledge
representation and inference, databases, information integration, and
statistical computing. My early contributions to this area put me in an
excellent position to tackle some of these challenges.