Challenges and Open Research Questions
Machine learning (ML) can generally be understood as the study and design of algorithms that build models from data to solve complex problems. Currently, there are three unresolved problems in ML:
First, one often encounters problems for which purely data-driven models fall short and data-independent information has to somehow be injected into the learning process. This occurs for example when not enough data is available to train a sufficiently generalized model, or when the model needs to meet some problem-specific constraint. How to precisely incorporate, in a principle way, such prior knowledge into the ML pipeline is one open problem in current ML research.
Second, the performance of ML models strongly depends on which representations of the data are used to build them. ML models should be able to uncover meaningful and causal factors of variations (representations) in data, from which one can understand and reason about the data generating mechanisms. How to drive models to automatically learn meaningful representations is another fundamental question in ML.
Third, many modern ML models and algorithms are in themselves complex systems, whose theoretical analysis is fundamental for their interpretability and explainability. The device of techniques to formally characterize training and generalization in these systems is a third longstanding goal in ML.
The Hybrid ML research, which can be split into Informed ML, Representation Learning and Theoretical ML, focuses, respectively, on studying each of these ML problems.
Current Research Highlights and Activities
The Informed ML research involves a variety of complex problems in fields ranging from natural language processing and computer vision to astrophysics.
- A Taxonomy and Survey of Informed ML approaches:
We have recently proposed a definition for informed ML as learning from hybrid sources that consist of data and additional prior knowledge. Our contribution introduces a taxonomy that classifies informed ML approaches according to the source of knowledge, its representation, and its integration into the ML pipeline. To_publication_
- Using Generative Adversarial Networks to estimate response times:
An example of informed ML is our recently developed deep generative model for response times, suitable to the large datasets typical of modern service providers. Our approach is informed by Queueing Theory and uses Wasserstein Generative Adversarial Networks to learn response-time distributions conditioned on the continuous arrival process. This work constitutes a first bridge between queuing systems and deep generative modelling. To_publication_
- Neural network modelling of environmental stresses in agriculture:
Another informed ML highlight is our method to combine expert knowledge with data-driven techniques for modelling in agriculture, which won first place at the 2019 Syngenta Crop Challenge/Data Science Competition.
- Information-extraction and knowledge base population algorithms from structured documents via probabilistic soft logic and neural networks
- Scene understanding from primitive shapes, object detection and pose estimation
- Knowledge-constrained learning from astroparticle simulations
- Neural-based knowledge graph reasoning
- Password-guessing algorithms via variational autoencoders
- Learning similarity functions for structured data
- Predicting strain-stretch behavior of non-woven fiber materials
Our research mainly focuses on different aspects of representation learning for natural language processing.
- Imposing tree structures onto vector representations:
We have successfully imposed tree-structured category information onto vector embeddings by promoting the latter into N-balls in a higher dimensional space, so that (i) vector embeddings are well preserved, and (ii) symbolic tree structures are encoded by inclusion relations among balls without error. This approach inherits good features from both symbolic and deep learning techniques. To_publication_
- Learning representations for dynamic language models:
We build on top of neural network models for text processing to incorporate temporal information and model how review data changes with time. Specifically, we use the dynamic representations of recurrent point process models, which encode the history of how business or service reviews are received in time, to generate instantaneous language models with improved prediction capabilities. To_publication_
- Ball embedding for knowledge representation and reasoning
- Representation learning for dynamic language models
- Deep generative models of semantic structure for natural language understanding
- Disentangled representations for language modelling and controlled text generation
- Neural-based variational inference for stochastic processes in continuous-time
We study different aspects of generalization and training of deep and wide neural networks.
- Heating up data spaces to understand decision problems:
We have recently studied classifiers’ decision boundaries via Brownian motion processes in ambient data space and associated probabilistic techniques. Intuitively, our ideas correspond to placing a heat source at the decision boundary and observing how effectively the sample points warm up. In contrast to well-known methods, our work provides a non-equivalent technique to address decision boundary geometry. In particular, it leads to refinements and improvements of previous state-of-the-art decision boundary analysis, as well as new curvature estimates and more detailed boundary density estimation. To_publication_
- Identifying structures in large graphs:
We have adapted sampling techniques from mathematical combinatorics to the problem of probabilistic subtree mining in arbitrary databases of many small to medium-size graphs or a single large graph. Our method approximately counts or decides subtree isomorphism for arbitrary transaction graphs in sub-linear time with one-sided error. This allowed us to obtain approximate counts for patterns that are more than five times as large as the current state-of-the-art in graphs with millions of vertices and edges. To_publication_
- Stable rank and spectral norm effects on the geometry and expressiveness of deep neural networks
- Out-of-equilibrium dynamics of training in wide neural networks
- Training dynamics of deep generative models via gradient flows in probability space
- Maximal closed set and half-space separations in finite closure systems
Recommended Reading and Selected Publications
C. Ojeda, K. Cvejoski, B. Georgiev, C. Bauckhage, J. Schuecker, R. J. Sanchez: Learning Deep Generative Models for Queueing Systems. AAAI, 2021.
L. von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrommer, A. Pick, R. Ramamurthy, M. Walczak, J. Garcke, C. Bauckhage, J. Schuecker: Informed Machine Learning – A Taxonomy and Survey of Integrating Knowledge into Learning Systems. IEEE, 2020. More
P. Welke, T. Horvath, S. Wrobel: Probabilistic and Exact Frequent Subtree Mining in Graphs Beyond Forests. In: Machine Learning 108(7), 2019.