Open Topics MSc Theses and project topics

Comparison of Quantification methods (reserved)

Quantification is a lesser known variation of the machine learning task of classification, which aims to estimate the class proportions in partially labeled data. The goal of this thesis is to summarize, implement, compare and possibly extend existing methods for this task. The techniques will also be applied to a real world application estimating the viewer motivation on Wikipedia articles.

Introductory Reading: Forman, George. "Quantifying counts and costs via classification." Data Mining and Knowledge Discovery 17.2 (2008): 164-206.

See also: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour

Contact: Dr. Florian Lemmerich

Concept embeddings for Wikipedia across language editions

Embeddings are an approach to capture the meaning of a concept, e.g., a word, based on the context it appears in. State-of-the art methods (such as word2vec) are based on Deep Learning. In this thesis, we want to apply such techniques in order to compute embeddings of Wikipedia articles is multiple language editions based on their text and their link structure. Then, we want to compare the embeddings between the language editions.

Introductory Reading:
Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. "Diachronic word embeddings reveal statistical laws of semantic change." arXiv preprint arXiv:1605.09096 (2016).
Sherkat, Ehsan, and Evangelos E. Milios. "Vector embedding of wikipedia concepts and entities." International Conference on Applications of Natural Language to Information Systems. Springer, Cham, 2017.

Contact: Dr. Florian Lemmerich

Toxic behavior in Online Gaming (reserved)

Multiplayer online gaming has become a multi-billion dollar industry. One common problem that influences user experience is toxic behavior between the players, i.e., players complaining about and insulting each other. In this thesis, we want to explore the consequences of such behavior based on a chat dataset on the game of Dota 2 (and possibly other datasets). For example, we want to analyze if toxicity is contagious (are players exposed to bad behavior more likely to behave badly themselves), if toxic behavior influences in-game performance, or if players exposed to toxic behavior are more likely to quit.

Introductory Reading: Kwak, Haewoon, and Jeremy Blackburn. "Linguistic analysis of toxic behavior in an online video game." International Conference on Social Informatics. Springer, Cham, 2014.

See also: https://blog.opendota.com/2017/03/24/datadump2/

Contact: Dr. Florian Lemmerich

Redescription Exceptional Model Mining

Redescription Mining and Exceptional Model Mininisg are two data mining techniques in the area of pattern mining. Redescription Mining seeks to find pairs of descriptions that describe the same instances. Exceptional Model Mining aims to identify descriptions of data subsets, for which the parameters of a model class deviate significantly. In this thesis, we want to combine these two approaches in order to develop a new mining method, which allows to find pairs of descriptions that have similar model parameters.

Introductory Reading:
Zaki, Mohammed J., and Naren Ramakrishnan. "Reasoning about sets using redescription mining." Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005.
Leman, Dennis, Ad Feelders, and Arno Knobbe. "Exceptional model mining." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2008.

Contact: Dr. Florian Lemmerich