OPUS 4 | Search

Twitter author topic modeling : comparative and classifactory topic analysis using latent Dirichlet allocation (2021)

The aim of this bachelor thesis is to compare and empirically test the use of classification to improve the topic models Latent Dirichlet Allocation (LDA) and Author Topic Modeling (ATM) in the context of the social media platform Twitter. For this purpose, a corpus was classified with the Dewey Decimal Classification (DDC) and then used to train the topic models. A second dataset, the unclassified corpus, was used for comparison. The assumption that the use of classification could improve the topic models did not prove true for the LDA topic model. Here, a sufficiently good improvement of the models could not be achieved. The ATM model, on the other hand, could be improved by using the classification. In general, the ATM model performed significantly better than the LDA model. In the context of the social media platform Twitter, it can thus be seen that the ATM model is superior to the LDA model and can additionally be improved by classifying the data.

The fundamental value of Art NFTs (2023)

Fridgen, Gilbert ; Kräussl, Roman ; Papageorgiou, Orestis ; Tugnetti, Alessandro

Art-related non-fungible tokens (NFTs) took the digital art space by storm in 2021, generating massive amounts of volume and attracting a large number of users to a previously obscure part of blockchain technology. Still, very little is known about the attributes that influence the price of these digital assets. This paper attempts to evaluate the level of speculation associated with art NFTs, comprehend the characteristics that confer value on them and design a profitable trading strategy based on our findings. We analyze 860,067 art NFTs that have been deployed on the Ethereum blockchain and have been involved in 317,950 sales using machine learning methods to forecast the probability of sale, the trade frequency and the average price. We find that NFTs are highly speculative assets and that their price and recurrence of sale are heavily determined by the floor and the last sale prices, independent of any fundamental value.

The effectiveness of central Bank purchases of long-term treasury securities: a neural network approach (2024)

Tänzer, Alina

Central bank intervention in the form of quantitative easing (QE) during times of low interest rates is a controversial topic. The author introduces a novel approach to study the effectiveness of such unconventional measures. Using U.S. data on six key financial and macroeconomic variables between 1990 and 2015, the economy is estimated by artificial neural networks. Historical counterfactual analyses show that real effects are less pronounced than yield effects. Disentangling the effects of the individual asset purchase programs, impulse response functions provide evidence for QE being less effective the more the crisis is overcome. The peak effects of all QE interventions during the Financial Crisis only amounts to 1.3 pp for GDP growth and 0.6 pp for inflation respectively. Hence, the time as well as the volume of the interventions should be deliberated.

Situationsabhängige Bekleidungsmodellierung mit Hilfe von Machine Learning für die Erstellung von Avataren (2022)

Meyer, Dirk

Bei der Bekleidungsmodellierung geht es um den Entwurf von Bekleidung von Personen, die beispielsweise in Szenen dargestellt werden können. Dabei stützt sich der Entwurf auf Informationen aus einer Datengrundlage. Die Darstellung von Szenen, in denen Personen dargestellt werden, stellt sich grundsätzlich als Zusammenspiel komplexer Teilaspekte dar. Dabei wird die Nachvollziehbarkeit einer modellierten Szene oder modellierter Avatare im Auge des Betrachters ganz wesentlich durch den Faktor passend gewählter Kleidung bestimmt. In dieser Arbeit werden Ansätze und Verfahren vorgestellt, die zur Bekleidungsmodellierung auf Grundlage von Textdokumenten basieren. Dafür werden Möglichkeiten erörtert, die es erlauben Informationen aus Texten zu extrahieren und für die Modellierung einzusetzen. Zur Bearbeitung der Aufgabenstellung wird zunächst ein aus dem Machine Learning bekanntes kontextuelles Modell hinsichtlich einer Mehrklassen-Klassifizierung trainiert und angewendet. Daraufhin wird die Erstellung einer eigenen Wissensressource, die sich auf textlicher Ebene mit dem Thema der Bekleidung auseinandersetzt, aufgebaut und mit zahlreichen Informationen aus bereits bestehenden Ressourcen popularisiert. Die neue Ressource wird in Form einer Graphdatenbank entworfen. Dabei werden Relationen zwischen den einzelnen Elementen mithilfe von statischen Modellen sowie einem kontextuellen Modell, dem BERT-Modell, erstellt. Schließlich wird auf Grundlage der entwickelten Graphdatenbank ein in der Programmiersprache Python entwickeltes Programm vorgestellt, dass Eingabetexte unter Hinzunahme der Informationen und Relationen innerhalb der Graphdatenbank verarbeitet und Kleidungsstücke detektiert. Nach der theoretischen Aufarbeitung der entwickelten Ansätze werden die daraus resultierenden Ergebnisse diskutiert und bestehende Problematiken bei der Bearbeitung der Aufgabenstellung angesprochen. Abschließend wird die Arbeit zusammengefasst und Anregungen für die weitere Bearbeitung dieser Thematik vorgestellt.

Regulatory costs and market power (2023)

Singla, Shikhar

Industry concentration and markups in the US have been rising over the last 3-4 decades. However, the causes remain largely unknown. This paper uses machine learning on regulatory documents to construct a novel dataset on compliance costs to examine the effect of regulations on market power. The dataset is comprehensive and consists of all significant regulations at the 6-digit NAICS level from 1970-2018. We find that regulatory costs have increased by $1 trillion during this period. We document that an increase in regulatory costs results in lower (higher) sales, employment, markups, and profitability for small (large) firms. Regulation driven increase in concentration is associated with lower elasticity of entry with respect to Tobin's Q, lower productivity and investment after the late 1990s. We estimate that increased regulations can explain 31-37% of the rise in market power. Finally, we uncover the political economy of rulemaking. While large firms are opposed to regulations in general, they push for the passage of regulations that have an adverse impact on small firms.

POS tagging for German : how important is the right context? (2008)

Ivanova, Steliana ; Kübler, Sandra

Part-of-Speech tagging is generally performed by Markov models, based on bigram or trigram models. While Markov models have a strong concentration on the left context of a word, many languages require the inclusion of right context for correct disambiguation. We show for German that the best results are reached by a combination of left and right context. If only left context is available, then changing the direction of analysis and going from right to left improves the results. In a version of MBT (Daelemans et al., 1996) with default parameter settings, the inclusion of the right context improved POS tagging accuracy from 94.00% to 96.08%, thus corroborating our hypothesis. The version with optimized parameters reaches 96.73%.

NEER: An unsupervised method for named entity evolution recognition (2012)

Tahmasebi, Nina ; Gossen, Gerhard ; Kanhabua, Nattiya ; Holzmann, Helge ; Risse, Thomas

High impact events, political changes and new technologies are reflected in our language and lead to constant evolution of terms, expressions and names. Not knowing about names used in the past for referring to a named entity can severely decrease the performance of many computational linguistic algorithms. We propose NEER, an unsupervised method for named entity evolution recognition independent of external knowledge sources. We find time periods with high likelihood of evolution. By analyzing only these time periods using a sliding window co-occurrence method we capture evolving terms in the same context. We thus avoid comparing terms from widely different periods in time and overcome a severe limitation of existing methods for named entity evolution, as shown by the high recall of 90% on the New York Times corpus. We compare several relatedness measures for filtering to improve precision. Furthermore, using machine learning with minimal supervision improves precision to 94%.

Multivariate macroeconomic forecasting: from DSGE and BVAR to artificial neural networks (2024)

Tänzer, Alina

This paper contributes a multivariate forecasting comparison between structural models and Machine-Learning-based tools. Specifically, a fully connected feed forward non-linear autoregressive neural network (ANN) is contrasted to a well established dynamic stochastic general equilibrium (DSGE) model, a Bayesian vector autoregression (BVAR) using optimized priors as well as Greenbook and SPF forecasts. Model estimation and forecasting is based on an expanding window scheme using quarterly U.S. real-time data (1964Q2:2020Q3) for 8 macroeconomic time series (GDP, inflation, federal funds rate, spread, consumption, investment, wage, hours worked), allowing for up to 8 quarter ahead forecasts. The results show that the BVAR improves forecasts compared to the DSGE model, however there is evidence for an overall improvement of predictions when relying on ANN, or including them in a weighted average. Especially, ANN-based inflation forecasts improve other predictions by up to 50%. These results indicate that nonlinear data-driven ANNs are a useful method when it comes to macroeconomic forecasting.

Multimodal analysis methods in predictive biomedicine (2023)

Qoku, Arber ; Katsaouni, Nikoletta ; Flinner, Nadine ; Büttner, Florian ; Schulz, Marcel Holger

For medicine to fulfill its promise of personalized treatments based on a better understanding of disease biology, computational and statistical tools must exist to analyze the increasing amount of patient data that becomes available. A particular challenge is that several types of data are being measured to cope with the complexity of the underlying systems, enhance predictive modeling and enrich molecular understanding. Here we review a number of recent approaches that specialize in the analysis of multimodal data in the context of predictive biomedicine. We focus on methods that combine different OMIC measurements with image or genome variation data. Our overview shows the diversity of methods that address analysis challenges and reveals new avenues for novel developments.

Machine learning for beam dynamics studies at the CERN Large Hadron Collider (2020)

Machine learning entails a broad range of techniques that have been widely used in Science and Engineering since decades. High-energy physics has also profited from the power of these tools for advanced analysis of colliders data. It is only up until recently that Machine Learning has started to be applied successfully in the domain of Accelerator Physics, which is testified by intense efforts deployed in this domain by several laboratories worldwide. This is also the case of CERN, where recently focused efforts have been devoted to the application of Machine Learning techniques to beam dynamics studies at the Large Hadron Collider (LHC). This implies a wide spectrum of applications from beam measurements and machine performance optimisation to analysis of numerical data from tracking simulations of non-linear beam dynamics. In this paper, the LHC-related applications that are currently pursued are presented and discussed in detail, paying also attention to future developments.

Intrinsically motivated agents for goal discovery in high dimensional state spaces (2023)

Bohlinger, Nico

Goal-Conditioned Reinforcement Learning (GCRL) is a popular framework for training agents to solve multiple tasks in a single environment. It is cru- cial to train an agent on a diverse set of goals to ensure that it can learn to generalize to unseen downstream goals. Therefore, current algorithms try to learn to reach goals while simultaneously exploring the environment for new ones (Aubret et al., 2021; Mendonca et al., 2021). This creates a form of the prominent exploration-exploitation dilemma. To relieve the pres- sure of a single agent having to optimize for two competing objectives at once, this thesis proposes the novel algorithm family Goal-Conditioned Re- inforcement Learning with Prior Intrinsic Exploration (GC-π), which sep- arates exploration and goal learning into distinct phases. In the first ex- ploration phase, an intrinsically motivated agent explores the environment and collects a rich dataset of states and actions. This dataset is then used to learn a representation space, which acts as the distance metric for the goal- conditioned reward signal. In the final phase, a goal-conditioned policy is trained with the help of the representation space, and its training goals are randomly sampled from the dataset collected during the exploration phase. Multiple variations of these three phases have been extensively evaluated in the classic AntMaze MuJoCo environment (Nachum et al., 2018). The fi- nal results show that the proposed algorithms are able to fully explore the environment and solve all downstream goals while using every dimension of the state space for the goal space. This makes the approach more flexible compared to previous GCRL work, which only ever uses a small subset of the dimensions for the goals (S. Li et al., 2021a; Pong et al., 2020).

Data driven enrichment of historical low-resource languages for foundational NLP tasks and their neural network models (2023)

Ahmed, Sajawel

In the recent past, we are making huge progress in the field of Artificial Intelligence. Since the rise of neural networks, astonishing new frontiers are continuously being discovered. The development is so fast that overall no major technical limits are in sight. Hence, digitization has expanded from the base of academia and industry to such an extent that it is prevalent in the politics, mass media and even popular arts. The DFG-funded project Specialized Information Service for Biodiversity Research and the BMBF-funded project Linked Open Tafsir can be placed exactly in that overall development. Both projects aim to build an intelligent, up-to-date, modern research infrastructure on biodiversity and theological studies for scholars researching in these respective fields of historical science. Starting from digitized German and Arabic historical literature containing so far unavailable valuable knowledge on biodiversity and theological studies, at its core, our dissertation targets to incorporate state-of-the-art Machine Learning methods for analyzing natural language texts of low-resource languages and enabling foundational Natural Language Processing tasks on them, such as Sentence Boundary Detection, Named Entity Recognition, and Topic Modeling. This ultimately leads to paving the way for new scientific discoveries in the historical disciplines of natural science and humanities. By enriching the landscape of historical low-resource languages with valuable annotation data, our work becomes part of the greater movement of digitizing the society, thus allowing people to focus on things which really matter in science and industry.

Clustering-based sector investing (2023)

Bagnara, Matteo ; Goodarzi, Milad

Industry classification groups firms into finer partitions to help investments and empirical analysis. To overcome the well-documented limitations of existing industry definitions, like their stale nature and coarse categories for firms with multiple operations, we employ a clustering approach on 69 firm characteristics and allocate companies to novel economic sectors maximizing the within-group explained variation. Such sectors are dynamic yet stable, and represent a superior investment set compared to standard classification schemes for portfolio optimization and for trading strategies based on within-industry mean-reversion, which give rise to a latent risk factor significantly priced in the cross-section. We provide a new metric to quantify feature importance for clustering methods, finding that size drives differences across classical industries while book-to-market and financial liquidity variables matter for clustering-based sectors.

Classification of monetary and fiscal dominance regimes using machine learning techniques (2021)

Hinterlang, Natascha ; Hollmayr, Josef

The authors identify U.S. monetary and fiscal dominance regimes using machine learning techniques. The algorithms are trained and verified by employing simulated data from Markov-switching DSGE models, before they classify regimes from 1968-2017 using actual U.S. data. All machine learning methods outperform a standard logistic regression concerning the simulated data. Among those the Boosted Ensemble Trees classifier yields the best results. The authors find clear evidence of fiscal dominance before Volcker. Monetary dominance is detected between 1984-1988, before a fiscally led regime turns up around the stock market crash lasting until 1994. Until the beginning of the new century, monetary dominance is established, while the more recent evidence following the financial crisis is mixed with a tendency towards fiscal dominance.

Bayesian machine learning for financial modeling (2020)

Nirwan, Rajbir Singh

Machine Learning (ML) is so pervasive in our todays life that we don't even realise that, more often than expected, we are using systems based on it. It is also evolving faster than ever before. When deploying ML systems that make decisions on their own, we need to think about their ignorance of our uncertain world. The uncertainty might arise due to scarcity of the data, the bias of the data or even a mismatch between the real world and the ML-model. Given all these uncertainties, we need to think about how to build systems that are not totally ignorant thereof. Bayesian ML can to some extent deal with these problems. The specification of the model using probabilities provides a convenient way to quantify uncertainties, which can then be included in the decision making process. In this thesis, we introduce the Bayesian ansatz to modeling and apply Bayesian ML models in finance and economics. Especially, we will dig deeper into Gaussian processes (GP) and Gaussian process latent variable model (GPLVM). Applied to the returns of several assets, GPLVM provides the covariance structure and also a latent space embedding thereof. Several financial applications can be build upon the output of the GPLVM. To demonstrate this, we build an automated asset allocation system, a predictor for missing asset prices and identify other structure in financial data. It turns out that the GPLVM exhibits a rotational symmetry in the latent space, which makes it harder to fit. Our second publication reports, how to deal with that symmetry. We propose another parameterization of the model using Householder transformations, by which the symmetry is broken. Bayesian models are changed by reparameterization, if the prior is not changed accordingly. We provide the correct prior distribution of the new parameters, such that the model, i.e. the data density, is not changed under the reparameterization. After applying the reparametrization on Bayesian PCA, we show that the symmetry of nonlinear models can also be broken in the same way. In our last project, we propose a new method for matching quantile observations, which uses order statistics. The use of order statistics as the likelihood, instead of a Gaussian likelihood, has several advantages. We compare these two models and highlight their advantages and disadvantages. To demonstrate our method, we fit quantiled salary data of several European countries. Given several candidate models for the fit, our method also provides a metric to choose the best option. We hope that this thesis illustrates some benefits of Bayesian modeling (especially Gaussian processes) in finance and economics and its usage when uncertainties are to be quantified.

Automatic Topic Modeling in the Context of Digital Libraries: Mehrsprachige Korpus-basierte Erweiterung von text2ddc - eine experimentelle Studie (2020)

Baumartz, Daniel

Diese Bachelorarbeit befasst sich mit der Themenklassifikation von unstrukturiertem Text. Aufgrund der stetig steigenden Menge von textbasierten Daten werden automatisierte Klassifikationsmethoden in vielen Disziplinen benötigt und erforscht. Aufbauend auf dem text2ddc-Klassifikator, der am Text Technology Lab der Goethe-Universität Frankfurt am Main entwickelt wurde, werden die Auswirkungen der Vergrößerung des Trainingskorpus mittels unterschiedlicher Methoden untersucht. text2ddc nutzt die Dewey Decimal Classification (DDC) als Zielklassifikation und wird trainiert auf Artikeln der Wikipedia. Nach einer Einführung, in der Grundlagen beschrieben werden, wird das Klassifikationsmodell von text2ddc vorgestellt, sowie die Probleme und daraus resultierenden Aufgaben betrachtet. Danach wird die Aktualisierung der bisherigen Daten beschrieben, gefolgt von der Vorstellung der verschiedenen Methoden, das Trainingskorpus zu erweitern. Mit insgesamt elf Sprachen wird experimentiert. Die Evaluation zeigt abschließend die Verbesserungen der Qualität der Klassifikation mit text2ddc auf, diskutiert die problematischen Fälle und gibt Anregungen für weitere zukünftige Arbeiten.

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

16 search hits