OPUS 4 | 004 Datenverarbeitung; Informatik

Threshold testing and semi-online prophet inequalities (2023)

We study threshold testing, an elementary probing model with the goal to choose a large value out of n i.i.d. random variables. An algorithm can test each variable X_i once for some threshold t_i, and the test returns binary feedback whether X_i ≥ t_i or not. Thresholds can be chosen adaptively or non-adaptively by the algorithm. Given the results for the tests of each variable, we then select the variable with highest conditional expectation. We compare the expected value obtained by the testing algorithm with expected maximum of the variables. Threshold testing is a semi-online variant of the gambler’s problem and prophet inequalities. Indeed, the optimal performance of non-adaptive algorithms for threshold testing is governed by the standard i.i.d. prophet inequality of approximately 0.745 + o(1) as n → ∞. We show how adaptive algorithms can significantly improve upon this ratio. Our adaptive testing strategy guarantees a competitive ratio of at least 0.869 - o(1). Moreover, we show that there are distributions that admit only a constant ratio c < 1, even when n → ∞. Finally, when each box can be tested multiple times (with n tests in total), we design an algorithm that achieves a ratio of 1 - o(1).

Improving data quality by rules: a numismatic example (2017)

Tolle, Karsten ; Wigg-Wolf, David

The archaeological data dealt with in our database solution Antike Fundmünzen in Europa (AFE), which records finds of ancient coins, is entered by humans. Based on the Linked Open Data (LOD) approach, we link our data to Nomisma.org concepts, as well as to other resources like Online Coins of the Roman Empire (OCRE). Since information such as denomination, material, etc. is recorded for each single coin, this information should be identical for coins of the same type. Unfortunately, this is not always the case, mostly due to human errors. Based on rules that we implemented, we were able to make use of this redundant information in order to detect possible errors within AFE, and were even able to correct errors in Nomimsa.org. However, the approach had the weakness that it was necessary to transform the data into an internal data model. In a second step, we therefore developed our rules within the Linked Open Data world. The rules can now be applied to datasets following the Nomisma. org modelling approach, as we demonstrated with data held by Corpus Nummorum Thracorum (CNT). We believe that the use of methods like this to increase the data quality of individual databases, as well as across different data sources and up to the higher levels of OCRE and Nomisma.org, is mandatory in order to increase trust in them.

Open Educational Resources and Digital Higher Education (2022)

Schulze, Uwe

When specialization helps: using pooled contextualized embeddings to detect chemical and biomedical entities in Spanish (2019)

Stoeckel, Manuel ; Hemati, Wahed ; Mehler, Alexander

The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF sequence tagger with stacked Pooled Contextualized Embeddings, word and sub-word embeddings using the open-source framework FLAIR. We present a new corpus composed of articles and papers from Spanish health science journals, termed the Spanish Health Corpus, and use it to train domain-specific embeddings which we incorporate in our model training. We achieve a result of 89.76% F1-score using pre-trained embeddings and are able to improve these results to 90.52% F1-score using specialized embeddings.

Voting for POS tagging of latin texts: using the flair of FLAIR to better ensemble classifiers by example of latin (2020)

Stoeckel, Manuel ; Henlein, Alexander ; Hemati, Wahed ; Mehler, Alexander

Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is, POS tagging. Since most of the available Latin word embeddings were trained on either few or inaccurate data, we trained several embeddings on better data in the first step. Based on these embeddings, we trained several state-of-the-art taggers and used them as input for an ensemble classifier called LSTMVoter. We were able to achieve the best results for both the cross-genre and the cross-time task (90.64% and 87.00%) without using additional annotated data (closed modality). In the meantime, we further improved the system and achieved even better results (96.91% on classical, 90.87% on cross-genre and 87.35% on cross-time).

TextAnnotator: a UIMA based tool for the simultaneous and collaborative annotation of texts (2020)

Abrami, Giuseppe ; Stoeckel, Manuel ; Mehler, Alexander

The annotation of texts and other material in the field of digital humanities and Natural Language Processing (NLP) is a common task of research projects. At the same time, the annotation of corpora is certainly the most time- and cost-intensive component in research projects and often requires a high level of expertise according to the research interest. However, for the annotation of texts, a wide range of tools is available, both for automatic and manual annotation. Since the automatic pre-processing methods are not error-free and there is an increasing demand for the generation of training data, also with regard to machine learning, suitable annotation tools are required. This paper defines criteria of flexibility and efficiency of complex annotations for the assessment of existing annotation tools. To extend this list of tools, the paper describes TextAnnotator, a browser-based, multi-annotation system, which has been developed to perform platform-independent multimodal annotations and annotate complex textual structures. The paper illustrates the current state of development of TextAnnotator and demonstrates its ability to evaluate annotation quality (inter-annotator agreement) at runtime. In addition, it will be shown how annotations of different users can be performed simultaneously and collaboratively on the same document from different platforms using UIMA as the basis for annotation.

An experimental study of external memory algorithms for connected components (2021)

Brodal, Gerth Stølting ; Fagerberg, Rolf ; Hammer, David ; Meyer, Ulrich ; Penschuck, Manuel ; Tran, Hung

We empirically investigate algorithms for solving Connected Components in the external memory model. In particular, we study whether the randomized O(Sort(E)) algorithm by Karger, Klein, and Tarjan can be implemented to compete with practically promising and simpler algorithms having only slightly worse theoretical cost, namely Borůvka’s algorithm and the algorithm by Sibeyn and collaborators. For all algorithms, we develop and test a number of tuning options. Our experiments are executed on a large set of different graph classes including random graphs, grids, geometric graphs, and hyperbolic graphs. Among our findings are: The Sibeyn algorithm is a very strong contender due to its simplicity and due to an added degree of freedom in its internal workings when used in the Connected Components setting. With the right tunings, the Karger-Klein-Tarjan algorithm can be implemented to be competitive in many cases. Higher graph density seems to benefit Karger-Klein-Tarjan relative to Sibeyn. Borůvka’s algorithm is not competitive with the two others.

Privacy concerns and behavior of Pokémon go players in Germany (2018)

Harborth, David ; Pape, Sebastian

We investigate privacy concerns and the privacy behavior of users of the AR smartphone game Pokémon Go. Pokémon Go accesses several functionalities of the smartphone and, in turn, collects a plethora of data of its users. For assessing the privacy concerns, we conduct an online study in Germany with 683 users of the game. The results indicate that the majority of the active players are concerned about the privacy practices of companies. This result hints towards the existence of a cognitive dissonance, i.e. the privacy paradox. Since this result is common in the privacy literature, we complement the first study with a second one with 199 users, which aims to assess the behavior of users with regard to which measures they undertake for protecting their privacy. The results are highly mixed and dependent on the measure, i.e. relatively many participants use privacy-preserving measures when interacting with their smartphone. This implies that many users know about risks and might take actions to protect their privacy, but deliberately trade-off their information privacy for the utility generated by playing the game.

On gender specific perception of data sharing in Japan (2016)

Tschersich, Markus ; Kiyomoto, Shinsaku ; Pape, Sebastian ; Nakamura, Toru ; Bal, Gökhan ; Takasaki, Haruo ; Rannenberg, Kai

Privacy and its protection is an important part of the culture in the USA and Europe. Literature in this field lacks empirical data from Japan. Thus, it is difficult– especially for foreign researchers – to understand the situation in Japan. To get a deeper understanding we examined the perception of a topic that is closely related to privacy: the perceived benefits of sharing data and the willingness to share in respect to the benefits for oneself, others and companies. We found a significant impact of the gender to each of the six analysed constructs.

Assessing privacy policies of internet of things services (2018)

Paul, Niklas ; Tesfay, Welderufael Berhane ; Kipker, Dennis-Kenji ; Stelter, Mattea ; Pape, Sebastian

This paper provides an assessment framework for privacy policies of Internet of Things Services which is based on particular GDPR requirements. The objective of the framework is to serve as supportive tool for users to take privacy-related informed decisions. For example when buying a new fitness tracker, users could compare different models in respect to privacy friendliness or more particular aspects of the framework such as if data is given to a third party. The framework consists of 16 parameters with one to four yes-or-no-questions each and allows the users to bring in their own weights for the different parameters. We assessed 110 devices which had 94 different policies. Furthermore, we did a legal assessment for the parameters to deal with the case that there is no statement at all regarding a certain parameter. The results of this comparative study show that most of the examined privacy policies of IoT devices/services are insufficient to address particular GDPR requirements and beyond. We also found a correlation between the length of the policy and the privacy transparency score, respectively.

Open Access

004 Datenverarbeitung; Informatik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

48 search hits