OPUS 4 | 004 Datenverarbeitung; Informatik

Threshold testing and semi-online prophet inequalities (2023)

We study threshold testing, an elementary probing model with the goal to choose a large value out of n i.i.d. random variables. An algorithm can test each variable X_i once for some threshold t_i, and the test returns binary feedback whether X_i ≥ t_i or not. Thresholds can be chosen adaptively or non-adaptively by the algorithm. Given the results for the tests of each variable, we then select the variable with highest conditional expectation. We compare the expected value obtained by the testing algorithm with expected maximum of the variables. Threshold testing is a semi-online variant of the gambler’s problem and prophet inequalities. Indeed, the optimal performance of non-adaptive algorithms for threshold testing is governed by the standard i.i.d. prophet inequality of approximately 0.745 + o(1) as n → ∞. We show how adaptive algorithms can significantly improve upon this ratio. Our adaptive testing strategy guarantees a competitive ratio of at least 0.869 - o(1). Moreover, we show that there are distributions that admit only a constant ratio c < 1, even when n → ∞. Finally, when each box can be tested multiple times (with n tests in total), we design an algorithm that achieves a ratio of 1 - o(1).

Improving data quality by rules: a numismatic example (2017)

Tolle, Karsten ; Wigg-Wolf, David

The archaeological data dealt with in our database solution Antike Fundmünzen in Europa (AFE), which records finds of ancient coins, is entered by humans. Based on the Linked Open Data (LOD) approach, we link our data to Nomisma.org concepts, as well as to other resources like Online Coins of the Roman Empire (OCRE). Since information such as denomination, material, etc. is recorded for each single coin, this information should be identical for coins of the same type. Unfortunately, this is not always the case, mostly due to human errors. Based on rules that we implemented, we were able to make use of this redundant information in order to detect possible errors within AFE, and were even able to correct errors in Nomimsa.org. However, the approach had the weakness that it was necessary to transform the data into an internal data model. In a second step, we therefore developed our rules within the Linked Open Data world. The rules can now be applied to datasets following the Nomisma. org modelling approach, as we demonstrated with data held by Corpus Nummorum Thracorum (CNT). We believe that the use of methods like this to increase the data quality of individual databases, as well as across different data sources and up to the higher levels of OCRE and Nomisma.org, is mandatory in order to increase trust in them.

Open Educational Resources and Digital Higher Education (2022)

Schulze, Uwe

When specialization helps: using pooled contextualized embeddings to detect chemical and biomedical entities in Spanish (2019)

Stoeckel, Manuel ; Hemati, Wahed ; Mehler, Alexander

The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF sequence tagger with stacked Pooled Contextualized Embeddings, word and sub-word embeddings using the open-source framework FLAIR. We present a new corpus composed of articles and papers from Spanish health science journals, termed the Spanish Health Corpus, and use it to train domain-specific embeddings which we incorporate in our model training. We achieve a result of 89.76% F1-score using pre-trained embeddings and are able to improve these results to 90.52% F1-score using specialized embeddings.

Voting for POS tagging of latin texts: using the flair of FLAIR to better ensemble classifiers by example of latin (2020)

Stoeckel, Manuel ; Henlein, Alexander ; Hemati, Wahed ; Mehler, Alexander

Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is, POS tagging. Since most of the available Latin word embeddings were trained on either few or inaccurate data, we trained several embeddings on better data in the first step. Based on these embeddings, we trained several state-of-the-art taggers and used them as input for an ensemble classifier called LSTMVoter. We were able to achieve the best results for both the cross-genre and the cross-time task (90.64% and 87.00%) without using additional annotated data (closed modality). In the meantime, we further improved the system and achieved even better results (96.91% on classical, 90.87% on cross-genre and 87.35% on cross-time).

TextAnnotator: a UIMA based tool for the simultaneous and collaborative annotation of texts (2020)

Abrami, Giuseppe ; Stoeckel, Manuel ; Mehler, Alexander

The annotation of texts and other material in the field of digital humanities and Natural Language Processing (NLP) is a common task of research projects. At the same time, the annotation of corpora is certainly the most time- and cost-intensive component in research projects and often requires a high level of expertise according to the research interest. However, for the annotation of texts, a wide range of tools is available, both for automatic and manual annotation. Since the automatic pre-processing methods are not error-free and there is an increasing demand for the generation of training data, also with regard to machine learning, suitable annotation tools are required. This paper defines criteria of flexibility and efficiency of complex annotations for the assessment of existing annotation tools. To extend this list of tools, the paper describes TextAnnotator, a browser-based, multi-annotation system, which has been developed to perform platform-independent multimodal annotations and annotate complex textual structures. The paper illustrates the current state of development of TextAnnotator and demonstrates its ability to evaluate annotation quality (inter-annotator agreement) at runtime. In addition, it will be shown how annotations of different users can be performed simultaneously and collaboratively on the same document from different platforms using UIMA as the basis for annotation.

An experimental study of external memory algorithms for connected components (2021)

Brodal, Gerth Stølting ; Fagerberg, Rolf ; Hammer, David ; Meyer, Ulrich ; Penschuck, Manuel ; Tran, Hung

We empirically investigate algorithms for solving Connected Components in the external memory model. In particular, we study whether the randomized O(Sort(E)) algorithm by Karger, Klein, and Tarjan can be implemented to compete with practically promising and simpler algorithms having only slightly worse theoretical cost, namely Borůvka’s algorithm and the algorithm by Sibeyn and collaborators. For all algorithms, we develop and test a number of tuning options. Our experiments are executed on a large set of different graph classes including random graphs, grids, geometric graphs, and hyperbolic graphs. Among our findings are: The Sibeyn algorithm is a very strong contender due to its simplicity and due to an added degree of freedom in its internal workings when used in the Connected Components setting. With the right tunings, the Karger-Klein-Tarjan algorithm can be implemented to be competitive in many cases. Higher graph density seems to benefit Karger-Klein-Tarjan relative to Sibeyn. Borůvka’s algorithm is not competitive with the two others.

Privacy concerns and behavior of Pokémon go players in Germany (2018)

Harborth, David ; Pape, Sebastian

We investigate privacy concerns and the privacy behavior of users of the AR smartphone game Pokémon Go. Pokémon Go accesses several functionalities of the smartphone and, in turn, collects a plethora of data of its users. For assessing the privacy concerns, we conduct an online study in Germany with 683 users of the game. The results indicate that the majority of the active players are concerned about the privacy practices of companies. This result hints towards the existence of a cognitive dissonance, i.e. the privacy paradox. Since this result is common in the privacy literature, we complement the first study with a second one with 199 users, which aims to assess the behavior of users with regard to which measures they undertake for protecting their privacy. The results are highly mixed and dependent on the measure, i.e. relatively many participants use privacy-preserving measures when interacting with their smartphone. This implies that many users know about risks and might take actions to protect their privacy, but deliberately trade-off their information privacy for the utility generated by playing the game.

On gender specific perception of data sharing in Japan (2016)

Tschersich, Markus ; Kiyomoto, Shinsaku ; Pape, Sebastian ; Nakamura, Toru ; Bal, Gökhan ; Takasaki, Haruo ; Rannenberg, Kai

Privacy and its protection is an important part of the culture in the USA and Europe. Literature in this field lacks empirical data from Japan. Thus, it is difficult– especially for foreign researchers – to understand the situation in Japan. To get a deeper understanding we examined the perception of a topic that is closely related to privacy: the perceived benefits of sharing data and the willingness to share in respect to the benefits for oneself, others and companies. We found a significant impact of the gender to each of the six analysed constructs.

Assessing privacy policies of internet of things services (2018)

Paul, Niklas ; Tesfay, Welderufael Berhane ; Kipker, Dennis-Kenji ; Stelter, Mattea ; Pape, Sebastian

This paper provides an assessment framework for privacy policies of Internet of Things Services which is based on particular GDPR requirements. The objective of the framework is to serve as supportive tool for users to take privacy-related informed decisions. For example when buying a new fitness tracker, users could compare different models in respect to privacy friendliness or more particular aspects of the framework such as if data is given to a third party. The framework consists of 16 parameters with one to four yes-or-no-questions each and allows the users to bring in their own weights for the different parameters. We assessed 110 devices which had 94 different policies. Furthermore, we did a legal assessment for the parameters to deal with the case that there is no statement at all regarding a certain parameter. The results of this comparative study show that most of the examined privacy policies of IoT devices/services are insufficient to address particular GDPR requirements and beyond. We also found a correlation between the length of the policy and the privacy transparency score, respectively.

A self-organized one-neuron controller for artificial life on wheels (2017)

Gros, Claudius ; Martin, Laura ; Sándor, Bulcsú

We study simulated animats in terms of wheeled robots with the most simple neural controller possible – a single neuron per actuator. The system is fully self-organized in the sense that the controlling neuron receives uniquely the actual angle of the wheel as an input. Non-trivial locomotion results in structured environments, with the robot determining autonomously the direction of movement (time-reversal symmetry is spontaneously broken). Our controller, which mimics the mechanism used to transmit power in steam locomotives, abstracts from the body plan of the animat, working without problems also in the presence of noise and for chains of individual two-wheeled cars. Being fully compliant our controller may be also used, in the spirit of morphological computation, as a basic unit for higher-level evolutionary algorithms.

An architecture blueprint for knowlege-based e-Science (2007)

Niederée, Claudia ; Risse, Thomas ; Paukert, Marco ; Stein, Adelheit

The scientific innovation process embraces the steps from problem definition through the development and evaluation of innovative solutions to their successful exploitation. The challenges imposed by this process can be answered by the creation of a powerful and flexible next-generation e-Science infrastructure, which exploits leading edge information and knowledge technologies and enables a comprehensive and intelligent means of supporting this process. This paper describes our vision of a Knowledge-based eScience infrastructure, which is based on the results of an in-depth study of the researchers requirements. Furthermore, it introduces the Fraunhofer e-Science Cockpit as a first implementation of our vision.

Terminology evolution in web archiving: Open issues (2008)

Tahmasebi, Nina ; Iofciu, Tereza ; Risse, Thomas ; Niederée, Claudia ; Siberski, Wolf

The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure "semantic" accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution this paper formalizes the problem and discusses issues, first ideas and relevant technologies.

Optimizing space of parallel processes (2019)

Schmidt-Schauß, Manfred ; Dallmeyer, Nils

This paper is a contribution to exploring and analyzing space-improvements in concurrent programming languages, in particular in the functional process-calculus CHF. Space-improvements are defined as a generalization of the corresponding notion in deterministic pure functional languages. The main part of the paper is the O(n ·logn) algorithm SPOPTN for offline space optimization of several parallel independent processes. Applications of this algorithm are: (i) affirmation of space improving transformations for particular classes of program transformations; (ii) support of an interpreter-based method for refuting space-improvements; and (iii) as a stand-alone offline-optimizer for space (or similar resources) of parallel processes.

A two-pillar approach to analyze the privacy policies and resource access behaviors of mobile augmented reality applications (2018)

Harborth, David ; Hatamian, Majid ; Tesfay, Welderufael Berhane ; Rannenberg, Kai

Augmented reality (AR) gained much public attention since the success of Pok´emon Go in 2016. Technology companies like Apple or Google are currently focusing primarily on mobile AR (MAR) technologies, i.e. applications on mobile devices, like smartphones or tablets. Associated privacy issues have to be investigated early to foster market adoption. This is especially relevant since past research found several threats associated with the use of smartphone applications. Thus, we investigate two of the main privacy risks for MAR application users based on a sample of 19 of the most downloaded MAR applications for Android. First, we assess threats arising from bad privacy policies based on a machine-learning approach. Second, we investigate which smartphone data resources are accessed by the MAR applications. Third, we combine both approaches to evaluate whether privacy policies cover certain data accesses or not. We provide theoretical and practical implications and recommendations based on our results.

Modeling saltwater intrusion scenarios for a coastal aquifer at the German North Sea (2018)

Schneider, Anke ; Zhao, Hong ; Wolf, Jens ; Logashenko, Dmitry ; Reiter, Sebastian ; Howahr, M. ; Eley, Malte ; Gelleszun, Marlene ; Wiederhold, Helga

A 3d regional density-driven flow model of a heterogeneous aquifer system at the German North Sea Coast is set up within the joint project NAWAK (“Development of sustainable adaption strategies for the water supply and distribution infrastructure on condition of climatic and demographic change”). The development of the freshwater-saltwater interface is simulated for three climate and demographic scenarios. Groundwater flow simulations are performed with the finite volume code d3f++ (distributed density driven flow) that has been developed with a view to the modelling of large, complex, strongly density-influenced aquifer systems over long time periods.

Generating practical random hyperbolic graphs in near-linear time and with sub-linear memory (2017)

Penschuck, Manuel

Random graph models, originally conceived to study the structure of networks and the emergence of their properties, have become an indispensable tool for experimental algorithmics. Amongst them, hyperbolic random graphs form a well-accepted family, yielding realistic complex networks while being both mathematically and algorithmically tractable. We introduce two generators MemGen and HyperGen for the G_{alpha,C}(n) model, which distributes n random points within a hyperbolic plane and produces m=n*d/2 undirected edges for all point pairs close by; the expected average degree d and exponent 2*alpha+1 of the power-law degree distribution are controlled by alpha>1/2 and C. Both algorithms emit a stream of edges which they do not have to store. MemGen keeps O(n) items in internal memory and has a time complexity of O(n*log(log n) + m), which is optimal for networks with an average degree of d=Omega(log(log n)). For realistic values of d=o(n / log^{1/alpha}(n)), HyperGen reduces the memory footprint to O([n^{1-alpha}*d^alpha + log(n)]*log(n)). In an experimental evaluation, we compare HyperGen with four generators among which it is consistently the fastest. For small d=10 we measure a speed-up of 4.0 compared to the fastest publicly available generator increasing to 29.6 for d=1000. On commodity hardware, HyperGen produces 3.7e8 edges per second for graphs with 1e6 < m < 1e12 and alpha=1, utilising less than 600MB of RAM. We demonstrate nearly linear scalability on an Intel Xeon Phi.

Tailored data science education using gamification (2016)

Kim, Hee Eun ; Zicari, Roberto V. ; Tolle, Karsten ; Manieri, Andrea

Interest to become a data scientist or related professions in data science domain is rapidly growing. To meet such a demand, we propose a novel educational service that aims to provide tailored learning paths for data science. Our target user is one who aims to be an expert in data science. Our approach is to analyze the background of the practitioner and match the learning units. A critical feature is that we use gamification to reinforce the practitioner engagement. We believe that our work provides a practical guideline for those who want to learn data science.

SEDGE : symbolic example data generation for dataflow programs (2013)

Li, Kaituo ; Reichenbach, Christoph ; Smaragdakis, Yannis ; Diao, Yanlei ; Csallner, Christoph

Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using input data as concrete aids in the solution process. The approach resembles dynamic-symbolic (a.k.a. "concolic") execution in a conventional programming language, adapted to the unique features of the dataflow domain. In third-party benchmarks, SEDGE achieves higher coverage than past techniques for 5 out of 20 PigMix benchmarks and 7 out of 11 SDSS benchmarks and (with equal coverage for the rest of the benchmarks). We also show that our targeting of the high-level dataflow language pays off: for complex programs, state-of-the-art dynamic-symbolic execution at the level of the generated map-reduce code (instead of the original dataflow program) requires many more test cases or achieves much lower coverage than our approach.

Semantic web technologies applied to numismatic collections (2012)

Gruber, Ethan ; Heath, Sebastian ; Meadows, Andrew ; Pett, Daniel ; Tolle, Karsten ; Wigg-Wolf, David

This paper describes the ongoing efforts of the authors to present ancient Greek and Roman numismatic data on the public internet, with an emphasis on efforts to integrate information from multiple sources using Linked Data and Semantic Web techniques. By way of very modern metaphor, it is useful to think of coins as intentionally created packages of 'named entities'. Each coin was struck by a particular authority, often at a known site, and coins often make reference to familiar concepts such as deities, historical events, or symbols that were widely recognized in the ancient world. The institutions represented among the authors have deployed search interfaces that allow users to take advantage of this aspect of numismatic databases. The American Numismatic Society's database provides faceted search to its collection of over 550,000 objects. The Portable Antiquities Scheme (PAS) in the UK presents individual finds (and hoards) recorded throughout the country. The Römisch-Germanische Kommission and the University of Frankfurt (DBIS) are developing a prototype metaportal (INTERFACE) that accesses national databases of coin finds held in in Frankfurt, Vienna and Utrecht. Each of these resources is beginning to explore Semantic Web/Linked data approaches so that the role of numismatic standards is immediately coming to the fore. DBIS and INTERFACE are developing a numismatic ontology. At the ANS and PAS, the public database already presents RDF serializations based on Dublin Core. Together, the authors have begun to explore standardization of conceptual names on the basis of the vocabulary presented at the site http://nomisma.org . Nomisma.org is a collaborative effort to provide stable digital representations of numismatic concepts and entities. It provides URIs for such basic concepts as 'coin', 'mint', 'axis'. All of these are defined within the scope of numismatics but are already being linked to other stable resources where available. This is particularly the case for mints. For example, the URI http://nomisma.org/id/corinth is intended to represent that ancient city in its role as a minter/issuer of coins. The URI is linked via the SKOS ontology to the Pleiades Gazetteer of ancient places. This allows Nomisma to be the basis for a common representation of the concept that an object is a coin minted at Corinth. The ANS has already deployed such relationships in its public database. The work of all these projects is very much in progress so that this paper hopes to generate discussion on how multiple large projects can move forward in their own work while encouraging sufficient commonality to support large scale research questions undertaken by diverse audiences.

Open Access

004 Datenverarbeitung; Informatik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

48 search hits