Refine
Year of publication
Document Type
- Doctoral Thesis (14)
- Master's Thesis (1)
Language
- English (15)
Has Fulltext
- yes (15)
Is part of the Bibliography
- no (15)
Keywords
- ALICE (4)
- FPGA (3)
- Blockchain (1)
- CERN (1)
- Coding Scheme (1)
- Dataflow Computing (1)
- Detector Readout (1)
- Erasure-Correcting Codes (1)
- Error Mitigation (1)
- Failure Erasure Code (1)
Institute
- Informatik (9)
- Informatik und Mathematik (6)
Virtual machines are for the most part not used inside of high-energy physics (HEP) environments. Even though they provide a high degree of isolation, the performance overhead they introduce is too great for them to be used. With the rising number of container technologies and their increasing separation capabilities, HEP-environments are evaluating if they could utilize the technology. The container images are small and self-contained which allows them to be easily distributed throughout the global environment. They also offer a near native performance while at the same time aproviding an often acceptable level of isolation. Only the needed services and libraries are packed into an image and executed directly by the host kernel. This work compared the performance impact of the three container technologies Docker, rkt and Singularity. The host kernel was additionally hardened with grsecurity and PaX to strengthen its security and make an exploitation from inside a container harder. The execution time of a physics simulation was used as a benchmark. The results show that the different container technologies have a different impact on the performance. The performance loss on a stock kernel is small; in some cases they were even faster than no container. Docker showed overall the best performance on a stock kernel. The difference on a hardened kernel was bigger than on a stock kernel, but in favor of the container technologies. rkt showed performed in almost all cases better than all the others.
Driven by rapid technological advancements, the amount of data that is created, captured, communicated, and stored worldwide has grown exponentially over the past decades. Along with this development it has become critical for many disciplines of science and business to being able to gather and analyze large amounts of data. The sheer volume of the data often exceeds the capabilities of classical storage systems, with the result that current large-scale storage systems are highly distributed and are comprised of a high number of individual storage components. As with any other electronic device, the reliability of storage hardware is governed by certain probability distributions, which in turn are influenced by the physical processes utilized to store the information. The traditional way to deal with the inherent unreliability of combined storage systems is to replicate the data several times. Another popular approach to achieve failure tolerance is to calculate the block-wise parity in one or more dimensions. With better understanding of the different failure modes of storage components, it has become evident that sophisticated high-level error detection and correction techniques are indispensable for the ever-growing distributed systems. The utilization of powerful cyclic error-correcting codes, however, comes with a high computational penalty, since the required operations over finite fields do not map very well onto current commodity processors. This thesis introduces a versatile coding scheme with fully adjustable fault-tolerance that is tailored specifically to modern processor architectures. To reduce stress on the memory subsystem the conventional table-based algorithm for multiplication over finite fields has been replaced with a polynomial version. This arithmetically intense algorithm is better suited to the wide SIMD units of the currently available general purpose processors, but also displays significant benefits when used with modern many-core accelerator devices (for instance the popular general purpose graphics processing units). A CPU implementation using SSE and a GPU version using CUDA are presented. The performance of the multiplication depends on the distribution of the polynomial coefficients in the finite field elements. This property has been used to create suitable matrices that generate a linear systematic erasure-correcting code which shows a significantly increased multiplication performance for the relevant matrix elements. Several approaches to obtain the optimized generator matrices are elaborated and their implications are discussed. A Monte-Carlo-based construction method allows it to influence the specific shape of the generator matrices and thus to adapt them to special storage and archiving workloads. Extensive benchmarks on CPU and GPU demonstrate the superior performance and the future application scenarios of this novel erasure-resilient coding scheme.
Acceleration of Biomedical Image Processing and Reconstruction with FPGAs
Increasing chip sizes and better programming tools have made it possible to increase the boundaries of application acceleration with reconfigurable computer chips. In this thesis the potential of acceleration with Field Programmable Gate Arrays (FPGAs) is examined for applications that perform biomedical image processing and reconstruction. The dataflow paradigm was used to port the analysis of image data for localization microscopy and for 3D electron tomography from an imperative description towards the FPGA for the first time.
After the primitives of image processing on FPGAs are presented, a general workflow is given for analyzing imperative source code and converting it to a hardware pipeline where every node processes image data in parallel. The theoretical foundation is then used to accelerate both example applications. For localization microscopy, an acceleration of 185 compared to an Intel i5 450 CPU was achieved, and electron tomography could be sped up by a factor of 5 over an Nvidia Tesla C1060 graphics card while maintaining full accuracy in both cases.
Detectors of modern high-energy physics experiments generate huge data rates during operation. The efficient read-out of this data from the front-end electronics is a sophisticated task, the main challenges, however, may vary from experiment to experiment. The Compressed Baryonic Matter (CBM) experiment that is currently under construction at the Facility for Antiproton and Ion Research (FAIR) in Darmstadt/Germany foresees a novel approach for data acquisition.
Unlike previous comparable experiments that organize data read-out based on global, hierarchical trigger decisions, CBM is based on free-running and self-triggered front-end electronics. Data is pushed to the next stage of the read-out chain rather than pulled from the buffers of the previous stage. This new paradigm requires a completely new development of read-out electronics.
As one part of this thesis, a firmware for a read-out controller to interface such a free-running and self-triggered front-end ASIC, the GET4 chip, was implemented. The firmware in question was developed to run on a Field Programmable Gate Array (FPGA). An FPGA is an integrated circuit whose behavior can be reconfigured "in the field" which offers a lot of flexibility, bugs can be fixed and also completely new features can be added, even after the hardware has already been installed. Due to these general advantages, the usage of FPGAs is desired for the final experiment. However, there is also a drawback to the usage of FPGAs. The only affordable FPGAs today are based on either SRAM or Flash technology and both cannot easily be operated in a radiation environment.
SRAM-based devices suffer severely from Single Event Upsets (SEUs) and Flash-based FPGAs deteriorate too fast from Total Ionizing Dose (TID) effects.
Several radiation mitigation techniques exist for SRAM-based FPGAs, but careful evaluation for each use case is required. For CBM it is not clear if the higher resource consumption of added redundancy, that more or less directly translates in to additional cost, outweighs the advantaged of using FPGAs. In addition, it is even not clear if radiation mitigation techniques (e.g. scrubbing) that were already successfully put into operation in space applications also work as efficiently at the much higher particle rates expected at CBM.
In this thesis, existing radiation mitigation techniques have been analyzed and eligible techniques have been implemented for the above-mentioned read-out controller. To minimize additional costs, redundancy was only implemented for selected parts of the design.
Finally, the radiation mitigated read-out controller was tested by mounting the device directly into a particle beam at Forschungszentrum Jülich. The tests show that the radiation mitigation effect of the implemented techniques remains sound, even at a very high particle flux and with only part of the design protected by costly redundancy.
The promising results of the in-beam tests suggest to use FPGAs in the read-out chain of the CBM-ToF detector.
Das Projekt anan ist ein Werkzeug zur Fehlersuche in verteilten Hochleistungsrechnern. Die Neuheit des Beitrags besteht darin, dass die bekannten Methoden, die bereits erfolgreich zum Debuggen von Soft- und Hardware eingesetzt werden, auf Hochleistungs-Rechnen übertragen worden sind. Im Rahmen der vorliegenden Arbeit wurde ein Werkzeug namens anan implementiert, das bei der Fehlersuche hilft. Außerdem kann es als dynamischeres Monitoring eingesetzt werden. Beide Einsatzzwecke sind
getestet worden.
Das Werkzeug besteht aus zwei Teilen:
1. aus einem Teil namens anan, der interaktiv vom Nutzer bedient wird
2. und aus einem Teil namens anand, der automatisiert die verlangten Messwerte erhebt und nötigenfalls Befehle ausführt.
Der Teil anan führt Sensoren aus — kleine mustergesteuerte Algorithmen —, deren Ergebnisse per anan zusammengeführt werden. In erster Näherung lässt anan sich als Monitoring beschreiben, welches (1) schnell umkonfiguriert werden (2) komplexere Werte messen kann, die über Korrelationen einfacher Zeitreihen hinausgehen.
Time-critical applications process a continuous stream of input data and have to meet specific timing constraints. A common approach to ensure that such an application satisfies its constraints is over-provisioning: The application is deployed in a dedicated cluster environment with enough processing power to achieve the target performance for every specified data input rate. This approach comes with a drawback: At times of decreased data input rates, the cluster resources are not fully utilized. A typical use case is the HLT-Chain application that processes physics data at runtime of the ALICE experiment at CERN. From a perspective of cost and efficiency it is desirable to exploit temporarily unused cluster resources. Existing approaches aim for that goal by running additional applications. These approaches, however, a) lack in flexibility to dynamically grant the time-critical application the resources it needs, b) are insufficient for isolating the time-critical application from harmful side-effects introduced by additional applications or c) are not general because application-specific interfaces are used. In this thesis, a software framework is presented that allows to exploit unused resources in a dedicated cluster without harming a time-critical application. Additional applications are hosted in Virtual Machines (VMs) and unused cluster resources are allocated to these VMs at runtime. In order to avoid resource bottlenecks, the resource usage of VMs is dynamically modified according to the needs of the time-critical application. For this purpose, a number of previously not combined methods is used. On a global level, appropriate VM manipulations like hot migration, suspend/resume and start/stop are determined by an informed search heuristic and applied at runtime. Locally on cluster nodes, a feedback-controlled adaption of VM resource usage is carried out in a decentralized manner. The employment of this framework allows to increase a cluster’s usage by running additional applications, while at the same time preventing negative impact towards a time-critical application. This capability of the framework is shown for the HLT-Chain application: In an empirical evaluation the cluster CPU usage is increased from 49% to 79%, additional results are computed and no negative effect towards the HLT-Chain application are observed.
Deep learning and isolation based security for intrusion detection and prevention in grid computing
(2018)
The use of distributed computational resources for the solution of scientific problems, which require highly intensive data processing is a fundamental mechanism for modern scientific collaborations. The Worldwide Large Hadron Collider Computing Grid (WLCG) is one of the most important examples of a distributed infrastructure for scientific projects and is one of the pioneering examples of grid computing. The WLCG is the global grid that analyzes data from the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN), with 170 sites in 40 countries and more than 600,000 processing cores. The grid service providers grant users access to resources that they can utilize on demand for the execution of custom software applications used for the analysis of data. The code that the users can execute is completely flexible, and commonly there are no significant restrictions. This flexibility and the availability of immense computing power increases the security challenges of these environments. Attackers are a concern for grid administrators. These attackers may request the execution of software with a malicious code that gives them the possibility of compromising the underlying institutions’ infrastructure. Grid systems need security countermeasures to keep the user code running, without allowing access to critical components but whilst still retaining flexibility. The administrators of grid systems also need to be continuously monitoring the activities that the applications are carrying out. An analysis of these activities is necessary to detect possible security issues, to identify ongoing incidents and to perform autonomous responses. The size and complexity of grid systems make manual security monitoring and response expensive and complicated for human analysts. Legacy intrusion detection and prevention systems (IDPS) such as Snort and OSSEC are traditionally used for security incident monitoring in the grid, cloud, clusters and standalone systems. However, IDPS are limited due to the use of hardcoded fixed rules that need to be updated continuously to cope with different threats.
This thesis introduces an architecture for improving security in grid computing. The architecture integrates the use of security by isolation, behavior monitoring and deep learning (DL) for the classification of real-time traces of the running user payloads also known as grid jobs. The first component of the proposal, the Linux containers (LCs), are used to provide isolation between grid jobs and to gather specific traceable information about the behavior of individual jobs. LCs offer a safe environment for the execution of arbitrary user scripts or binaries, protecting the sensitive components of the grid member organizations. The containers consist of a software sandboxing technique and form a lightweight alternative to other technologies such as virtual machines (VMs) that usually implement a full machine-level emulation and can, therefore, significantly affect the performance. This performance loss is commonly unacceptable in high-throughput computing scenarios. Containers enable the collection of monitoring information from the processes running inside them. The data collected via the LCs monitoring is employed to feed a DL-based IDPS.
DL methods can acquire knowledge from experience, which eliminates the need for operators to formally specify all the knowledge that a system requires. These methods can improve IDPS by building models that are utilized to detect security incidents automatically, having the ability to generalize to new classes of issues. DL can produce lower false positive rates for intrusion detection, but also provides a measure of false negatives, which can be improved with new training data. Convolutional neural networks (CNNs) are utilized for the distinction between regular and malicious job classes. A set of samples is collected from regular production grid jobs from the grid infrastructure of “A Large Ion Collider Experiment” (ALICE) and malicious Linux binaries from a malware research website. The features extracted from these samples are utilized for the training and validation of the machine learning (ML) models. The utilization of a generative approach to enhance the required training data is also proposed. Recurrent neural networks (RNN) are used as generative models for the simulation of training data that complements and improves the real collected dataset. This data augmentation strategy is useful to supplement the lack of training data in ML processes.
...
A Large Ion Collider Experiment (ALICE) is a high-energy physics experiment, designed to study heavy ion collisions at the European Organization for Nuclear Research (CERN)Large Hadron Collider (LHC). ALICE is built to study the fundamental properties of matter as it existed shortly after the big bang. This requires reading out millions of sensors with high frequency, enabling high statistics for physics analysis, resulting in a considerable computing demand concerning network throughput and processing power. With the ALICE Run 3 upgrade [14], requirements for a High Throughput Computing
(HTC) online processing cluster increased significantly, due to more than an order of magnitude more data than in Run 2, resulting in a processing input rate of up to 900 GB/s. Online (real-time) event reconstruction allows for the compression of the data stream to 130 GB/s, which is stored on disk for physics analysis.
This thesis presents the implementation of the ALICE Event Processing Node (EPN) compute farm, to cope with the Run 3 online computing challenges. Building a Data Centre tailored to ALICE requirements for the Run 3 and Run 4 EPN farm. Providing the operational conditions for a dynamic compute environment of a High Performance Computing (HPC) cluster, with significant load changes in a short time span, when starting or stopping a data-taking run. EPN servers provide the required computing resources for online reconstruction and data compression. The farm includes network connectivity towards First Level Processors (FLPs), requiring reliable throughput of 900 GB/s between FLPs and EPNs and connectivity from the internal InfiniBand network to the CERN Exabyte Object Storage (EOS) Ethernet network, with more than 100 GB/s.
The results of operating the EPN computing infrastructure during the first year of Run 3 LHC collisions are described in the context of the ALICE experiment. The EPN farm was delivering the expected performance for ALICE data-taking. Data Centre environmental conditions remained stable during the last more than two years, in particular during starting and stopping runs, which include significant changes in IT load. Several unforeseen external circumstances lead to increasing demands for the Online Offline System (O2). Higher data rates than anticipated required network performance to exceed the initial design specifications, for the throughput between FLPs and EPNs. In particular, the high throughput from an internal EPN InfiniBand network towards the storage Ethernet network was one of the challenges to overcome.
Programmable hardware in the form of FPGAs found its place in various high energy physics experiments over the past few decades. These devices provide highly parallel and fully configurable data transport, data formatting, and data processing capabilities with custom interfaces, even in rigid or constrained environments. Additionally, FPGA functionalities and the number of their logic resources have grown exponentially in the last few years, making FPGAs more and more suitable for complex data processing tasks. ALICE is one of the four main experiments at the LHC and specialized in the study of heavy-ion collisions. The readout chain of the ALICE detectors makes use of FPGAs at various places. The Read-Out Receiver Cards (RORCs) are one example of FPGA-based readout hardware, building the interface between the custom detector electronics and the commercial server nodes in the data processing clusters of the Data Acquisition (DAQ) system as well as the High Level Trigger (HLT). These boards are implemented as server plug-in cards with serial optical links towards the detectors. Experimental data is received via more than 500 optical links, already partly pre-processed in the FPGAs, and pushed towards the host machines. Computer clusters consisting of a few hundred nodes collect, aggregate, compress, reconstruct, and prepare the experimental data for permanent storage and later analysis. With the end of the first LHC run period in 2012 and the start of Run 2 in 2015, the DAQ and HLT systems were renewed and several detector components were upgraded for higher data rates and event rates. Increased detector link rates and obsolete host interfaces rendered it impossible to reuse the previous RORCs in Run 2.
This thesis describes the development, integration, and maintenance of the next generation of RORCs for ALICE in Run 2. A custom hardware platform, initially developed as a joint effort between the ALICE DAQ and HLT groups in the course of this work, found its place in the Run 2 readout systems of the ALICE and ATLAS experiments. The hardware fulfills all experiment requirements, matches its target performance, and has been running stable in the production systems since the start of Run 2. Firmware and software developments for the hardware evaluation, the design of the board, the mass production hardware tests, as well as the operation of the final board in the HLT, were carried out as part of this work. 74 boards were integrated into the HLT hardware and software infrastructure, with various firmware and software developments, to provide the main experimental data input and output interface of the HLT for Run 2. The hardware cluster finder, an FPGA-based data pre-processing core from the previous generation of RORCs, was ported to the new hardware. It has been improved and extended to meet the experimental requirements throughout Run 2. The throughput of this firmware component could be doubled and the algorithm extended, providing an improved noise rejection and an increased overall mean data compression ratio compared to its previous implementation. The hardware cluster finder forms a crucial component in the HLT data reconstruction and compression scheme with a processing performance of one board equivalent to around ten server nodes for comparable processing steps in software.
The work on the firmware development, especially on the hardware cluster finder, once more demonstrated that developing and maintaining data processing algorithms with the common low-level hardware description methods is tedious and time-consuming. Therefore, a high-level synthesis (HLS) hardware description method applying dataflow computing at an algorithmic level to FPGAs was evaluated in this context. The hardware cluster finder served as an example of a typical data processing algorithm in a high energy physics readout application. The existing and highly optimized low-level implementation provided a reference for comparisons in terms of throughput and resource usage. The cluster finder algorithm could be implemented in the dataflow description with comparably little effort, providing fast development cycles, compact code and at, the same time, simplified extension and maintenance options. The performance results in terms of throughput and resource usage are comparable to the manual implementation. The dataflow environment proved to be highly valuable for design space explorations. An integration of the dataflow description into the HLT firmware and software infrastructure could be demonstrated as a proof of concept. A high-level hardware description could ease both the design space exploration, the initial development, the maintenance, and the extension of hardware algorithms for high energy physics readout applications.
A Large Ion Collider Experiment (ALICE) is one of the four large experiments at the Large Hadron Collider (LHC) at the European Organization for Particle Physics (CERN). ALICE focuses on the physics of the strong interaction and in particular on the Quark-Gluon Plasma. This is a state of matter in which quarks are de-confined. It is believed that it existed in the earliest moments of the evolution of the universe. The ALICE detector studies the products of the collisions between heavy-nuclei, between protons, and between protons and heavy-nuclei. The sub-detector closest to the interaction point is the Inner Tracking System (ITS), which is used to measure the momentum and trajectory of the particles generated by the collisions and allows reconstructing primary and secondary interaction vertices. The ITS needs to have an accurate spatial resolution, together with a low material budget to limit the effect of multiple scattering on low-energetic particles to precisely reconstruct their trajectory. During the Long Shutdown 2 (2019-2020) of the LHC, the current ITS will be replaced by a completely redesigned sub-detector, which will improve readout rate and particle tracking performance especially at low-momentum.
The ALice PIxel DEtector (ALPIDE) chip was designed to meet the requirements of the upgraded ITS in terms of resolution, material budget, radiation hardness, and readout rate. The ALPIDE chip is a Monolithic Active Pixel Sensor (MAPS) realised in Complementary Metal-Oxide Semiconductor (CMOS) technology. Sensing element, analogue front-end, and its digital readout are integrated into the same silicon die. The readout architecture of the new ITS foresees that data is transmitted via a high-speed serial link directly from the ALPIDE to the off-detector electronics. The data is transmitted off-chip by a so-called Data Transmission Unit (DTU) which needs to be tolerant to Single-Event Effects induced by radiation, in order to guarantee reliable operation. The ALPIDE chip will operate in a radiation field with a High-Energy Hadron peak flux of 7.7·10^5 cm^-2s^-1.
The data are sent by the ALPIDE on copper cables to the readout system, which aggregates them and re-transmits them via optical fibres to the counting room. The position where the readout electronics will be placed is constrained by the maximum transmission distance reasonably achievable by the ALPIDE Data Transmission Unit and mechanical constraints of the ALICE experiment. The radiation field at that location is not negligible for its effects on electronics: the high-energy hadrons flux can reach 10^3 cm^-2s^-1. Static RAM (SRAM)-based Field Programmable Gate Arrays (FPGAs) are favoured over Application Specific Integrated Circuits (ASICs) or Radiation Hard by Design (RHBD) commercial devices because of cost effectiveness. Moreover, SRAM-based FPGAs are re-configurable and provide the data throughput required by the ITS. The main issue with SRAM-based FPGAs, for the intended application, is the susceptibility of their Configuration RAM (CRAM) to Single-Event Upsets: the number of CRAM bits is indeed much higher than the logic they configure. Total Ionizing Dose (TID) at the readout designed position is indeed still acceptable for Component Off The Shelf (COTS), provided that proper verification is carried out.
This dissertation focuses on two parts of the design of the readout system: the Data Transmission Unit of the ALPIDE chip and the design of fundamental modules for the SRAM-based FPGA of the readout electronics. In the first part, a module of the Data Transmission Unit is designed, optimising the trade-off between power consumption, radiation tolerance, and jitter performance. The design was tested and thoroughly characterised, including tests while under irradiation with a 30 MeV protons. Furthermore the Data Transmission Unit performance was validated after the integration into the first prototypes of ITS modules. In the second part, the problem of developing a radiation-tolerant SRAM-based FPGA design is investigated and a solution is provided. First, a general methodology for designing radiation-tolerant Finite State Machines in SRAM-based FPGAs is analysed, implemented, and verified. Later, the radiation-tolerant FPGA design for the ITS readout is described together with the radiation effects mitigation techniques that were selectively applied to the different modules. The design was tested with multiple irradiation tests and the results are stated below.