• Deutsch
Login

Open Access

  • Home
  • Search
  • Browse
  • Publish
  • FAQ

Refine

Author

  • Adler, Alexander (1)
  • Alt, Torsten (1)
  • Bilanovic, Daniel (1)
  • Boettger, Stefan (1)
  • Engel, Heiko (1)
  • Gebelein, Jáno (1)
  • Grüll, Frederik (1)
  • Gómez Ramírez, Andrés (1)
  • Kalcher, Sebastian (1)
  • Lupi, Matteo (1)
+ more

Year of publication

  • 2012 (2)
  • 2013 (2)
  • 2017 (2)
  • 2014 (1)
  • 2015 (1)
  • 2016 (1)
  • 2018 (1)
  • 2019 (1)
  • 2020 (1)
  • 2021 (1)
+ more

Document Type

  • Doctoral Thesis (12)
  • Master's Thesis (1)

Language

  • English (13)

Has Fulltext

  • yes (13)

Is part of the Bibliography

  • no (13)

Keywords

  • ALICE (3)
  • FPGA (3)
  • Coding Scheme (1)
  • Dataflow Computing (1)
  • Detector Readout (1)
  • Erasure-Correcting Codes (1)
  • Error Mitigation (1)
  • Failure Erasure Code (1)
  • Fault Tolerance (1)
  • GPU (1)
+ more

Institute

  • Informatik (8)
  • Informatik und Mathematik (5)

13 search hits

  • 1 to 10
  • 10
  • 20
  • 50
  • 100

Sort by

  • Year
  • Year
  • Title
  • Title
  • Author
  • Author
Anan — a debugger for compute clusters (2021)
Adler, Alexander
Das Projekt anan ist ein Werkzeug zur Fehlersuche in verteilten Hochleistungsrechnern. Die Neuheit des Beitrags besteht darin, dass die bekannten Methoden, die bereits erfolgreich zum Debuggen von Soft- und Hardware eingesetzt werden, auf Hochleistungs-Rechnen übertragen worden sind. Im Rahmen der vorliegenden Arbeit wurde ein Werkzeug namens anan implementiert, das bei der Fehlersuche hilft. Außerdem kann es als dynamischeres Monitoring eingesetzt werden. Beide Einsatzzwecke sind getestet worden. Das Werkzeug besteht aus zwei Teilen: 1. aus einem Teil namens anan, der interaktiv vom Nutzer bedient wird 2. und aus einem Teil namens anand, der automatisiert die verlangten Messwerte erhebt und nötigenfalls Befehle ausführt. Der Teil anan führt Sensoren aus — kleine mustergesteuerte Algorithmen —, deren Ergebnisse per anan zusammengeführt werden. In erster Näherung lässt anan sich als Monitoring beschreiben, welches (1) schnell umkonfiguriert werden (2) komplexere Werte messen kann, die über Korrelationen einfacher Zeitreihen hinausgehen.
Analysis of security isolation technologies for HEP-Computing (2017)
Bilanovic, Daniel
Virtual machines are for the most part not used inside of high-energy physics (HEP) environments. Even though they provide a high degree of isolation, the performance overhead they introduce is too great for them to be used. With the rising number of container technologies and their increasing separation capabilities, HEP-environments are evaluating if they could utilize the technology. The container images are small and self-contained which allows them to be easily distributed throughout the global environment. They also offer a near native performance while at the same time aproviding an often acceptable level of isolation. Only the needed services and libraries are packed into an image and executed directly by the host kernel. This work compared the performance impact of the three container technologies Docker, rkt and Singularity. The host kernel was additionally hardened with grsecurity and PaX to strengthen its security and make an exploitation from inside a container harder. The execution time of a physics simulation was used as a benchmark. The results show that the different container technologies have a different impact on the performance. The performance loss on a stock kernel is small; in some cases they were even faster than no container. Docker showed overall the best performance on a stock kernel. The difference on a hardened kernel was bigger than on a stock kernel, but in favor of the container technologies. rkt showed performed in almost all cases better than all the others.
On development, feasibility, and limits of highly efficient CPU and GPU programs in several fields (2013)
Rohr, David
With processor clock speeds having stagnated, parallel computing architectures have achieved a breakthrough in recent years. Emerging many-core processors like graphics cards run hundreds of threads in parallel and vector instructions are experiencing a revival. Parallel processors with many independent but simple arithmetical logical units fail executing serial tasks efficiently. However, their sheer parallel processing power makes them predestined for parallel applications while the simple construction of their cores makes them unbeatably power efficient. Unfortunately, old programs cannot profit by simple recompilation. Adaptation often requires rethinking and modifying algorithms to make use of parallel execution. Many applications have some serial subroutines which are very hard to parallelize, hence contemporary compute clusters are often homogeneous, offering fast processors for serial tasks and parallel processors for parallel tasks. In order not to waste the available compute power, highly efficient programs are mandatory. This thesis is about the development of fast algorithms and their implementations on modern CPUs and GPUs, about the maximum achievable efficiency with respect to peak performance and to power consumption respectively, and about feasibility and limits of programs for CPUs, GPUs, and heterogeneous systems. Three totally different applications from distinct fields, which were developed in the extent of this thesis, are presented. The ALICE experiment at the LHC particle collider at CERN studies heavy-ion collisions at high rates of several hundred Hz, while every collision produces thousands of particles, whose trajectories must be reconstructed. For this purpose, ALICE track reconstruction and ALICE track merging have been adapted for GPUs and deployed on 64 GPU-enabled compute-nodes at CERN. After a testing phase, the tracker ran in nonstop operation during 2012 providing full real-time track reconstruction. The tracker employs a multithreaded pipeline as well as asynchronous data transfer to ensure continuous GPU utilization and outperforms the fastest available CPUs by about a factor three. The Linpack benchmark is the standard tool for ranking compute clusters. It solves a dense system of linear equations using primarily matrix multiplication facilitated by a routine called DGEMM. A heterogeneous GPU-enabled version of DGEMM and Linpack has been developed, which can utilize the CAL, CUDA, and OpenCL APIs as backend. Employing this implementation, the LOEWE-CSC cluster ranked place 22 in the November 2010 Top500 list of the fastest supercomputers, and the Sanam cluster achieved the second place in the November 2012 Green500 list of the most power efficient supercomputers. An elaborate lookahead algorithm, a pipeline, and asynchronous data transfer hide the serial CPU-bound tasks of Linpack behind DGEMM execution on the GPU reaching the highest efficiency on GPU-accelerated clusters. Failure erasure codes enable failure tolerant storage of data and real-time failover, ensuring that in case of a hardware defect servers and even complete data centers remain operational. It is an absolute necessity for present-day computer infrastructure. The mathematical theory behind the codes involves matrix-computations in finite fields, which are not natively supported by modern processors and hence computationally very expensive. This thesis presents a novel scheme for fast encoding matrix generation and demonstrates a fast implementation for the encoding itself, which uses exclusively either integer or logical vector instructions. Depending on the scenario, it is always hitting different hard limits of the hardware: either the maximum attainable memory bandwidth, or the peak instruction throughput, or the PCI Express bandwidth limit when GPUs or FPGAs are used. The thesis demonstrates that in most cases with respect to the available peak performance, GPU implementations can be as efficient as their CPU counterparts. With respect to costs or power consumption, they are much more efficient. For this purpose, complex tasks must be split in serial as well as parallel parts and the execution must be pipelined such that the CPU bound tasks are hidden behind GPU execution. Few cases are identified where this is not possible due to PCI Express limitations or not reasonable because practical GPU languages are missing.
FPGA fault tolerance in radiation environments (2016)
Gebelein, Jáno
The constantly increasing memory density and performance of recent Field Programmable Gate Arrays (FPGA) has boosted a usage in many technical applications such as particle accelerators, automotive industry as well as defense and space. Some of these fields of interest are characterized by the presence of ionizing radiation as caused by natural decay or artificial excitation processes. Unfortunately, this type of radiation affects various digital circuits, including transistors forming Static Random Access Memory (SRAM) storage cells that constitute the technology node for high performance FPGAs. Various digital misbehavior in temporal or permanent manner as well as physical destruction of transistors are the consequence. Therefore, the mitigation of such effects becomes an essential design rule when using SRAM FPGAs in ionizing radiation environments. Tolerance against soft errors can be handled across various layers of modern FPGA design, starting with the most basic silicon manufacturing process, towards configuration, firmware, and system design, until finally ending up with application and software engineering. But only a highly optimized, joint concept of system-wide fault tolerance provides sufficient resilience against ionizing radiation effects without losing too much valuable device resources to the safety approach. This concept is introduced, analyzed, improved and validated in the present work. It includes, but is not limited to, static configuration scrubbing, various firmware redundancy approaches, dynamic memory conservation as well as state machine protection. Guidelines are given to improve manual design practices concerning fault tolerance and tools are shown to reduce necessary efforts. Finally, the SysCore development platform has been maintained to support the recommended design methods and act as Device Under Test (DUT) for all particle irradiation experiments that prove the efficiency of the proposed concept of system-wide fault tolerance for SRAM FPGAs in ionizing radiation environments.
An FPGA-based preprocessor for the ALICE High-Level-Trigger (2017)
Alt, Torsten
The ALICE High-Level-Trigger (HLT) is a large scale computing farm designed and constructed for the purpose of the realtime reconstruction of particle interactions (events) inside the ALICE detector. The reconstruction of such events is based on the raw data produced in collisions inside the ALICE at the Large Hadron Collider. The online reconstruction in the HLT allows the triggering on certain event topologies and a significant data reduction by applying compression algorithms. Moreover, it enables a real-time verification of the quality of the data. To receive the raw data from the various sub-detectors of ALICE, the HLT is equipped with 226 custom built FPGA-based PCI-X cards, the H-RORCs. The H-RORC interfaces the detector readout electronics to the nodes of the HLT farm. In addition to the transfer of raw data, 108 H-RORCs host 216 Fast-Cluster-Finder (FCF) processors for the Time-Projection-Chamber (TPC). The TPC is the main tracking detector of ALICE and contributes with up to 16 GB/s to over 90% of the overall data volume. The FCF processor implements the first of two steps in the data reconstruction of the TPC. It calculates the space points and their properties from charge clouds (clusters) created by charged particles traversing the TPCs gas volume. Those space points are not only the base for the tracking algorithm, but also allow for a Huffman-based data compression, which reduces the data volume by a factor of 4 to 6. The FCF processor is designed to cope with any incoming data rate up to the maximum bandwidth of the incoming optical link (160 MB/s) without creating back-pressure to the detectors readout electronics. A performance comparison with the software implementation of the algorithm shows a speedup factor of about 20 compared with one AMD Opteron 6172 Core @ 2.1 GHz, the CPU type used in the HLT during the LHC Run1 campaign. Comparison with an Intel E5-2690 Core @ 3.0 GHz, the CPU type used by the HLT for the LHC Run2 campaign, results in a speedup factor of 8.5. In total numbers, the 216 FCF processors provide the computing performance of 4255 AMD Opteron cores or 2203 Intel cores of the previously mentioned type. The performance of the reconstruction with respect to the physics analysis is equivalent or better than the official ALICE Offline clusterizer. Therefore, ALICE data taking was switched in 2011 to FCF cluster recording and compression only, discarding the raw data from the TPC. Due to the capability to compress the clusters, the recorded data volume could be increased by a factor of 4 to 6. For the LHC Run3 campaign, starting in 2020, the FCF builds the foundation of the ALICE data taking and processing strategy. The raw data volume (before processing) of the upgraded TPC will exceed 3 TB/s. As a consequence, online processing of the raw data and compression of the results before it enters the online computing farms is an essential and crucial part of the computing model. Within the scope of this thesis, the H-RORC card and the FCF processor were developed and built from scratch. It covers the conceptual design, the optimisation and implementation, as well as the verification. It is completed by performance benchmarks and experiences from real data taking.
Conceptual design of an ALICE Tier-2 centre integrated into a multi-purpose computing facility (2012)
Zynovyev, Mykhaylo
This thesis discusses the issues and challenges associated with the design and operation of a data analysis facility for a high-energy physics experiment at a multi-purpose computing centre. At the spotlight is a Tier-2 centre of the distributed computing model of the ALICE experiment at the Large Hadron Collider at CERN in Geneva, Switzerland. The design steps, examined in the thesis, include analysis and optimization of the I/O access patterns of the user workload, integration of the storage resources, and development of the techniques for effective system administration and operation of the facility in a shared computing environment. A number of I/O access performance issues on multiple levels of the I/O subsystem, introduced by utilization of hard disks for data storage, have been addressed by the means of exhaustive benchmarking and thorough analysis of the I/O of the user applications in the ALICE software framework. Defining the set of requirements to the storage system, describing the potential performance bottlenecks and single points of failure and examining possible ways to avoid them allows one to develop guidelines for selecting the way how to integrate the storage resources. The solution, how to preserve a specific software stack for the experiment in a shared environment, is presented along with its effects on the user workload performance. The proposal for a flexible model to deploy and operate the ALICE Tier-2 infrastructure and applications in a virtual environment through adoption of the cloud computing technology and the 'Infrastructure as Code' concept completes the thesis. Scientific software applications can be efficiently computed in a virtual environment, and there is an urgent need to adapt the infrastructure for effective usage of cloud resources.
An erasure-resilient and compute-efficient coding scheme for storage applications (2013)
Kalcher, Sebastian
Driven by rapid technological advancements, the amount of data that is created, captured, communicated, and stored worldwide has grown exponentially over the past decades. Along with this development it has become critical for many disciplines of science and business to being able to gather and analyze large amounts of data. The sheer volume of the data often exceeds the capabilities of classical storage systems, with the result that current large-scale storage systems are highly distributed and are comprised of a high number of individual storage components. As with any other electronic device, the reliability of storage hardware is governed by certain probability distributions, which in turn are influenced by the physical processes utilized to store the information. The traditional way to deal with the inherent unreliability of combined storage systems is to replicate the data several times. Another popular approach to achieve failure tolerance is to calculate the block-wise parity in one or more dimensions. With better understanding of the different failure modes of storage components, it has become evident that sophisticated high-level error detection and correction techniques are indispensable for the ever-growing distributed systems. The utilization of powerful cyclic error-correcting codes, however, comes with a high computational penalty, since the required operations over finite fields do not map very well onto current commodity processors. This thesis introduces a versatile coding scheme with fully adjustable fault-tolerance that is tailored specifically to modern processor architectures. To reduce stress on the memory subsystem the conventional table-based algorithm for multiplication over finite fields has been replaced with a polynomial version. This arithmetically intense algorithm is better suited to the wide SIMD units of the currently available general purpose processors, but also displays significant benefits when used with modern many-core accelerator devices (for instance the popular general purpose graphics processing units). A CPU implementation using SSE and a GPU version using CUDA are presented. The performance of the multiplication depends on the distribution of the polynomial coefficients in the finite field elements. This property has been used to create suitable matrices that generate a linear systematic erasure-correcting code which shows a significantly increased multiplication performance for the relevant matrix elements. Several approaches to obtain the optimized generator matrices are elaborated and their implications are discussed. A Monte-Carlo-based construction method allows it to influence the specific shape of the generator matrices and thus to adapt them to special storage and archiving workloads. Extensive benchmarks on CPU and GPU demonstrate the superior performance and the future application scenarios of this novel erasure-resilient coding scheme.
Acceleration of biomedical image processing and reconstruction with FPGAs (2014)
Grüll, Frederik
Acceleration of Biomedical Image Processing and Reconstruction with FPGAs Increasing chip sizes and better programming tools have made it possible to increase the boundaries of application acceleration with reconfigurable computer chips. In this thesis the potential of acceleration with Field Programmable Gate Arrays (FPGAs) is examined for applications that perform biomedical image processing and reconstruction. The dataflow paradigm was used to port the analysis of image data for localization microscopy and for 3D electron tomography from an imperative description towards the FPGA for the first time. After the primitives of image processing on FPGAs are presented, a general workflow is given for analyzing imperative source code and converting it to a hardware pipeline where every node processes image data in parallel. The theoretical foundation is then used to accelerate both example applications. For localization microscopy, an acceleration of 185 compared to an Intel i5 450 CPU was achieved, and electron tomography could be sped up by a factor of 5 over an Nvidia Tesla C1060 graphics card while maintaining full accuracy in both cases.
Radiation mitigation for SRAM-Based FPGAs in the CBM experiment (2015)
Manz, Sebastian Andreas
Detectors of modern high-energy physics experiments generate huge data rates during operation. The efficient read-out of this data from the front-end electronics is a sophisticated task, the main challenges, however, may vary from experiment to experiment. The Compressed Baryonic Matter (CBM) experiment that is currently under construction at the Facility for Antiproton and Ion Research (FAIR) in Darmstadt/Germany foresees a novel approach for data acquisition. Unlike previous comparable experiments that organize data read-out based on global, hierarchical trigger decisions, CBM is based on free-running and self-triggered front-end electronics. Data is pushed to the next stage of the read-out chain rather than pulled from the buffers of the previous stage. This new paradigm requires a completely new development of read-out electronics. As one part of this thesis, a firmware for a read-out controller to interface such a free-running and self-triggered front-end ASIC, the GET4 chip, was implemented. The firmware in question was developed to run on a Field Programmable Gate Array (FPGA). An FPGA is an integrated circuit whose behavior can be reconfigured "in the field" which offers a lot of flexibility, bugs can be fixed and also completely new features can be added, even after the hardware has already been installed. Due to these general advantages, the usage of FPGAs is desired for the final experiment. However, there is also a drawback to the usage of FPGAs. The only affordable FPGAs today are based on either SRAM or Flash technology and both cannot easily be operated in a radiation environment. SRAM-based devices suffer severely from Single Event Upsets (SEUs) and Flash-based FPGAs deteriorate too fast from Total Ionizing Dose (TID) effects. Several radiation mitigation techniques exist for SRAM-based FPGAs, but careful evaluation for each use case is required. For CBM it is not clear if the higher resource consumption of added redundancy, that more or less directly translates in to additional cost, outweighs the advantaged of using FPGAs. In addition, it is even not clear if radiation mitigation techniques (e.g. scrubbing) that were already successfully put into operation in space applications also work as efficiently at the much higher particle rates expected at CBM. In this thesis, existing radiation mitigation techniques have been analyzed and eligible techniques have been implemented for the above-mentioned read-out controller. To minimize additional costs, redundancy was only implemented for selected parts of the design. Finally, the radiation mitigated read-out controller was tested by mounting the device directly into a particle beam at Forschungszentrum Jülich. The tests show that the radiation mitigation effect of the implemented techniques remains sound, even at a very high particle flux and with only part of the design protected by costly redundancy. The promising results of the in-beam tests suggest to use FPGAs in the read-out chain of the CBM-ToF detector.
Virtual machine scheduling in dedicated computing clusters (2012)
Boettger, Stefan
Time-critical applications process a continuous stream of input data and have to meet specific timing constraints. A common approach to ensure that such an application satisfies its constraints is over-provisioning: The application is deployed in a dedicated cluster environment with enough processing power to achieve the target performance for every specified data input rate. This approach comes with a drawback: At times of decreased data input rates, the cluster resources are not fully utilized. A typical use case is the HLT-Chain application that processes physics data at runtime of the ALICE experiment at CERN. From a perspective of cost and efficiency it is desirable to exploit temporarily unused cluster resources. Existing approaches aim for that goal by running additional applications. These approaches, however, a) lack in flexibility to dynamically grant the time-critical application the resources it needs, b) are insufficient for isolating the time-critical application from harmful side-effects introduced by additional applications or c) are not general because application-specific interfaces are used. In this thesis, a software framework is presented that allows to exploit unused resources in a dedicated cluster without harming a time-critical application. Additional applications are hosted in Virtual Machines (VMs) and unused cluster resources are allocated to these VMs at runtime. In order to avoid resource bottlenecks, the resource usage of VMs is dynamically modified according to the needs of the time-critical application. For this purpose, a number of previously not combined methods is used. On a global level, appropriate VM manipulations like hot migration, suspend/resume and start/stop are determined by an informed search heuristic and applied at runtime. Locally on cluster nodes, a feedback-controlled adaption of VM resource usage is carried out in a decentralized manner. The employment of this framework allows to increase a cluster’s usage by running additional applications, while at the same time preventing negative impact towards a time-critical application. This capability of the framework is shown for the HLT-Chain application: In an empirical evaluation the cluster CPU usage is increased from 49% to 79%, additional results are computed and no negative effect towards the HLT-Chain application are observed.
  • 1 to 10

OPUS4 Logo

  • Contact
  • Imprint
  • Sitelinks