Get Latest CSE Projects in your Email


Data Streaming in Hadoop: a Study of Real Time Data Pipeline Integration Between Hadoop Environments and External Systems

The field of distributed computing is growing and quickly becoming a natural part of large as well as smaller enterprises’ IT processes. Driving the progress is the cost effectiveness of distributed systems compared to centralized options, the physical limitations of single machines and reliability concerns.

There are frameworks within the field which aims to create a standardized platform to facilitate the development and implementation of distributed services and applications. Apache Hadoop is one of those projects. Hadoop is a framework for distributed processing and data storage. It contains support for many different modules for different purposes such as distributed database management, security, data streaming and processing.

In addition to offering storage much cheaper than traditional centralized relation databases, Hadoop supports powerful methods of handling very large amounts of data as it streams through and is stored on the system. These methods are widely used for all kinds of big data processing in large IT companies with a need for low-latency, high-throughput processing of the data.

More and more companies are looking towards implementing Hadoop in their IT processes, one of them is Unomaly, a company which offers agnostic, proactive anomaly detection. The anomaly detection system analyses system logs to detect discrepancies. The anomaly detection system is reliant on large amounts of data to build an accurate image of the target system. Integration with Hadoop would result in the possibility to consume incredibly large amounts of data as it is streamed to the Hadoop storage or other parts of the system.

In this project an integration layer application has been developed to allow Hadoop integration with Unomalys system. Research has been conducted throughout the project in order to determine the best way of implementing the integration. The first part of the result of the project is a PoC application for real time data pipelining between Hadoop clusters and the Unomaly system. The second part is a recommendation of how the integration should be designed, based on the studies conducted in the thesis work.
Source: KTH
Authors: Björk, Kim | Bodvill, Jonatan

Download Project

For Free CSE Project Downloads:
Enter your email address:
( Its Free 100% )


Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>