Overview of ETL Tool: Executive Summary: To present overview on StreamSets and its development, its significance as Data Integration Platform and why to choose StreamSets in comparison to its alternative products. Established Launched Launched Released Launched Streamsets Stream Inc Data Collector Streamsets Control Hub Transformer Engine Dataops Platform 2014 2015 2017 2019 2021 Timeline for Development History of StreamSets Overview on Streamsets High Level Architecture Strengths / Weakness Competitive Position Overview: StreamSets ? Hadoop-based data lakes became the data storage system of choice for raw, unstructured data due to flood of big data in 2014. ? IT firms were struggling to keep up with data sources no longer under their control. ? From Hadoop to Apache Kafka, then to Databricks, an open source software transformed hardware centered data center into a decentralized system of increasingly specialized applications connected across owned and virtual systems. ? Founded in 2014 by Arvind Prabhakar and Girish Pancha, a Former Cloudera Engineer and Informatica Product Leader to manage data integration. ? Modern Data Integration Platform to build smart data pipelines for DataOps across multi cloud architectures. ? Automating as much as possible abstracting away “How” of data pipeline to “What” of the data, so data teams spends less time fixing and more time doing. ? The Future of data infrastructures was not about schema and scale, but about managing change and automating as much as possible. ? In 2015, StreamSets launched Data Collector, an open source data execution engine for streaming data pipelines built with resiliency to data drift. ? Data Collector Engine tackled the problem of streaming data ingest just like Apache Kafka and Hadoop systems. ? Data Collector simplify batch, streaming and CDC pipelines, Data Collector Engine becomes tool of choice for thousands of organization worldwide. ? In 2017, StreamSets Control Hub is introduced providing a single software as a service platform to design, deploy, monitor, and manage smart data pipelines at scale on any cloud and premises. ? In 2019, Streamsets Transformer Engine was released which added ETL capabilities on Apache Spark. ? Humana, BT Group, Shell, and IBM have made streamsets a core technology in their DataOps practice. ? In 2021 StreamSets brought all the functionality of Control Hub, the Data Collector and Transformer Engines into a fully managed service called StreamSets DataOps Platform. StreamSets High Level Architecture / Data Pipeline Architectures Three Basic Architectures of Data Pipelines depending on nature of data we are gathering and its use. Batch Data Pipeline Batch data pipelines move large sets of data at a particular time or in response to a behavior or when a threshold is met. A batch data pipeline is often used for bulk ingestion or ETL processing. A batch data pipeline might be used to deliver data weekly or daily from a CRM system to a data warehouse for use in a dashboard for reporting and business intelligence. Streaming Data Pipeline Streaming data pipelines flow data continuously from source to destination as it is created. Streaming data pipelines are used to populate data lakes or as part of data warehouse integration, or to publish to a messaging system or data stream. They are also used in event processing for real-time applications. For example, streaming data pipelines might be used to provide real-time data to a fraud detection system and to monitor quality of service. Change Data Capture Pipeline (CDC) Change data capture pipelines are used to refresh data and keep multiple systems in sync. Instead of copying the entire database, only changes to data since the last sync are shared. This can be particularly useful during a cloud migration project when 2 systems are operating with the same data sets. Data Engineering Platforms There is a third way. A data engineering platform builds smart data pipelines according to DataOps principles. Smart data pipelines abstract away the “how” so you can focus on the what, who, and where of the data. This is the fundamental difference between data integration and data engineering. Instead of being perpetually under construction, out of order, or limited to a single platform, smart data pipelines allow you to move fast with confidence that your data will continue to flow with little to no intervention. Data engineering platforms allow you to: ? Design and deploy data pipelines in hours, not weeks or months ? Build in as much resiliency as possible to handle changes ? Adopt to new platforms by pointing to them, a task that takes minutes not months Streamsets: Strength 1. User Friendly, Not Steep Learning Curve and Non-technological personnel can also learn quick 2. On Premises and Cloud Environment 3. Easy to use when connecting enterprises data stores such as OLTP databases or messaging systems such as Kafka. Enables us to create a data pipeline without coding knowledge. No need to have knowledge on all databases and coding to work with streamsets. 4. Built in data drift resilience plays in ETL operations. 5. It helps to resolve data sync issues coming from various sources. Reduces time to fix data drift breakages. 6. It has lots of features from AWS, Azure and Snowflake. 7. Saves cost on licensing for some of the legacy software. Streamsets: Strength (Continued) 8. Reusability of template for certain use case for moving data to the cloud. We can create job templates for certain cases and use same templates again by just changing parameters. 9. Faster Data Transfer than Hadoop Scenario. 10. Masking sensitive data like PHI and PII is made easy. 11. Scheduling is easy in streamsets. 12. Easy to connect to Hadoop using streamsets. 13. It provides Change Data Capture as soon as source data has changed. Good technical customer support. 14. It is compatible with various source systems like SQL Server, Oracle, REST API. 15. Control Hub Dataops platform manages the load balancing. 16. Everything in one place with streamsets. 17. Good online documentation. Streamsets: Weakness 1. Lack of folder structures in organizing pipelines and jobs in Control Hub. 2. The Logging mechanism can be improved. 3. Visualization part can be improved by adding time factor in it. To see the changes with respect to time. 4. Unable to read multiple tables from SAP HANA without querying them. 5. JDBC Lookup take long time to process. 6. Memory Leak Issues. Competitive Positioning : StreamSets Before making decision for using Streamsets Dataops Platforms as Data Integration Tools, users also consider the following as an alternatives. Even though having short history of its development in comparison to its alternative, StreamSets team has developed the tool in rapid pace. It has done really good job by making StreamSets data pipeline resilient to data drift. Informatica PowerCenter SQL Server Integration Services Fivetran Alteryx Designer AWS Glue Oracle GoldenGate Qlik Replicate Talend Data Fabric IBM DataStage Denodo Platform   Summary: Has Short History of Development, Developed in 2015. Modern Data Integration Platform to build smart data pipelines for DataOps across multi cloud architectures. Automating as much as possible taking away “How” of data pipeline to “What” of the data. So more time in doing than fixing. StreamSets Data Collector, an open source data execution engine for streaming data pipelines built with resilience to data drift. StreamSets Control Hub(2017) as single software as service to design, deploy, monitor, and manage smart data pipelines. StreamSets Transformer Engine(2019) added ETL capabilities on Apache Spark. StreamSets (2021) bought all the functionality of Control Hub, Data Collector, Data Transformer Engines as StreamSets DataOps Platform. User Friendly UI, No steep learning curve, No coding required, Built in Data Drift Resilient. References: https://streamsets.com https://www.peerspot.com/products/streamsets-reviews#review_2482491 https://www.gartner.com/reviews/market/data-integration-tools/vendor/streamsets/product/streamsets-dataops-platform/alternatives

Order your essay today and save 10% with the discount code ESSAYHELP