Close Menu
    Trending
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    • How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data | by Rendy Dalimunthe | Dec, 2024
    Artificial Intelligence

    Understanding Deduplication Methods: Ways to Preserve the Integrity of Your Data | by Rendy Dalimunthe | Dec, 2024

    Team_AIBS NewsBy Team_AIBS NewsDecember 20, 2024No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Rising progress and information complexities have made information deduplication much more related

    Towards Data Science

    Information duplication continues to be an issue for a lot of organisations. Though information processing and storage programs have developed quickly together with technological advances, the complexity of the info produced can also be rising. Furthermore, with the proliferation of Massive Information and the utilisation of cloud-based purposes, at present’s organisations should more and more take care of fragmented information sources.

    Photograph by Damir: https://www.pexels.com/photo/serene-lakeside-reflection-with-birch-trees-29167854/

    Ignoring the phenomenon of the big quantity of duplicated information could have a unfavorable impression on the organisation. Resembling:

    • Disruption of the decision-making course of. Unclean information can bias metrics and never mirror the precise circumstances. For instance: if there may be one buyer that’s truly the identical, however is represented as 2 or 3 clients information in CRM, this is usually a distortion when projecting income.
    • Swelling storage prices as a result of each bit of knowledge principally takes up cupboard space.
    • Disruption of buyer expertise. For instance: if the system has to offer notifications or ship emails to clients, it is rather probably that clients whose information is duplicate will obtain a couple of notification.
    • Making the AI ​​coaching course of lower than optimum. When an organisation begins growing an AI resolution, one of many necessities is to conduct coaching with clear information. If there may be nonetheless loads of duplicates, the info can’t be mentioned to be clear and when compelled for use in AI coaching, it is going to probably produce biased AI.

    Given the essential impression brought on when an organisation doesn’t try to scale back or get rid of information duplication, the method of knowledge deduplication turns into more and more related. It’s also essential to make sure information high quality. The rising sophistication and complexity of the system should be accompanied by the evolution of sufficient deduplication methods.

    On this event, we’ll study the three newest deduplication strategies, which is usually a reference for practitioners when planning the deduplication course of.

    It’s the strategy of eliminating duplicate information throughout a number of storage areas. It’s now frequent for organisations to retailer their information throughout a number of servers, information facilities, or the cloud. International deduplication ensures that just one copy of the info is saved.

    This technique works by creating a worldwide index, which is a listing of all present information, within the type of a singular code (hash) utilizing an algorithm corresponding to SHA256 that represents each bit of knowledge. When a brand new file is uploaded to a server (for instance Server 1), the system will retailer a singular code for that file.

    On one other day when a consumer uploads a file to Server 2, the system will examine the distinctive code of the brand new file with the worldwide index. If the brand new file is discovered to have the identical distinctive code/hash as the worldwide index, then as a substitute of constant to retailer the identical file in two locations, the system will exchange the duplicate file saved on Server 2 with a reference/pointer that factors to a duplicate of the file that already exists on Server 1.

    With this technique, cupboard space can clearly be saved. And if mixed with Information Virtualisation approach then when the file is required the system will take it from the unique location, however all customers will nonetheless really feel the info is on their respective servers.

    The illustration under exhibits how International Deduplication works the place every server solely shops one copy of the unique information and duplicates on different servers are changed by references to the unique file.

    supply: Creator

    It must be famous that the International Deduplication technique doesn’t work in real-time, however post-process. Which implies the tactic can solely be utilized when the file has entered storage.

    Not like International Deduplication, this technique works in real-time proper when information is being written to the storage system. With the Inline Deduplication approach, duplicate information is instantly changed with references with out going by way of the bodily storage course of.

    The method begins when information is about to enter the system or a file is being uploaded, the system will instantly divide the file into a number of small items or chunks. Utilizing an algorithm corresponding to SHA-256, every chunk will then be given a hash worth as a singular code. Instance:

    Chunk1 -> hashA

    Chunk2-> hashB

    Chunk3 -> hashC

    The system will then examine whether or not any of the chunks have hashes already within the storage index. If one of many chunks is discovered whose distinctive code is already within the storage hash, the system is not going to re-save the bodily information from the chunk, however will solely retailer a reference to the unique chunk location that was beforehand saved.

    Whereas every distinctive chunk shall be saved bodily.

    Later, when a consumer needs to entry the file, the system will rearrange the info from the present chunks primarily based on the reference, in order that the whole file can be utilized by the consumer.

    Inline Deduplication is broadly utilized by cloud service suppliers corresponding to Amazon S3 or Google Drive. This technique may be very helpful for optimising storage capability.

    The easy illustration under illustrates the Inline Deduplication course of, from information chunking to how information is accessed.

    Supply: Creator

    Machine learning-powered deduplication makes use of AI to detect and take away duplicate information, even when it isn’t fully equivalent.

    The method begins when incoming information, corresponding to recordsdata/paperwork/data, are despatched to the deduplication system for evaluation. For instance, the system receives two scanned paperwork that initially look look related however even have refined variations in structure or textual content format.

    The system will then intelligently extract necessary options, often within the type of metadata or visible patterns. These necessary options will then be analysed and in contrast for similarity. The similarity of a characteristic shall be represented as a worth/rating. And every system/organisation can outline whether or not information is a replica or not primarily based on its similarity rating. For instance: solely information with a similarity rating above 90% may be mentioned to be probably duplicate.

    Based mostly on the similarity rating, the system can choose whether or not the info is a replica. If acknowledged that it’s a duplicate, then steps may be taken like different duplication strategies, the place for duplicate information solely the reference is saved.

    What’s attention-grabbing about ML-enhanced Deduplication is that it permits human involvement to validate the classification that has been executed by the system. In order that the system can proceed to get smarter primarily based on the inputs which have been realized (suggestions loop).

    Nonetheless, it must be famous that in contrast to Inline Deduplication, ML-enhanced deduplication will not be appropriate to be used in real-time. That is as a result of latency issue, the place ML takes time to extract options and course of information. As well as, if compelled to be real-time, this technique requires extra intensive computing sources.

    Though not real-time, the advantages it brings are nonetheless optimum, particularly with its capability to deal with unstructured or semi-structured information.

    The next is an illustration of the steps of ML-enhanced Deduplication together with examples.

    Supply: Creator



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe Technological Singularity: Are We Ready for the Dawn of Superintelligence? | by Thought Stream | Dec, 2024
    Next Article Don’t Bank on Holiday Sales to Save Your Bottom Line — Here’s How to Engage the Discount-Driven Customer Year-Round
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    An Introduction to Remote Model Context Protocol Servers

    July 2, 2025
    Artificial Intelligence

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Artificial Intelligence

    STOP Building Useless ML Projects – What Actually Works

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Using Machine Learning for Novel Breast Cancer Screening | by Developers Society | Feb, 2025

    February 25, 2025

    IEEE MOVE fleet powers Helene-Milton hurricane relief

    December 26, 2024

    Trump Tariffs Could Raise iPhone Prices, But Affordable Options Remain

    April 15, 2025
    Our Picks

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025

    He Went From $471K in Debt to Teaching Others How to Succeed

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.