Skip to main content

Command Palette

Search for a command to run...

Overlapping intervals to remove Redundant User Highlights

Updated
6 min read
P

Learning from my experiences while joking about the stupid mistakes I did in my prior code!

How did I come across this?

  • Recently I started working on a project titled Kindle-Clippings: A tool that makes organizing, accessing and using kindle highlights efficient and easy.

  • While developing the tool, I came across the way a kindle device stores user highlights.

    • The good thing, it takes the highlight, it’s page number, loc(line of content) and timestamp information.

    • The bad thing, it stores all the highlights, and edits to the highlights.

    • This means if you highlight once, change the highlight, or edit it. All the different versions of highlights will be saved.

This would result in user highlights containing redundant, half-highlighted quotes and data. Bad user experience XC.

Problem Statement : Remove Redundant Highlights from user highlights file

  • Now, I had to figure out how to remove the highlights while saving backend resources as I was limited by compute power on the render instance I was using.

  • I came up with a couple of ideas

    • Checking each highlight with other highlights

    • Doing a plag-check and removing similar ones

    • checking overlapping intervals and adding a time constraint

Solution 1 - Check each Highlight with other highlights

  • This approach is very naive.

  • The second I thought this, I knew it wasn’t going to work.

  • Given I have on average let’s say 3000 highlights, I would be processing 9000000 highlights in total (n*n).

  • This time complexity was a big no.

  • But we got 1 improvement in this stage, instead of comparing each highlight with each other, what we will be doing is compare each highlight with other highlights for the same book!

    • This would drop the time to O(n * m²) where M was smaller

Solution 2 - Doing a Plagiarism Check and removing similar highlights

  • The roots of this idea go to a professor from our college who used to plag-check our assignments.

    • This works for 180-200 students, but checking for each highlight with the other even in the same book is time consuming as the previous approach mentioned.

    • To reduce time complexity even more, I set a location limit for the plag-check by

      • sorting the quotes by start location

      • plagiarism checking only within a fixed range of location around highlights own location. i.e. highlight.location ± 5 .

      • If the similarityScore was above a certain threshold value, the highlight would be considered duplicate and will be removed.

    • The problem with this approach were

      • Deciding the similarityScore

        • too less and we would have false positive getting valuable knowledge lost

        • too more and we would have false negatives, adding redundant highlights, leading to a bad user experience.

        • There was still considerable time being spent on processing all the quotes against each other.

Solution 3 - Overlapping Sub-intervals

  • Then it hit me, this seemed like a leetcode problem.

  • I had location values, which was a string, it could either be a string number, or a string range.

    • I parsed the value into start and end

    •   function parseLocation(location) {
            if (!location) return { start: 0, end: 0 };
            if (typeof location === 'number') return { start: location, end: location };
            const parts = location.split('-').map(x => parseInt(x.trim(), 10));
            if (parts.length === 2 && !isNaN(parts[0]) && !isNaN(parts[1])) {
                return { start: parts[0], end: parts[1] };
            }
            if (parts.length === 1 && !isNaN(parts[0])) {
                return { start: parts[0], end: parts[0] };
            }
            return { start: 0, end: 0 };
        }
      
    • This gave me the start and end line number for every quote.

  • Now, most; if not all the times, what users do is mess their highlights up by a word or two missing or additional one being highlighted.

    • When this happens the user highlights the content again till they finally get it right.

    • This gave me the first big assumption, temporal value i.e. timestamp field was valuable.

      • If I could get all the highlights done by the user around a single intended highlight. I could decide the winner here by choosing the latest timestamp value.
  • Now figuring out which highlights are same and which are different.

    • This is where the problem of overlapping sub intervals came in.

    •       // sort highlights by locStart here
            highlights.sort((a, b) => a.locStart - b.locStart);
      
            let current = highlights[0];
            for (let i = 1; i < highlights.length; i++) {
                const next = highlights[i];
                if (current.locEnd >= next.locStart) {
                    // Overlapping or touching intervals, keep the latest one out of the two
                    let currentTime = new Date(current.timestamp).getTime();
                    let nextTime = new Date(next.timestamp).getTime();
                    if (nextTime > currentTime) {
                        // If next highlight is more recent, update current and forget about it ;)
                        current = next
                    }
                    // current.locEnd = Math.max(current.locEnd, next.locEnd);
                } else {
                    // No overlap, push the current highlight and move to the next
                    solution.push(current);
                    current = next;
                }
            }
            return solution;
      
    • Here, we start by sorting the highlights for the current book based on their start locations. O(n*mlogm)

    • Now, the algorithm finds if at any point the current-highlight being compared is longer or equal to the next one.

      • Checking for overlap here. If at any point next→start is less than current→end; it means next is starting within current.

      • And just like that we found an overlapping interval.

        • This is where the leetcode algorithm differs from the requirement here.

        • In the leetcode algorithm you would go on to merge the interval

        • But here, we would instead update the current highlight with whichever one is the latest among the two being compared.

        • If the current and the next highlight differs and does not overlap, we save the current highlight in the result and move on to the next highlight.

Accepting the 3rd Solution

  • A major reason for accepting the 3rd solution is it’s time complexity.

    • That small difference of m*logm instead of in O(n*m*logm) can lead to a world of difference when the data grows.
  • But there are more reasons for sticking with the third solution such as:

    • Relying on a similarityScore can be dangerous as it would lead to false-positives and false-negatives.

    • The overlapping intervals is easy to read and understand than plagiarism checker giving higher maintainability and supporting better for the current use case.

Real-time results

  • For the first check, the location based approach dropped a total of 1710 highlights (27% of total highlights)
  • It dropped the size of the final archive file from 1.8MB to 1.5MB, a 17% drop.

Thoughts

  • The reason I am writing this would be to remind you, and myself in the future that leetcode problems have meaning when you can remember them out of context of the leetcode.

  • If you are developing anything similar this might help you, if it helps I am happy ;)

  • If you have read this article so far, do try out the website https://kindle-clippings.pages.dev/.

  • Want to read more, unpublished blogs of mine? books | computer Science

More from this blog

Pratyay Dhond's Blog

11 posts

These articles are my "EUREKA, good gracious, that's how it works" moments, or problems I encountered that I think I will face again in future when I blow up my current OS! Lovely to have you here! <3