Educative: Interactive Courses for Software Developers

Statement#

Given two strings, str1 and str2, find the shortest substring in str1 such that str2 is a subsequence of that substring.

A substring is defined as a contiguous sequence of characters within a string. A subsequence is a sequence that can be derived from another sequence by deleting zero or more elements without changing the order of the remaining elements.

Let’s say you have the following two strings:

str1 = “ $abbcb$ ”

str2 = “ $ac$ ”

In this example, “ $abbc$ ” is a substring of str1, from which we can derive str2 simply by deleting both the instances of the character $b$ . Therefore, str2 is a subsequence of this substring. Since this substring is the shortest among all the substrings in which str2 is present as a subsequence, the function should return this substring, that is, “ $abbc$ ”.

If there is no substring in str1 that covers all characters in str2, return an empty string.

If there are multiple minimum-length substrings that meet the subsequence requirement, return the one with the left-most starting index.

Constraints:

$1$ $\leq$ str1.length $\leq$ $2 \times 10^3$
$1$ $\leq$ str2.length $\leq$ $100$
str1 and str2 consist of uppercase and lowercase English letters.

Solution#

So far, you’ve probably brainstormed some approaches and have an idea of how to solve this problem. Let’s explore some of these approaches and figure out which one to follow based on considerations such as time complexity and any implementation constraints.

Naive approach#

The naive approach would be to generate all possible substrings of str1 and then check which substrings contain str2 as a subsequence. Out of all the substrings in str1 that contain str2 as a subsequence, we’ll choose the one with the shortest length. Now, let’s look at the cost of this solution. We need two nested loops to get all possible substrings and another loop to check whether each substring contains all the required characters. This brings the time complexity to $O(n^{3})$ . Since we’re not using any extra space, the space complexity is $O(1)$ .

Optimized approach using sliding window#

To eliminate the extra traversals of the substrings, we use the sliding window approach. With this approach, we only consider the substrings that we are sure contain all the characters of str2 in the same order. This problem can be conveniently solved using the sliding window pattern. The idea is to keep track of whether the subsequence has been found or not and to select the shortest subsequence from str1.

Note: In the following section, we will gradually build the solution. Alternatively, you can skip straight to just the code.

Step-by-step construction#

The first step of the solution is to initialize the variables. We begin by creating two variables, size_str1 and size_str2, to store the lengths of str1 and str2, respectively. We then initialize min_sub_len to infinity, which will be used to store the length of the minimum subsequence.

To help us traverse the two strings, we create two indexes, index_s1 and index_s2, which initially point to the first characters of str1 and str2, respectively. These indexes will be incremented as we scan through the strings to find the subsequence.

Finally, we initialize min_subsequence to an empty string. This variable will store the output, which is the smallest possible subsequence.

def min_window(str1, str2):
    # save the size of str1 and str2
    size_str1, size_str2 = len(str1), len(str2)
    # initialize min_sub_len to a very large number (infinity)
    min_sub_len = float('inf')
    # initialize indexes to zero and the min_subsequence to an empty string
    index_s1, index_s2 = 0, 0
    min_subsequence = ""
    print("\t Length of str1 is: ", size_str1)
    print("\t Length of str2 is: ", size_str2)
# driver code
def main():
    str1 = ["abcdebdde", "fgrqsqsnodwmxzkzxwqegkndaa", "zxcvnhss", "alpha", "beta"]
    str2 = ["bde", "kzed", "css", "la", "ab"]
    for i in range(len(str1)):
        print(i+1, ".\t Input strings: (" + str1[i]+", " + str2[i]+")", sep="")
        min_window(str1[i], str2[i])
        print("-"*100)
if __name__ == '__main__':
    main()

Enter to Rename, Shift+Enter to Preview

Initializing variables

Once we’ve initialized the variables, we start looping over str1 using index_s1. In each iteration, if the current character of str1, pointed at by index_s1, and the current character of str2, pointed at by index_s2, are the same, we increment both the indexes. Otherwise, we only increment index_s1. Using this logic, index_s2 will reach the end of str2 only if each character of str2 is found in str1. At this point, we have found a potential minimum window subsequence, i.e., a substring that contains str2 as a subsequence. So, we set min_sub_len to the length of this substring and min_subsequence to this substring.

Now, we need to check the rest of str1 for a shorter substring that meets our requirement. So, we resume our search in str1 from index_s1 $+ 1$ . Whenever we find a substring that meets our requirement, we compare its length with min_sub_len and if it is shorter, we update min_sub_len with the length of this substring and min_subsequence with this substring. Finally, when we have traversed the entire str1, we return the minimum window subsequence.

Let’s take a look at the code of this solution:

def min_window(str1, str2):
    # save the size of str1 and str2
    size_str1, size_str2 = len(str1), len(str2)
    # initialize min_sub_len to a very large number (infinity)
    min_sub_len = float('inf')
    # initialize pointers to zero and the min_subsequence to an empty string
    index_s1, index_s2 = 0, 0
    min_subsequence = ""
    # iterate over str1
    while index_s1 < size_str1:
        # check if the character pointed by index_s1 in str1
        # is the same as the character pointed by index_s2 in str2
        if str1[index_s1] == str2[index_s2]:
            # if this was the first character of str2, mark it as the start of the substring
            if index_s2 == 0:
                start = index_s1
            
            # if the pointed character is the same in both strings increment index_s2
            index_s2 += 1
            
            # check if a valid substring has been found
            if index_s2 == size_str2:
                end = index_s1
                # update min_sub_len and min_subsequence if current subsequence is shorter
                if end - start + 1 < min_sub_len:
                    min_sub_len = end - start + 1
                    min_subsequence = str1[start : end + 1]

Enter to Rename, Shift+Enter to Preview

First draft of the solution

By this stage, we’ve found a potential solution to the problem but if we uncomment the test case provided in lines 49 - 51 in the code widget above, we’ll notice that our solution does not return the correct minimum window subsequence. The issue with our solution is that it skips over any potential minimum window subsequences that start before the last character of the substring that we just found. Let’s understand this with the help of the example:

str1 = “abcdedeaqdweq”

str2 = “adeq”

Here, the minimum window subsequence is “aqdweq” that starts at index $7$ and ends at index $12$ . The first potential minimum window subsequence that our code identifies is “abcdedeaq”, ending at $8$ . If we resume the search from the end of this substring, i.e., from index $9$ in str1, we won’t be able to locate the shortest substring that satisfies our condition, since we skipped over index $7$ , which is the starting index of the actual minimum window subsequence.

To fix this, we simply change the index from which we resume our search and continue the iteration from the second character of the current substring, instead of index_s1 $+ 1$ .

Let’s look at the code of this solution after this correction, seen in line 35:

def min_window(str1, str2):
    # save the size of str1 and str2
    size_str1, size_str2 = len(str1), len(str2)
    # initialize min_sub_len to a very large number (infinity)
    min_sub_len = float('inf')
    # initialize pointers to zero and the min_subsequence to an empty string
    index_s1, index_s2 = 0, 0
    min_subsequence = ""
    # iterate over str1
    while index_s1 < size_str1:
        # check if the character pointed by index_s1 in str1
        # is the same as the character pointed by index_s2 in str2
        if str1[index_s1] == str2[index_s2]:
            # if this was the first character of str2, mark it as the start of the 
substring
            if index_s2 == 0:
                start = index_s1
            
            # if the pointed character is the same in both strings increment index_s2
            index_s2 += 1
            
            # check if index_s2 has reached the end of str2
            if index_s2 == size_str2:
                end = index_s1
                # update min_sub_len and min_subsequence if current subsequence is 
shorter
                if end - start + 1 < min_sub_len:

Enter to Rename, Shift+Enter to Preview

Corrected solution

At this point, our solution is correct and complete, since we can see that the example where str1 = “abcdedeaqdweq” and str2 = “adeq”, provided as the first test case in the coding widget above, give the correct output.

However, if we uncomment the test case given in lines 48 - 50, we’ll notice that our solution times out, which means that our solution is inefficient. Let’s first understand our test case. The first string, str1, consists of $10,000$ occurrences of the letter “f”, $9,000$ occurrences of the letter “s”, $500$ occurrences of the letter “e”, followed by one occurrence of the letter “a”. The second string, str2 is “fffessa”.

Our solution identifies all the potential minimum window subsequences starting from every instance of the character $f$ in str1. The upper bound on the number of such subsequences in this case is $10000 \times 9000 \times 500$ . This is obviously wasteful, since we can see from inspection that we can discard all but the last three occurrences of the letter “f”. Let’s see how we can improve our solution based on this insight.

After finding a potential subsequence, let’s try to shrink it by moving backward in str1, comparing it one character at a time with str2. We initialize two more pointer variables, start and end, to index_s1 and point index_s2 to the last character of str2. Then, we iterate backward over the potential minimum window subsequence using the start pointer. When we have discovered the entire str2 in the potential minimum window subsequence in the reverse order, we update the min_sub_len and min_subsequence variables accordingly.

At this point, we have found the shortest window subsequence that meets our requirement up to the index end in str1. So, now, we need to check the rest of the str1 string. We simply resume iterating over str1 in the forward direction, starting, as before, from start $+1$ .

Let’s visualize the solution:

def min_window(str1, str2):
    # save the size of str1 and str2
    size_str1, size_str2 = len(str1), len(str2)
    
    # initialize min_sub_len to a very large number (infinity)
    min_sub_len = float('inf')
    
    # initialize pointers to zero and the min_subsequence to an empty string
    index_s1, index_s2 = 0, 0
    min_subsequence = ""
    # iterate over str1
    while index_s1 < size_str1:
        # check if the character pointed by index_s1 in str1
        # is the same as the character pointed by index_s2 in str2
        if str1[index_s1] == str2[index_s2]:
            # if the pointed character is the same
            # in both strings increment index_s2
            index_s2 += 1
            
            # check if index_s2 has reached the end of str2
            if index_s2 == size_str2:
                # initialize start to the index where all characters of
                # str2 were present in str1
                start, end = index_s1, index_s1
                index_s2 -= 1
                
                # decrement pointer index_s2 and start a reverse loop
                while index_s2 >= 0:
                    # decrement pointer index_s2 until all characters of
                    #  str2 are found in str1

Enter to Rename, Shift+Enter to Preview

Optimized solution

def min_window(str1, str2):
    size_str1, size_str2 = len(str1), len(str2)
    min_sub_len = float('inf')
    index_s1, index_s2 = 0, 0
    min_subsequence = ""
    while index_s1 < size_str1:
        if str1[index_s1] == str2[index_s2]:
            index_s2 += 1
            if index_s2 == size_str2:
                start, end = index_s1, index_s1
                index_s2 -= 1
                while index_s2 >= 0:
                    if str1[start] == str2[index_s2]:
                        index_s2 -= 1
                    start -= 1
                start += 1
                if end - start < min_sub_len:
                    min_sub_len = end - start
                    min_subsequence = str1[start:end+1]
                index_s1 = start
                index_s2 = 0
        index_s1 += 1
    return min_subsequence
# driver code
def main():
    str1 = ["abcdedeaqdweq", "fgrqsqsnodwmxzkzxwqegkndaa", "zxcvnhss", "alpha", "beta"]
    str2 = ["adeq", "kzed", "css", "la", "ab"]
    for i in range(len(str1)):
        print(i+1, ". \tInput string: (" + str1[i]+", " + str2[i]+")", sep="")

Enter to Rename, Shift+Enter to Preview

Minimum Window Subsequence

Solution summary#

Initialize two indexes, index_s1 and index_s2, to zero for iterating both strings.
If the character pointed by index_s1 in str1 is the same as the character pointed by index_s2 in str2, increment both pointers. Otherwise, only increment index_s1.
Once index_s2 reaches the end of str2, initialize two new indexes (start and end). With these two indexes, we’ll slide the window backward.
Set start and end to index_s1.
If the characters pointed to by index_s2 and start are the same, decrement both of them. Otherwise, only decrement start.
Once, str2 has been discovered in str1 in the backward direction, calculate the length of the substring.
If this length is less than the current minimum length, update the min_sub_len variable and the min_subsequence string.
Resume the search in the forward direction from start $+1$ in str1.
Repeat until index_s1 reaches the end of str1.

Time complexity#

The outer loop iterates over the string str1, so the time complexity of this loop will be $O(n)$ , where $n$ is the length of string str1. Inside this loop, there is a while loop that is used to iterate back over the window once all the characters of str2 have been found in the current window. The time complexity of this loop will be $O(m)$ , where $m$ is the length of string str2. Therefore, the overall time complexity of this solution is $O(n \times m)$ . For example, when str1 = “aaaaa” and str2 = “aa”, it takes $O(n \times m)$ time.

Let’s consider another example where the inner loop travels for more than $O(m)$ iterations. For example, when str1 = “abcdefg” and str2 = “af”. When all characters of str2 have been found (using outer loop in $O(n)$ time), the inner loop iterates back from ‘f’ till ‘a’, taking $O(n)$ time. After that, the outer loop will start from ‘b’ and traverse all characters till the end of str1 without initiating the inner loop. This is because the outer loop will never find a subsequence containing str2. This will take $O(n)$ time. Therefore, the total time complexity in this case will be $O(n+n+n)$ , which is $O(n)$ .

Therefore, the worst-case time complexity of this algorithm is $O(n \times m)$ .

Space complexity#

Since we are not using any extra space apart from a few variables, the space complexity is $O(1)$ .

Solution: Minimum Window Subsequence

Statement#

Solution#

Naive approach#

Optimized approach using sliding window#

Step-by-step construction#

Just the code#

Solution summary#

Time complexity#

Space complexity#