Solution: Find Median from a Data Stream

Let's solve the Find Median from a Data Stream problem using the Two Heaps pattern.

Statement #

Create a data structure that can store a list of integers that can change in size over time and find the median from this dynamically growing list in constant time, O(1)O(1).

Implement a class, MedianOfStream, which should support the following operations:

  1. Constructor(): This initializes the object of this class, which in turn creates the max and the min heap.

  2. Insert Num(num): This adds an integer, num, to the data structure.

  3. Find Median(): This finds the median of all elements seen so far. If there are an even number of elements, return the average of the two middle values.

Constraints:

  • 105-10^5 \leq num 105\leq 10^5, where num is an integer received from the data stream.

  • There will be at least one element in the data structure before the median is computed.

  • At most, 500500 calls will be made to the function that calculates the median.

Solution#

So far, you’ve probably brainstormed some approaches and have an idea of how to solve this problem. Let’s explore some of these approaches and figure out which one to follow based on considerations such as time complexity and any implementation constraints.

Naive approach#

The naive solution is to first sort the data and then find the median. Insertion sort is an algorithm that can be used to sort the data as it appears. This way, every time a new number is added to the stream, the numbers before that new number are already sorted, allowing insertion at the correct index. This, however, also requires us to shift the elements greater than the inserted number one place forward.

The overall time complexity of the algorithm becomes O(n2)O(n^2), where nn is the number of elements in the data stream. The amortized time complexity of each insertion is therefore O(n2n)O(\frac{n^2}{n}), that is, O(n)O(n). The time complexity of the function that calculates the median would be O(1)O(1), assuming we are storing the data in an array. The space complexity is O(1)O(1).

Optimized approach using two heaps#

We’ll assume that x is the median of the integers in a list. Half of the numbers in the list will be smaller than (or equal to) x, and the other half will be greater than (or equal to) x. We can divide the list into two halves. One half stores the smaller numbers, and the other half stores the larger numbers.

The median of all numbers will either be the largest number in the small list or the smallest number in the large list. If the total number of elements is even, we know that the median will be the average of these two numbers.

The most efficient data structure for repeatedly finding the smallest or largest number in a changing list is a heap. We can store the first half of the numbers in a max-heap and the second half in a min-heap. That’s why the two heaps pattern is a perfect fit.

Here’s how we’ll implement this algorithm:

  1. First, we’ll store the first half of the numbers (smaller than x) in a max-heap. We use a max-heap because we want to know the largest number in the first half of the list.

  2. Then, we’ll store the second half of the numbers (larger than x) in a min-heap because we want to know the smallest number in the second half of the list.

  3. We can calculate the median of the current list of numbers using the top element of the two heaps.

Created with Fabric.js 3.6.6 input number 22 min-heap max-heap The first number is 22. Since both heaps are empty, we simply push the negative of the input number to the max-heap.

1 of 11

Created with Fabric.js 3.6.6 input number 35 min-heap max-heap -22 Since the input number is greater than the top element in the max-heap, we push it to the min-heap.

2 of 11

Created with Fabric.js 3.6.6 We'll use the top elements of both heaps tocalculate our median.Median = (22+35)/2 = 28.5 input number min-heap 35 max-heap -22 median = 28.5

3 of 11

Created with Fabric.js 3.6.6 The next input number is 30, which is greater thanthe top element in max-heap. So,we'll push it to the min-heap. input number 30 min-heap 35 max-heap -22

4 of 11

Created with Fabric.js 3.6.6 input number min-heap 30 35 max-heap -22 Since the length of max-heap is less than thelength of min-heap, we'll rebalance the heaps by transferring the top element from the min-heap to max-heap.

5 of 11

Created with Fabric.js 3.6.6 We have an odd number of elements and the length of the max-heap is greater than the length of the min-heap. The median will be the top element of our max-heap. input number min-heap 35 max-heap -30 -22 median = 30.0

6 of 11

Created with Fabric.js 3.6.6 input number 16 min-heap 35 max-heap -30 -22 The input number is less than the top element ofmax-heap. Hence, we'll push it to the max-heap.

7 of 11

Created with Fabric.js 3.6.6 input number min-heap 35 max-heap -30 -22 -16 Length of max-heap is greater than the length of min-heap + 1. Hence, we'll rebalance the two heaps by transferring the top element from the max-heap to the min-heap.

8 of 11

Created with Fabric.js 3.6.6 The median will be the average of the top elementsof the two heaps.Median = (30+22)/2 = 26.0 input number min-heap 30 35 max-heap -22 -16 median = 26.0

9 of 11

Created with Fabric.js 3.6.6 The input number is less than the top elementof max-heap. Hence, we'll push it to the max-heap. input number 18 min-heap 30 35 max-heap -22 -16

10 of 11

Created with Fabric.js 3.6.6 input number min-heap 30 35 max-heap -22 -16 -18 median = 22.0 We have an odd number of elements,and the length of max-heap is greater thanmin-heap. The median will be the top element ofour max-heap.

11 of 11

Let’s look at the code for this solution below:

Find Median from a Data Stream

Solution summary#

To recap, the solution to this problem can be divided into the following parts:

  • Use a min-heap to store the larger 50% of the numbers seen so far and a max-heap for the smaller 50% of the numbers.
  • Add the incoming elements to the appropriate heaps.
  • Calculate the median using the top elements of the two heaps.

Time complexity#

The time complexity of the Insert Num method will change depending on how many numbers have already been received from the stream. So, the time complexity is amortized over the number of insert operations.

Each insert operation will trigger a heapify process that runs in O(logn)O(\log n) times, where nn is the count of numbers received so far from the stream. Because of this, the cumulative complexity of a sequence of nn insert operations is described by the expression log1+log2+log3++logn\log 1 + \log 2 + \log 3 + … + \log n.

This expression simplifies to logn!\log n!, which, as per Stirling’s approximation, is O(nlogn)O(n \log n). As we have performed nn insert operations, the amortized time complexity of one insert operation is O(nlognn)O(\frac{n \log n}{n}), that is, O(logn)O(\log n).

The time complexity of the Find Median method will be O(1)O(1) since retrieving the top element of a heap is a constant-time operation, and we need to do at most two such retrievals.

Space complexity#

The space complexity will be O(1)O(1) since no additional space is used other than that which is required to store the numbers received from the data stream.

Find Median from a Data Stream

Sliding Window Median