Find Median from Data Stream - Python Solution

1. Introduction

"Find Median from Data Stream" is an advanced problem that involves designing a data structure to efficiently find the median of a growing set of numbers. This problem is significant in understanding data stream processing and efficient data structure design, particularly in balancing time complexities of insertion and median finding operations.

Problem

Implement the MedianFinder class:

- MedianFinder() initializes the MedianFinder object.

- void addNum(int num) adds the integer num from the data stream to the data structure.

- double findMedian() returns the median of all elements so far.

The median is defined as the middle value in an ordered integer list. For an even size of the list, the median is the mean of the two middle values.

2. Solution Steps

1. Use two heaps: a max heap for the lower half of the numbers and a min heap for the upper half.

2. Ensure that the max heap contains the smaller half of the numbers, and the min heap contains the larger half.

3. For each new number, add it to one of the heaps, then balance the heaps to maintain their size property.

4. To find the median, check the sizes of the heaps. If they are equal, return the average of the tops of the heaps. If not, return the top of the heap with more elements.

3. Code Program

import heapq

class MedianFinder:

    def __init__(self):
        self.max_heap = []  # Contains the smaller half of the numbers
        self.min_heap = []  # Contains the larger half of the numbers

    def addNum(self, num):
        heapq.heappush(self.max_heap, -num)
        heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))

        # Balance the heaps
        if len(self.min_heap) > len(self.max_heap):
            heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))

    def findMedian(self):
        if len(self.min_heap) == len(self.max_heap):
            return (-self.max_heap[0] + self.min_heap[0]) / 2.0
        return -self.max_heap[0]

# Example Usage
medianFinder = MedianFinder()
medianFinder.addNum(1)
medianFinder.addNum(2)
print(medianFinder.findMedian())  # Output: 1.5
medianFinder.addNum(3)
print(medianFinder.findMedian())  # Output: 2.0

Output:

1.5
2.0

Explanation:

1. Two Heaps Approach: The max_heap and min_heap are used to store the smaller and larger halves of the numbers, respectively.

2. Adding Numbers: When a new number is added, it is first added to max_heap and then the smallest element from max_heap is moved to min_heap to maintain order.

3. Heap Balancing: The heaps are balanced to ensure that max_heap always has equal or one more element than min_heap.

4. Median Finding: The median is calculated based on the sizes of the heaps - if equal, it's the average of the tops of both heaps; otherwise, it's the top of max_heap.

5. Efficient Stream Processing: This method allows for efficient addition of numbers and finding of the median in a dynamic data stream.


Comments