Statistics

Type of statistics
Max – Largest value in L – sorting L(0)
Min – Smallest value in L – L[n-1]
Midpoint – average of max & min – (L[0]+L(n-1))/2
Mean -average of values in L
Mode – most common value in L
Median – “middle” value in L (half bigger, half smaller) – L(n/2)

Statistics by sorting list, all running time 0(1) (constant time)
Statistics on Unsorted list,
1.Sort first, running time 0(n log n) under the comparison model, like sorter
2.Extract the statistic in effectively one scan through the data, running time 0(n), like max, min, median

L – List of centrality values, 0 1 2 3 … n-1, n is nodes

L =[2,3,2,3,2,4]
def mean(L):
	total =0
	for i in range(len(L)): total+=L[i]
	return (0.0+total)/len(L)

print mean(L) # running time is  0(n) 
print (0.0+sum(L))/len(L)

It’s just simple statement. It seems like it should be 0(1), but it’s not, it is 0(n). ‘sum’ command is shorthand for a loop like ‘for i in range(len(L)): total+=L[i]’. So it is running overall the elements in the list internally.
Difference between c and python.
C basically doesn’t allow you to hide running time in the individual statements. All the C statements run pretty much constant time.
Python write more complicated programs in a nice distinct way, but they do the shortcuts, so you should know what the running time.

print max(set(L), key=lambda x: L.count(x)) #Mode
import timeit
import random

def max_v1(L):
    max_so_far = L[0]
    for i in range(1, len(L)):
       if L[i] > max_so_far: max_so_far = L[i]
    return max_so_far

def max_v2(L):
    m = L[0]
    for n in L:  
        if n > m: m = n
    return m

def max_v3(L):
    m = L[0]
    for n in L[1:]:
        if n > m: m = n
    return m

testlist= [random.choice(range(10000)) for _ in range(10000)] # generate a test list
numtimes = 10000                                              # Do 10000 iterations
statement = "z =max_v1(testlist)"                             # this is what we will execute
setup = "from __main__ import  max_v1,testlist"               # this is setup code
t = timeit.Timer(statement, setup)
elapsed = t.timeit(number=numtimes)                           # do it!
sf = "Code  %-20s takes %0.3f micro-seconds/pass"
print( sf % (statement, 1000000*elapsed/numtimes))

statement = "z =max_v2(testlist)"                            # this is what we will execute
setup = "from __main__ import  max_v2,testlist"              # this is setup code
t = timeit.Timer(statement, setup)
elapsed = t.timeit(number=numtimes)                          # do it!
sf = "Code  %-20s takes %0.3f micro-seconds/pass"
print( sf % (statement, 1000000*elapsed/numtimes))

statement = "z =max_v3(testlist)"                            # this is what we will execute
setup = "from __main__ import  max_v3,testlist"              # this is setup code
t = timeit.Timer(statement, setup)
elapsed = t.timeit(number=numtimes)                          # do it!
sf = "Code  %-20s takes %0.3f micro-seconds/pass"
print( sf % (statement, 1000000*elapsed/numtimes))


##>>> 
##Code  z =max_v1(testlist)  takes 518.615 micro-seconds/pass
##Code  z =max_v2(testlist)  takes 466.716 micro-seconds/pass
##Code  z =max_v3(testlist)  takes 512.866 micro-seconds/pass
##>>>

##Slices and anything in [] are expensive in python

Top K Problem
topk
selection – Find the max value and pull it out of the list, so the list is now in n-1 long and then repeat this. Each time we compute the max, it will be actually the 1st, 2nd, 3rd, 4th largest from original list, but it’s always the max of the current list.

#Find 2nd highest female name - Top K Insertion 
yob = open("yob1995.txt","r")

maxres= [(None,-1),(None,-1)]
for d in yob:
    name, sex, count = d.strip().split(",")
    if (sex=='F'):
        count = int(count)
        if (count>maxres[0][1]):
            maxres[1] = maxres[0]
            maxres[0] = (name,count)
        elif (count>maxres[1][1]):
            maxres[1] = (name,count)
print maxres[1]
yob.close()

"""
Aadam,M,6
Aadil,M,14
Aailiyah,F,5 => [('Aailiyah', 5), (None, -1)]
Aailyah,F,6 =>  [('Aailyah', 6), ('Aailiyah', 5)]
Aaisha,F,9 =>   [('Aaisha', 9), ('Aailyah', 6)]
Aakash,M,33
Aaleyah,F,8 =>  [('Aaisha', 9), ('Aaleyah', 8)]
Aalia,F,5
Aaliah,F,12 =>  [('Aaliah', 12), ('Aaisha', 9)]
"""

Python append is super fast compare to +=

Randomized Top K – Algorithm is finds the Top K elements in no particular order in 0(n) independent of K.
For example, the list has n elements, it takes 0(n log n) time to sort it.
How could it take us less than that to find the Top K if k = n?, this is not equivalent to sorting, but tt can be actually much more efficient than sorting.

Top K (smallest) via partitioning
topkparti

import random
L =[31, 45, 91, 51, 66, 82, 28, 33, 11, 89, 84, 27, 36]

def partition(L,v):
    s=[]
    l=[]
    for val in L:
        if val < v: s.append(val)         
        elif val > v: l.append(val)
    return(s,[v],l)

def topk(L,K):
    v = random.choice(L)
    (left,middle, right) =  partition(L,v)
    if len(left)==K: return left
    elif len(left)+1==K: return left+[v]
    elif len(left)>=K: return topk(left,K)
    return left+[v]+topk(right,K-len(left)-1)

print topk(L,6)    
"""
example 1
91 [31, 45, 51, 66, 82, 28, 33, 11, 89, 84, 27, 36] [91] []
51 [31, 45, 28, 33, 11, 27, 36] [51] [66, 82, 89, 84]
31 [28, 11, 27] [31] [45, 33, 36]
45 [33, 36] [45] []
[28, 11, 27, 31, 33, 36]

example 2
91 [31, 45, 51, 66, 82, 28, 33, 11, 89, 84, 27, 36] [91] []
36 [31, 28, 33, 11, 27] [36] [45, 51, 66, 82, 89, 84]
[31, 28, 33, 11, 27, 36]

example 3
36 [31, 28, 33, 11, 27] [36] [45, 91, 51, 66, 82, 89, 84]
[31, 28, 33, 11, 27, 36]
"""
#solution for middle value contain multiple times.
import random
import math

#improved paritioing
#L = [random.randint(1,10) for i in range(1,20)]
L= [1, 7, 2, 3, 8, 6, 5, 1, 9, 9, 5, 8, 8, 4, 5, 5, 3, 2, 4]

def partition(L, n):
        r=[]
        l=[]
        m=[]
        for i in L:
                if i<n:l.append(i)                 
                elif i>n:r.append(i)
                elif i==n:m.append(i)
        return l,m,r

def topk(L, k):
     n = random.choice(L)
     (left, middle, right) = partition(L, n)
     ll = len(left)
     lm = len(middle)
     if ll==k: return left
     elif ll+lm==k: return left+middle
     elif llk: return left+middle[0:k-ll]
     elif ll>k: return topk(left, k)
     else: return left+middle+topk(right, k-ll-lm)

n = len(L)           
print topk(L, (n/2)+1)
# Given a list of numbers, L, find a number, x, that
# minimizes the sum of the absolute value of the difference
# between each element in L and x: SUM_{i=0}^{n-1} |L[i] - x|
# code should run in Theta(n) time
'''
40/10 = 4.0
|1-4| + |2-4| + |3-4| + |4-4| + 6(|5-4|)
  3   +   2   +   1   +   0   + 61 = 12
 the correct answer of 5.0 gives
|1-5| + |2-5| + |3-5| + |4-5| + 6(|5-5|)
4   +   3   +   2   +   1   + 60 = 10
1 => 0+1+2+3+4+4+4+4+4+4>10
2 => 1+0+1+2+3+3+3+3+3+3>10
3 => 2+1+0+1+2+2+2+2+2+2>10
'''
def minimize_absolute(L):
        n = len(L)
        tk= topk(L, (n/2)+1)
        return max(tk)
assert 1 == minimize_absolute([1,1,1,1,1,1000,100000])
assert 5 == minimize_absolute([1,2,3,4,5,5,5,5,5,5])

Analysis of Top k via Partioning
topkpartiana

Partition – which is better and the worst case than sorting and better than selection/insertion.

Heaps

heapPython, deleting the first element of L using “del L[0]”,constant time, 0(1). Using “L = L[1:]” appears to involving a copy and is 0(n).

An ordered list – from smallest to largest. Binary search to find the position, insert a new element, re arrange the position, because it has to keep an ordered list. This inserting into an ordered list is a linear time process.

An unordered list– Scan through to find where the minimum is and to remove it, then re arrange the position. That is actually a linear time operation.

A heap is a particular data structure. It is used to implement a “priority queue”, which is an abstract data type. It’s a data type that supports 3 different operations in a logarithmic time. Insert a new element, find and remove minimum from the list.

Heap Height

heapFor example, balanced binary tree with 20 nodes. If you read left to right and top to bottom (13, 24, 21, 30, 29, 27…), It’s not sorted list. But if you read down past (13, 24, 30, 58, 74…  or 13, 21, 27, 43…), it’s sorted list, because going down the tree the parent value is never bigger than the value of the child.

Properties of a heap
1)    Nodes are filled in left to right, top to bottom (root to leaf).
2)    Smallest value is at root.
3)    Store in an array (0,1,2,3,4,..)
What is the height from the root of the tree (the tippy top) down to the deepest leaf? How long is the longest pass in this heap?

The last level is filled in all the way to the end.  How many nodes we have at that level? We have half as many at the level above because each of the nodes shares a parent with one of the other nodes and again half and again half and again half. How many times can you half (n) before you get down to 1 which is the root?  That’s log n. So the height of the heap is 0(log n).

What’s the deepest level you can find the 3rd smallest? 3 (24).
What  node would be the right of 72?    146

Down Heap

import random
#L = [random.randrange(90)+10 for i in range(20)]
L =[50, 88,27,58,30, 21, 58, 13, 84, 24, 29, 43, 61, 44, 65, 74, 76, 30, 82 ,41, 30]
L = range(10)

# Implement remove_min
def remove_min(L):
    # your code here
    L[0] = L.pop()
    print L[0], L
    down_heapify(L,0)
    # your code here
    return L

# Implement down_heapify
# Heap shorcuts
def parent(i): 
    return (i-1)/2
def left_child(i): 
    return 2*i+1
def right_child(i): 
    return 2*i+2
def is_leaf(L,i):
    return (left_child(i) >= len(L)) and (right_child(i) >= len(L))
def one_child(L,i):
    return (left_child(i) = len(L))

# Call this routine if the heap rooted at i satisfies the heap property
# *except* perhaps i to its immediate children
def down_heapify(L, i):
    # If i is a leaf, heap property holds
    if is_leaf(L, i): return
    
    # If i has one child...
    if one_child(L, i):
        # check heap property
        if L[i] > L[left_child(i)]:
            # If it fails, swap, fixing i and its child (a leaf)
            (L[i], L[left_child(i)]) = (L[left_child(i)], L[i])
        return
    
    # If i has two children...  check heap property
    if min(L[left_child(i)], L[right_child(i)]) >= L[i]: return
    
    # If it fails, see which child is the smaller
    # and swap i's value into that child
    # Afterwards, recurse into that child, which might violate
    if L[left_child(i)] < L[right_child(i)]:
        # Swap into left child
        (L[i], L[left_child(i)]) = (L[left_child(i)], L[i])
        down_heapify(L, left_child(i))
        return
    else:
    	# Swap into right child
        (L[i], L[right_child(i)]) = (L[right_child(i)], L[i])
        down_heapify(L, right_child(i))
        return

# build_heap
def build_heap(L):
    for i in range(len(L)-1, -1, -1):
        down_heapify(L, i)
    return L

def testheap():
    print build_heap(L)
    #now, the new minimum should be 1
    assert L[0] == 1
testheap()

#heap sort
def heap_sort():
    build_heap(L)
    while len(L)>1:
        print L[0]
        remove_min(L)
    print L[0]	
heap_sort()

head_example

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20
50, 88, 27, 58, 30, 21, 58, 13, 84, 24, 29, 43, 61, 44, 65, 74, 76, 30, 82, 41, 30
i  l  r  leaf		 onechild	min		swap

20 41 42 41>=21 & 42>=21
19 39 40 39>=21 & 40>=21
...
9  19 20 19>=21x	 19=21x	41,30 30>=24
8  17 18 17>=21x 	 17=21x	30,82 30>=84x	30=21 & 36>=21 (8 down 17 tree check 8D17)
7  15 16 15>=21x 	 15=21x	74,76 74>=13 (continue from 8 CF8)
6  13 14 13>=21x 	 13=21x	44,65 44>=58x	44=21 & 28>=21 (6D13)
5  11 12 11>=21x 	 11=21x	43,61 43>=21 (CF6)
50, 88, 27, 58, 24, 21, 44, 13, 30, 30, 29, 43, 61, 58, 65, 74, 76, 84, 82, 41, 30 (4,9)=(9,4) => (4D9 x, CF3)
50, 88, 27, 13, 24, 21, 44, 58, 30, 30, 29, 43, 61, 58, 65, 74, 76, 84, 82, 41, 30 (3,7)=(7,3)
50, 88, 21, 13, 24, 27, 44, 58, 30, 30, 29, 43, 61, 58, 65, 74, 76, 84, 82, 41, 30 (2,5)=(5,2)
01 03 04 03>=21x	03=21x	13,24 13>=88	
50, 13, 21, 88, 24, 27, 44, 58, 30, 30, 29, 43, 61, 58, 65, 74, 76, 84, 82, 41, 30 (1,3)=(3,1)
03 07 08 07>=21x	07=21x	58,30 30>=88 (1D3)	
50, 13, 21, 30, 24, 27, 44, 58, 88, 30, 29, 43, 61, 58, 65, 74, 76, 84, 82, 41, 30 (3,8)=(8,3)
08 17 18 17>=21x	17=21x	84,82 82>=88 (3D8)	
50, 13, 21, 30, 24, 27, 44, 58, 82, 30, 29, 43, 61, 58, 65, 74, 76, 84, 88, 41, 30 (8,18)=(18,8)
00 01 02 01>=21x 	 01=21x	13,21 13>=50 (CF1)
13, 50, 21, 30, 24, 27, 44, 58, 82, 30, 29, 43, 61, 58, 65, 74, 76, 84, 88, 41, 30 (0,1)=(1,0) => (0D1)
13, 24, 21, 30, 50, 27, 44, 58, 82, 30, 29, 43, 61, 58, 65, 74, 76, 84, 88, 41, 30 (1,4)=(4,1) => (1D4)
13, 24, 21, 30, 29, 27, 44, 58, 82, 30, 50, 43, 61, 58, 65, 74, 76, 84, 88, 41, 30 (4,10)=(10,4)

Remove L[0] => 13, replace L[0]=L[n-1] => 30  
30, 24, 21, 30, 29, 27, 44, 58, 82, 30, 50, 43, 61, 58, 65, 74, 76, 84, 88, 41
21, 24, 30, 30, 29, 27, 44, 58, 82, 30, 50, 43, 61, 58, 65, 74, 76, 84, 88, 41 (0,2)=(2,0) => (0D2)
21, 24, 27, 30, 29, 30, 44, 58, 82, 30, 50, 43, 61, 58, 65, 74, 76, 84, 88, 41 (2,5)=(5,2) => (2D5)

Up Heap

def up_heapify(L, i):
    if i==0: return
    p = parent(i)
    if L[i]= len(L)) and (right_child(i) >= len(L))

def testheap():
    L = [2, 4, 3, 5, 9, 7, 7]
    #L.append(1)
    #up_heapify(L, 7)
    L =[50, 88,27,58,30, 21, 58, 13, 84, 24, 29, 43, 61, 44, 65, 74, 76, 30, 82 ,41, 30]
    for i in range(len(L)-1, -1, -1):
        up_heapify(L, i)
    print L
testheap()

Top K summary
sort : 0(n log n)
selection/insertion : 0(nk)
partition : 0(n)
heap : 0(n log k)

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s