I was looking for a single dimension clustering algorithm back in the day and i didnt really find one implemented “out of the box” so I took the challenge and implemented one myself, I called the resulting algorithm “cluster1d” and you can see it in full here. Its runtime complexity is O(nlogn) and space complexity is O(n). lets jump into its details, first lets play with some data, why wont we try to find clusters over 1 year prices of Microsoft stock? yeah lets do this :
import yfinance as yf
msft = yf.Ticker("MSFT")
prices = list(msft.history(period="1y")["Close"])
Then let’s zip them with their indices into tuples, sort these tuples by prices and unzip it to a sorted prices list and their corresponding indices list :
p, i = zip(*sorted(zip(prices, range(len(prices)))))
p = np.array(p)
Great, now lets have distances of all consecutive (sorted) prices, zipped with the corresponding indices (of prices we took distance of), sort them and unzip to distances, left indices and right indices :
d, li, ri = map(np.array,zip(*sorted(zip(p[1:] - p[:-1], i[:-1], i[1:]))))
At this point, we can filter out the outliers – distances above a max threshold which was given to the algorithm as a parameter :
mask = d <= max_distance
d, li, ri = d[mask], li[mask], ri[mask]
We’ll connect all left indices and right indices as edges :
def to_adj(edges):
adj = defaultdict(set)
for u, v in edges:
adj[u].add(v)
adj[v].add(u)
return adj
adj = to_adj(zip(li, ri))
Now that we have the adjacency mapping, lets find all connected components :
def connected_components(vertices, adj):
V = set(vertices)
components = list()
def component(v):
visited = set([v])
toVisit = adj[v]
while toVisit:
v = toVisit.pop()
visited.add(v)
toVisit |= adj[v] - visited
return visited
while V:
v = V.pop()
c = component(v)
components.append(c)
V -= c
return components
components = connected_components(adj.keys(),adj)
The components we found, are the appropriate clusters we seek, let’s put them in a data structure :
class Cluster:
def __init__(self, items, cluster):
self.indices = list(map(int, cluster))
self.items = [items[i] for i in self.indices]
self.count = len(self.items)
self.range = (min(self.items), max(self.items))
self.ranged = self.range[1] - self.range[0]
def __str__(self):
return self.__dict__.__str__()
def __repr__(self):
return self.__dict__.__repr__()
clusters = [Cluster(prices, c) for c in components]
Finally here are the clusters ranges for MSFT when given max distance of 2$ :
[(177.94, 188.41), (192.15 244.42), (246.47, 261.97)]
In the next article, we will use this clustering algorithm to create heatmaps over stock prices, stay tuned.
Did my post help you? share it with me here and share it with your friends and colleagues everywhere