Streaming-Data-Algorithms
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Implementations of the Count-Min (CM) Sketch, Insert-only CM sketch, and the SpaceSaving sketch as part of UC Berkeley's Scalable Machine Learning class
As part of an assignment for the Scalable Machine Learning course (CS 281B) at the University of California - Berkeley, I'm implementing and comparing three data streaming algorithms:
1) The Count-Min Sketch Algorithm - Cormode, Graham; S. Muthukrishnan (2004). "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 55: 29–38.
2) An improved (i.e. tighter-bounded) Count-Min Sketch algorithm which only handles inserts (sacrificing removal capabilities).
3) The SpaceSaving sketch - Efficient Computation of Frequent and Top-k Elements in Data Streams by Ahmed Metwally, Divyakant Agrawal and Amr El Abbadi

While I don't anticipate that this implementation will be better than reference implementations by the original authors, it might come in handy for comparison purposes.

Input Data: This code was originally run on the english-language dump of Wikipedia, which was around 2GB compressed at the time. Now it should be significantly larger, and you can download it separately at: http://dumps.wikimedia.org/enwiki/


Mark Fuge
markfuge@gmail.com

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。