= Dimensional Embedding
This implements a method of dimensional embedding of the atomspace using the Harel-Koren algorithm as described on
http://wiki.opencog.org/w/OpenCogPrime:WikiBook#Dimensional_Embedding and http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.5390
k-nearest neighbour queries are answered quickly using the following cover tree implementation: http://hunch.net/~jl/projects/cover_tree/cover_tree.html
The clustering code, with documentation, can be found here: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#source
== Using DimEmbedModule in Scheme
To embed an atomspace using a certain link type and number of dimensions (eg similarity link, 50 dimensions)...
(embedSpace 'SimilarityLink 50)
To find the k nearest neighbors of a node...
(kNN node linkType k)
eg to find 10 similar things to the "dog" concept node...
(kNN (cog-node 'ConceptNode "dog") 'SimilarityLink 10)
Note that the atomspace must already be embedded (via embedSpace) for the given link type before you can find the k nearest neighbours for any nodes.
The entire embedding (the list of pivots and each node's embedding vector) can be written to the cogserver log using
(logEmbedding LinkType)
== About Complexity
It uses dijkstra's algorithm on the whole atomspace once for each pivot, so it should run in O(D*(|E| + |V|log|V|)) time, where E is the number of links, V is the number of nodes, and D is the number of dimensions (pivots). Based on experiments with random datasets, this seems accurate (specific times from my 2008 laptop):
For embedding in 50 dimensions:
1,000 nodes 4000 links - 21 seconds 1,000 nodes 8000 links - 41 seconds 1,000 nodes 12000 links - 63 seconds
3,000 nodes 9000 links - 54 seconds 3,000 nodes 18000 links - 114 seconds 3,000 nodes 24000 links - 155 seconds
10,000 nodes 30,000 links - 246 seconds 10,000 nodes 60,000 links - 415 seconds 10,000 nodes 90,000 links - 585 seconds
The amount of time for each pivot was consistent for any given embedding (ie the first 10 pivots took the same amount of time to pick as the last 10). The thesaurus data had 6k nodes and 500k links and that took about 3000 seconds (50 minutes, 1 minute per pivot), which seems pretty consistent with the random datasets/big-O-predicted complexity.
See the citeseer paper linked above for more detail on the embedding algorithm.