Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

README.md

= Dimensional Embedding

This implements a method of dimensional embedding of the atomspace using the Harel-Koren algorithm as described on

http://wiki.opencog.org/w/OpenCogPrime:WikiBook#Dimensional_Embedding and http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.5390

k-nearest neighbour queries are answered quickly using the following cover tree implementation: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

The clustering code, with documentation, can be found here: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#source

== Using DimEmbedModule in Scheme

To embed an atomspace using a certain link type and number of dimensions (eg similarity link, 50 dimensions)...

(embedSpace 'SimilarityLink 50)

To find the k nearest neighbors of a node...

(kNN node linkType k)

eg to find 10 similar things to the "dog" concept node...

(kNN (cog-node 'ConceptNode "dog") 'SimilarityLink 10)

Note that the atomspace must already be embedded (via embedSpace) for the given link type before you can find the k nearest neighbours for any nodes.

The entire embedding (the list of pivots and each node's embedding vector) can be written to the cogserver log using

(logEmbedding LinkType)

== About Complexity

It uses dijkstra's algorithm on the whole atomspace once for each pivot, so it should run in O(D*(|E| + |V|log|V|)) time, where E is the number of links, V is the number of nodes, and D is the number of dimensions (pivots). Based on experiments with random datasets, this seems accurate (specific times from my 2008 laptop):

For embedding in 50 dimensions:

1,000 nodes 4000 links - 21 seconds 1,000 nodes 8000 links - 41 seconds 1,000 nodes 12000 links - 63 seconds

3,000 nodes 9000 links - 54 seconds 3,000 nodes 18000 links - 114 seconds 3,000 nodes 24000 links - 155 seconds

10,000 nodes 30,000 links - 246 seconds 10,000 nodes 60,000 links - 415 seconds 10,000 nodes 90,000 links - 585 seconds

The amount of time for each pivot was consistent for any given embedding (ie the first 10 pivots took the same amount of time to pick as the last 10). The thesaurus data had 6k nodes and 500k links and that took about 3000 seconds (50 minutes, 1 minute per pivot), which seems pretty consistent with the random datasets/big-O-predicted complexity.

See the citeseer paper linked above for more detail on the embedding algorithm.

About

Dimensional Embedding Clustering

Resources

License

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.