opencog / dimensional-embedding

= Dimensional Embedding

This implements a method of dimensional embedding of the atomspace using the Harel-Koren algorithm as described on

http://wiki.opencog.org/w/OpenCogPrime:WikiBook#Dimensional_Embedding and http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.5390

k-nearest neighbour queries are answered quickly using the following cover tree implementation: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

The clustering code, with documentation, can be found here: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#source

== Using DimEmbedModule in Scheme

To embed an atomspace using a certain link type and number of dimensions (eg similarity link, 50 dimensions)...

(embedSpace 'SimilarityLink 50)

To find the k nearest neighbors of a node...

(kNN node linkType k)

eg to find 10 similar things to the "dog" concept node...

(kNN (cog-node 'ConceptNode "dog") 'SimilarityLink 10)

Note that the atomspace must already be embedded (via embedSpace) for the given link type before you can find the k nearest neighbours for any nodes.

The entire embedding (the list of pivots and each node's embedding vector) can be written to the cogserver log using

(logEmbedding LinkType)

== About Complexity

It uses dijkstra's algorithm on the whole atomspace once for each pivot, so it should run in O(D*(|E| + |V|log|V|)) time, where E is the number of links, V is the number of nodes, and D is the number of dimensions (pivots). Based on experiments with random datasets, this seems accurate (specific times from my 2008 laptop):

For embedding in 50 dimensions:

1,000 nodes 4000 links - 21 seconds 1,000 nodes 8000 links - 41 seconds 1,000 nodes 12000 links - 63 seconds

3,000 nodes 9000 links - 54 seconds 3,000 nodes 18000 links - 114 seconds 3,000 nodes 24000 links - 155 seconds

10,000 nodes 30,000 links - 246 seconds 10,000 nodes 60,000 links - 415 seconds 10,000 nodes 90,000 links - 585 seconds

The amount of time for each pivot was consistent for any given embedding (ie the first 10 pivots took the same amount of time to pick as the last 10). The thesaurus data had 6k nodes and 500k links and that took about 3000 seconds (50 minutes, 1 minute per pivot), which seems pretty consistent with the random datasets/big-O-predicted complexity.

See the citeseer paper linked above for more detail on the embedding algorithm.

opencog / dimensional-embedding

README.md

About

Releases

Packages

Contributors 27

Languages

opencog / dimensional-embedding

Join GitHub today

GitHub is where the world builds software

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Git stats

Files

README.md

About

Resources

License

Releases

Packages 0

Contributors 27

Languages

Essential cookies

Always active

Analytics cookies

Packages