Signature Similarity Tool


Overview

The signature similarity tool computes the similarity (or match score/ relevance) between two Semantic Signatures®. Similarity score is a value between 0 and 1. Except for floating point error, a signature matches itself with a score of exactly 1. The higher the score, the more similar or relevent the signatures (content) are to one another.

Pseudocode

The following pseudocode illustrates how similarity between two signatures is computed. The code below simply computes the sum of the products of the weights of the intersecting dimensions of two signatures.

float score = 0;
for (dimension in signature1) {
	if (dimension in signature2) {
		score += signature1[dimension] * signature2[dimension];
	}
}
return score;

Help Text for similarity

$ java -jar sh-tools.jar similarity
Usage: java com.semantichacker.api.tools.Similarity [OPTIONS] file1 file2
Compute the similarity score of two Semantic Signatures

Option Summary:
	file1        		The first signature (XML from API) to compute the
	             		similarity of                                      
	file2        		The second signature (XML from API) to compute the
	             		similarity of                                      
	-v, --verbose	Print each matching dimension and rank the score
	--labels FILE	The list of labels from the SemanticHacker
	             		Datafile, this allows labels without needing to get
	             		them from the API                                  
	--nolabels   	Do not show labels in dimension printout.
	-h, --help   	Display this help

Homepage: http://www.semantichacker.com

Examples

Following is an example of getting two signatures from the API and computing their similarity score.

$ java -jar sh-tools.jar signature -t TOKEN -c java --xmlout --outfile java.xml
$ java -jar sh-tools.jar signature -t TOKEN -c jdk --xmlout --outfile jdk.xml
$ java -jar sh-tools.jar similarity -v java.xml jdk.xml

Dim ID	Sig1		Sig2		Weight		Label
 9442	0.301207	0.287189	0.086503	Computers/Programming/Languages/Java/Resources
 9465	0.291356	0.142665	0.041566	Computers/Programming/Languages/Java/News_and_Media/Books
 9443	0.234836	0.087548	0.020559	Computers/Programming/Languages/Java/Resources/Certification
 9422	0.233443	0.126003	0.029415	Computers/Programming/Languages/Java
 9467	0.209163	0.225452	0.047156	Computers/Programming/Languages/Java/Official_Documentation
 9427	0.201700	0.160814	0.032436	Computers/Programming/Languages/Java/Development_Tools/Performance_and_Testing
 9440	0.200207	0.398539	0.079791	Computers/Programming/Languages/Java/Implementations
 9446	0.187670	0.242634	0.045535	Computers/Programming/Languages/Java/FAQs,_Help,_and_Tutorials/Tutorials
 9445	0.185879	0.288379	0.053604	Computers/Programming/Languages/Java/FAQs,_Help,_and_Tutorials/FAQs
 9423	0.173341	0.316496	0.054862	Computers/Programming/Languages/Java/Development_Tools
 9441	0.172246	0.115069	0.019820	Computers/Programming/Languages/Java/Personal_Pages
 9474	0.162494	0.087845	0.014274	Computers/Programming/Languages/Java/Applications
 9456	0.156624	0.105102	0.016461	Computers/Programming/Languages/Java/Class_Libraries/Data_Formats
 9449	0.152345	0.103168	0.015717	Computers/Programming/Languages/Java/Mailing_Lists
 9453	0.151250	0.148467	0.022456	Computers/Programming/Languages/Java/Class_Libraries/Graphics
 9452	0.150852	0.148467	0.022397	Computers/Programming/Languages/Java/Class_Libraries
 9466	0.139011	0.081895	0.011384	Computers/Programming/Languages/Java/News_and_Media/Magazines_and_E-zines/Articles
 9755	0.138812	0.181120	0.025142	Computers/Programming/Threads/Java
 9433	0.133239	0.090449	0.012051	Computers/Programming/Languages/Java/Server-Side/JavaServer_Pages
0.6511292

The first 31 lines in the output displayed show the dimensions that are common between the two signatures and the last line indicates the similarity score of the signatures.

The following shows the output without the -v parameter.
$ java -jar sh-tools.jar similarity java.xml jdk.xml
0.6511292