Signature Similarity Tool
Overview
The signature similarity tool computes the similarity (or match score/ relevance) between two Semantic Signatures®. Similarity score is a value between 0 and 1. Except for floating point error, a signature matches itself with a score of exactly 1. The higher the score, the more similar or relevent the signatures (content) are to one another.
Pseudocode
The following pseudocode illustrates how similarity between two signatures is computed. The code below simply computes the sum of the products of the weights of the intersecting dimensions of two signatures.
float score = 0;
for (dimension in signature1) {
if (dimension in signature2) {
score += signature1[dimension] * signature2[dimension];
}
}
return score;
Help Text for similarity
$ java -jar sh-tools.jar similarity Usage: java com.semantichacker.api.tools.Similarity [OPTIONS] file1 file2 Compute the similarity score of two Semantic Signatures Option Summary: file1 The first signature (XML from API) to compute the similarity of file2 The second signature (XML from API) to compute the similarity of -v, --verbose Print each matching dimension and rank the score --labels FILE The list of labels from the SemanticHacker Datafile, this allows labels without needing to get them from the API --nolabels Do not show labels in dimension printout. -h, --help Display this help Homepage: http://www.semantichacker.com
Examples
Following is an example of getting two signatures from the API and computing their similarity score.
$ java -jar sh-tools.jar signature -t TOKEN -c java --xmlout --outfile java.xml $ java -jar sh-tools.jar signature -t TOKEN -c jdk --xmlout --outfile jdk.xml $ java -jar sh-tools.jar similarity -v java.xml jdk.xml Dim ID Sig1 Sig2 Weight Label 9442 0.301207 0.287189 0.086503 Computers/Programming/Languages/Java/Resources 9465 0.291356 0.142665 0.041566 Computers/Programming/Languages/Java/News_and_Media/Books 9443 0.234836 0.087548 0.020559 Computers/Programming/Languages/Java/Resources/Certification 9422 0.233443 0.126003 0.029415 Computers/Programming/Languages/Java 9467 0.209163 0.225452 0.047156 Computers/Programming/Languages/Java/Official_Documentation 9427 0.201700 0.160814 0.032436 Computers/Programming/Languages/Java/Development_Tools/Performance_and_Testing 9440 0.200207 0.398539 0.079791 Computers/Programming/Languages/Java/Implementations 9446 0.187670 0.242634 0.045535 Computers/Programming/Languages/Java/FAQs,_Help,_and_Tutorials/Tutorials 9445 0.185879 0.288379 0.053604 Computers/Programming/Languages/Java/FAQs,_Help,_and_Tutorials/FAQs 9423 0.173341 0.316496 0.054862 Computers/Programming/Languages/Java/Development_Tools 9441 0.172246 0.115069 0.019820 Computers/Programming/Languages/Java/Personal_Pages 9474 0.162494 0.087845 0.014274 Computers/Programming/Languages/Java/Applications 9456 0.156624 0.105102 0.016461 Computers/Programming/Languages/Java/Class_Libraries/Data_Formats 9449 0.152345 0.103168 0.015717 Computers/Programming/Languages/Java/Mailing_Lists 9453 0.151250 0.148467 0.022456 Computers/Programming/Languages/Java/Class_Libraries/Graphics 9452 0.150852 0.148467 0.022397 Computers/Programming/Languages/Java/Class_Libraries 9466 0.139011 0.081895 0.011384 Computers/Programming/Languages/Java/News_and_Media/Magazines_and_E-zines/Articles 9755 0.138812 0.181120 0.025142 Computers/Programming/Threads/Java 9433 0.133239 0.090449 0.012051 Computers/Programming/Languages/Java/Server-Side/JavaServer_Pages 0.6511292
The first 31 lines in the output displayed show the dimensions that are common between the two signatures and the last line indicates the similarity score of the signatures.
The following shows the output without the -v parameter.$ java -jar sh-tools.jar similarity java.xml jdk.xml 0.6511292