The Jaccard index between two bit-vectors A and B is
||A intersection B|| / ||A union B||
= popcount(A & B) / popcount(A | B)
Let N be the number of bits in the bit-vector.
Consider A_1, A_2, A_3, ... A_i which are distinct bit-vectors in the 2^N-1 possibilities. (Clearly the all-zero bit vector is not a solution since it will have an index of 0.0 with all other bit-vectors.)
For each N, what is the largest i such that all pair-wise Jaccard index values are distinct? And what is an example?
This comes up in testing algorithms in my field of cheminformatics. The bit-vectors are "molecular fingerprints", and my field refers to the Jaccard index as "the Tanimoto score".
If there are ties in the scores then different implementations of the same algorithm break ties arbitrarily.
For cross testing, I want a test set which has no ties. All implementations should give the same results.
This is not urgent. We were able to come up with a random sampling method which found 27 bit vectors for N=64 bits. (I also tried a GA and a Z3 search, but those weren't as effective.)
The Jaccard index between two bit-vectors A and B is
Let N be the number of bits in the bit-vector.Consider A_1, A_2, A_3, ... A_i which are distinct bit-vectors in the 2^N-1 possibilities. (Clearly the all-zero bit vector is not a solution since it will have an index of 0.0 with all other bit-vectors.)
For each N, what is the largest i such that all pair-wise Jaccard index values are distinct? And what is an example?
This comes up in testing algorithms in my field of cheminformatics. The bit-vectors are "molecular fingerprints", and my field refers to the Jaccard index as "the Tanimoto score".
If there are ties in the scores then different implementations of the same algorithm break ties arbitrarily.
For cross testing, I want a test set which has no ties. All implementations should give the same results.
This is not urgent. We were able to come up with a random sampling method which found 27 bit vectors for N=64 bits. (I also tried a GA and a Z3 search, but those weren't as effective.)
Still, I would like to know the maximum value for i.I brute-forced it in C and found values up to N=8 or so. Starting from N=2: max i = 2, 3, 4, 5, 5, 6, 7. An example solution for N=8 is: