Ask HN: What research questions do you want answered?

  • The Jaccard index between two bit-vectors A and B is

        ||A intersection B|| / ||A union B||
        = popcount(A & B) / popcount(A | B)
    
    Let N be the number of bits in the bit-vector.

    Consider A_1, A_2, A_3, ... A_i which are distinct bit-vectors in the 2^N-1 possibilities. (Clearly the all-zero bit vector is not a solution since it will have an index of 0.0 with all other bit-vectors.)

    For each N, what is the largest i such that all pair-wise Jaccard index values are distinct? And what is an example?

    This comes up in testing algorithms in my field of cheminformatics. The bit-vectors are "molecular fingerprints", and my field refers to the Jaccard index as "the Tanimoto score".

    If there are ties in the scores then different implementations of the same algorithm break ties arbitrarily.

    For cross testing, I want a test set which has no ties. All implementations should give the same results.

    This is not urgent. We were able to come up with a random sampling method which found 27 bit vectors for N=64 bits. (I also tried a GA and a Z3 search, but those weren't as effective.)

      010ca5212bb73967        id1
      222257801a012513        id2
      0005683000080038        id3
      efffff7fffffffff        id4
      fbfbdfffedfbdeff        id5
      21602a5034b064a0        id6
      7d01cc7d85503009        id7
      e7ff2efbfa723f4d        id8
      dfe9bcffddffefff        id9
      11eaf6836d1a1b13        id10
      356c55eff7decef9        id11
      b6bfdfda73f7faff        id12
      0200000000002122        id13
      dfdd7fdf7f7dfeff        id14
      12449480c4c7cb1c        id15
      67efdeff75fbbe4a        id16
      d0ae48211cc31910        id17
      77e23fb4dfd848b1        id18
      53c4fffbe5fbf4fb        id19
      b51f9cbffede9f10        id20
      0100200100000281        id21
      f7f60d895cc39d5f        id22
      f997c59d6c75067a        id23
      eeffffffffffdf7f        id24
      0003d010581e0813        id25
      914d85e6e9f9566b        id26
      9d6f3f77b74a779c        id27
    
    Still, I would like to know the maximum value for i.

    I brute-forced it in C and found values up to N=8 or so. Starting from N=2: max i = 2, 3, 4, 5, 5, 6, 7. An example solution for N=8 is:

      01 id1
      0e id2
      17 id3
      37 id4
      77 id5
      7f id6
      ff id7