As someone who writes a lot of software for "old-school pharma", I'll point out a few things.
1) "Old-school pharma" includes AI - we were developing AI-based software products for drug discovery back in the 1990s - so what makes this anything new?
2) Clustering chemical compounds dates from the 1980s. Interactive 2D/3D clustering are of course newer, but still ... what makes this new?
3) "tons of molecular data" in this case is appears to be https://www.drugbank.ca/ . That is, "the license of our data-provider" links to that page. Quoting from that page:
> The latest release of DrugBank (version 5.1.3, released 2019-04-02) contains 13,339 drug entries including 2,594 approved small molecule drugs, 1,289 approved biotech (protein/peptide) drugs, 130 nutraceuticals and over 6,304 experimental drugs.
That's tiny. ChEMBL contains 1.8 million entries, most pharmas have much more than that.
So, what justifies the use of "tons of molecular data"?
4) "If a drug candidate is in the neighborhood of another molecule, there is an extremely high chance that they are related!"
How was that quantified? It's trivial to have a high true positive rate. For example, graph edit measures like in SmallWorld can make that guarantee.
But as the cluster boundary decreases, it's less and less true.
As someone who writes a lot of software for "old-school pharma", I'll point out a few things.
1) "Old-school pharma" includes AI - we were developing AI-based software products for drug discovery back in the 1990s - so what makes this anything new?
2) Clustering chemical compounds dates from the 1980s. Interactive 2D/3D clustering are of course newer, but still ... what makes this new?
3) "tons of molecular data" in this case is appears to be https://www.drugbank.ca/ . That is, "the license of our data-provider" links to that page. Quoting from that page:
> The latest release of DrugBank (version 5.1.3, released 2019-04-02) contains 13,339 drug entries including 2,594 approved small molecule drugs, 1,289 approved biotech (protein/peptide) drugs, 130 nutraceuticals and over 6,304 experimental drugs.
That's tiny. ChEMBL contains 1.8 million entries, most pharmas have much more than that.
So, what justifies the use of "tons of molecular data"?
4) "If a drug candidate is in the neighborhood of another molecule, there is an extremely high chance that they are related!"
How was that quantified? It's trivial to have a high true positive rate. For example, graph edit measures like in SmallWorld can make that guarantee.
But as the cluster boundary decreases, it's less and less true.