####
**TL;DR:**

In his

excellent blog post, Matt Swain described the implementation of compound similarity searches in

MongoDB. Unfortunately, Matt's approach had suboptimal (

polynomial) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with

Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms

RDKit PostgreSQL cartridge.

####
**myChEMBL 21 - NoSQL edition **

####
** **

Given that

NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are

Neo4j and

MongoDB. The former is a graph database and the latter is a

BSON document storage. We would like to provide

IPython notebook-based tutorials explaining how to use this software to deal with common cheminformatics problems.

Matt Swain's article "

Chemical similarity search in MongoDB" seemed to be an ideal candidate for converting it to IPython notebook and shipping with the next myChEMBL release, so we've started working on it. Matt mentions other researchers, who have worked on this topic already: Davy Suvee, whose series (

1,

2,

3) of blog posts defined the basic algorithm, without providing any statistics about performance and accuracy and

Rajarshi Guha, who attempted to verify time complexity using one million compound sample from ChEMBL 17 and was disappointed with the performance of a median query time of 6332 ms with the similarity threshold set to 0.9.

####
**General algorithm:**

The general algorithm developed by Suvee, verified by Guha and improved by Swain looks as follows:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. For each fingerprint generated in step 1, store it along with the number of occurrences and its length in separate MongoDB collection (precomputing step)

3. For the query compound, compute the fingerprint and its length

Lq.

4. Apply pruning rules, originally described by Swamidass and Baldi in their

paper:

a) Consider only fingerprints with length

L in range:

` T * Lq <= L <= Lq / T `

where

T is a threshold

b) Consider:

Lq − T * Lq + 1
most differentiating fingerprint bits to check if at least one is present in searched compounds.
5. For all compounds which successfully passed pruning in step 4, compute the Tanimoto coefficient and return only those with a similarity score higher than the given threshold.
####
**Time complexity:**

After benchmarking his solution with 1000 random compounds from ChEMBL 18, Matt got following results:

While not identical with the results we obtained (we got higher absolute values because we had

SMILES on input, as opposed to RDKit objects in Matt's code, we measured the mean query time by recording the time needed to complete 1k random queries against ChEMBL 20 using a

docker container running on tiny Gigabyte BRIX machine with a Intel Core i7-3537U 2GHz CPU, 16GB RAM and 256GB SSD with Ubuntu 14.04 as host OS), the shape of the curve is quite similar and indicates polynomial time complexity with respect to decreasing threshold.

This is easy to understand as the function f(T) = [(Lq/T) - TLq](Lq-TLq + 1) can be reduced to a polynomial of the fourth degree. The lower is threshold the more compounds pass pruning and it takes more time to compute Tanimoto coefficient. Assuming optimistically that Matt's results can be easily achieved on any machine/computing environment, then searching 1k compounds with the threshold of 0.7 would take more than 16 minutes (and usually the query is provided as SMILES string, not a binary RDKit object, which requires expensive time consuming conversion), which can be too long for production environments. Moreover, there is a faster alternative, namely RDKit cartridge, which is (according to Matt's results) faster for thresholds lower than 0.85 and a bit slower for higher thresholds.

####
**Introducing LSH**

After we realised, that the time complexity is not optimal (mostly by sitting and staring at monitor for ages waiting for results), we decided to look for ways to improve the process. Some more time spent on researching this topic led us to the conclusion that using a technique called 'Locality Sensitive Hashing' (LSH) can help us solving the problem.

The idea behind LSH is pretty simple - while most hashes we use everyday (cryptographic hashes like SHA) generate completely different values for documents, even when they differ only slightly, LSH tries to generate similar hash values for similar documents in order to maximize the probability of hash collision.
We've managed to find only two research papers, directly describing the application of LSH to chemical similarity. The first one, by Dutta, Guha et al. is quite old (2006) and was published 3 whole years before the initial release of MongoDB. It looks at a bit different problems, such as outlier detection and clustering and uses LSH implementation written in C++ by Andoni and Indyk - authors of the authoritative paper in this field: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. The second paper by Nasr, Baldi et al. exactly describes our problem but is highly theoretical and doesn't provide any code or hints for practical implementation.

####
**Comparison with audio fingerprinting**

Knowing that our problem is quite analogous to finding songs similar to the one played currently in the radio, we searched for applications of LSH to audio fingerprinting and stumbled upon this article by Ciumac Sergiu. It describes an algorithm for detecting duplicate songs using MinHash (min-wise independent permutations LSH scheme). Based on this article, we've created an improved compound similarity search algorithm, which as the following steps:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. Generate 100 random permutations of length equal to the length of a single fingerprint and save them as a separate MongoDB collection.

3. For each fingerprint:

For each permutation generated in step 2, reorder the fingerprint bits according to the permutation and store the position of the first '1' bit.

After this is completed each fingerprint will have a corresponding list of 100 positions of the first '1' according to 100 permutations, store this information along with the fingerprint.

4. For each list generated in step 3, divide this list into 25 chunks, 4 numbers in each chunk. Using

left shift (<<) binary operation convert four numbers in each chunk into single 64-bit number. After this is done each fingerprint will be reduced to the vector of 25 integers, store this vector along with the fingerprint.

5. Create 25 different MongoDB collections. For each compound, get the corresponding 64-bit integer vector. For each number in the vector check the respective collection for the document identified by this number. If the document exists, extend it with compound identifier. If it doesn't exist create a new document identified by the number and containing single compound identifier.

6. For the query compound use the method described in steps 3 and 4 to generate a vector of 25 64-bit integer numbers. For each number check the respective collection for the document identified by this number. It it exists take compounds assigned to this document. Get the distinct set of all compounds found by this process for all 25 numbers. On this set, apply pruning procedure from the original algorithm and compute Tanimoto coefficient to all compound that successfully passed pruning. Return only those with a similarity score higher than the given threshold.

####
**Comparing results**

After implementing this method in Python and calling it from IPython notebook we've benchmarked it in the same way as we did with original algorithm (mean time of 5 repetitions of searching 1k random sample against 1455713 compounds from ChEMBL 20 for each discrete threshold value). In the same way we measured RDKit postgreSQL cartridge performance and got these results:

As you can see, LSH-based method is always faster than PostgreSQL cartridge and apart from 0.95 point, faster than Swain's implementation. Again, you may wonder why all the numbers are higher than Matt's results. Well, the most important difference between our and Matt's implementation is that our software expects SMILES as an input whereas Matt's stuff expects RDKit mol. This means that our methods have to perform one more expensive step during each iteration. We've decided to implement our functions like this because this is what they should do in practice. Another (but less significant) factor is that we were running our calculations in completely different environment (hardware, OS, presence/absence of virtualisation, ChEMBL version, RDKit version).

####
**What about accuracy?**

**EDIT:** we've updated error plots due to the bug in code (explained in comments), the actual error values are lower then we said previously.

Shorter timings is one thing but are we getting exactly the same results? Since our hashing function is only an approximation, we should expect some discrepancies. To measure the number of discrepancies, we selected random sample of 1k compounds and compared results from Swain's and our implementation. For each pair of results, we recorded the size of symmetric set difference between results int the pair and divided them by the size of sum (union) of both result sets. We repeated this for different threshold values, results (in percent) are presented below:

As an alternative error measure, we counted number of compounds, for which we observed any discrepancy in results (regardless of its size) and divided by the number of compounds in sample (1k). Again, results in percent are shown below:

As can be noticed, for thresholds greater or equal to 0.8 the results are almost the same. For threshold 0.7, less than 6% of query compounds has different results. As we checked, the result of LSH similarity is always a subset of original method. This number alone is probably not very high but we decided to further investigate this. At first we thought that since without any pruning and filtering by Tanimoto coefficient, pure LSH procedure returns on average 3.5k 'similar' compounds for given query compound, then for low thresholds standard algorithm will yield more results and this is why we observe differences. This wouldn't be very bad because usually when asking for similar compounds we don't expect to deal with thousands of them. But it turned out that this situation happens very rarely, especially for Morgan fingerprints. To get better idea of what really happens let's take some example query compound that generates different results.

For input SMILES:

`Clc1ccc(Cn2c(nc3ccccc23)C(=O)N4CCC(CC4)C(=O)NCCc5ccncc5)cc1`

We've got two two results using standard algorithm and one using LSH. Those where

`CHEMBL3085935`

and

`CHEMBL3085944`

, their SMILES are:

`O=C(NCCc1ccncc1)C1CCN(C(=O)c2nc3ccccc3n2Cc2ccc(Cl)cc2)CC1`

`O=C(NCCc1ccncc1)C1CCN(C(=O)c2cc3ccccc3n2Cc2ccc(Cl)cc2)CC1`

The difference is marked in red. Let's look at their vector of 25 hashes:

`[644282988592, 77607305247, 610086746128, 412446916610, 266355085320, 137447358470,...]`

`[25807697968, 77607305263, 610086778896, 412514025474, 403794038792, 137447358509,...]`

Notice, that some of them have very similar values (thanks to LSH). This means that the algorithm can be further improved by increasing number of chunks (instead of dividing 100 integers into 25 groups of 4, we could split them into 50 groups of 2 or even skip diving them at all) or use a different data structure, that instead of exactly matching hash string would first give a neighbors within some radius and return compounds from this neighborhood. Of course there would be some performance penalty but we believe it's possible to make the discrepancy chart completely flat in the range of thresholds greater than 0.7 and still have better performance than original method.

####

####
**Further improvements/research**

There are number of things mentioned in this article that could be better explored. First of all it should be noted, that we are not really comparing MongoDB and PostgreSQL as competing database engines. This is because both engines are using different techniques to implement compound similarity search. RDKit Postgres cartridge is not using LSH but it should be relatively easy to add code implementing it to RDKit library or as a separate piece of software called from the cartridge. On the other hand MongoDB support any extensions, executed on the MongoDB server, so all computations have to be done on the client side. That said, there is a very interesting project called

ToroDB - MongoDB drop-in replacement, that runs on top of PostgreSQL. So a possible immediate next step would be to run the same compound searching benchmarks using ToroDB, which should give us some idea of how PostgreSQL and MongoDB perform executing the same NoSQL tasks.

Generating permutations for LSH is another interesting topic. In our code, we used the standard numpy random function generating uniformly distributed numbers. It would be interesting to check if using different probability distributions can give better results. The audio fingerprinting article, on which we based our code mentions a paper by Baluja et al. titled "

Permutation grouping: intelligent hash function design for audio and image retrieval".

Measuring accuracy and time over LSH vector/groups size would be important as well. That would allow to find the optimal number of hashes that are required to get desired accuracy/time ratio.

Last but not least, in our benchmarks we've only used Morgan 2048-bit fingerprints. Fixed length fingerprints are already hashed and folded to achieve constant length and the price for this is worse fidelity. Using LSH on unfolded fingerprints may lead to better results; in addition it would also be interesting to compare different types of fingerprints as well as different lengths.

####
**Code**

####
** **

**EDIT:** you can now download the code from

here.

To better illustrate our method, we provide this IPython notebook. As already explained, it's based on Matt Swain

blog post but enhanced with methods described by Ciumac Sergiu in his

article about audio fingerprinting. This notebook will be included (along with all required dependencies) on myChEMBL 21 but if you have ChEMBL 20, you can load it as well but then, it's important to remember about installing MongoDB first. This can be done easily by ssh-ing to the machine and running:

`sudo [apt-get|yum] install mongodb`

and then start the mongod daemon:

`sudo mongod --fork --logpath /var/log/mongod.log `

Using the 'fork' option is very important and not including it may lead to worse performance. When running ChEMBL 20 as a virtual machine, it may happen that the size of collections created by MongoDB will exceed remaining disk space. This is why we suggest using ChEMBL 20 docker container on Linux box when possible because then, the only limiting factor is the size of physical disk space of the host machine.

*Michał*