Improve the MongoDB Aggregation Framework

显示全部楼层 · 2014-2-19 11:55:14

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.
1. MongoDB Aggregation Framework
The MongoDB Aggregation Framework draws on the well-known linux pipeline concept, where the output of one command is piped or redirected to be used as input of the next command. In case of MongoDB, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. Some operators, such as $match, $limit and $skip take a document as input and output the same document in case a certain set of criteria’s is met. Other operators, such as $project and $unwind take a single document as input and reshape that document or emit multiple documents based upon a certain projection. The $group operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. Expressions can be used within some of these operators to calculate new values or execute string operations.
Multiple operators are combined into a single pipeline that is applied upon a list of documents. The pipeline itself is executed as a MongoDB Command, resulting in single MongoDB document that contains an array of all documents that came out at end of the pipeline. The next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. Make sure to (re)read the previous two articles to fully grasp the implementation logic.
2. Molecular Similarity Pipeline

When applying a pipeline upon a certain collection, all documents contained within this collection are given as input to the first operator. It’s considered best practice to filter this list as quickly as possible to limit the number of total documents that are passed through the pipeline. In our case, this means filtering out all document that will never be able to satisfy the target Tanimoto coefficient. Hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. If we target a Tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the $match operator look as follows:
{ "$match" :
{ "fingerprint_count" : { "$gte" : 32 , "$lte" : 50}}
}
复制代码

千问 · 2014-2-19 11:55:14

Only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. To perform this filtering, the $match operator is able to use the index that we have defined for the fingerprint_count property. For computing the Tanimoto coefficient, we need to calculate the number of shared fingerprints between a certain input compound and the compound we are targeting. In order to be able to work at the fingerprint level, we use the $unwind operator. $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. In our case, we apply the $unwind upon the fingerprints property. Hence, each compound document will result in n compound documents, where n is the number of unique fingerprints contained within the compound.{ "$unwind" : "$fingerprints"}复制代码In order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. For doing so, we again apply the $match operator, this time filtering on the fingerprints property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.{ "$match" :
{ "fingerprints" :
{ "$in" : [ 1960 , 15111 , 5186 , 5371 , 756 , 1015 , 1018 , 338 , 325 , 776 , 3900 , ..., 2473] }
}
}
复制代码As we only match fingerprints that are in the list of target fingerprints, the output can be used to count the total number of shared fingerprints. For this, we apply the $group operator on the compound_cid, though which we create a new type of document, containing the number of matching fingerprints (by summating the number of occurrences), the total number of fingerprints of the input compound and the smiles representation.{ "$group" :
{ "_id" : "$compound_cid" , "fingerprintmatches" : { "$sum" : 1} ,
"totalcount" : { "$first" : "$fingerprint_count"} ,"smiles" : { "$first" : "$smiles"}
}
}
复制代码

千问 · 2014-2-19 11:55:14

We now have all parameters in place to calculate the Tanimoto coefficient. For this we will use the $project operator which, next to copying the compound id and smiles property, also adds a new, computed property named tanimoto.{ "$project" : { "_id" : 1 , "tanimoto" : { "$divide" : [ "$fingerprintmatches" , { "$subtract" : [ { "$add" : [ 40 , "$totalcount"] } , "$fingerprintmatches"] } ] } , "smiles" : 1}}复制代码As we are only interested in compounds that have a target Tanimoto coefficient of 0.8, we apply an additional $match operator to filter out all the ones that do not reach this coefficient.{ "$match" :
{ "tanimoto" : { "$gte" : 0.8}
}
复制代码The full pipeline command can be found below.01.{ "aggregate" : "compounds" ,
02."pipeline" : [
03.{ "$match" :
04.{ "fingerprint_count" : { "$gte" : 32 , "$lte" : 50} }
05.},
06.{ "$unwind" : "$fingerprints"},
07.{ "$match" :
08.{ "fingerprints" :
09.{ "$in" : [ 1960 , 15111 , 5186 , 5371 , 756 , 1015 , 1018 , 338 , 325 , 776 , 3900, ... , 2473] }
10.}
11.},
12.{ "$group" :
13.{ "_id" : "$compound_cid" ,
14."fingerprintmatches" : { "$sum" : 1} ,
15."totalcount" : { "$first" : "$fingerprint_count"} ,
16."smiles" : { "$first" : "$smiles"}
17.}
18.},
19.{ "$project" :
20.{ "_id" : 1 ,
21."tanimoto" : { "$divide" : [ "$fingerprintmatches" , { "$subtract": [ { "$add" : [ 89 , "$totalcount"]} , "$fingerprintmatches"] } ] } ,
22."smiles" : 1
23.}
24.},
25.{ "$match" :
26.{ "tanimoto" : { "$gte" : 0.05} }
27.} ]
28.}
复制代码The output of this pipeline contains a list of compounds which have a Tanimoto of 0.8 or higher with respect to a particular target compound. A visual representation of this pipeline can be found below:

pipeline.jpg (35.66 KB, 下载次数: 54)
下载附件
2012-3-8 11:32 上传

千问 · 2014-2-19 11:55:14

3. Conclusion
The new MongoDB Aggregation Framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more concise fashion. The pipeline concept beneath it offers an intuitive way of processing data. It is no surprise that this pipeline paradigm is adopted by various NoSQL approaches, including Tinkerpop’s Gremlin Framework and Neo4J’s Cypher implementation.
Performance wise, the pipeline solution is a major improvement upon the map-reduce implementation. The employed operators are natively supported by the MongoDB platform, which results in a huge performance improvement with respect to interpreted Javascript. As the Aggregation Framework is also able to work in a sharded environment, it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target Tanimoto coefficient is low. Great work from the MongoDB team!

千问 · 2014-2-19 11:55:14

看着头大，给个中文的