Skip to content

How to find duplicate documents in a MongoDB collection?

In this post we will show you how to find duplicate documents in your existing database using MongoDB’s aggregation pipeline.

Advertisements

Let’s take a sample employees collection for this post. I’ve inserted the following documents into the employees collection.

> db.employees.insertOne({employeeId: 1, employeeName: "Anand"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfbcf0694cb7ba344f46b")
}
> db.employees.insertOne({employeeId: 2, employeeName: "Wayne"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfbd80694cb7ba344f46c")
}
> db.employees.insertOne({employeeId: 3, employeeName: "Clark"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfbf10694cb7ba344f46d")
}
> db.employees.insertOne({employeeId: 2, employeeName: "Juan"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfc050694cb7ba344f46e")
}
> db.employees.insertOne({employeeId: 3, employeeName: "Naval"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfc3b0694cb7ba344f46f")
}
are
> db.employees.find({})
{ "_id" : ObjectId("5f2cfbcf0694cb7ba344f46b"), "employeeId" : 1, "employeeName" : "Anand" }
{ "_id" : ObjectId("5f2cfbd80694cb7ba344f46c"), "employeeId" : 2, "employeeName" : "Wayne" }
{ "_id" : ObjectId("5f2cfbf10694cb7ba344f46d"), "employeeId" : 3, "employeeName" : "Clark" }
{ "_id" : ObjectId("5f2cfc050694cb7ba344f46e"), "employeeId" : 2, "employeeName" : "Juan" }
{ "_id" : ObjectId("5f2cfc3b0694cb7ba344f46f"), "employeeId" : 3, "employeeName" : "Naval" }
Advertisements

As you can see there are two documents with same employeeId 2 and 3. To find these documents with duplicate employee ids, we are going to use MongoDB’s aggregation pipeline. It makes finding duplicate documents easier by allowing you to customize how documents are grouped together and filtered. In other words, MongoDB lets you select fields and group together documents based on your selection in order to find duplicate documents in a collection.

We can achieve this in two steps, first the documents can be grouped based on employeeId and if employeeId matches, we sum it up and put it in count variable.

{
    $group: {
        _id: "$field_to_search",
        "dups": {
            "$push": "$_id"
        },
        count: {
            $sum: 1
        }
    }
}

In the next step we check if the count is greater than 1, which gives you the duplicate documents.

{
  count: {
    $gt: 1
  }
}

Combining these two steps, you can easily find duplicate documents in a MongoDB collection.

> db.employees.aggregate([{
    $group: {
        _id: "$employeeId",
        "dups": {
            "$push": "$_id"
        },
        count: {
            $sum: 1
        }
    }
}, {
    $match: {
        count: {
            "$gt": 1
        }
    }
}, {
    $sort: {
        count: -1
    }
}]);
Advertisements
{ "_id" : { "employeeId" : 3 }, "uniqueIds" : [ ObjectId("5f2cfc3b0694cb7ba344f46f"), ObjectId("5f2cfbf10694cb7ba344f46d") ], "count" : 2 }
{ "_id" : { "employeeId" : 2 }, "uniqueIds" : [ ObjectId("5f2cfc050694cb7ba344f46e"), ObjectId("5f2cfbd80694cb7ba344f46c") ], "count" : 2 }

Python

[
    {
        '$group': {
            '_id': '$featureName', 
            'dups': {
                '$push': '$_id'
            }, 
            'count': {
                '$sum': 1
            }
        }
    }, {
        '$match': {
            'count': {
                '$gt': 1
            }
        }
    }, {
        '$sort': {
            'count': -1
        }
    }
]

JavaScript

[
  {
    '$group': {
      '_id': '$featureName', 
      'dups': {
        '$push': '$_id'
      }, 
      'count': {
        '$sum': 1
      }
    }
  }, {
    '$match': {
      'count': {
        '$gt': 1
      }
    }
  }, {
    '$sort': {
      'count': -1
    }
  }
]
Advertisements

Java

Arrays.asList(group("$featureName", push("dups", "$_id"), sum("count", 1L)), match(gt("count", 1L)), sort(descending("count")))
Advertisements

C#

new BsonArray
{
    new BsonDocument("$group", 
    new BsonDocument
        {
            { "_id", "$featureName" }, 
            { "dups", 
    new BsonDocument("$push", "$_id") }, 
            { "count", 
    new BsonDocument("$sum", 1) }
        }),
    new BsonDocument("$match", 
    new BsonDocument("count", 
    new BsonDocument("$gt", 1))),
    new BsonDocument("$sort", 
    new BsonDocument("count", -1))
}

If you’re looking for a solution in JavaScript, you can take a look at this solution developed by Tej Pratap.

See also  Convert MongoDB ObjectId to timestamp and vice versa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.