How to find duplicate documents in a MongoDB collection?

In this post we will show you how to find duplicate documents in your existing database using MongoDB’s aggregation pipeline.

Let’s take a sample employees collection for this post. I’ve inserted the following documents into the employees collection.

> db.employees.insertOne({employeeId: 1, employeeName: "Anand"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfbcf0694cb7ba344f46b")
}
> db.employees.insertOne({employeeId: 2, employeeName: "Wayne"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfbd80694cb7ba344f46c")
}
> db.employees.insertOne({employeeId: 3, employeeName: "Clark"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfbf10694cb7ba344f46d")
}
> db.employees.insertOne({employeeId: 2, employeeName: "Juan"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfc050694cb7ba344f46e")
}
> db.employees.insertOne({employeeId: 3, employeeName: "Naval"})
{
	"acknowledged" : true,
	"insertedId" : ObjectId("5f2cfc3b0694cb7ba344f46f")
}
are
> db.employees.find({})
{ "_id" : ObjectId("5f2cfbcf0694cb7ba344f46b"), "employeeId" : 1, "employeeName" : "Anand" }
{ "_id" : ObjectId("5f2cfbd80694cb7ba344f46c"), "employeeId" : 2, "employeeName" : "Wayne" }
{ "_id" : ObjectId("5f2cfbf10694cb7ba344f46d"), "employeeId" : 3, "employeeName" : "Clark" }
{ "_id" : ObjectId("5f2cfc050694cb7ba344f46e"), "employeeId" : 2, "employeeName" : "Juan" }
{ "_id" : ObjectId("5f2cfc3b0694cb7ba344f46f"), "employeeId" : 3, "employeeName" : "Naval" }

As you can see there are two documents with same employeeId 2 and 3. To find these documents with duplicate employee ids, we are going to use MongoDB’s aggregation pipeline. It makes finding duplicate documents easier by allowing you to customize how documents are grouped together and filtered. In other words, MongoDB lets you select fields and group together documents based on your selection in order to find duplicate documents in a collection.

We can achieve this in two steps, first the documents can be grouped based on employeeId and if employeeId matches, we sum it up and put it in count variable.

{
    $group: {
        _id: "$field_to_search",
        "dups": {
            "$push": "$_id"
        },
        count: {
            $sum: 1
        }
    }
}

In the next step we check if the count is greater than 1, which gives you the duplicate documents.

{
  count: {
    $gt: 1
  }
}

Combining these two steps, you can easily find duplicate documents in a MongoDB collection.

> db.employees.aggregate([{
    $group: {
        _id: "$employeeId",
        "dups": {
            "$push": "$_id"
        },
        count: {
            $sum: 1
        }
    }
}, {
    $match: {
        count: {
            "$gt": 1
        }
    }
}, {
    $sort: {
        count: -1
    }
}]);
{ "_id" : { "employeeId" : 3 }, "uniqueIds" : [ ObjectId("5f2cfc3b0694cb7ba344f46f"), ObjectId("5f2cfbf10694cb7ba344f46d") ], "count" : 2 }
{ "_id" : { "employeeId" : 2 }, "uniqueIds" : [ ObjectId("5f2cfc050694cb7ba344f46e"), ObjectId("5f2cfbd80694cb7ba344f46c") ], "count" : 2 }

Python

[
    {
        '$group': {
            '_id': '$featureName', 
            'dups': {
                '$push': '$_id'
            }, 
            'count': {
                '$sum': 1
            }
        }
    }, {
        '$match': {
            'count': {
                '$gt': 1
            }
        }
    }, {
        '$sort': {
            'count': -1
        }
    }
]

JavaScript

[
  {
    '$group': {
      '_id': '$featureName', 
      'dups': {
        '$push': '$_id'
      }, 
      'count': {
        '$sum': 1
      }
    }
  }, {
    '$match': {
      'count': {
        '$gt': 1
      }
    }
  }, {
    '$sort': {
      'count': -1
    }
  }
]

Java

Arrays.asList(group("$featureName", push("dups", "$_id"), sum("count", 1L)), match(gt("count", 1L)), sort(descending("count")))

C#

new BsonArray
{
    new BsonDocument("$group", 
    new BsonDocument
        {
            { "_id", "$featureName" }, 
            { "dups", 
    new BsonDocument("$push", "$_id") }, 
            { "count", 
    new BsonDocument("$sum", 1) }
        }),
    new BsonDocument("$match", 
    new BsonDocument("count", 
    new BsonDocument("$gt", 1))),
    new BsonDocument("$sort", 
    new BsonDocument("count", -1))
}

If you’re looking for a solution in JavaScript, you can take a look at this solution developed by Tej Pratap.