For a single project, I needed a PowerShell script to find duplicate files in a file server’s shared network folder. There are many third-party tools to find and remove duplicate files in Windows, but most of them are commercial or not suitable for automated scenarios.
Because files may have different names but the same content, you should not compare files by name only. It’s better to get the hashes of all the files and find one of them.
The following PowerShell one-liner command allows you to rescan a folder (including its subfolders) and find duplicate files. In this example, two identical files with the same hash were found:
Get-ChildItem –path C:\Share\ -Recurse | Get-FileHash | Group-Object -property hash | Where-Object { $_.count -gt 1 } | ForEach-Object { $_.group | Select-Object Path, Hash }
This PowerShell one-liner is easy-to-use for finding duplicates, however, it has pretty poor performance. If there are many files in the folder, it will take a long time to calculate their hash. It’s easier to first compare the files by their size (this is a ready file attribute that doesn’t need to be calculated). Then we’ll only get a hash of files of the same size:
$file_dublicates = Get-ChildItem –path C:\Share\ -Recurse| Group-Object -property Length| Where-Object { $_.count -gt 1 }| Select-Object –Expand Group| Get-FileHash | Group-Object -property hash | Where-Object { $_.count -gt 1 }| ForEach-Object { $_.group | Select-Object Path, Hash }
You can compare the performance of both the commands on test folder measure-command cmdlet,
Measure-Command {your_powershell_command}
For a folder with 2,000 files, the second command is much faster than the first (10 minutes vs 3 seconds).
You can prompt a user to select duplicate files to delete. To do this, pipe a list of duplicate files into the Out-GridView cmdlet:
$file_dublicates | Out-GridView -Title "Select files to delete" -OutputMode Multiple –PassThru|Remove-Item –Verbose –WhatIf
The user can select the files to be deleted in the table (to select multiple files, press and hold CTRL) and click OK.
Instead of deleting, you can move the selected files to another directory using move-item,
In addition, you can replace duplicate files with hard links. The approach allows files to be kept in place and saves disk space significantly.
param(
[Parameter(Mandatory=$True)][ValidateScript({Test-Path -Path $_ -PathType Container})][string]$dir1,
[Parameter(Mandatory=$True)][ValidateScript({(Test-Path -Path $_ -PathType Container) -and $_ -ne $dir1})][string]$dir2
)
Get-ChildItem -Recurse $dir1, $dir2 |
Group-Object Length | Where-Object {$_.Count -ge 2} |
Select-Object -Expand Group | Get-FileHash |
Group-Object hash | Where-Object {$_.Count -ge 2} |
Foreach-Object {
$f1 = $_.Group[0].Path
Remove-Item $f1
New-Item -ItemType HardLink -Path $f1 -Target $_.Group[1].Path | Out-Null
#fsutil hardlink create $f1 $_.Group[1].Path
}
To run the file, use the following command format:
.\hardlinks.ps1 -dir1 d:\folder1 -dir2 d:\folder2
The script can be used to find and replace duplicates of static files (that don’t change!) with symbolic hard links.
On Windows Server, you can use the built-in data deduplication feature of the File Server role to solve the problem of duplicating files. However, when using duplication and incremental backup, you will have some troubles when restoring from backup.
Also you can use a console dipmerge Tool to replace duplicate files with hard links.
Leave a Comment