Files
Sharepoint-Toolbox/.planning/milestones/v1.0-phases/03-storage/03-04-PLAN.md
Dev 724fdc550d chore: complete v1.0 milestone
Archive 5 phases (36 plans) to milestones/v1.0-phases/.
Archive roadmap, requirements, and audit to milestones/.
Evolve PROJECT.md with shipped state and validated requirements.
Collapse ROADMAP.md to one-line milestone summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 09:19:03 +02:00

23 KiB
Raw Blame History

phase, plan, title, status, wave, depends_on, files_modified, autonomous, requirements, must_haves
phase plan title status wave depends_on files_modified autonomous requirements must_haves
03 04 SearchService and DuplicatesService — KQL Pagination and Duplicate Grouping pending 2
03-01
SharepointToolbox/Services/SearchService.cs
SharepointToolbox/Services/DuplicatesService.cs
true
SRCH-01
SRCH-02
DUPL-01
DUPL-02
truths artifacts key_links
SearchService implements ISearchService and builds KQL from all SearchOptions fields (extension, dates, creator, editor, library)
SearchService paginates StartRow += 500 and stops when StartRow > 50,000 (platform cap) or MaxResults reached
SearchService filters out _vti_history/ paths from results
SearchService applies client-side Regex filter when SearchOptions.Regex is non-empty
DuplicatesService implements IDuplicatesService for both Mode=Files (Search API) and Mode=Folders (CAML FSObjType=1)
DuplicatesService groups items by MakeKey composite key and returns only groups with count >= 2
All CSOM round-trips use ExecuteQueryRetryHelper.ExecuteQueryRetryAsync
Folder enumeration uses SharePointPaginationHelper.GetAllItemsAsync with FSObjType=1 CAML
path provides exports
SharepointToolbox/Services/SearchService.cs KQL search engine with pagination (SRCH-01/02)
SearchService
path provides exports
SharepointToolbox/Services/DuplicatesService.cs Duplicate detection for files and folders (DUPL-01/02)
DuplicatesService
from to via pattern
SearchService.cs KeywordQuery + SearchExecutor Microsoft.SharePoint.Client.Search.Query KeywordQuery
from to via pattern
DuplicatesService.cs SharePointPaginationHelper.GetAllItemsAsync folder enumeration SharePointPaginationHelper.GetAllItemsAsync
from to via pattern
DuplicatesService.cs MakeKey composite key grouping MakeKey

Plan 03-04: SearchService and DuplicatesService — KQL Pagination and Duplicate Grouping

Goal

Implement SearchService (KQL-based file search with 500-row pagination and 50,000 hard cap) and DuplicatesService (file duplicates via Search API + folder duplicates via CAML FSObjType=1). Both services are wave 2 — they depend only on the models and interfaces from Plan 03-01, not on StorageService.

Context

Microsoft.SharePoint.Client.Search.dll is available as a transitive dependency of PnP.Framework 1.18.0. The namespace is Microsoft.SharePoint.Client.Search.Query. The search pattern requires calling executor.ExecuteQuery(kq) to register the query, then ExecuteQueryRetryHelper.ExecuteQueryRetryAsync to execute it — calling ctx.ExecuteQuery() directly afterward is incorrect and must be avoided.

DuplicatesService for folders uses SharePointPaginationHelper.GetAllItemsAsync with FSObjType=1 CAML. The CAML field name is FSObjType (not FileSystemObjectType) — using the wrong name returns zero results silently.

The MakeKey composite key logic tested in Plan 03-01 DuplicatesServiceTests must match exactly what DuplicatesService implements.

Tasks

Task 1: Implement SearchService

File: SharepointToolbox/Services/SearchService.cs

Action: Create

Why: SRCH-01 (multi-criteria search) and SRCH-02 (configurable max results up to 50,000).

using Microsoft.SharePoint.Client;
using Microsoft.SharePoint.Client.Search.Query;
using SharepointToolbox.Core.Helpers;
using SharepointToolbox.Core.Models;
using System.Text.RegularExpressions;

namespace SharepointToolbox.Services;

/// <summary>
/// File search using SharePoint KQL Search API.
/// Port of PS Search-SPOFiles pattern (PS lines 4747-4987).
/// Pagination: 500 rows per batch, hard cap StartRow=50,000 (SharePoint Search boundary).
/// </summary>
public class SearchService : ISearchService
{
    private const int BatchSize = 500;
    private const int MaxStartRow = 50_000;

    public async Task<IReadOnlyList<SearchResult>> SearchFilesAsync(
        ClientContext ctx,
        SearchOptions options,
        IProgress<OperationProgress> progress,
        CancellationToken ct)
    {
        ct.ThrowIfCancellationRequested();

        string kql = BuildKql(options);
        ValidateKqlLength(kql);

        Regex? regexFilter = null;
        if (!string.IsNullOrWhiteSpace(options.Regex))
        {
            regexFilter = new Regex(options.Regex,
                RegexOptions.IgnoreCase | RegexOptions.Compiled,
                TimeSpan.FromSeconds(2));
        }

        var allResults = new List<SearchResult>();
        int startRow = 0;
        int maxResults = Math.Min(options.MaxResults, MaxStartRow);

        do
        {
            ct.ThrowIfCancellationRequested();

            var kq = new KeywordQuery(ctx)
            {
                QueryText      = kql,
                StartRow       = startRow,
                RowLimit       = BatchSize,
                TrimDuplicates = false
            };
            kq.SelectProperties.AddRange(new[]
            {
                "Title", "Path", "Author", "LastModifiedTime",
                "FileExtension", "Created", "ModifiedBy", "Size"
            });

            var executor = new SearchExecutor(ctx);
            ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
            await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

            var table = clientResult.Value
                .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
            if (table == null || table.RowCount == 0) break;

            foreach (System.Collections.Hashtable row in table.ResultRows)
            {
                var dict = row.Cast<System.Collections.DictionaryEntry>()
                    .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? (object)string.Empty);

                // Skip SharePoint version history paths
                string path = Str(dict, "Path");
                if (path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase))
                    continue;

                var result = ParseRow(dict);

                // Client-side Regex filter on file name
                if (regexFilter != null)
                {
                    string fileName = System.IO.Path.GetFileName(result.Path);
                    if (!regexFilter.IsMatch(fileName) && !regexFilter.IsMatch(result.Title))
                        continue;
                }

                allResults.Add(result);
                if (allResults.Count >= maxResults) goto done;
            }

            progress.Report(new OperationProgress(allResults.Count, maxResults,
                $"Retrieved {allResults.Count:N0} results…"));

            startRow += BatchSize;
        }
        while (startRow <= MaxStartRow && allResults.Count < maxResults);

        done:
        return allResults;
    }

    // ── Extension point: bypassing the 50,000-item cap ───────────────────────
    //
    // The StartRow approach has a hard ceiling at 50,000 (SharePoint Search boundary).
    // To go beyond it, replace the StartRow loop with a DocId cursor:
    //
    //   1. Add "DocId" to SelectProperties.
    //   2. Add query.SortList.Add("DocId", SortDirection.Ascending).
    //   3. First page KQL: unchanged.
    //      Subsequent pages: append "AND DocId>{lastDocId}" to the KQL (StartRow stays 0).
    //   4. Track lastDocId = Convert.ToInt64(lastRow["DocId"]) after each batch.
    //   5. Stop when batch.RowCount < BatchSize.
    //
    // Caveats:
    //   - DocId is per-site-collection; for multi-site searches, maintain a separate
    //     cursor per ClientContext (site URL).
    //   - The search index can shift between batches (new items indexed mid-scan);
    //     the DocId cursor is safer than StartRow but cannot guarantee zero drift.
    //   - DocId is not returned by default — it must be in SelectProperties.
    //
    // This is deliberately not implemented here because SRCH-02 caps results at 50,000,
    // which the StartRow approach already covers exactly (100 pages × 500 rows).
    // Implement the DocId cursor if the cap needs to be lifted in a future version.

    // ── KQL builder ───────────────────────────────────────────────────────────

    internal static string BuildKql(SearchOptions opts)
    {
        var parts = new List<string> { "ContentType:Document" };

        if (opts.Extensions.Length > 0)
        {
            var extParts = opts.Extensions
                .Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
            parts.Add($"({string.Join(" OR ", extParts)})");
        }
        if (opts.CreatedAfter.HasValue)
            parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
        if (opts.CreatedBefore.HasValue)
            parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
        if (opts.ModifiedAfter.HasValue)
            parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
        if (opts.ModifiedBefore.HasValue)
            parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
        if (!string.IsNullOrEmpty(opts.CreatedBy))
            parts.Add($"Author:\"{opts.CreatedBy}\"");
        if (!string.IsNullOrEmpty(opts.ModifiedBy))
            parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
        if (!string.IsNullOrEmpty(opts.Library) && !string.IsNullOrEmpty(opts.SiteUrl))
            parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");

        return string.Join(" AND ", parts);
    }

    private static void ValidateKqlLength(string kql)
    {
        // SharePoint Search KQL text hard cap is 4096 characters
        if (kql.Length > 4096)
            throw new InvalidOperationException(
                $"KQL query exceeds 4096-character SharePoint Search limit ({kql.Length} chars). " +
                "Reduce the number of extension filters.");
    }

    // ── Row parser ────────────────────────────────────────────────────────────

    private static SearchResult ParseRow(IDictionary<string, object> row)
    {
        static string Str(IDictionary<string, object> r, string key) =>
            r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;

        static DateTime? Date(IDictionary<string, object> r, string key)
        {
            var s = Str(r, key);
            return DateTime.TryParse(s, out var dt) ? dt : (DateTime?)null;
        }

        static long ParseSize(IDictionary<string, object> r, string key)
        {
            var raw = Str(r, key);
            var digits = Regex.Replace(raw, "[^0-9]", "");
            return long.TryParse(digits, out var v) ? v : 0L;
        }

        return new SearchResult
        {
            Title         = Str(row, "Title"),
            Path          = Str(row, "Path"),
            FileExtension = Str(row, "FileExtension"),
            Created       = Date(row, "Created"),
            LastModified  = Date(row, "LastModifiedTime"),
            Author        = Str(row, "Author"),
            ModifiedBy    = Str(row, "ModifiedBy"),
            SizeBytes     = ParseSize(row, "Size")
        };
    }

    private static string Str(IDictionary<string, object> r, string key) =>
        r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;
}

Verification:

dotnet build C:/Users/dev/Documents/projets/Sharepoint/SharepointToolbox.slnx
dotnet test C:/Users/dev/Documents/projets/Sharepoint/SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "FullyQualifiedName~SearchServiceTests" -x

Expected: 0 build errors; CSOM tests skip, no compile errors

Task 2: Implement DuplicatesService

File: SharepointToolbox/Services/DuplicatesService.cs

Action: Create

Why: DUPL-01 (file duplicates via Search API) and DUPL-02 (folder duplicates via CAML pagination).

using Microsoft.SharePoint.Client;
using Microsoft.SharePoint.Client.Search.Query;
using SharepointToolbox.Core.Helpers;
using SharepointToolbox.Core.Models;

namespace SharepointToolbox.Services;

/// <summary>
/// Duplicate file and folder detection.
/// Files: Search API (same KQL engine as SearchService) + client-side composite key grouping.
/// Folders: CSOM CAML FSObjType=1 via SharePointPaginationHelper + composite key grouping.
/// Port of PS Find-DuplicateFiles / Find-DuplicateFolders (PS lines 4942-5036).
/// </summary>
public class DuplicatesService : IDuplicatesService
{
    private const int BatchSize = 500;
    private const int MaxStartRow = 50_000;

    public async Task<IReadOnlyList<DuplicateGroup>> ScanDuplicatesAsync(
        ClientContext ctx,
        DuplicateScanOptions options,
        IProgress<OperationProgress> progress,
        CancellationToken ct)
    {
        ct.ThrowIfCancellationRequested();

        List<DuplicateItem> allItems;

        if (options.Mode == "Folders")
            allItems = await CollectFolderItemsAsync(ctx, options, progress, ct);
        else
            allItems = await CollectFileItemsAsync(ctx, options, progress, ct);

        progress.Report(OperationProgress.Indeterminate($"Grouping {allItems.Count:N0} items by duplicate key…"));

        var groups = allItems
            .GroupBy(item => MakeKey(item, options))
            .Where(g => g.Count() >= 2)
            .Select(g => new DuplicateGroup
            {
                GroupKey = g.Key,
                Name     = g.First().Name,
                Items    = g.ToList()
            })
            .OrderByDescending(g => g.Items.Count)
            .ThenBy(g => g.Name)
            .ToList();

        return groups;
    }

    // ── File collection via Search API ────────────────────────────────────────

    private static async Task<List<DuplicateItem>> CollectFileItemsAsync(
        ClientContext ctx,
        DuplicateScanOptions options,
        IProgress<OperationProgress> progress,
        CancellationToken ct)
    {
        // KQL: all documents, optionally scoped to a library
        var kqlParts = new List<string> { "ContentType:Document" };
        if (!string.IsNullOrEmpty(options.Library))
            kqlParts.Add($"Path:\"{ctx.Url.TrimEnd('/')}/{options.Library.TrimStart('/')}*\"");
        string kql = string.Join(" AND ", kqlParts);

        var allItems = new List<DuplicateItem>();
        int startRow = 0;

        do
        {
            ct.ThrowIfCancellationRequested();

            var kq = new KeywordQuery(ctx)
            {
                QueryText      = kql,
                StartRow       = startRow,
                RowLimit       = BatchSize,
                TrimDuplicates = false
            };
            kq.SelectProperties.AddRange(new[]
            {
                "Title", "Path", "FileExtension", "Created",
                "LastModifiedTime", "Size", "ParentLink"
            });

            var executor = new SearchExecutor(ctx);
            ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
            await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

            var table = clientResult.Value
                .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
            if (table == null || table.RowCount == 0) break;

            foreach (System.Collections.Hashtable row in table.ResultRows)
            {
                var dict = row.Cast<System.Collections.DictionaryEntry>()
                    .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? (object)string.Empty);

                string path = GetStr(dict, "Path");
                if (path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase))
                    continue;

                string name = System.IO.Path.GetFileName(path);
                if (string.IsNullOrEmpty(name))
                    name = GetStr(dict, "Title");

                string raw = GetStr(dict, "Size");
                string digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
                long size = long.TryParse(digits, out var sv) ? sv : 0L;

                DateTime? created  = ParseDate(GetStr(dict, "Created"));
                DateTime? modified = ParseDate(GetStr(dict, "LastModifiedTime"));

                // Derive library from ParentLink or path segments
                string parentLink = GetStr(dict, "ParentLink");
                string library    = ExtractLibraryFromPath(path, ctx.Url);

                allItems.Add(new DuplicateItem
                {
                    Name      = name,
                    Path      = path,
                    Library   = library,
                    SizeBytes = size,
                    Created   = created,
                    Modified  = modified
                });
            }

            progress.Report(new OperationProgress(allItems.Count, MaxStartRow,
                $"Collected {allItems.Count:N0} files…"));

            startRow += BatchSize;
        }
        while (startRow <= MaxStartRow);

        return allItems;
    }

    // ── Folder collection via CAML ────────────────────────────────────────────

    private static async Task<List<DuplicateItem>> CollectFolderItemsAsync(
        ClientContext ctx,
        DuplicateScanOptions options,
        IProgress<OperationProgress> progress,
        CancellationToken ct)
    {
        // Load all document libraries on the site
        ctx.Load(ctx.Web,
            w => w.Lists.Include(
                l => l.Title, l => l.Hidden, l => l.BaseType));
        await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

        var libs = ctx.Web.Lists
            .Where(l => !l.Hidden && l.BaseType == BaseType.DocumentLibrary)
            .ToList();

        // Filter to specific library if requested
        if (!string.IsNullOrEmpty(options.Library))
        {
            libs = libs
                .Where(l => l.Title.Equals(options.Library, StringComparison.OrdinalIgnoreCase))
                .ToList();
        }

        var camlQuery = new CamlQuery
        {
            ViewXml = """
                <View Scope='RecursiveAll'>
                  <Query>
                    <Where>
                      <Eq>
                        <FieldRef Name='FSObjType' />
                        <Value Type='Integer'>1</Value>
                      </Eq>
                    </Where>
                  </Query>
                  <RowLimit>2000</RowLimit>
                </View>
                """
        };

        var allItems = new List<DuplicateItem>();

        foreach (var lib in libs)
        {
            ct.ThrowIfCancellationRequested();
            progress.Report(OperationProgress.Indeterminate($"Scanning folders in {lib.Title}…"));

            await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, lib, camlQuery, ct))
            {
                ct.ThrowIfCancellationRequested();

                var fv = item.FieldValues;
                string name      = fv["FileLeafRef"]?.ToString() ?? string.Empty;
                string fileRef   = fv["FileRef"]?.ToString() ?? string.Empty;
                int subCount     = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
                int childCount   = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
                int fileCount    = Math.Max(0, childCount - subCount);
                DateTime? created  = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
                DateTime? modified = fv["Modified"] is DateTime md ? md : (DateTime?)null;

                allItems.Add(new DuplicateItem
                {
                    Name        = name,
                    Path        = fileRef,
                    Library     = lib.Title,
                    FolderCount = subCount,
                    FileCount   = fileCount,
                    Created     = created,
                    Modified    = modified
                });
            }
        }

        return allItems;
    }

    // ── Composite key builder (matches test scaffold in DuplicatesServiceTests) ──

    internal static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
    {
        var parts = new List<string> { item.Name.ToLowerInvariant() };
        if (opts.MatchSize     && item.SizeBytes.HasValue)    parts.Add(item.SizeBytes.Value.ToString());
        if (opts.MatchCreated  && item.Created.HasValue)      parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
        if (opts.MatchModified && item.Modified.HasValue)     parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
        if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
        if (opts.MatchFileCount && item.FileCount.HasValue)   parts.Add(item.FileCount.Value.ToString());
        return string.Join("|", parts);
    }

    // ── Private utilities ─────────────────────────────────────────────────────

    private static string GetStr(IDictionary<string, object> r, string key) =>
        r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;

    private static DateTime? ParseDate(string s) =>
        DateTime.TryParse(s, out var dt) ? dt : (DateTime?)null;

    private static string ExtractLibraryFromPath(string path, string siteUrl)
    {
        // Extract first path segment after the site URL as library name
        // e.g. https://tenant.sharepoint.com/sites/MySite/Shared Documents/file.docx -> "Shared Documents"
        if (string.IsNullOrEmpty(path) || string.IsNullOrEmpty(siteUrl))
            return string.Empty;

        string relative = path.StartsWith(siteUrl.TrimEnd('/'), StringComparison.OrdinalIgnoreCase)
            ? path.Substring(siteUrl.TrimEnd('/').Length).TrimStart('/')
            : path;

        int slash = relative.IndexOf('/');
        return slash > 0 ? relative.Substring(0, slash) : relative;
    }
}

Verification:

dotnet test C:/Users/dev/Documents/projets/Sharepoint/SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "FullyQualifiedName~DuplicatesServiceTests" -x

Expected: 5 pure-logic tests pass (MakeKey), 2 CSOM stubs skip

Verification

dotnet build C:/Users/dev/Documents/projets/Sharepoint/SharepointToolbox.slnx
dotnet test C:/Users/dev/Documents/projets/Sharepoint/SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "FullyQualifiedName~SearchServiceTests|FullyQualifiedName~DuplicatesServiceTests" -x

Expected: 0 build errors; 5 MakeKey tests pass; CSOM stub tests skip; no compile errors

Commit Message

feat(03-04): implement SearchService KQL pagination and DuplicatesService composite key grouping

Output

After completion, create .planning/phases/03-storage/03-04-SUMMARY.md