Files
Sharepoint-Toolbox/.planning/phases/03-storage/03-RESEARCH.md
2026-04-02 14:41:39 +02:00

44 KiB
Raw Blame History

Phase 3: Storage and File Operations - Research

Researched: 2026-04-02 Domain: CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection Confidence: HIGH


<phase_requirements>

Phase Requirements

ID Description Research Support
STOR-01 User can view storage consumption per library on a site CSOM Folder.StorageMetrics (one Load call per folder) + flat DataGrid with indent column
STOR-02 User can view storage consumption per site with configurable folder depth Recursive Collect-FolderStorage pattern translated to async CSOM; depth guard via split-count
STOR-03 Storage metrics include total size, version size, item count, and last modified date StorageMetrics.TotalSize, TotalFileStreamSize, TotalFileCount, StorageMetrics.LastModified; version size = TotalSize - TotalFileStreamSize
STOR-04 User can export storage metrics to CSV New StorageCsvExportService — same UTF-8 BOM pattern as Phase 2
STOR-05 User can export storage metrics to interactive HTML with collapsible tree view New StorageHtmlExportService — port PS lines 1621-1780; toggle() JS + nested table rows
SRCH-01 User can search files across sites using multiple criteria KeywordQuery + SearchExecutor (CSOM search); KQL built from filter params; client-side Regex post-filter
SRCH-02 User can configure maximum search results (up to 50,000) SharePoint Search StartRow hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max
SRCH-03 User can export search results to CSV New SearchCsvExportService
SRCH-04 User can export search results to interactive HTML (sortable, filterable) New SearchHtmlExportService — port PS lines 2112-2233; sortable columns via data attributes
DUPL-01 User can scan for duplicate files by name, size, creation date, modification date Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed
DUPL-02 User can scan for duplicate folders by name, subfolder count, file count SharePointPaginationHelper.GetAllItemsAsync with CAML FSObjType=1; read FolderChildCount, ItemChildCount from field values
DUPL-03 User can export duplicate report to HTML with grouped display and visual indicators New DuplicatesHtmlExportService — port PS lines 2235-2406; collapsible group cards, ok/diff badges
</phase_requirements>

Summary

Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — Microsoft.SharePoint.Client.Search.dll is already in the output folder as a transitive dependency of PnP.Framework 1.18.0.

Storage uses CSOM Folder.StorageMetrics (loaded via ctx.Load(folder, f => f.StorageMetrics)). One CSOM round-trip per folder. Version size is derived as TotalSize - TotalFileStreamSize. The data model is a recursive tree (site → library → folder → subfolder), flattened to a DataGrid with an indent-level column for WPF display. The HTML export ports the PS Export-StorageToHTML function (PS lines 1621-1780) with its toggle(i) JS pattern.

File Search uses Microsoft.SharePoint.Client.Search.Query.KeywordQuery + SearchExecutor. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is StartRow += 500 per batch; the hard ceiling is StartRow = 50,000 (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233.

Duplicate Detection uses the same Search API for file duplicates (with all documents query) and SharePointPaginationHelper.GetAllItemsAsync with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation.

Primary recommendation: Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns.


User Constraints

No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt.

Locked Decisions

  • .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2)
  • PnP.Framework 1.18.0 (CSOM-based SharePoint access)
  • No new major packages preferred — only add if truly necessary
  • Microsoft.Extensions.Hosting DI
  • Serilog logging
  • xUnit 2.9.3 tests

Deferred / Out of Scope

  • Content hashing for duplicate detection (v2)
  • Storage charts/graphs (v2 requirement VIZZ-01/02/03)
  • Cross-tenant file search

Standard Stack

Core (no new packages needed)

Library Version Purpose Why Standard
PnP.Framework 1.18.0 CSOM access, ClientContext Already in project
Microsoft.SharePoint.Client.Search.dll (via PnP.Framework) KeywordQuery, SearchExecutor Transitive dep — confirmed present in bin/Debug/net10.0-windows/
CommunityToolkit.Mvvm 8.4.2 [ObservableProperty], AsyncRelayCommand Already in project
Microsoft.Extensions.Hosting 10.x DI container Already in project
Serilog 4.3.1 Structured logging Already in project
xUnit 2.9.3 Tests Already in project
Moq 4.20.72 Mock interfaces in tests Already in project

No new NuGet packages required. Microsoft.SharePoint.Client.Search.dll ships as a transitive dependency of PnP.Framework — confirmed present at SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll.

New Models Needed

Model Location Fields
StorageNode Core/Models/StorageNode.cs string Name, string Url, string SiteTitle, string Library, long TotalSizeBytes, long FileStreamSizeBytes, long TotalFileCount, DateTime? LastModified, int IndentLevel, List<StorageNode> Children
SearchResult Core/Models/SearchResult.cs string Title, string Path, string FileExtension, DateTime? Created, DateTime? LastModified, string Author, string ModifiedBy, long SizeBytes
DuplicateGroup Core/Models/DuplicateGroup.cs string GroupKey, string Name, List<DuplicateItem> Items
DuplicateItem Core/Models/DuplicateItem.cs string Name, string Path, string Library, long? SizeBytes, DateTime? Created, DateTime? Modified, int? FolderCount, int? FileCount
StorageScanOptions Core/Models/StorageScanOptions.cs bool PerLibrary, bool IncludeSubsites, int FolderDepth
SearchOptions Core/Models/SearchOptions.cs string[] Extensions, string? Regex, DateTime? CreatedAfter, DateTime? CreatedBefore, DateTime? ModifiedAfter, DateTime? ModifiedBefore, string? CreatedBy, string? ModifiedBy, string? Library, int MaxResults
DuplicateScanOptions Core/Models/DuplicateScanOptions.cs string Mode ("Files"/"Folders"), bool MatchSize, bool MatchCreated, bool MatchModified, bool MatchSubfolderCount, bool MatchFileCount, bool IncludeSubsites, string? Library

Architecture Patterns

SharepointToolbox/
├── Core/Models/
│   ├── StorageNode.cs           # new
│   ├── SearchResult.cs          # new
│   ├── DuplicateGroup.cs        # new
│   ├── DuplicateItem.cs         # new
│   ├── StorageScanOptions.cs    # new
│   ├── SearchOptions.cs         # new
│   └── DuplicateScanOptions.cs  # new
├── Services/
│   ├── IStorageService.cs       # new
│   ├── StorageService.cs        # new
│   ├── ISearchService.cs        # new
│   ├── SearchService.cs         # new
│   ├── IDuplicatesService.cs    # new
│   ├── DuplicatesService.cs     # new
│   └── Export/
│       ├── StorageCsvExportService.cs   # new
│       ├── StorageHtmlExportService.cs  # new
│       ├── SearchCsvExportService.cs    # new
│       ├── SearchHtmlExportService.cs   # new
│       └── DuplicatesHtmlExportService.cs # new
├── ViewModels/Tabs/
│   ├── StorageViewModel.cs      # new
│   ├── SearchViewModel.cs       # new
│   └── DuplicatesViewModel.cs   # new
└── Views/Tabs/
    ├── StorageView.xaml          # new
    ├── StorageView.xaml.cs       # new
    ├── SearchView.xaml           # new
    ├── SearchView.xaml.cs        # new
    ├── DuplicatesView.xaml       # new
    └── DuplicatesView.xaml.cs    # new

Pattern 1: CSOM StorageMetrics Load

What: Load Folder.StorageMetrics with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched.

When to use: Whenever reading storage data for a folder or library root.

Example:

// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics
// + https://longnlp.github.io/load-storage-metric-from-SPO

// Get folder by server-relative URL (library root or subfolder)
Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl);
ctx.Load(folder,
    f => f.StorageMetrics,               // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified
    f => f.TimeLastModified,             // alternative timestamp if StorageMetrics.LastModified is null
    f => f.ServerRelativeUrl,
    f => f.Name);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

long totalBytes    = folder.StorageMetrics.TotalSize;
long streamBytes   = folder.StorageMetrics.TotalFileStreamSize;  // current-version files only
long versionBytes  = Math.Max(0L, totalBytes - streamBytes);      // version overhead
long fileCount     = folder.StorageMetrics.TotalFileCount;
DateTime? lastMod  = folder.StorageMetrics.IsPropertyAvailable("LastModified")
    ? folder.StorageMetrics.LastModified
    : folder.TimeLastModified;

Unit: TotalSize and TotalFileStreamSize are in bytes (Int64). TotalFileStreamSize is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = TotalSize - TotalFileStreamSize.

Pattern 2: KQL Search with Pagination

What: Use KeywordQuery + SearchExecutor (in Microsoft.SharePoint.Client.Search.Query) to execute a KQL query, paginating 500 rows at a time via StartRow.

When to use: SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection).

Example:

// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor
// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/

using Microsoft.SharePoint.Client.Search.Query;

// namespace: Microsoft.SharePoint.Client.Search.Query
// assembly:  Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep)

var allResults = new List<IDictionary<string, object>>();
int startRow = 0;
const int batchSize = 500;

do
{
    ct.ThrowIfCancellationRequested();

    var kq = new KeywordQuery(ctx)
    {
        QueryText      = kql,          // e.g. "ContentType:Document AND FileExtension:pdf"
        StartRow       = startRow,
        RowLimit       = batchSize,
        TrimDuplicates = false
    };
    // Explicit managed properties to retrieve
    kq.SelectProperties.AddRange(new[]
    {
        "Title", "Path", "Author", "LastModifiedTime",
        "FileExtension", "Created", "ModifiedBy", "Size"
    });

    var executor = new SearchExecutor(ctx);
    ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
    await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
    // Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again

    var table = clientResult.Value
        .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
    if (table == null) break;

    int retrieved = table.RowCount;
    foreach (System.Collections.Hashtable row in table.ResultRows)
    {
        allResults.Add(row.Cast<System.Collections.DictionaryEntry>()
            .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty));
    }

    progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…"));
    startRow += batchSize;
}
while (startRow < maxResults && startRow <= 50_000 // platform hard cap
       && allResults.Count < maxResults);

Critical detail: ExecuteQueryRetryHelper.ExecuteQueryRetryAsync wraps ctx.ExecuteQuery(). Call it AFTER executor.ExecuteQuery(kq) — do NOT call ctx.ExecuteQuery() directly afterward.

StartRow limit: SharePoint Search imposes a hard boundary of 50,000 for StartRow. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02.

KQL field mappings (from PS reference lines 4747-4763):

  • Extension: FileExtension:pdf OR FileExtension:docx
  • Created after/before: Created>=2024-01-01 / Created<=2024-12-31
  • Modified after/before: Write>=2024-01-01 / Write<=2024-12-31
  • Created by: Author:"First Last"
  • Modified by: ModifiedBy:"First Last"
  • Library path: Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"
  • Documents only: ContentType:Document

Pattern 3: Folder Enumeration for Duplicate Folders

What: Use SharePointPaginationHelper.GetAllItemsAsync with a CAML filter on FSObjType = 1 (folders). Read FolderChildCount and ItemChildCount from FieldValues.

When to use: DUPL-02 (folder duplicate scan).

Example:

// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern

var camlQuery = new CamlQuery
{
    ViewXml = @"<View Scope='RecursiveAll'>
                    <Query>
                        <Where>
                            <Eq>
                                <FieldRef Name='FSObjType' />
                                <Value Type='Integer'>1</Value>
                            </Eq>
                        </Where>
                    </Query>
                    <RowLimit>2000</RowLimit>
                </View>"
};

await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct))
{
    var fv = item.FieldValues;
    var name       = fv["FileLeafRef"]?.ToString() ?? string.Empty;
    var fileRef    = fv["FileRef"]?.ToString() ?? string.Empty;
    var subCount   = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
    var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
    var fileCount  = Math.Max(0, childCount - subCount);
    var created    = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
    var modified   = fv["Modified"] is DateTime md ? md : (DateTime?)null;
    // ...build DuplicateItem
}

Pattern 4: Duplicate Composite Key (name+size+date grouping)

What: Build a string composite key from the fields the user selected, then GroupBy(key).Where(g => g.Count() >= 2).

When to use: DUPL-01 (files) and DUPL-02 (folders).

Example:

// Source: PS reference lines 4942-4949 (MakeKey function)

private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
{
    var parts = new List<string> { item.Name.ToLowerInvariant() };
    if (opts.MatchSize    && item.SizeBytes.HasValue)    parts.Add(item.SizeBytes.Value.ToString());
    if (opts.MatchCreated && item.Created.HasValue)      parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
    if (opts.MatchModified && item.Modified.HasValue)    parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
    if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
    if (opts.MatchFileCount && item.FileCount.HasValue)  parts.Add(item.FileCount.Value.ToString());
    return string.Join("|", parts);
}

var groups = allItems
    .GroupBy(i => MakeKey(i, opts))
    .Where(g => g.Count() >= 2)
    .Select(g => new DuplicateGroup
    {
        GroupKey = g.Key,
        Name     = g.First().Name,
        Items    = g.ToList()
    })
    .OrderByDescending(g => g.Items.Count)
    .ToList();

Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid

What: Flatten the recursive tree (site → library → folder → subfolder) into a flat List<StorageNode> where each node carries an IndentLevel. The WPF DataGrid renders a Margin on the name cell based on IndentLevel.

When to use: STOR-01/02 WPF display.

Rationale for DataGrid over TreeView: WPF TreeView requires hierarchical HierarchicalDataTemplate and loses virtualization with deep nesting. A flat DataGrid with VirtualizingPanel.IsVirtualizing="True" stays performant for thousands of rows and is trivially sortable.

Example:

// Flatten tree to observable list for DataGrid binding
private static void FlattenTree(StorageNode node, int level, List<StorageNode> result)
{
    node.IndentLevel = level;
    result.Add(node);
    foreach (var child in node.Children)
        FlattenTree(child, level + 1, result);
}
<!-- WPF DataGrid cell template for name column with indent -->
<DataGridTemplateColumn Header="Library / Folder" Width="*">
    <DataGridTemplateColumn.CellTemplate>
        <DataTemplate>
            <TextBlock Text="{Binding Name}"
                       Margin="{Binding IndentLevel, Converter={StaticResource IndentConverter}}" />
        </DataTemplate>
    </DataGridTemplateColumn.CellTemplate>
</DataGridTemplateColumn>

Use IValueConverter mapping IndentLevelnew Thickness(IndentLevel * 16, 0, 0, 0).

Pattern 6: Storage HTML Collapsible Tree

What: The HTML export uses inline nested tables with display:none rows toggled by toggle(i) JS. Each library/folder that has children gets a unique numeric index.

When to use: STOR-05 export.

Key design (from PS lines 1621-1780):

  • A global _togIdx counter assigns unique IDs to collapsible rows: <tr id='sf-{i}' style='display:none'>.
  • A <button onclick='toggle({i})'> triggers row.style.display = visible ? 'none' : 'table-row'.
  • Library rows embed a nested <table class='sf-tbl'> inside the collapsible row (colspan spanning all columns).
  • This is a pure inline pattern — no external JS or CSS dependencies.
  • In C# the counter is a field on StorageHtmlExportService reset at the start of each BuildHtml() call.

Anti-Patterns to Avoid

  • Loading StorageMetrics without including it in ctx.Load: folder.StorageMetrics.TotalSize throws PropertyOrFieldNotInitializedException if StorageMetrics is not included in the Load expression. Always use ctx.Load(folder, f => f.StorageMetrics, ...).
  • Calling ctx.ExecuteQuery() after executor.ExecuteQuery(kq): The search executor pattern requires calling ctx.ExecuteQuery() ONCE (inside ExecuteQueryRetryAsync). Calling it twice is a no-op at best, throws at worst.
  • StartRow > 50,000: SharePoint Search hard boundary — will return zero results or error. Cap loop exit at startRow <= 50_000.
  • Modifying ObservableCollection from Task.Run: Same rule as Phase 2 — accumulate in List<T> on background thread, then Dispatcher.InvokeAsync(() => StorageResults = new ObservableCollection<T>(list)).
  • Recursive CSOM calls without depth guard: Without a depth guard, Collect-FolderStorage on a deep site can make thousands of CSOM round-trips. Always pass MaxDepth and check currentDepth >= maxDepth before recursing.
  • Building a TreeView for storage display: WPF TreeView loses UI virtualization with more than ~1000 visible items. Use DataGrid with IndentLevel.
  • Version size from index: The Search API's Size property is the current-version file size, not total including versions. Only StorageMetrics.TotalFileStreamSize vs TotalSize gives accurate version overhead.

Don't Hand-Roll

Problem Don't Build Use Instead Why
CSOM throttle retry Custom retry loop ExecuteQueryRetryHelper.ExecuteQueryRetryAsync (Phase 1) Already handles 429/503 with exponential backoff
List pagination Raw ExecuteQuery loop SharePointPaginationHelper.GetAllItemsAsync (Phase 1) Handles 5000-item threshold, CAML position continuation
Search pagination Manual do/while per search Same KeywordQuery+SearchExecutor pattern (internal to SearchService) Wrap in a helper method inside SearchService to avoid duplication across SRCH and DUPL features
HTML header/footer boilerplate New template each export service Copy from existing HtmlExportService pattern (Phase 2) Consistent <!DOCTYPE>, viewport meta, Segoe UI font stack
CSV field escaping Custom escaping RFC 4180 Csv() helper pattern from Phase 2 CsvExportService Already handles quotes, empty values, UTF-8 BOM
OperationProgress reporting New progress model OperationProgress.Indeterminate(msg) + new OperationProgress(current, total, msg) (Phase 1) Already wired to UI via FeatureViewModelBase
Tenant context management Directly create ClientContext ISessionManager.GetOrCreateContextAsync (Phase 1) Handles MSAL cache, per-tenant context pooling

Common Pitfalls

Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException

What goes wrong: folder.StorageMetrics.TotalSize throws PropertyOrFieldNotInitializedException at runtime. Why it happens: CSOM lazy-loading — if StorageMetrics is not in the Load expression, the proxy object exists but has no data. How to avoid: Always include f => f.StorageMetrics in the ctx.Load(folder, ...) lambda. Warning signs: Exception message contains "The property or field 'StorageMetrics' has not been initialized".

Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed

What goes wrong: Accessing row["Size"] returns object — Size comes back as a string "12345" not a long. Why it happens: ResultTable.ResultRows is IEnumerable<IDictionary<string, object>>. All values are strings from the search index. How to avoid: Always parse with long.TryParse(row["Size"]?.ToString() ?? "0", out var sizeBytes). Strip non-numeric characters as PS does: Regex.Replace(sizeStr, "[^0-9]", ""). Warning signs: InvalidCastException when binding Size to a numeric column.

Pitfall 3: Search API Returns Duplicates for Versioned Files

What goes wrong: Files with many versions appear multiple times in results via /_vti_history/ paths. Why it happens: SharePoint indexes each version as a separate item in some cases. How to avoid: Filter items where Path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase) — port of PS line 4973. Warning signs: Duplicate file paths in results with _vti_history segment.

Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue

What goes wrong: LastModified shows as 01/01/0001 for empty folders. Why it happens: SharePoint returns a default DateTime for folders with no modifications. How to avoid: Check lastModified > DateTime.MinValue before formatting. Fall back to folder.TimeLastModified if StorageMetrics.LastModified is unset. Warning signs: "01/01/0001" in the LastModified column.

Pitfall 5: KQL Query Text Exceeds 4096 Characters

What goes wrong: Search query silently fails or returns error for very long KQL strings. Why it happens: SharePoint Search has a 4096-character KQL text boundary. How to avoid: For extension filters with many extensions, use (FileExtension:a OR FileExtension:b OR ...) and validate total length before calling. Warn user if limit approached. Warning signs: Zero results returned when many extensions entered; no CSOM exception.

Pitfall 6: CAML FSObjType Field Name

What goes wrong: CAML query for folders returns no results. Why it happens: The internal CAML field name is FSObjType, not FileSystemObjectType. Using the wrong name returns no matches silently. How to avoid: Use <FieldRef Name='FSObjType' /> (integer) with <Value Type='Integer'>1</Value>. Confirmed by PS reference line 5011 which uses CSOM FileSystemObjectType.Folder comparison. Warning signs: Zero items returned from folder CAML query on a library known to have folders.

Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path

What goes wrong: Get-PnPFolderStorageMetric -FolderSiteRelativeUrl requires a path relative to the web root (e.g., Shared Documents), not the server root (e.g., /sites/MySite/Shared Documents). Why it happens: CSOM Folder.StorageMetrics uses server-relative URLs, so you need to strip the web's ServerRelativeUrl prefix. How to avoid: Load ctx.Web.ServerRelativeUrl first, then compute: siteRelUrl = rootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/'). Use ctx.Web.GetFolderByServerRelativeUrl(siteAbsoluteUrl) which accepts full server-relative paths. Warning signs: 404/FileNotFoundException from CSOM when calling StorageMetrics.


Code Examples

Loading StorageMetrics (STOR-01/02/03)

// Source: MS Learn — StorageMetrics Class; [MS-CSOMSPT] TotalFileStreamSize definition

ctx.Load(ctx.Web, w => w.ServerRelativeUrl, w => w.Url, w => w.Title);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

string webSrl = ctx.Web.ServerRelativeUrl.TrimEnd('/');

// Per-library: iterate document libraries
ctx.Load(ctx.Web.Lists, lists => lists.Include(
    l => l.Title, l => l.BaseType, l => l.Hidden, l => l.RootFolder.ServerRelativeUrl));
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

foreach (var list in ctx.Web.Lists)
{
    if (list.Hidden || list.BaseType != BaseType.DocumentLibrary) continue;

    string siteRelUrl = list.RootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/');
    Folder rootFolder = ctx.Web.GetFolderByServerRelativeUrl(list.RootFolder.ServerRelativeUrl);
    ctx.Load(rootFolder,
        f => f.StorageMetrics,
        f => f.TimeLastModified,
        f => f.ServerRelativeUrl);
    await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

    var node = new StorageNode
    {
        Name              = list.Title,
        Url               = $"{ctx.Web.Url.TrimEnd('/')}/{siteRelUrl}",
        SiteTitle         = ctx.Web.Title,
        Library           = list.Title,
        TotalSizeBytes    = rootFolder.StorageMetrics.TotalSize,
        FileStreamSizeBytes = rootFolder.StorageMetrics.TotalFileStreamSize,
        TotalFileCount    = rootFolder.StorageMetrics.TotalFileCount,
        LastModified      = rootFolder.StorageMetrics.LastModified > DateTime.MinValue
                            ? rootFolder.StorageMetrics.LastModified
                            : rootFolder.TimeLastModified,
        IndentLevel       = 0,
        Children          = new List<StorageNode>()
    };

    // Recursive subfolder collection up to maxDepth
    if (maxDepth > 0)
        await CollectSubfoldersAsync(ctx, list.RootFolder.ServerRelativeUrl, node, 1, maxDepth, progress, ct);
}

KQL Build from SearchOptions

// Source: PS reference lines 4747-4763

private static string BuildKql(SearchOptions opts)
{
    var parts = new List<string> { "ContentType:Document" };

    if (opts.Extensions.Length > 0)
    {
        var extParts = opts.Extensions.Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
        parts.Add($"({string.Join(" OR ", extParts)})");
    }
    if (opts.CreatedAfter.HasValue)
        parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
    if (opts.CreatedBefore.HasValue)
        parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
    if (opts.ModifiedAfter.HasValue)
        parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
    if (opts.ModifiedBefore.HasValue)
        parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
    if (!string.IsNullOrEmpty(opts.CreatedBy))
        parts.Add($"Author:\"{opts.CreatedBy}\"");
    if (!string.IsNullOrEmpty(opts.ModifiedBy))
        parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
    if (!string.IsNullOrEmpty(opts.Library))
        parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");

    return string.Join(" AND ", parts);
}

Parsing Search ResultRows

// Source: PS reference lines 4971-4987

private static SearchResult ParseRow(IDictionary<string, object> row)
{
    static string Str(IDictionary<string, object> r, string key) =>
        r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;

    static DateTime? Date(IDictionary<string, object> r, string key)
    {
        var s = Str(r, key);
        return DateTime.TryParse(s, out var dt) ? dt : null;
    }

    static long ParseSize(IDictionary<string, object> r, string key)
    {
        var raw = Str(r, key);
        var digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
        return long.TryParse(digits, out var v) ? v : 0L;
    }

    return new SearchResult
    {
        Title         = Str(row, "Title"),
        Path          = Str(row, "Path"),
        FileExtension = Str(row, "FileExtension"),
        Created       = Date(row, "Created"),
        LastModified  = Date(row, "LastModifiedTime"),
        Author        = Str(row, "Author"),
        ModifiedBy    = Str(row, "ModifiedBy"),
        SizeBytes     = ParseSize(row, "Size")
    };
}

Localization Keys Needed

The following keys are needed for Phase 3 Views. Keys from the PS reference (lines 2747-2813) are remapped to the C# Strings.resx naming convention. Existing keys already in Strings.resx are marked with (existing).

Storage Tab

Key EN Value Notes
tab.storage Storage (existing — already in Strings.resx line 77)
chk.per.lib Per-Library Breakdown new
chk.subsites Include Subsites new
lbl.folder.depth Folder depth: (existing — shared with permissions)
chk.max.depth Maximum (all levels) (existing — shared with permissions)
stor.note Note: deeper folder scans on large sites may take several minutes. new
btn.gen.storage Generate Metrics new
btn.open.storage Open Report new
stor.col.library Library new
stor.col.site Site new
stor.col.files Files new
stor.col.size Size new
stor.col.versions Versions new
stor.col.lastmod Last Modified new
stor.col.share Share of Total new

File Search Tab

Key EN Value Notes
tab.search File Search (existing — already in Strings.resx line 79)
grp.search.filters Search Filters new
lbl.extensions Extension(s): new
ph.extensions docx pdf xlsx new (placeholder)
lbl.regex Name / Regex: new
ph.regex Ex: report.* or \.bak$ new (placeholder)
chk.created.after Created after: new
chk.created.before Created before: new
chk.modified.after Modified after: new
chk.modified.before Modified before: new
lbl.created.by Created by: new
ph.created.by First Last or email new (placeholder)
lbl.modified.by Modified by: new
ph.modified.by First Last or email new (placeholder)
lbl.library Library: new
ph.library Optional relative path e.g. Shared Documents new (placeholder)
lbl.max.results Max results: new
btn.run.search Run Search new
btn.open.search Open Results new
srch.col.name File Name new
srch.col.ext Extension new
srch.col.created Created new
srch.col.modified Modified new
srch.col.author Created By new
srch.col.modby Modified By new
srch.col.size Size new

Duplicates Tab

Key EN Value Notes
tab.duplicates Duplicates (existing — already in Strings.resx line 83)
grp.dup.type Duplicate Type new
rad.dup.files Duplicate files new
rad.dup.folders Duplicate folders new
grp.dup.criteria Comparison Criteria new
lbl.dup.note Name is always the primary criterion. Check additional criteria: new
chk.dup.size Same size new
chk.dup.created Same creation date new
chk.dup.modified Same modification date new
chk.dup.subfolders Same subfolder count new
chk.dup.filecount Same file count new
chk.include.subsites Include subsites new
ph.dup.lib All (leave empty) new (placeholder)
btn.run.scan Run Scan new
btn.open.results Open Results new

Duplicate Detection Scale — Known Concern Resolution

The STATE.md concern ("Duplicate detection at scale (100k+ files) — Graph API hash enumeration limits") is resolved: the PS reference does NOT use file hashes. It uses name+size+date grouping, which is exactly what DUPL-01/02/03 specify. The requirements do not mention hash-based deduplication.

Scale analysis:

  • File duplicates use the Search API. SharePoint Search caps at 50,000 results (StartRow=50,000 max). A site with 100k+ files will be capped at 50,000 returned results. This is the same cap as SRCH-02, and is a known/accepted limitation.
  • Folder duplicates use CAML pagination. SharePointPaginationHelper.GetAllItemsAsync handles arbitrary folder counts with RowLimit=2000 pagination — no effective upper bound.
  • Client-side GroupBy on 50,000 items is instantaneous (Dictionary-based O(n) operation).
  • No Graph API or SHA256 content hashing is needed. The concern was about a potential v2 enhancement not required by DUPL-01/02/03.

State of the Art

Old Approach Current Approach When Changed Impact
Get-PnPFolderStorageMetric (PS cmdlet) CSOM Folder.StorageMetrics Phase 3 migration One CSOM round-trip per folder; no PnP PS module required
Submit-PnPSearchQuery (PS cmdlet) CSOM KeywordQuery + SearchExecutor Phase 3 migration Same pagination model; TrimDuplicates=false explicit
Get-PnPListItem for folders (PS) SharePointPaginationHelper.GetAllItemsAsync with CAML Phase 3 migration Reuses Phase 1 helper; handles 5000-item threshold
Storage TreeView control Flat DataGrid with IndentLevel + IValueConverter Phase 3 design decision Better UI virtualization for large sites

Validation Architecture

Test Framework

Property Value
Framework xUnit 2.9.3
Config file none (SDK auto-discovery)
Quick run command dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "Category!=Integration" -x
Full suite command dotnet test SharepointToolbox.slnx

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
STOR-01/02 StorageService.CollectStorageAsync returns StorageNode list unit (mock ISessionManager) dotnet test --filter "StorageServiceTests" Wave 0
STOR-03 VersionSizeBytes = TotalSizeBytes - FileStreamSizeBytes unit dotnet test --filter "StorageNodeTests" Wave 0
STOR-04 StorageCsvExportService.BuildCsv produces correct header and rows unit dotnet test --filter "StorageCsvExportServiceTests" Wave 0
STOR-05 StorageHtmlExportService.BuildHtml contains toggle JS and nested tables unit dotnet test --filter "StorageHtmlExportServiceTests" Wave 0
SRCH-01 SearchService builds correct KQL from SearchOptions unit dotnet test --filter "SearchServiceTests" Wave 0
SRCH-02 Search loop exits when startRow > 50_000 unit dotnet test --filter "SearchServiceTests" Wave 0
SRCH-03 SearchCsvExportService.BuildCsv produces correct header unit dotnet test --filter "SearchCsvExportServiceTests" Wave 0
SRCH-04 SearchHtmlExportService.BuildHtml contains sort JS and filter input unit dotnet test --filter "SearchHtmlExportServiceTests" Wave 0
DUPL-01 MakeKey function groups identical name+size+date items unit dotnet test --filter "DuplicatesServiceTests" Wave 0
DUPL-02 CAML query targets FSObjType=1; FileCount = ItemChildCount - FolderChildCount unit (logic only) dotnet test --filter "DuplicatesServiceTests" Wave 0
DUPL-03 DuplicatesHtmlExportService.BuildHtml contains group cards with ok/diff badges unit dotnet test --filter "DuplicatesHtmlExportServiceTests" Wave 0

Note: StorageService, SearchService, and DuplicatesService depend on live CSOM — service-level tests use Skip like PermissionsServiceTests. ViewModel tests use Moq for IStorageService, ISearchService, IDuplicatesService following PermissionsViewModelTests pattern. Export service tests are fully unit-testable (no CSOM).

Sampling Rate

  • Per task commit: dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj -x
  • Per wave merge: dotnet test SharepointToolbox.slnx
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • SharepointToolbox.Tests/Services/StorageServiceTests.cs — covers STOR-01/02 (stub + Skip like PermissionsServiceTests)
  • SharepointToolbox.Tests/Services/Export/StorageCsvExportServiceTests.cs — covers STOR-04
  • SharepointToolbox.Tests/Services/Export/StorageHtmlExportServiceTests.cs — covers STOR-05
  • SharepointToolbox.Tests/Services/SearchServiceTests.cs — covers SRCH-01/02 (KQL build + pagination cap logic)
  • SharepointToolbox.Tests/Services/Export/SearchCsvExportServiceTests.cs — covers SRCH-03
  • SharepointToolbox.Tests/Services/Export/SearchHtmlExportServiceTests.cs — covers SRCH-04
  • SharepointToolbox.Tests/Services/DuplicatesServiceTests.cs — covers DUPL-01/02 composite key logic
  • SharepointToolbox.Tests/Services/Export/DuplicatesHtmlExportServiceTests.cs — covers DUPL-03
  • SharepointToolbox.Tests/ViewModels/StorageViewModelTests.cs — covers STOR-01 ViewModel (Moq IStorageService)
  • SharepointToolbox.Tests/ViewModels/SearchViewModelTests.cs — covers SRCH-01/02 ViewModel
  • SharepointToolbox.Tests/ViewModels/DuplicatesViewModelTests.cs — covers DUPL-01/02 ViewModel

Open Questions

  1. StorageMetrics.LastModified vs TimeLastModified

    • What we know: StorageMetrics.LastModified exists per the API docs. Folder.TimeLastModified is a separate CSOM property.
    • What's unclear: Whether StorageMetrics.LastModified can return DateTime.MinValue for recently created empty folders in all SharePoint Online tenants.
    • Recommendation: Load both (f => f.StorageMetrics, f => f.TimeLastModified) and prefer StorageMetrics.LastModified when it is > DateTime.MinValue, falling back to TimeLastModified.
  2. Search index freshness for duplicate detection

    • What we know: SharePoint Search is eventually consistent — newly created files may not appear for up to 15 minutes.
    • What's unclear: Whether users expect real-time accuracy or accept eventual consistency.
    • Recommendation: Document in UI that search-based results (files) reflect the search index, not the current state. Add a note in the log output.
  3. Multiple-site file search scope

    • What we know: The PS reference scopes search to $siteUrl context only (one site per search). SRCH-01 says "across sites" in the goal description but the requirements only specify search criteria, not multi-site.
    • What's unclear: Whether SRCH-01 requires multi-site search in one operation or per-site.
    • Recommendation: Implement per-site search (matching PS reference). Multi-site search would require separate ClientContext per site plus result merging — treat as a future enhancement.

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence — implementation detail, verify when coding)


Metadata

Confidence breakdown:

  • Standard Stack: HIGH — no new packages needed; Search.dll confirmed present; all APIs verified against MS docs
  • Architecture Patterns: HIGH — direct port of working PS reference; CSOM API shapes confirmed
  • Pitfalls: HIGH for StorageMetrics loading, search result typing, vti_history filter (all from PS reference or official docs); MEDIUM for KQL length limit (documented but not commonly hit)
  • Localization keys: HIGH — directly extracted from PS reference lines 2747-2813

Research date: 2026-04-02 Valid until: 2026-07-01 (CSOM APIs stable; SharePoint search limits stable; re-verify if PnP.Framework upgrades past 1.18)