Files

Dev d09db015f2 docs(phase-03): research storage, search, and duplicate detection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-02 14:41:39 +02:00

44 KiB

Raw Blame History

Phase 3: Storage and File Operations - Research

Researched: 2026-04-02 Domain: CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection Confidence: HIGH

<phase_requirements>

Phase Requirements

ID	Description	Research Support
STOR-01	User can view storage consumption per library on a site	CSOM `Folder.StorageMetrics` (one Load call per folder) + flat DataGrid with indent column
STOR-02	User can view storage consumption per site with configurable folder depth	Recursive `Collect-FolderStorage` pattern translated to async CSOM; depth guard via split-count
STOR-03	Storage metrics include total size, version size, item count, and last modified date	`StorageMetrics.TotalSize`, `TotalFileStreamSize`, `TotalFileCount`, `StorageMetrics.LastModified`; version size = TotalSize - TotalFileStreamSize
STOR-04	User can export storage metrics to CSV	New `StorageCsvExportService` — same UTF-8 BOM pattern as Phase 2
STOR-05	User can export storage metrics to interactive HTML with collapsible tree view	New `StorageHtmlExportService` — port PS lines 1621-1780; toggle() JS + nested table rows
SRCH-01	User can search files across sites using multiple criteria	`KeywordQuery` + `SearchExecutor` (CSOM search); KQL built from filter params; client-side Regex post-filter
SRCH-02	User can configure maximum search results (up to 50,000)	SharePoint Search `StartRow` hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max
SRCH-03	User can export search results to CSV	New `SearchCsvExportService`
SRCH-04	User can export search results to interactive HTML (sortable, filterable)	New `SearchHtmlExportService` — port PS lines 2112-2233; sortable columns via data attributes
DUPL-01	User can scan for duplicate files by name, size, creation date, modification date	Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed
DUPL-02	User can scan for duplicate folders by name, subfolder count, file count	`SharePointPaginationHelper.GetAllItemsAsync` with CAML `FSObjType=1`; read `FolderChildCount`, `ItemChildCount` from field values
DUPL-03	User can export duplicate report to HTML with grouped display and visual indicators	New `DuplicatesHtmlExportService` — port PS lines 2235-2406; collapsible group cards, ok/diff badges
</phase_requirements>

Summary

Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — Microsoft.SharePoint.Client.Search.dll is already in the output folder as a transitive dependency of PnP.Framework 1.18.0.

Storage uses CSOM Folder.StorageMetrics (loaded via ctx.Load(folder, f => f.StorageMetrics)). One CSOM round-trip per folder. Version size is derived as TotalSize - TotalFileStreamSize. The data model is a recursive tree (site → library → folder → subfolder), flattened to a DataGrid with an indent-level column for WPF display. The HTML export ports the PS Export-StorageToHTML function (PS lines 1621-1780) with its toggle(i) JS pattern.

File Search uses Microsoft.SharePoint.Client.Search.Query.KeywordQuery + SearchExecutor. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is StartRow += 500 per batch; the hard ceiling is StartRow = 50,000 (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233.

Duplicate Detection uses the same Search API for file duplicates (with all documents query) and SharePointPaginationHelper.GetAllItemsAsync with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation.

Primary recommendation: Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns.

User Constraints

No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt.

Locked Decisions

.NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2)
PnP.Framework 1.18.0 (CSOM-based SharePoint access)
No new major packages preferred — only add if truly necessary
Microsoft.Extensions.Hosting DI
Serilog logging
xUnit 2.9.3 tests

Deferred / Out of Scope

Content hashing for duplicate detection (v2)
Storage charts/graphs (v2 requirement VIZZ-01/02/03)
Cross-tenant file search

Standard Stack

Core (no new packages needed)

Library	Version	Purpose	Why Standard
PnP.Framework	1.18.0	CSOM access, `ClientContext`	Already in project
Microsoft.SharePoint.Client.Search.dll	(via PnP.Framework)	`KeywordQuery`, `SearchExecutor`	Transitive dep — confirmed present in `bin/Debug/net10.0-windows/`
CommunityToolkit.Mvvm	8.4.2	`[ObservableProperty]`, `AsyncRelayCommand`	Already in project
Microsoft.Extensions.Hosting	10.x	DI container	Already in project
Serilog	4.3.1	Structured logging	Already in project
xUnit	2.9.3	Tests	Already in project
Moq	4.20.72	Mock interfaces in tests	Already in project

No new NuGet packages required. Microsoft.SharePoint.Client.Search.dll ships as a transitive dependency of PnP.Framework — confirmed present at SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll.

New Models Needed

Model	Location	Fields
`StorageNode`	`Core/Models/StorageNode.cs`	`string Name`, `string Url`, `string SiteTitle`, `string Library`, `long TotalSizeBytes`, `long FileStreamSizeBytes`, `long TotalFileCount`, `DateTime? LastModified`, `int IndentLevel`, `List<StorageNode> Children`
`SearchResult`	`Core/Models/SearchResult.cs`	`string Title`, `string Path`, `string FileExtension`, `DateTime? Created`, `DateTime? LastModified`, `string Author`, `string ModifiedBy`, `long SizeBytes`
`DuplicateGroup`	`Core/Models/DuplicateGroup.cs`	`string GroupKey`, `string Name`, `List<DuplicateItem> Items`
`DuplicateItem`	`Core/Models/DuplicateItem.cs`	`string Name`, `string Path`, `string Library`, `long? SizeBytes`, `DateTime? Created`, `DateTime? Modified`, `int? FolderCount`, `int? FileCount`
`StorageScanOptions`	`Core/Models/StorageScanOptions.cs`	`bool PerLibrary`, `bool IncludeSubsites`, `int FolderDepth`
`SearchOptions`	`Core/Models/SearchOptions.cs`	`string[] Extensions`, `string? Regex`, `DateTime? CreatedAfter`, `DateTime? CreatedBefore`, `DateTime? ModifiedAfter`, `DateTime? ModifiedBefore`, `string? CreatedBy`, `string? ModifiedBy`, `string? Library`, `int MaxResults`
`DuplicateScanOptions`	`Core/Models/DuplicateScanOptions.cs`	`string Mode` ("Files"/"Folders"), `bool MatchSize`, `bool MatchCreated`, `bool MatchModified`, `bool MatchSubfolderCount`, `bool MatchFileCount`, `bool IncludeSubsites`, `string? Library`

Architecture Patterns

Recommended Project Structure (additions only)

SharepointToolbox/
├── Core/Models/
│   ├── StorageNode.cs           # new
│   ├── SearchResult.cs          # new
│   ├── DuplicateGroup.cs        # new
│   ├── DuplicateItem.cs         # new
│   ├── StorageScanOptions.cs    # new
│   ├── SearchOptions.cs         # new
│   └── DuplicateScanOptions.cs  # new
├── Services/
│   ├── IStorageService.cs       # new
│   ├── StorageService.cs        # new
│   ├── ISearchService.cs        # new
│   ├── SearchService.cs         # new
│   ├── IDuplicatesService.cs    # new
│   ├── DuplicatesService.cs     # new
│   └── Export/
│       ├── StorageCsvExportService.cs   # new
│       ├── StorageHtmlExportService.cs  # new
│       ├── SearchCsvExportService.cs    # new
│       ├── SearchHtmlExportService.cs   # new
│       └── DuplicatesHtmlExportService.cs # new
├── ViewModels/Tabs/
│   ├── StorageViewModel.cs      # new
│   ├── SearchViewModel.cs       # new
│   └── DuplicatesViewModel.cs   # new
└── Views/Tabs/
    ├── StorageView.xaml          # new
    ├── StorageView.xaml.cs       # new
    ├── SearchView.xaml           # new
    ├── SearchView.xaml.cs        # new
    ├── DuplicatesView.xaml       # new
    └── DuplicatesView.xaml.cs    # new

Pattern 1: CSOM StorageMetrics Load

What: Load Folder.StorageMetrics with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched.

When to use: Whenever reading storage data for a folder or library root.

Example:

// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics
// + https://longnlp.github.io/load-storage-metric-from-SPO

// Get folder by server-relative URL (library root or subfolder)
Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl);
ctx.Load(folder,
    f => f.StorageMetrics,               // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified
    f => f.TimeLastModified,             // alternative timestamp if StorageMetrics.LastModified is null
    f => f.ServerRelativeUrl,
    f => f.Name);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

long totalBytes    = folder.StorageMetrics.TotalSize;
long streamBytes   = folder.StorageMetrics.TotalFileStreamSize;  // current-version files only
long versionBytes  = Math.Max(0L, totalBytes - streamBytes);      // version overhead
long fileCount     = folder.StorageMetrics.TotalFileCount;
DateTime? lastMod  = folder.StorageMetrics.IsPropertyAvailable("LastModified")
    ? folder.StorageMetrics.LastModified
    : folder.TimeLastModified;

Unit: TotalSize and TotalFileStreamSize are in bytes (Int64). TotalFileStreamSize is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = TotalSize - TotalFileStreamSize.

Pattern 2: KQL Search with Pagination

What: Use KeywordQuery + SearchExecutor (in Microsoft.SharePoint.Client.Search.Query) to execute a KQL query, paginating 500 rows at a time via StartRow.

When to use: SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection).

Example:

// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor
// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/

using Microsoft.SharePoint.Client.Search.Query;

// namespace: Microsoft.SharePoint.Client.Search.Query
// assembly:  Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep)

var allResults = new List<IDictionary<string, object>>();
int startRow = 0;
const int batchSize = 500;

do
{
    ct.ThrowIfCancellationRequested();

    var kq = new KeywordQuery(ctx)
    {
        QueryText      = kql,          // e.g. "ContentType:Document AND FileExtension:pdf"
        StartRow       = startRow,
        RowLimit       = batchSize,
        TrimDuplicates = false
    };
    // Explicit managed properties to retrieve
    kq.SelectProperties.AddRange(new[]
    {
        "Title", "Path", "Author", "LastModifiedTime",
        "FileExtension", "Created", "ModifiedBy", "Size"
    });

    var executor = new SearchExecutor(ctx);
    ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
    await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
    // Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again

    var table = clientResult.Value
        .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
    if (table == null) break;

    int retrieved = table.RowCount;
    foreach (System.Collections.Hashtable row in table.ResultRows)
    {
        allResults.Add(row.Cast<System.Collections.DictionaryEntry>()
            .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty));
    }

    progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…"));
    startRow += batchSize;
}
while (startRow < maxResults && startRow <= 50_000 // platform hard cap
       && allResults.Count < maxResults);

Critical detail: ExecuteQueryRetryHelper.ExecuteQueryRetryAsync wraps ctx.ExecuteQuery(). Call it AFTER executor.ExecuteQuery(kq) — do NOT call ctx.ExecuteQuery() directly afterward.

StartRow limit: SharePoint Search imposes a hard boundary of 50,000 for StartRow. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02.

KQL field mappings (from PS reference lines 4747-4763):

Extension: FileExtension:pdf OR FileExtension:docx
Created after/before: Created>=2024-01-01 / Created<=2024-12-31
Modified after/before: Write>=2024-01-01 / Write<=2024-12-31
Created by: Author:"First Last"
Modified by: ModifiedBy:"First Last"
Library path: Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"
Documents only: ContentType:Document

Pattern 3: Folder Enumeration for Duplicate Folders

What: Use SharePointPaginationHelper.GetAllItemsAsync with a CAML filter on FSObjType = 1 (folders). Read FolderChildCount and ItemChildCount from FieldValues.

When to use: DUPL-02 (folder duplicate scan).

Example:

// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern

var camlQuery = new CamlQuery
{
    ViewXml = @"<View Scope='RecursiveAll'>
                    <Query>
                        <Where>
                            <Eq>
                                <FieldRef Name='FSObjType' />
                                <Value Type='Integer'>1</Value>
                            </Eq>
                        </Where>
                    </Query>
                    <RowLimit>2000</RowLimit>
                </View>"
};

await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct))
{
    var fv = item.FieldValues;
    var name       = fv["FileLeafRef"]?.ToString() ?? string.Empty;
    var fileRef    = fv["FileRef"]?.ToString() ?? string.Empty;
    var subCount   = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
    var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
    var fileCount  = Math.Max(0, childCount - subCount);
    var created    = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
    var modified   = fv["Modified"] is DateTime md ? md : (DateTime?)null;
    // ...build DuplicateItem
}

Pattern 4: Duplicate Composite Key (name+size+date grouping)

What: Build a string composite key from the fields the user selected, then GroupBy(key).Where(g => g.Count() >= 2).

When to use: DUPL-01 (files) and DUPL-02 (folders).

Example:

// Source: PS reference lines 4942-4949 (MakeKey function)

private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
{
    var parts = new List<string> { item.Name.ToLowerInvariant() };
    if (opts.MatchSize    && item.SizeBytes.HasValue)    parts.Add(item.SizeBytes.Value.ToString());
    if (opts.MatchCreated && item.Created.HasValue)      parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
    if (opts.MatchModified && item.Modified.HasValue)    parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
    if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
    if (opts.MatchFileCount && item.FileCount.HasValue)  parts.Add(item.FileCount.Value.ToString());
    return string.Join("|", parts);
}

var groups = allItems
    .GroupBy(i => MakeKey(i, opts))
    .Where(g => g.Count() >= 2)
    .Select(g => new DuplicateGroup
    {
        GroupKey = g.Key,
        Name     = g.First().Name,
        Items    = g.ToList()
    })
    .OrderByDescending(g => g.Items.Count)
    .ToList();

Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid

What: Flatten the recursive tree (site → library → folder → subfolder) into a flat List<StorageNode> where each node carries an IndentLevel. The WPF DataGrid renders a Margin on the name cell based on IndentLevel.

When to use: STOR-01/02 WPF display.

Rationale for DataGrid over TreeView: WPF TreeView requires hierarchical HierarchicalDataTemplate and loses virtualization with deep nesting. A flat DataGrid with VirtualizingPanel.IsVirtualizing="True" stays performant for thousands of rows and is trivially sortable.

Example:

// Flatten tree to observable list for DataGrid binding
private static void FlattenTree(StorageNode node, int level, List<StorageNode> result)
{
    node.IndentLevel = level;
    result.Add(node);
    foreach (var child in node.Children)
        FlattenTree(child, level + 1, result);
}

<!-- WPF DataGrid cell template for name column with indent -->
<DataGridTemplateColumn Header="Library / Folder" Width="*">
    <DataGridTemplateColumn.CellTemplate>
        <DataTemplate>
            <TextBlock Text="{Binding Name}"
                       Margin="{Binding IndentLevel, Converter={StaticResource IndentConverter}}" />
        </DataTemplate>
    </DataGridTemplateColumn.CellTemplate>
</DataGridTemplateColumn>

Use IValueConverter mapping IndentLevel → new Thickness(IndentLevel * 16, 0, 0, 0).

Pattern 6: Storage HTML Collapsible Tree

What: The HTML export uses inline nested tables with display:none rows toggled by toggle(i) JS. Each library/folder that has children gets a unique numeric index.

When to use: STOR-05 export.

Key design (from PS lines 1621-1780):

A global _togIdx counter assigns unique IDs to collapsible rows: <tr id='sf-{i}' style='display:none'>.
A <button onclick='toggle({i})'> triggers row.style.display = visible ? 'none' : 'table-row'.
Library rows embed a nested <table class='sf-tbl'> inside the collapsible row (colspan spanning all columns).
This is a pure inline pattern — no external JS or CSS dependencies.
In C# the counter is a field on StorageHtmlExportService reset at the start of each BuildHtml() call.

Anti-Patterns to Avoid

Loading StorageMetrics without including it in ctx.Load: folder.StorageMetrics.TotalSize throws PropertyOrFieldNotInitializedException if StorageMetrics is not included in the Load expression. Always use ctx.Load(folder, f => f.StorageMetrics, ...).
Calling ctx.ExecuteQuery() after executor.ExecuteQuery(kq): The search executor pattern requires calling ctx.ExecuteQuery() ONCE (inside ExecuteQueryRetryAsync). Calling it twice is a no-op at best, throws at worst.
StartRow > 50,000: SharePoint Search hard boundary — will return zero results or error. Cap loop exit at startRow <= 50_000.
Modifying ObservableCollection from Task.Run: Same rule as Phase 2 — accumulate in List<T> on background thread, then Dispatcher.InvokeAsync(() => StorageResults = new ObservableCollection<T>(list)).
Recursive CSOM calls without depth guard: Without a depth guard, Collect-FolderStorage on a deep site can make thousands of CSOM round-trips. Always pass MaxDepth and check currentDepth >= maxDepth before recursing.
Building a TreeView for storage display: WPF TreeView loses UI virtualization with more than ~1000 visible items. Use DataGrid with IndentLevel.
Version size from index: The Search API's Size property is the current-version file size, not total including versions. Only StorageMetrics.TotalFileStreamSize vs TotalSize gives accurate version overhead.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
CSOM throttle retry	Custom retry loop	`ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` (Phase 1)	Already handles 429/503 with exponential backoff
List pagination	Raw `ExecuteQuery` loop	`SharePointPaginationHelper.GetAllItemsAsync` (Phase 1)	Handles 5000-item threshold, CAML position continuation
Search pagination	Manual `do/while` per search	Same `KeywordQuery`+`SearchExecutor` pattern (internal to SearchService)	Wrap in a helper method inside `SearchService` to avoid duplication across SRCH and DUPL features
HTML header/footer boilerplate	New template each export service	Copy from existing `HtmlExportService` pattern (Phase 2)	Consistent `<!DOCTYPE>`, viewport meta, `Segoe UI` font stack
CSV field escaping	Custom escaping	RFC 4180 `Csv()` helper pattern from Phase 2 `CsvExportService`	Already handles quotes, empty values, UTF-8 BOM
OperationProgress reporting	New progress model	`OperationProgress.Indeterminate(msg)` + `new OperationProgress(current, total, msg)` (Phase 1)	Already wired to UI via `FeatureViewModelBase`
Tenant context management	Directly create `ClientContext`	`ISessionManager.GetOrCreateContextAsync` (Phase 1)	Handles MSAL cache, per-tenant context pooling

Common Pitfalls

Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException

What goes wrong: folder.StorageMetrics.TotalSize throws PropertyOrFieldNotInitializedException at runtime. Why it happens: CSOM lazy-loading — if StorageMetrics is not in the Load expression, the proxy object exists but has no data. How to avoid: Always include f => f.StorageMetrics in the ctx.Load(folder, ...) lambda. Warning signs: Exception message contains "The property or field 'StorageMetrics' has not been initialized".

Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed

What goes wrong: Accessing row["Size"] returns object — Size comes back as a string "12345" not a long. Why it happens: ResultTable.ResultRows is IEnumerable<IDictionary<string, object>>. All values are strings from the search index. How to avoid: Always parse with long.TryParse(row["Size"]?.ToString() ?? "0", out var sizeBytes). Strip non-numeric characters as PS does: Regex.Replace(sizeStr, "[^0-9]", ""). Warning signs: InvalidCastException when binding Size to a numeric column.

Pitfall 3: Search API Returns Duplicates for Versioned Files

What goes wrong: Files with many versions appear multiple times in results via /_vti_history/ paths. Why it happens: SharePoint indexes each version as a separate item in some cases. How to avoid: Filter items where Path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase) — port of PS line 4973. Warning signs: Duplicate file paths in results with _vti_history segment.

Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue

What goes wrong: LastModified shows as 01/01/0001 for empty folders. Why it happens: SharePoint returns a default DateTime for folders with no modifications. How to avoid: Check lastModified > DateTime.MinValue before formatting. Fall back to folder.TimeLastModified if StorageMetrics.LastModified is unset. Warning signs: "01/01/0001" in the LastModified column.

Pitfall 5: KQL Query Text Exceeds 4096 Characters

What goes wrong: Search query silently fails or returns error for very long KQL strings. Why it happens: SharePoint Search has a 4096-character KQL text boundary. How to avoid: For extension filters with many extensions, use (FileExtension:a OR FileExtension:b OR ...) and validate total length before calling. Warn user if limit approached. Warning signs: Zero results returned when many extensions entered; no CSOM exception.

Pitfall 6: CAML FSObjType Field Name

What goes wrong: CAML query for folders returns no results. Why it happens: The internal CAML field name is FSObjType, not FileSystemObjectType. Using the wrong name returns no matches silently. How to avoid: Use <FieldRef Name='FSObjType' /> (integer) with <Value Type='Integer'>1</Value>. Confirmed by PS reference line 5011 which uses CSOM FileSystemObjectType.Folder comparison. Warning signs: Zero items returned from folder CAML query on a library known to have folders.

Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path

What goes wrong: Get-PnPFolderStorageMetric -FolderSiteRelativeUrl requires a path relative to the web root (e.g., Shared Documents), not the server root (e.g., /sites/MySite/Shared Documents). Why it happens: CSOM Folder.StorageMetrics uses server-relative URLs, so you need to strip the web's ServerRelativeUrl prefix. How to avoid: Load ctx.Web.ServerRelativeUrl first, then compute: siteRelUrl = rootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/'). Use ctx.Web.GetFolderByServerRelativeUrl(siteAbsoluteUrl) which accepts full server-relative paths. Warning signs: 404/FileNotFoundException from CSOM when calling StorageMetrics.

Code Examples

Loading StorageMetrics (STOR-01/02/03)

// Source: MS Learn — StorageMetrics Class; [MS-CSOMSPT] TotalFileStreamSize definition

ctx.Load(ctx.Web, w => w.ServerRelativeUrl, w => w.Url, w => w.Title);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

string webSrl = ctx.Web.ServerRelativeUrl.TrimEnd('/');

// Per-library: iterate document libraries
ctx.Load(ctx.Web.Lists, lists => lists.Include(
    l => l.Title, l => l.BaseType, l => l.Hidden, l => l.RootFolder.ServerRelativeUrl));
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

foreach (var list in ctx.Web.Lists)
{
    if (list.Hidden || list.BaseType != BaseType.DocumentLibrary) continue;

    string siteRelUrl = list.RootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/');
    Folder rootFolder = ctx.Web.GetFolderByServerRelativeUrl(list.RootFolder.ServerRelativeUrl);
    ctx.Load(rootFolder,
        f => f.StorageMetrics,
        f => f.TimeLastModified,
        f => f.ServerRelativeUrl);
    await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

    var node = new StorageNode
    {
        Name              = list.Title,
        Url               = $"{ctx.Web.Url.TrimEnd('/')}/{siteRelUrl}",
        SiteTitle         = ctx.Web.Title,
        Library           = list.Title,
        TotalSizeBytes    = rootFolder.StorageMetrics.TotalSize,
        FileStreamSizeBytes = rootFolder.StorageMetrics.TotalFileStreamSize,
        TotalFileCount    = rootFolder.StorageMetrics.TotalFileCount,
        LastModified      = rootFolder.StorageMetrics.LastModified > DateTime.MinValue
                            ? rootFolder.StorageMetrics.LastModified
                            : rootFolder.TimeLastModified,
        IndentLevel       = 0,
        Children          = new List<StorageNode>()
    };

    // Recursive subfolder collection up to maxDepth
    if (maxDepth > 0)
        await CollectSubfoldersAsync(ctx, list.RootFolder.ServerRelativeUrl, node, 1, maxDepth, progress, ct);
}

KQL Build from SearchOptions

// Source: PS reference lines 4747-4763

private static string BuildKql(SearchOptions opts)
{
    var parts = new List<string> { "ContentType:Document" };

    if (opts.Extensions.Length > 0)
    {
        var extParts = opts.Extensions.Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
        parts.Add($"({string.Join(" OR ", extParts)})");
    }
    if (opts.CreatedAfter.HasValue)
        parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
    if (opts.CreatedBefore.HasValue)
        parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
    if (opts.ModifiedAfter.HasValue)
        parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
    if (opts.ModifiedBefore.HasValue)
        parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
    if (!string.IsNullOrEmpty(opts.CreatedBy))
        parts.Add($"Author:\"{opts.CreatedBy}\"");
    if (!string.IsNullOrEmpty(opts.ModifiedBy))
        parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
    if (!string.IsNullOrEmpty(opts.Library))
        parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");

    return string.Join(" AND ", parts);
}

Parsing Search ResultRows

// Source: PS reference lines 4971-4987

private static SearchResult ParseRow(IDictionary<string, object> row)
{
    static string Str(IDictionary<string, object> r, string key) =>
        r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;

    static DateTime? Date(IDictionary<string, object> r, string key)
    {
        var s = Str(r, key);
        return DateTime.TryParse(s, out var dt) ? dt : null;
    }

    static long ParseSize(IDictionary<string, object> r, string key)
    {
        var raw = Str(r, key);
        var digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
        return long.TryParse(digits, out var v) ? v : 0L;
    }

    return new SearchResult
    {
        Title         = Str(row, "Title"),
        Path          = Str(row, "Path"),
        FileExtension = Str(row, "FileExtension"),
        Created       = Date(row, "Created"),
        LastModified  = Date(row, "LastModifiedTime"),
        Author        = Str(row, "Author"),
        ModifiedBy    = Str(row, "ModifiedBy"),
        SizeBytes     = ParseSize(row, "Size")
    };
}

Localization Keys Needed

The following keys are needed for Phase 3 Views. Keys from the PS reference (lines 2747-2813) are remapped to the C# Strings.resx naming convention. Existing keys already in Strings.resx are marked with (existing).

Storage Tab

Key	EN Value	Notes
`tab.storage`	`Storage`	(existing — already in Strings.resx line 77)
`chk.per.lib`	`Per-Library Breakdown`	new
`chk.subsites`	`Include Subsites`	new
`lbl.folder.depth`	`Folder depth:`	(existing — shared with permissions)
`chk.max.depth`	`Maximum (all levels)`	(existing — shared with permissions)
`stor.note`	`Note: deeper folder scans on large sites may take several minutes.`	new
`btn.gen.storage`	`Generate Metrics`	new
`btn.open.storage`	`Open Report`	new
`stor.col.library`	`Library`	new
`stor.col.site`	`Site`	new
`stor.col.files`	`Files`	new
`stor.col.size`	`Size`	new
`stor.col.versions`	`Versions`	new
`stor.col.lastmod`	`Last Modified`	new
`stor.col.share`	`Share of Total`	new

File Search Tab

Key	EN Value	Notes
`tab.search`	`File Search`	(existing — already in Strings.resx line 79)
`grp.search.filters`	`Search Filters`	new
`lbl.extensions`	`Extension(s):`	new
`ph.extensions`	`docx pdf xlsx`	new (placeholder)
`lbl.regex`	`Name / Regex:`	new
`ph.regex`	`Ex: report.* or \.bak$`	new (placeholder)
`chk.created.after`	`Created after:`	new
`chk.created.before`	`Created before:`	new
`chk.modified.after`	`Modified after:`	new
`chk.modified.before`	`Modified before:`	new
`lbl.created.by`	`Created by:`	new
`ph.created.by`	`First Last or email`	new (placeholder)
`lbl.modified.by`	`Modified by:`	new
`ph.modified.by`	`First Last or email`	new (placeholder)
`lbl.library`	`Library:`	new
`ph.library`	`Optional relative path e.g. Shared Documents`	new (placeholder)
`lbl.max.results`	`Max results:`	new
`btn.run.search`	`Run Search`	new
`btn.open.search`	`Open Results`	new
`srch.col.name`	`File Name`	new
`srch.col.ext`	`Extension`	new
`srch.col.created`	`Created`	new
`srch.col.modified`	`Modified`	new
`srch.col.author`	`Created By`	new
`srch.col.modby`	`Modified By`	new
`srch.col.size`	`Size`	new

Duplicates Tab

Key	EN Value	Notes
`tab.duplicates`	`Duplicates`	(existing — already in Strings.resx line 83)
`grp.dup.type`	`Duplicate Type`	new
`rad.dup.files`	`Duplicate files`	new
`rad.dup.folders`	`Duplicate folders`	new
`grp.dup.criteria`	`Comparison Criteria`	new
`lbl.dup.note`	`Name is always the primary criterion. Check additional criteria:`	new
`chk.dup.size`	`Same size`	new
`chk.dup.created`	`Same creation date`	new
`chk.dup.modified`	`Same modification date`	new
`chk.dup.subfolders`	`Same subfolder count`	new
`chk.dup.filecount`	`Same file count`	new
`chk.include.subsites`	`Include subsites`	new
`ph.dup.lib`	`All (leave empty)`	new (placeholder)
`btn.run.scan`	`Run Scan`	new
`btn.open.results`	`Open Results`	new

Duplicate Detection Scale — Known Concern Resolution

The STATE.md concern ("Duplicate detection at scale (100k+ files) — Graph API hash enumeration limits") is resolved: the PS reference does NOT use file hashes. It uses name+size+date grouping, which is exactly what DUPL-01/02/03 specify. The requirements do not mention hash-based deduplication.

Scale analysis:

File duplicates use the Search API. SharePoint Search caps at 50,000 results (StartRow=50,000 max). A site with 100k+ files will be capped at 50,000 returned results. This is the same cap as SRCH-02, and is a known/accepted limitation.
Folder duplicates use CAML pagination. SharePointPaginationHelper.GetAllItemsAsync handles arbitrary folder counts with RowLimit=2000 pagination — no effective upper bound.
Client-side GroupBy on 50,000 items is instantaneous (Dictionary-based O(n) operation).
No Graph API or SHA256 content hashing is needed. The concern was about a potential v2 enhancement not required by DUPL-01/02/03.

State of the Art

Old Approach	Current Approach	When Changed	Impact
`Get-PnPFolderStorageMetric` (PS cmdlet)	CSOM `Folder.StorageMetrics`	Phase 3 migration	One CSOM round-trip per folder; no PnP PS module required
`Submit-PnPSearchQuery` (PS cmdlet)	CSOM `KeywordQuery` + `SearchExecutor`	Phase 3 migration	Same pagination model; TrimDuplicates=false explicit
`Get-PnPListItem` for folders (PS)	`SharePointPaginationHelper.GetAllItemsAsync` with CAML	Phase 3 migration	Reuses Phase 1 helper; handles 5000-item threshold
Storage TreeView control	Flat DataGrid with IndentLevel + IValueConverter	Phase 3 design decision	Better UI virtualization for large sites

Validation Architecture

Test Framework

Property	Value
Framework	xUnit 2.9.3
Config file	none (SDK auto-discovery)
Quick run command	`dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "Category!=Integration" -x`
Full suite command	`dotnet test SharepointToolbox.slnx`

Phase Requirements → Test Map

Req ID	Behavior	Test Type	Automated Command	File Exists?
STOR-01/02	`StorageService.CollectStorageAsync` returns `StorageNode` list	unit (mock ISessionManager)	`dotnet test --filter "StorageServiceTests"`	❌ Wave 0
STOR-03	VersionSizeBytes = TotalSizeBytes - FileStreamSizeBytes	unit	`dotnet test --filter "StorageNodeTests"`	❌ Wave 0
STOR-04	`StorageCsvExportService.BuildCsv` produces correct header and rows	unit	`dotnet test --filter "StorageCsvExportServiceTests"`	❌ Wave 0
STOR-05	`StorageHtmlExportService.BuildHtml` contains toggle JS and nested tables	unit	`dotnet test --filter "StorageHtmlExportServiceTests"`	❌ Wave 0
SRCH-01	`SearchService` builds correct KQL from `SearchOptions`	unit	`dotnet test --filter "SearchServiceTests"`	❌ Wave 0
SRCH-02	Search loop exits when `startRow > 50_000`	unit	`dotnet test --filter "SearchServiceTests"`	❌ Wave 0
SRCH-03	`SearchCsvExportService.BuildCsv` produces correct header	unit	`dotnet test --filter "SearchCsvExportServiceTests"`	❌ Wave 0
SRCH-04	`SearchHtmlExportService.BuildHtml` contains sort JS and filter input	unit	`dotnet test --filter "SearchHtmlExportServiceTests"`	❌ Wave 0
DUPL-01	`MakeKey` function groups identical name+size+date items	unit	`dotnet test --filter "DuplicatesServiceTests"`	❌ Wave 0
DUPL-02	CAML query targets `FSObjType=1`; `FileCount = ItemChildCount - FolderChildCount`	unit (logic only)	`dotnet test --filter "DuplicatesServiceTests"`	❌ Wave 0
DUPL-03	`DuplicatesHtmlExportService.BuildHtml` contains group cards with ok/diff badges	unit	`dotnet test --filter "DuplicatesHtmlExportServiceTests"`	❌ Wave 0

Note: StorageService, SearchService, and DuplicatesService depend on live CSOM — service-level tests use Skip like PermissionsServiceTests. ViewModel tests use Moq for IStorageService, ISearchService, IDuplicatesService following PermissionsViewModelTests pattern. Export service tests are fully unit-testable (no CSOM).

Sampling Rate

Per task commit: dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj -x
Per wave merge: dotnet test SharepointToolbox.slnx
Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

SharepointToolbox.Tests/Services/StorageServiceTests.cs — covers STOR-01/02 (stub + Skip like PermissionsServiceTests)
SharepointToolbox.Tests/Services/Export/StorageCsvExportServiceTests.cs — covers STOR-04
SharepointToolbox.Tests/Services/Export/StorageHtmlExportServiceTests.cs — covers STOR-05
SharepointToolbox.Tests/Services/SearchServiceTests.cs — covers SRCH-01/02 (KQL build + pagination cap logic)
SharepointToolbox.Tests/Services/Export/SearchCsvExportServiceTests.cs — covers SRCH-03
SharepointToolbox.Tests/Services/Export/SearchHtmlExportServiceTests.cs — covers SRCH-04
SharepointToolbox.Tests/Services/DuplicatesServiceTests.cs — covers DUPL-01/02 composite key logic
SharepointToolbox.Tests/Services/Export/DuplicatesHtmlExportServiceTests.cs — covers DUPL-03
SharepointToolbox.Tests/ViewModels/StorageViewModelTests.cs — covers STOR-01 ViewModel (Moq IStorageService)
SharepointToolbox.Tests/ViewModels/SearchViewModelTests.cs — covers SRCH-01/02 ViewModel
SharepointToolbox.Tests/ViewModels/DuplicatesViewModelTests.cs — covers DUPL-01/02 ViewModel

Open Questions

StorageMetrics.LastModified vs TimeLastModified
- What we know: StorageMetrics.LastModified exists per the API docs. Folder.TimeLastModified is a separate CSOM property.
- What's unclear: Whether StorageMetrics.LastModified can return DateTime.MinValue for recently created empty folders in all SharePoint Online tenants.
- Recommendation: Load both (f => f.StorageMetrics, f => f.TimeLastModified) and prefer StorageMetrics.LastModified when it is > DateTime.MinValue, falling back to TimeLastModified.
Search index freshness for duplicate detection
- What we know: SharePoint Search is eventually consistent — newly created files may not appear for up to 15 minutes.
- What's unclear: Whether users expect real-time accuracy or accept eventual consistency.
- Recommendation: Document in UI that search-based results (files) reflect the search index, not the current state. Add a note in the log output.
Multiple-site file search scope
- What we know: The PS reference scopes search to $siteUrl context only (one site per search). SRCH-01 says "across sites" in the goal description but the requirements only specify search criteria, not multi-site.
- What's unclear: Whether SRCH-01 requires multi-site search in one operation or per-site.
- Recommendation: Implement per-site search (matching PS reference). Multi-site search would require separate ClientContext per site plus result merging — treat as a future enhancement.

Sources

Primary (HIGH confidence)

StorageMetrics Class — MS Learn CSOM reference — properties TotalSize, TotalFileStreamSize, TotalFileCount, LastModified confirmed
StorageMetrics.TotalSize — MS Learn — confirmed as Int64, ReadOnly
[MS-CSOMSPT] TotalFileStreamSize — confirmed definition: "Aggregate stream size in bytes for all files... Excludes version, metadata, list item attachment, and non-customized document sizes"
SearchExecutor Class — MS Learn CSOM reference — namespace Microsoft.SharePoint.Client.Search.Query, assembly Microsoft.SharePoint.Client.Search.Portable.dll
Search limits for SharePoint — MS Learn — StartRow max 50,000 (boundary), RowLimit max 500 (boundary) confirmed
[SharepointToolbox/bin/Debug output] — Microsoft.SharePoint.Client.Search.dll confirmed present as transitive dep

Secondary (MEDIUM confidence)

Load storage metric from SPO — longnlp.github.io — CSOM Load pattern: ctx.Load(folder, f => f.StorageMetrics) verified
Fetch all results from SharePoint Search using CSOM — usefulscripts.wordpress.com — KeywordQuery + SearchExecutor pagination pattern with StartRow; confirmed against official docs
PowerShell reference Sharepoint_ToolBox.ps1 lines 1621-1780 (Export-StorageToHTML), 2112-2233 (Export-SearchResultsToHTML), 2235-2406 (Export-DuplicatesToHTML), 4432-4534 (storage scan), 4747-4808 (file search), 4937-5059 (duplicate scan) — authoritative reference implementation

Tertiary (LOW confidence — implementation detail, verify when coding)

SharePoint CSOM Q&A — Getting size of subsite — general pattern confirmed; specific edge cases not verified
Pagination for large result sets — MS Learn — DocId-based pagination beyond 50k exists but is not needed for Phase 3

Metadata

Confidence breakdown:

Standard Stack: HIGH — no new packages needed; Search.dll confirmed present; all APIs verified against MS docs
Architecture Patterns: HIGH — direct port of working PS reference; CSOM API shapes confirmed
Pitfalls: HIGH for StorageMetrics loading, search result typing, vti_history filter (all from PS reference or official docs); MEDIUM for KQL length limit (documented but not commonly hit)
Localization keys: HIGH — directly extracted from PS reference lines 2747-2813

Research date: 2026-04-02 Valid until: 2026-07-01 (CSOM APIs stable; SharePoint search limits stable; re-verify if PnP.Framework upgrades past 1.18)

44 KiB Raw Blame History Unescape Escape

Phase 3: Storage and File Operations - Research

Phase Requirements

Summary

User Constraints

Locked Decisions

Deferred / Out of Scope

Standard Stack

Core (no new packages needed)

New Models Needed

Architecture Patterns

Recommended Project Structure (additions only)

Pattern 1: CSOM StorageMetrics Load

Pattern 2: KQL Search with Pagination

Pattern 3: Folder Enumeration for Duplicate Folders

Pattern 4: Duplicate Composite Key (name+size+date grouping)

Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid

Pattern 6: Storage HTML Collapsible Tree

Anti-Patterns to Avoid

Don't Hand-Roll

Common Pitfalls

Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException

Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed

Pitfall 3: Search API Returns Duplicates for Versioned Files

Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue

Pitfall 5: KQL Query Text Exceeds 4096 Characters

Pitfall 6: CAML FSObjType Field Name

Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path

Code Examples

Loading StorageMetrics (STOR-01/02/03)

KQL Build from SearchOptions

Parsing Search ResultRows

Localization Keys Needed

Storage Tab

File Search Tab

Duplicates Tab

Duplicate Detection Scale — Known Concern Resolution

State of the Art

Validation Architecture

Test Framework

Phase Requirements → Test Map

Sampling Rate

Wave 0 Gaps

Open Questions

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence — implementation detail, verify when coding)

Metadata

44 KiB

Raw Blame History