# Phase 3: Storage and File Operations - Research **Researched:** 2026-04-02 **Domain:** CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection **Confidence:** HIGH --- ## Phase Requirements | ID | Description | Research Support | |----|-------------|-----------------| | STOR-01 | User can view storage consumption per library on a site | CSOM `Folder.StorageMetrics` (one Load call per folder) + flat DataGrid with indent column | | STOR-02 | User can view storage consumption per site with configurable folder depth | Recursive `Collect-FolderStorage` pattern translated to async CSOM; depth guard via split-count | | STOR-03 | Storage metrics include total size, version size, item count, and last modified date | `StorageMetrics.TotalSize`, `TotalFileStreamSize`, `TotalFileCount`, `StorageMetrics.LastModified`; version size = TotalSize - TotalFileStreamSize | | STOR-04 | User can export storage metrics to CSV | New `StorageCsvExportService` — same UTF-8 BOM pattern as Phase 2 | | STOR-05 | User can export storage metrics to interactive HTML with collapsible tree view | New `StorageHtmlExportService` — port PS lines 1621-1780; toggle() JS + nested table rows | | SRCH-01 | User can search files across sites using multiple criteria | `KeywordQuery` + `SearchExecutor` (CSOM search); KQL built from filter params; client-side Regex post-filter | | SRCH-02 | User can configure maximum search results (up to 50,000) | SharePoint Search `StartRow` hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max | | SRCH-03 | User can export search results to CSV | New `SearchCsvExportService` | | SRCH-04 | User can export search results to interactive HTML (sortable, filterable) | New `SearchHtmlExportService` — port PS lines 2112-2233; sortable columns via data attributes | | DUPL-01 | User can scan for duplicate files by name, size, creation date, modification date | Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed | | DUPL-02 | User can scan for duplicate folders by name, subfolder count, file count | `SharePointPaginationHelper.GetAllItemsAsync` with CAML `FSObjType=1`; read `FolderChildCount`, `ItemChildCount` from field values | | DUPL-03 | User can export duplicate report to HTML with grouped display and visual indicators | New `DuplicatesHtmlExportService` — port PS lines 2235-2406; collapsible group cards, ok/diff badges | --- ## Summary Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — `Microsoft.SharePoint.Client.Search.dll` is already in the output folder as a transitive dependency of PnP.Framework 1.18.0. **Storage** uses CSOM `Folder.StorageMetrics` (loaded via `ctx.Load(folder, f => f.StorageMetrics)`). One CSOM round-trip per folder. Version size is derived as `TotalSize - TotalFileStreamSize`. The data model is a recursive tree (site → library → folder → subfolder), flattened to a `DataGrid` with an indent-level column for WPF display. The HTML export ports the PS `Export-StorageToHTML` function (PS lines 1621-1780) with its toggle(i) JS pattern. **File Search** uses `Microsoft.SharePoint.Client.Search.Query.KeywordQuery` + `SearchExecutor`. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is `StartRow += 500` per batch; the hard ceiling is `StartRow = 50,000` (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233. **Duplicate Detection** uses the same Search API for file duplicates (with all documents query) and `SharePointPaginationHelper.GetAllItemsAsync` with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation. **Primary recommendation:** Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns. --- ## User Constraints No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt. ### Locked Decisions - .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2) - PnP.Framework 1.18.0 (CSOM-based SharePoint access) - No new major packages preferred — only add if truly necessary - Microsoft.Extensions.Hosting DI - Serilog logging - xUnit 2.9.3 tests ### Deferred / Out of Scope - Content hashing for duplicate detection (v2) - Storage charts/graphs (v2 requirement VIZZ-01/02/03) - Cross-tenant file search --- ## Standard Stack ### Core (no new packages needed) | Library | Version | Purpose | Why Standard | |---------|---------|---------|--------------| | PnP.Framework | 1.18.0 | CSOM access, `ClientContext` | Already in project | | Microsoft.SharePoint.Client.Search.dll | (via PnP.Framework) | `KeywordQuery`, `SearchExecutor` | Transitive dep — confirmed present in `bin/Debug/net10.0-windows/` | | CommunityToolkit.Mvvm | 8.4.2 | `[ObservableProperty]`, `AsyncRelayCommand` | Already in project | | Microsoft.Extensions.Hosting | 10.x | DI container | Already in project | | Serilog | 4.3.1 | Structured logging | Already in project | | xUnit | 2.9.3 | Tests | Already in project | | Moq | 4.20.72 | Mock interfaces in tests | Already in project | **No new NuGet packages required.** `Microsoft.SharePoint.Client.Search.dll` ships as a transitive dependency of PnP.Framework — confirmed present at `SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll`. ### New Models Needed | Model | Location | Fields | |-------|----------|--------| | `StorageNode` | `Core/Models/StorageNode.cs` | `string Name`, `string Url`, `string SiteTitle`, `string Library`, `long TotalSizeBytes`, `long FileStreamSizeBytes`, `long TotalFileCount`, `DateTime? LastModified`, `int IndentLevel`, `List Children` | | `SearchResult` | `Core/Models/SearchResult.cs` | `string Title`, `string Path`, `string FileExtension`, `DateTime? Created`, `DateTime? LastModified`, `string Author`, `string ModifiedBy`, `long SizeBytes` | | `DuplicateGroup` | `Core/Models/DuplicateGroup.cs` | `string GroupKey`, `string Name`, `List Items` | | `DuplicateItem` | `Core/Models/DuplicateItem.cs` | `string Name`, `string Path`, `string Library`, `long? SizeBytes`, `DateTime? Created`, `DateTime? Modified`, `int? FolderCount`, `int? FileCount` | | `StorageScanOptions` | `Core/Models/StorageScanOptions.cs` | `bool PerLibrary`, `bool IncludeSubsites`, `int FolderDepth` | | `SearchOptions` | `Core/Models/SearchOptions.cs` | `string[] Extensions`, `string? Regex`, `DateTime? CreatedAfter`, `DateTime? CreatedBefore`, `DateTime? ModifiedAfter`, `DateTime? ModifiedBefore`, `string? CreatedBy`, `string? ModifiedBy`, `string? Library`, `int MaxResults` | | `DuplicateScanOptions` | `Core/Models/DuplicateScanOptions.cs` | `string Mode` ("Files"/"Folders"), `bool MatchSize`, `bool MatchCreated`, `bool MatchModified`, `bool MatchSubfolderCount`, `bool MatchFileCount`, `bool IncludeSubsites`, `string? Library` | --- ## Architecture Patterns ### Recommended Project Structure (additions only) ``` SharepointToolbox/ ├── Core/Models/ │ ├── StorageNode.cs # new │ ├── SearchResult.cs # new │ ├── DuplicateGroup.cs # new │ ├── DuplicateItem.cs # new │ ├── StorageScanOptions.cs # new │ ├── SearchOptions.cs # new │ └── DuplicateScanOptions.cs # new ├── Services/ │ ├── IStorageService.cs # new │ ├── StorageService.cs # new │ ├── ISearchService.cs # new │ ├── SearchService.cs # new │ ├── IDuplicatesService.cs # new │ ├── DuplicatesService.cs # new │ └── Export/ │ ├── StorageCsvExportService.cs # new │ ├── StorageHtmlExportService.cs # new │ ├── SearchCsvExportService.cs # new │ ├── SearchHtmlExportService.cs # new │ └── DuplicatesHtmlExportService.cs # new ├── ViewModels/Tabs/ │ ├── StorageViewModel.cs # new │ ├── SearchViewModel.cs # new │ └── DuplicatesViewModel.cs # new └── Views/Tabs/ ├── StorageView.xaml # new ├── StorageView.xaml.cs # new ├── SearchView.xaml # new ├── SearchView.xaml.cs # new ├── DuplicatesView.xaml # new └── DuplicatesView.xaml.cs # new ``` ### Pattern 1: CSOM StorageMetrics Load **What:** Load `Folder.StorageMetrics` with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched. **When to use:** Whenever reading storage data for a folder or library root. **Example:** ```csharp // Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics // + https://longnlp.github.io/load-storage-metric-from-SPO // Get folder by server-relative URL (library root or subfolder) Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl); ctx.Load(folder, f => f.StorageMetrics, // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified f => f.TimeLastModified, // alternative timestamp if StorageMetrics.LastModified is null f => f.ServerRelativeUrl, f => f.Name); await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct); long totalBytes = folder.StorageMetrics.TotalSize; long streamBytes = folder.StorageMetrics.TotalFileStreamSize; // current-version files only long versionBytes = Math.Max(0L, totalBytes - streamBytes); // version overhead long fileCount = folder.StorageMetrics.TotalFileCount; DateTime? lastMod = folder.StorageMetrics.IsPropertyAvailable("LastModified") ? folder.StorageMetrics.LastModified : folder.TimeLastModified; ``` **Unit:** `TotalSize` and `TotalFileStreamSize` are in **bytes** (Int64). `TotalFileStreamSize` is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = `TotalSize - TotalFileStreamSize`. ### Pattern 2: KQL Search with Pagination **What:** Use `KeywordQuery` + `SearchExecutor` (in `Microsoft.SharePoint.Client.Search.Query`) to execute a KQL query, paginating 500 rows at a time via `StartRow`. **When to use:** SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection). **Example:** ```csharp // Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor // + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/ using Microsoft.SharePoint.Client.Search.Query; // namespace: Microsoft.SharePoint.Client.Search.Query // assembly: Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep) var allResults = new List>(); int startRow = 0; const int batchSize = 500; do { ct.ThrowIfCancellationRequested(); var kq = new KeywordQuery(ctx) { QueryText = kql, // e.g. "ContentType:Document AND FileExtension:pdf" StartRow = startRow, RowLimit = batchSize, TrimDuplicates = false }; // Explicit managed properties to retrieve kq.SelectProperties.AddRange(new[] { "Title", "Path", "Author", "LastModifiedTime", "FileExtension", "Created", "ModifiedBy", "Size" }); var executor = new SearchExecutor(ctx); ClientResult clientResult = executor.ExecuteQuery(kq); await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct); // Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again var table = clientResult.Value .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults); if (table == null) break; int retrieved = table.RowCount; foreach (System.Collections.Hashtable row in table.ResultRows) { allResults.Add(row.Cast() .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty)); } progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…")); startRow += batchSize; } while (startRow < maxResults && startRow <= 50_000 // platform hard cap && allResults.Count < maxResults); ``` **Critical detail:** `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` wraps `ctx.ExecuteQuery()`. Call it AFTER `executor.ExecuteQuery(kq)` — do NOT call `ctx.ExecuteQuery()` directly afterward. **StartRow limit:** SharePoint Search imposes a hard boundary of 50,000 for `StartRow`. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02. **KQL field mappings (from PS reference lines 4747-4763):** - Extension: `FileExtension:pdf OR FileExtension:docx` - Created after/before: `Created>=2024-01-01` / `Created<=2024-12-31` - Modified after/before: `Write>=2024-01-01` / `Write<=2024-12-31` - Created by: `Author:"First Last"` - Modified by: `ModifiedBy:"First Last"` - Library path: `Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"` - Documents only: `ContentType:Document` ### Pattern 3: Folder Enumeration for Duplicate Folders **What:** Use `SharePointPaginationHelper.GetAllItemsAsync` with a CAML filter on `FSObjType = 1` (folders). Read `FolderChildCount` and `ItemChildCount` from `FieldValues`. **When to use:** DUPL-02 (folder duplicate scan). **Example:** ```csharp // Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern var camlQuery = new CamlQuery { ViewXml = @" 1 2000 " }; await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct)) { var fv = item.FieldValues; var name = fv["FileLeafRef"]?.ToString() ?? string.Empty; var fileRef = fv["FileRef"]?.ToString() ?? string.Empty; var subCount = Convert.ToInt32(fv["FolderChildCount"] ?? 0); var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0); var fileCount = Math.Max(0, childCount - subCount); var created = fv["Created"] is DateTime cr ? cr : (DateTime?)null; var modified = fv["Modified"] is DateTime md ? md : (DateTime?)null; // ...build DuplicateItem } ``` ### Pattern 4: Duplicate Composite Key (name+size+date grouping) **What:** Build a string composite key from the fields the user selected, then `GroupBy(key).Where(g => g.Count() >= 2)`. **When to use:** DUPL-01 (files) and DUPL-02 (folders). **Example:** ```csharp // Source: PS reference lines 4942-4949 (MakeKey function) private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts) { var parts = new List { item.Name.ToLowerInvariant() }; if (opts.MatchSize && item.SizeBytes.HasValue) parts.Add(item.SizeBytes.Value.ToString()); if (opts.MatchCreated && item.Created.HasValue) parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd")); if (opts.MatchModified && item.Modified.HasValue) parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd")); if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString()); if (opts.MatchFileCount && item.FileCount.HasValue) parts.Add(item.FileCount.Value.ToString()); return string.Join("|", parts); } var groups = allItems .GroupBy(i => MakeKey(i, opts)) .Where(g => g.Count() >= 2) .Select(g => new DuplicateGroup { GroupKey = g.Key, Name = g.First().Name, Items = g.ToList() }) .OrderByDescending(g => g.Items.Count) .ToList(); ``` ### Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid **What:** Flatten the recursive tree (site → library → folder → subfolder) into a flat `List` where each node carries an `IndentLevel`. The WPF `DataGrid` renders a `Margin` on the name cell based on `IndentLevel`. **When to use:** STOR-01/02 WPF display. **Rationale for DataGrid over TreeView:** WPF `TreeView` requires hierarchical `HierarchicalDataTemplate` and loses virtualization with deep nesting. A flat `DataGrid` with `VirtualizingPanel.IsVirtualizing="True"` stays performant for thousands of rows and is trivially sortable. **Example:** ```csharp // Flatten tree to observable list for DataGrid binding private static void FlattenTree(StorageNode node, int level, List result) { node.IndentLevel = level; result.Add(node); foreach (var child in node.Children) FlattenTree(child, level + 1, result); } ``` ```xml ``` Use `IValueConverter` mapping `IndentLevel` → `new Thickness(IndentLevel * 16, 0, 0, 0)`. ### Pattern 6: Storage HTML Collapsible Tree **What:** The HTML export uses inline nested tables with `display:none` rows toggled by `toggle(i)` JS. Each library/folder that has children gets a unique numeric index. **When to use:** STOR-05 export. **Key design (from PS lines 1621-1780):** - A global `_togIdx` counter assigns unique IDs to collapsible rows: ``. - A `