diff --git a/.planning/phases/03-storage/03-RESEARCH.md b/.planning/phases/03-storage/03-RESEARCH.md new file mode 100644 index 0000000..54da041 --- /dev/null +++ b/.planning/phases/03-storage/03-RESEARCH.md @@ -0,0 +1,756 @@ +# Phase 3: Storage and File Operations - Research + +**Researched:** 2026-04-02 +**Domain:** CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection +**Confidence:** HIGH + +--- + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|-----------------| +| STOR-01 | User can view storage consumption per library on a site | CSOM `Folder.StorageMetrics` (one Load call per folder) + flat DataGrid with indent column | +| STOR-02 | User can view storage consumption per site with configurable folder depth | Recursive `Collect-FolderStorage` pattern translated to async CSOM; depth guard via split-count | +| STOR-03 | Storage metrics include total size, version size, item count, and last modified date | `StorageMetrics.TotalSize`, `TotalFileStreamSize`, `TotalFileCount`, `StorageMetrics.LastModified`; version size = TotalSize - TotalFileStreamSize | +| STOR-04 | User can export storage metrics to CSV | New `StorageCsvExportService` — same UTF-8 BOM pattern as Phase 2 | +| STOR-05 | User can export storage metrics to interactive HTML with collapsible tree view | New `StorageHtmlExportService` — port PS lines 1621-1780; toggle() JS + nested table rows | +| SRCH-01 | User can search files across sites using multiple criteria | `KeywordQuery` + `SearchExecutor` (CSOM search); KQL built from filter params; client-side Regex post-filter | +| SRCH-02 | User can configure maximum search results (up to 50,000) | SharePoint Search `StartRow` hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max | +| SRCH-03 | User can export search results to CSV | New `SearchCsvExportService` | +| SRCH-04 | User can export search results to interactive HTML (sortable, filterable) | New `SearchHtmlExportService` — port PS lines 2112-2233; sortable columns via data attributes | +| DUPL-01 | User can scan for duplicate files by name, size, creation date, modification date | Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed | +| DUPL-02 | User can scan for duplicate folders by name, subfolder count, file count | `SharePointPaginationHelper.GetAllItemsAsync` with CAML `FSObjType=1`; read `FolderChildCount`, `ItemChildCount` from field values | +| DUPL-03 | User can export duplicate report to HTML with grouped display and visual indicators | New `DuplicatesHtmlExportService` — port PS lines 2235-2406; collapsible group cards, ok/diff badges | + + +--- + +## Summary + +Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — `Microsoft.SharePoint.Client.Search.dll` is already in the output folder as a transitive dependency of PnP.Framework 1.18.0. + +**Storage** uses CSOM `Folder.StorageMetrics` (loaded via `ctx.Load(folder, f => f.StorageMetrics)`). One CSOM round-trip per folder. Version size is derived as `TotalSize - TotalFileStreamSize`. The data model is a recursive tree (site → library → folder → subfolder), flattened to a `DataGrid` with an indent-level column for WPF display. The HTML export ports the PS `Export-StorageToHTML` function (PS lines 1621-1780) with its toggle(i) JS pattern. + +**File Search** uses `Microsoft.SharePoint.Client.Search.Query.KeywordQuery` + `SearchExecutor`. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is `StartRow += 500` per batch; the hard ceiling is `StartRow = 50,000` (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233. + +**Duplicate Detection** uses the same Search API for file duplicates (with all documents query) and `SharePointPaginationHelper.GetAllItemsAsync` with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation. + +**Primary recommendation:** Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns. + +--- + +## User Constraints + +No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt. + +### Locked Decisions +- .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2) +- PnP.Framework 1.18.0 (CSOM-based SharePoint access) +- No new major packages preferred — only add if truly necessary +- Microsoft.Extensions.Hosting DI +- Serilog logging +- xUnit 2.9.3 tests + +### Deferred / Out of Scope +- Content hashing for duplicate detection (v2) +- Storage charts/graphs (v2 requirement VIZZ-01/02/03) +- Cross-tenant file search + +--- + +## Standard Stack + +### Core (no new packages needed) + +| Library | Version | Purpose | Why Standard | +|---------|---------|---------|--------------| +| PnP.Framework | 1.18.0 | CSOM access, `ClientContext` | Already in project | +| Microsoft.SharePoint.Client.Search.dll | (via PnP.Framework) | `KeywordQuery`, `SearchExecutor` | Transitive dep — confirmed present in `bin/Debug/net10.0-windows/` | +| CommunityToolkit.Mvvm | 8.4.2 | `[ObservableProperty]`, `AsyncRelayCommand` | Already in project | +| Microsoft.Extensions.Hosting | 10.x | DI container | Already in project | +| Serilog | 4.3.1 | Structured logging | Already in project | +| xUnit | 2.9.3 | Tests | Already in project | +| Moq | 4.20.72 | Mock interfaces in tests | Already in project | + +**No new NuGet packages required.** `Microsoft.SharePoint.Client.Search.dll` ships as a transitive dependency of PnP.Framework — confirmed present at `SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll`. + +### New Models Needed + +| Model | Location | Fields | +|-------|----------|--------| +| `StorageNode` | `Core/Models/StorageNode.cs` | `string Name`, `string Url`, `string SiteTitle`, `string Library`, `long TotalSizeBytes`, `long FileStreamSizeBytes`, `long TotalFileCount`, `DateTime? LastModified`, `int IndentLevel`, `List Children` | +| `SearchResult` | `Core/Models/SearchResult.cs` | `string Title`, `string Path`, `string FileExtension`, `DateTime? Created`, `DateTime? LastModified`, `string Author`, `string ModifiedBy`, `long SizeBytes` | +| `DuplicateGroup` | `Core/Models/DuplicateGroup.cs` | `string GroupKey`, `string Name`, `List Items` | +| `DuplicateItem` | `Core/Models/DuplicateItem.cs` | `string Name`, `string Path`, `string Library`, `long? SizeBytes`, `DateTime? Created`, `DateTime? Modified`, `int? FolderCount`, `int? FileCount` | +| `StorageScanOptions` | `Core/Models/StorageScanOptions.cs` | `bool PerLibrary`, `bool IncludeSubsites`, `int FolderDepth` | +| `SearchOptions` | `Core/Models/SearchOptions.cs` | `string[] Extensions`, `string? Regex`, `DateTime? CreatedAfter`, `DateTime? CreatedBefore`, `DateTime? ModifiedAfter`, `DateTime? ModifiedBefore`, `string? CreatedBy`, `string? ModifiedBy`, `string? Library`, `int MaxResults` | +| `DuplicateScanOptions` | `Core/Models/DuplicateScanOptions.cs` | `string Mode` ("Files"/"Folders"), `bool MatchSize`, `bool MatchCreated`, `bool MatchModified`, `bool MatchSubfolderCount`, `bool MatchFileCount`, `bool IncludeSubsites`, `string? Library` | + +--- + +## Architecture Patterns + +### Recommended Project Structure (additions only) + +``` +SharepointToolbox/ +├── Core/Models/ +│ ├── StorageNode.cs # new +│ ├── SearchResult.cs # new +│ ├── DuplicateGroup.cs # new +│ ├── DuplicateItem.cs # new +│ ├── StorageScanOptions.cs # new +│ ├── SearchOptions.cs # new +│ └── DuplicateScanOptions.cs # new +├── Services/ +│ ├── IStorageService.cs # new +│ ├── StorageService.cs # new +│ ├── ISearchService.cs # new +│ ├── SearchService.cs # new +│ ├── IDuplicatesService.cs # new +│ ├── DuplicatesService.cs # new +│ └── Export/ +│ ├── StorageCsvExportService.cs # new +│ ├── StorageHtmlExportService.cs # new +│ ├── SearchCsvExportService.cs # new +│ ├── SearchHtmlExportService.cs # new +│ └── DuplicatesHtmlExportService.cs # new +├── ViewModels/Tabs/ +│ ├── StorageViewModel.cs # new +│ ├── SearchViewModel.cs # new +│ └── DuplicatesViewModel.cs # new +└── Views/Tabs/ + ├── StorageView.xaml # new + ├── StorageView.xaml.cs # new + ├── SearchView.xaml # new + ├── SearchView.xaml.cs # new + ├── DuplicatesView.xaml # new + └── DuplicatesView.xaml.cs # new +``` + +### Pattern 1: CSOM StorageMetrics Load + +**What:** Load `Folder.StorageMetrics` with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched. + +**When to use:** Whenever reading storage data for a folder or library root. + +**Example:** +```csharp +// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics +// + https://longnlp.github.io/load-storage-metric-from-SPO + +// Get folder by server-relative URL (library root or subfolder) +Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl); +ctx.Load(folder, + f => f.StorageMetrics, // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified + f => f.TimeLastModified, // alternative timestamp if StorageMetrics.LastModified is null + f => f.ServerRelativeUrl, + f => f.Name); +await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct); + +long totalBytes = folder.StorageMetrics.TotalSize; +long streamBytes = folder.StorageMetrics.TotalFileStreamSize; // current-version files only +long versionBytes = Math.Max(0L, totalBytes - streamBytes); // version overhead +long fileCount = folder.StorageMetrics.TotalFileCount; +DateTime? lastMod = folder.StorageMetrics.IsPropertyAvailable("LastModified") + ? folder.StorageMetrics.LastModified + : folder.TimeLastModified; +``` + +**Unit:** `TotalSize` and `TotalFileStreamSize` are in **bytes** (Int64). `TotalFileStreamSize` is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = `TotalSize - TotalFileStreamSize`. + +### Pattern 2: KQL Search with Pagination + +**What:** Use `KeywordQuery` + `SearchExecutor` (in `Microsoft.SharePoint.Client.Search.Query`) to execute a KQL query, paginating 500 rows at a time via `StartRow`. + +**When to use:** SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection). + +**Example:** +```csharp +// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor +// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/ + +using Microsoft.SharePoint.Client.Search.Query; + +// namespace: Microsoft.SharePoint.Client.Search.Query +// assembly: Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep) + +var allResults = new List>(); +int startRow = 0; +const int batchSize = 500; + +do +{ + ct.ThrowIfCancellationRequested(); + + var kq = new KeywordQuery(ctx) + { + QueryText = kql, // e.g. "ContentType:Document AND FileExtension:pdf" + StartRow = startRow, + RowLimit = batchSize, + TrimDuplicates = false + }; + // Explicit managed properties to retrieve + kq.SelectProperties.AddRange(new[] + { + "Title", "Path", "Author", "LastModifiedTime", + "FileExtension", "Created", "ModifiedBy", "Size" + }); + + var executor = new SearchExecutor(ctx); + ClientResult clientResult = executor.ExecuteQuery(kq); + await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct); + // Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again + + var table = clientResult.Value + .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults); + if (table == null) break; + + int retrieved = table.RowCount; + foreach (System.Collections.Hashtable row in table.ResultRows) + { + allResults.Add(row.Cast() + .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty)); + } + + progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…")); + startRow += batchSize; +} +while (startRow < maxResults && startRow <= 50_000 // platform hard cap + && allResults.Count < maxResults); +``` + +**Critical detail:** `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` wraps `ctx.ExecuteQuery()`. Call it AFTER `executor.ExecuteQuery(kq)` — do NOT call `ctx.ExecuteQuery()` directly afterward. + +**StartRow limit:** SharePoint Search imposes a hard boundary of 50,000 for `StartRow`. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02. + +**KQL field mappings (from PS reference lines 4747-4763):** +- Extension: `FileExtension:pdf OR FileExtension:docx` +- Created after/before: `Created>=2024-01-01` / `Created<=2024-12-31` +- Modified after/before: `Write>=2024-01-01` / `Write<=2024-12-31` +- Created by: `Author:"First Last"` +- Modified by: `ModifiedBy:"First Last"` +- Library path: `Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"` +- Documents only: `ContentType:Document` + +### Pattern 3: Folder Enumeration for Duplicate Folders + +**What:** Use `SharePointPaginationHelper.GetAllItemsAsync` with a CAML filter on `FSObjType = 1` (folders). Read `FolderChildCount` and `ItemChildCount` from `FieldValues`. + +**When to use:** DUPL-02 (folder duplicate scan). + +**Example:** +```csharp +// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern + +var camlQuery = new CamlQuery +{ + ViewXml = @" + + + + + 1 + + + + 2000 + " +}; + +await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct)) +{ + var fv = item.FieldValues; + var name = fv["FileLeafRef"]?.ToString() ?? string.Empty; + var fileRef = fv["FileRef"]?.ToString() ?? string.Empty; + var subCount = Convert.ToInt32(fv["FolderChildCount"] ?? 0); + var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0); + var fileCount = Math.Max(0, childCount - subCount); + var created = fv["Created"] is DateTime cr ? cr : (DateTime?)null; + var modified = fv["Modified"] is DateTime md ? md : (DateTime?)null; + // ...build DuplicateItem +} +``` + +### Pattern 4: Duplicate Composite Key (name+size+date grouping) + +**What:** Build a string composite key from the fields the user selected, then `GroupBy(key).Where(g => g.Count() >= 2)`. + +**When to use:** DUPL-01 (files) and DUPL-02 (folders). + +**Example:** +```csharp +// Source: PS reference lines 4942-4949 (MakeKey function) + +private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts) +{ + var parts = new List { item.Name.ToLowerInvariant() }; + if (opts.MatchSize && item.SizeBytes.HasValue) parts.Add(item.SizeBytes.Value.ToString()); + if (opts.MatchCreated && item.Created.HasValue) parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd")); + if (opts.MatchModified && item.Modified.HasValue) parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd")); + if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString()); + if (opts.MatchFileCount && item.FileCount.HasValue) parts.Add(item.FileCount.Value.ToString()); + return string.Join("|", parts); +} + +var groups = allItems + .GroupBy(i => MakeKey(i, opts)) + .Where(g => g.Count() >= 2) + .Select(g => new DuplicateGroup + { + GroupKey = g.Key, + Name = g.First().Name, + Items = g.ToList() + }) + .OrderByDescending(g => g.Items.Count) + .ToList(); +``` + +### Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid + +**What:** Flatten the recursive tree (site → library → folder → subfolder) into a flat `List` where each node carries an `IndentLevel`. The WPF `DataGrid` renders a `Margin` on the name cell based on `IndentLevel`. + +**When to use:** STOR-01/02 WPF display. + +**Rationale for DataGrid over TreeView:** WPF `TreeView` requires hierarchical `HierarchicalDataTemplate` and loses virtualization with deep nesting. A flat `DataGrid` with `VirtualizingPanel.IsVirtualizing="True"` stays performant for thousands of rows and is trivially sortable. + +**Example:** +```csharp +// Flatten tree to observable list for DataGrid binding +private static void FlattenTree(StorageNode node, int level, List result) +{ + node.IndentLevel = level; + result.Add(node); + foreach (var child in node.Children) + FlattenTree(child, level + 1, result); +} +``` + +```xml + + + + + + + + +``` + +Use `IValueConverter` mapping `IndentLevel` → `new Thickness(IndentLevel * 16, 0, 0, 0)`. + +### Pattern 6: Storage HTML Collapsible Tree + +**What:** The HTML export uses inline nested tables with `display:none` rows toggled by `toggle(i)` JS. Each library/folder that has children gets a unique numeric index. + +**When to use:** STOR-05 export. + +**Key design (from PS lines 1621-1780):** +- A global `_togIdx` counter assigns unique IDs to collapsible rows: ``. +- A `