chore: complete v1.0 milestone
All checks were successful
Release zip package / release (push) Successful in 10s

Archive 5 phases (36 plans) to milestones/v1.0-phases/.
Archive roadmap, requirements, and audit to milestones/.
Evolve PROJECT.md with shipped state and validated requirements.
Collapse ROADMAP.md to one-line milestone summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Dev
2026-04-07 09:15:14 +02:00
parent b815c323d7
commit 655bb79a99
95 changed files with 610 additions and 332 deletions

View File

@@ -0,0 +1,756 @@
# Phase 3: Storage and File Operations - Research
**Researched:** 2026-04-02
**Domain:** CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection
**Confidence:** HIGH
---
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|-----------------|
| STOR-01 | User can view storage consumption per library on a site | CSOM `Folder.StorageMetrics` (one Load call per folder) + flat DataGrid with indent column |
| STOR-02 | User can view storage consumption per site with configurable folder depth | Recursive `Collect-FolderStorage` pattern translated to async CSOM; depth guard via split-count |
| STOR-03 | Storage metrics include total size, version size, item count, and last modified date | `StorageMetrics.TotalSize`, `TotalFileStreamSize`, `TotalFileCount`, `StorageMetrics.LastModified`; version size = TotalSize - TotalFileStreamSize |
| STOR-04 | User can export storage metrics to CSV | New `StorageCsvExportService` — same UTF-8 BOM pattern as Phase 2 |
| STOR-05 | User can export storage metrics to interactive HTML with collapsible tree view | New `StorageHtmlExportService` — port PS lines 1621-1780; toggle() JS + nested table rows |
| SRCH-01 | User can search files across sites using multiple criteria | `KeywordQuery` + `SearchExecutor` (CSOM search); KQL built from filter params; client-side Regex post-filter |
| SRCH-02 | User can configure maximum search results (up to 50,000) | SharePoint Search `StartRow` hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max |
| SRCH-03 | User can export search results to CSV | New `SearchCsvExportService` |
| SRCH-04 | User can export search results to interactive HTML (sortable, filterable) | New `SearchHtmlExportService` — port PS lines 2112-2233; sortable columns via data attributes |
| DUPL-01 | User can scan for duplicate files by name, size, creation date, modification date | Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed |
| DUPL-02 | User can scan for duplicate folders by name, subfolder count, file count | `SharePointPaginationHelper.GetAllItemsAsync` with CAML `FSObjType=1`; read `FolderChildCount`, `ItemChildCount` from field values |
| DUPL-03 | User can export duplicate report to HTML with grouped display and visual indicators | New `DuplicatesHtmlExportService` — port PS lines 2235-2406; collapsible group cards, ok/diff badges |
</phase_requirements>
---
## Summary
Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — `Microsoft.SharePoint.Client.Search.dll` is already in the output folder as a transitive dependency of PnP.Framework 1.18.0.
**Storage** uses CSOM `Folder.StorageMetrics` (loaded via `ctx.Load(folder, f => f.StorageMetrics)`). One CSOM round-trip per folder. Version size is derived as `TotalSize - TotalFileStreamSize`. The data model is a recursive tree (site → library → folder → subfolder), flattened to a `DataGrid` with an indent-level column for WPF display. The HTML export ports the PS `Export-StorageToHTML` function (PS lines 1621-1780) with its toggle(i) JS pattern.
**File Search** uses `Microsoft.SharePoint.Client.Search.Query.KeywordQuery` + `SearchExecutor`. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is `StartRow += 500` per batch; the hard ceiling is `StartRow = 50,000` (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233.
**Duplicate Detection** uses the same Search API for file duplicates (with all documents query) and `SharePointPaginationHelper.GetAllItemsAsync` with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation.
**Primary recommendation:** Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns.
---
## User Constraints
No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt.
### Locked Decisions
- .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2)
- PnP.Framework 1.18.0 (CSOM-based SharePoint access)
- No new major packages preferred — only add if truly necessary
- Microsoft.Extensions.Hosting DI
- Serilog logging
- xUnit 2.9.3 tests
### Deferred / Out of Scope
- Content hashing for duplicate detection (v2)
- Storage charts/graphs (v2 requirement VIZZ-01/02/03)
- Cross-tenant file search
---
## Standard Stack
### Core (no new packages needed)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| PnP.Framework | 1.18.0 | CSOM access, `ClientContext` | Already in project |
| Microsoft.SharePoint.Client.Search.dll | (via PnP.Framework) | `KeywordQuery`, `SearchExecutor` | Transitive dep — confirmed present in `bin/Debug/net10.0-windows/` |
| CommunityToolkit.Mvvm | 8.4.2 | `[ObservableProperty]`, `AsyncRelayCommand` | Already in project |
| Microsoft.Extensions.Hosting | 10.x | DI container | Already in project |
| Serilog | 4.3.1 | Structured logging | Already in project |
| xUnit | 2.9.3 | Tests | Already in project |
| Moq | 4.20.72 | Mock interfaces in tests | Already in project |
**No new NuGet packages required.** `Microsoft.SharePoint.Client.Search.dll` ships as a transitive dependency of PnP.Framework — confirmed present at `SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll`.
### New Models Needed
| Model | Location | Fields |
|-------|----------|--------|
| `StorageNode` | `Core/Models/StorageNode.cs` | `string Name`, `string Url`, `string SiteTitle`, `string Library`, `long TotalSizeBytes`, `long FileStreamSizeBytes`, `long TotalFileCount`, `DateTime? LastModified`, `int IndentLevel`, `List<StorageNode> Children` |
| `SearchResult` | `Core/Models/SearchResult.cs` | `string Title`, `string Path`, `string FileExtension`, `DateTime? Created`, `DateTime? LastModified`, `string Author`, `string ModifiedBy`, `long SizeBytes` |
| `DuplicateGroup` | `Core/Models/DuplicateGroup.cs` | `string GroupKey`, `string Name`, `List<DuplicateItem> Items` |
| `DuplicateItem` | `Core/Models/DuplicateItem.cs` | `string Name`, `string Path`, `string Library`, `long? SizeBytes`, `DateTime? Created`, `DateTime? Modified`, `int? FolderCount`, `int? FileCount` |
| `StorageScanOptions` | `Core/Models/StorageScanOptions.cs` | `bool PerLibrary`, `bool IncludeSubsites`, `int FolderDepth` |
| `SearchOptions` | `Core/Models/SearchOptions.cs` | `string[] Extensions`, `string? Regex`, `DateTime? CreatedAfter`, `DateTime? CreatedBefore`, `DateTime? ModifiedAfter`, `DateTime? ModifiedBefore`, `string? CreatedBy`, `string? ModifiedBy`, `string? Library`, `int MaxResults` |
| `DuplicateScanOptions` | `Core/Models/DuplicateScanOptions.cs` | `string Mode` ("Files"/"Folders"), `bool MatchSize`, `bool MatchCreated`, `bool MatchModified`, `bool MatchSubfolderCount`, `bool MatchFileCount`, `bool IncludeSubsites`, `string? Library` |
---
## Architecture Patterns
### Recommended Project Structure (additions only)
```
SharepointToolbox/
├── Core/Models/
│ ├── StorageNode.cs # new
│ ├── SearchResult.cs # new
│ ├── DuplicateGroup.cs # new
│ ├── DuplicateItem.cs # new
│ ├── StorageScanOptions.cs # new
│ ├── SearchOptions.cs # new
│ └── DuplicateScanOptions.cs # new
├── Services/
│ ├── IStorageService.cs # new
│ ├── StorageService.cs # new
│ ├── ISearchService.cs # new
│ ├── SearchService.cs # new
│ ├── IDuplicatesService.cs # new
│ ├── DuplicatesService.cs # new
│ └── Export/
│ ├── StorageCsvExportService.cs # new
│ ├── StorageHtmlExportService.cs # new
│ ├── SearchCsvExportService.cs # new
│ ├── SearchHtmlExportService.cs # new
│ └── DuplicatesHtmlExportService.cs # new
├── ViewModels/Tabs/
│ ├── StorageViewModel.cs # new
│ ├── SearchViewModel.cs # new
│ └── DuplicatesViewModel.cs # new
└── Views/Tabs/
├── StorageView.xaml # new
├── StorageView.xaml.cs # new
├── SearchView.xaml # new
├── SearchView.xaml.cs # new
├── DuplicatesView.xaml # new
└── DuplicatesView.xaml.cs # new
```
### Pattern 1: CSOM StorageMetrics Load
**What:** Load `Folder.StorageMetrics` with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched.
**When to use:** Whenever reading storage data for a folder or library root.
**Example:**
```csharp
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics
// + https://longnlp.github.io/load-storage-metric-from-SPO
// Get folder by server-relative URL (library root or subfolder)
Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl);
ctx.Load(folder,
f => f.StorageMetrics, // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified
f => f.TimeLastModified, // alternative timestamp if StorageMetrics.LastModified is null
f => f.ServerRelativeUrl,
f => f.Name);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
long totalBytes = folder.StorageMetrics.TotalSize;
long streamBytes = folder.StorageMetrics.TotalFileStreamSize; // current-version files only
long versionBytes = Math.Max(0L, totalBytes - streamBytes); // version overhead
long fileCount = folder.StorageMetrics.TotalFileCount;
DateTime? lastMod = folder.StorageMetrics.IsPropertyAvailable("LastModified")
? folder.StorageMetrics.LastModified
: folder.TimeLastModified;
```
**Unit:** `TotalSize` and `TotalFileStreamSize` are in **bytes** (Int64). `TotalFileStreamSize` is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = `TotalSize - TotalFileStreamSize`.
### Pattern 2: KQL Search with Pagination
**What:** Use `KeywordQuery` + `SearchExecutor` (in `Microsoft.SharePoint.Client.Search.Query`) to execute a KQL query, paginating 500 rows at a time via `StartRow`.
**When to use:** SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection).
**Example:**
```csharp
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor
// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/
using Microsoft.SharePoint.Client.Search.Query;
// namespace: Microsoft.SharePoint.Client.Search.Query
// assembly: Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep)
var allResults = new List<IDictionary<string, object>>();
int startRow = 0;
const int batchSize = 500;
do
{
ct.ThrowIfCancellationRequested();
var kq = new KeywordQuery(ctx)
{
QueryText = kql, // e.g. "ContentType:Document AND FileExtension:pdf"
StartRow = startRow,
RowLimit = batchSize,
TrimDuplicates = false
};
// Explicit managed properties to retrieve
kq.SelectProperties.AddRange(new[]
{
"Title", "Path", "Author", "LastModifiedTime",
"FileExtension", "Created", "ModifiedBy", "Size"
});
var executor = new SearchExecutor(ctx);
ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
// Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again
var table = clientResult.Value
.FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
if (table == null) break;
int retrieved = table.RowCount;
foreach (System.Collections.Hashtable row in table.ResultRows)
{
allResults.Add(row.Cast<System.Collections.DictionaryEntry>()
.ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty));
}
progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…"));
startRow += batchSize;
}
while (startRow < maxResults && startRow <= 50_000 // platform hard cap
&& allResults.Count < maxResults);
```
**Critical detail:** `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` wraps `ctx.ExecuteQuery()`. Call it AFTER `executor.ExecuteQuery(kq)` — do NOT call `ctx.ExecuteQuery()` directly afterward.
**StartRow limit:** SharePoint Search imposes a hard boundary of 50,000 for `StartRow`. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02.
**KQL field mappings (from PS reference lines 4747-4763):**
- Extension: `FileExtension:pdf OR FileExtension:docx`
- Created after/before: `Created>=2024-01-01` / `Created<=2024-12-31`
- Modified after/before: `Write>=2024-01-01` / `Write<=2024-12-31`
- Created by: `Author:"First Last"`
- Modified by: `ModifiedBy:"First Last"`
- Library path: `Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"`
- Documents only: `ContentType:Document`
### Pattern 3: Folder Enumeration for Duplicate Folders
**What:** Use `SharePointPaginationHelper.GetAllItemsAsync` with a CAML filter on `FSObjType = 1` (folders). Read `FolderChildCount` and `ItemChildCount` from `FieldValues`.
**When to use:** DUPL-02 (folder duplicate scan).
**Example:**
```csharp
// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern
var camlQuery = new CamlQuery
{
ViewXml = @"<View Scope='RecursiveAll'>
<Query>
<Where>
<Eq>
<FieldRef Name='FSObjType' />
<Value Type='Integer'>1</Value>
</Eq>
</Where>
</Query>
<RowLimit>2000</RowLimit>
</View>"
};
await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct))
{
var fv = item.FieldValues;
var name = fv["FileLeafRef"]?.ToString() ?? string.Empty;
var fileRef = fv["FileRef"]?.ToString() ?? string.Empty;
var subCount = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
var fileCount = Math.Max(0, childCount - subCount);
var created = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
var modified = fv["Modified"] is DateTime md ? md : (DateTime?)null;
// ...build DuplicateItem
}
```
### Pattern 4: Duplicate Composite Key (name+size+date grouping)
**What:** Build a string composite key from the fields the user selected, then `GroupBy(key).Where(g => g.Count() >= 2)`.
**When to use:** DUPL-01 (files) and DUPL-02 (folders).
**Example:**
```csharp
// Source: PS reference lines 4942-4949 (MakeKey function)
private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
{
var parts = new List<string> { item.Name.ToLowerInvariant() };
if (opts.MatchSize && item.SizeBytes.HasValue) parts.Add(item.SizeBytes.Value.ToString());
if (opts.MatchCreated && item.Created.HasValue) parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
if (opts.MatchModified && item.Modified.HasValue) parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
if (opts.MatchFileCount && item.FileCount.HasValue) parts.Add(item.FileCount.Value.ToString());
return string.Join("|", parts);
}
var groups = allItems
.GroupBy(i => MakeKey(i, opts))
.Where(g => g.Count() >= 2)
.Select(g => new DuplicateGroup
{
GroupKey = g.Key,
Name = g.First().Name,
Items = g.ToList()
})
.OrderByDescending(g => g.Items.Count)
.ToList();
```
### Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid
**What:** Flatten the recursive tree (site → library → folder → subfolder) into a flat `List<StorageNode>` where each node carries an `IndentLevel`. The WPF `DataGrid` renders a `Margin` on the name cell based on `IndentLevel`.
**When to use:** STOR-01/02 WPF display.
**Rationale for DataGrid over TreeView:** WPF `TreeView` requires hierarchical `HierarchicalDataTemplate` and loses virtualization with deep nesting. A flat `DataGrid` with `VirtualizingPanel.IsVirtualizing="True"` stays performant for thousands of rows and is trivially sortable.
**Example:**
```csharp
// Flatten tree to observable list for DataGrid binding
private static void FlattenTree(StorageNode node, int level, List<StorageNode> result)
{
node.IndentLevel = level;
result.Add(node);
foreach (var child in node.Children)
FlattenTree(child, level + 1, result);
}
```
```xml
<!-- WPF DataGrid cell template for name column with indent -->
<DataGridTemplateColumn Header="Library / Folder" Width="*">
<DataGridTemplateColumn.CellTemplate>
<DataTemplate>
<TextBlock Text="{Binding Name}"
Margin="{Binding IndentLevel, Converter={StaticResource IndentConverter}}" />
</DataTemplate>
</DataGridTemplateColumn.CellTemplate>
</DataGridTemplateColumn>
```
Use `IValueConverter` mapping `IndentLevel``new Thickness(IndentLevel * 16, 0, 0, 0)`.
### Pattern 6: Storage HTML Collapsible Tree
**What:** The HTML export uses inline nested tables with `display:none` rows toggled by `toggle(i)` JS. Each library/folder that has children gets a unique numeric index.
**When to use:** STOR-05 export.
**Key design (from PS lines 1621-1780):**
- A global `_togIdx` counter assigns unique IDs to collapsible rows: `<tr id='sf-{i}' style='display:none'>`.
- A `<button onclick='toggle({i})'>` triggers `row.style.display = visible ? 'none' : 'table-row'`.
- Library rows embed a nested `<table class='sf-tbl'>` inside the collapsible row (colspan spanning all columns).
- This is a pure inline pattern — no external JS or CSS dependencies.
- In C# the counter is a field on `StorageHtmlExportService` reset at the start of each `BuildHtml()` call.
### Anti-Patterns to Avoid
- **Loading StorageMetrics without including it in ctx.Load:** `folder.StorageMetrics.TotalSize` throws `PropertyOrFieldNotInitializedException` if `StorageMetrics` is not included in the Load expression. Always use `ctx.Load(folder, f => f.StorageMetrics, ...)`.
- **Calling ctx.ExecuteQuery() after executor.ExecuteQuery(kq):** The search executor pattern requires calling `ctx.ExecuteQuery()` ONCE (inside `ExecuteQueryRetryAsync`). Calling it twice is a no-op at best, throws at worst.
- **StartRow > 50,000:** SharePoint Search hard boundary — will return zero results or error. Cap loop exit at `startRow <= 50_000`.
- **Modifying ObservableCollection from Task.Run:** Same rule as Phase 2 — accumulate in `List<T>` on background thread, then `Dispatcher.InvokeAsync(() => StorageResults = new ObservableCollection<T>(list))`.
- **Recursive CSOM calls without depth guard:** Without a depth guard, `Collect-FolderStorage` on a deep site can make thousands of CSOM round-trips. Always pass `MaxDepth` and check `currentDepth >= maxDepth` before recursing.
- **Building a TreeView for storage display:** WPF TreeView loses UI virtualization with more than ~1000 visible items. Use DataGrid with IndentLevel.
- **Version size from index:** The Search API's `Size` property is the current-version file size, not total including versions. Only `StorageMetrics.TotalFileStreamSize` vs `TotalSize` gives accurate version overhead.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| CSOM throttle retry | Custom retry loop | `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` (Phase 1) | Already handles 429/503 with exponential backoff |
| List pagination | Raw `ExecuteQuery` loop | `SharePointPaginationHelper.GetAllItemsAsync` (Phase 1) | Handles 5000-item threshold, CAML position continuation |
| Search pagination | Manual `do/while` per search | Same `KeywordQuery`+`SearchExecutor` pattern (internal to SearchService) | Wrap in a helper method inside `SearchService` to avoid duplication across SRCH and DUPL features |
| HTML header/footer boilerplate | New template each export service | Copy from existing `HtmlExportService` pattern (Phase 2) | Consistent `<!DOCTYPE>`, viewport meta, `Segoe UI` font stack |
| CSV field escaping | Custom escaping | RFC 4180 `Csv()` helper pattern from Phase 2 `CsvExportService` | Already handles quotes, empty values, UTF-8 BOM |
| OperationProgress reporting | New progress model | `OperationProgress.Indeterminate(msg)` + `new OperationProgress(current, total, msg)` (Phase 1) | Already wired to UI via `FeatureViewModelBase` |
| Tenant context management | Directly create `ClientContext` | `ISessionManager.GetOrCreateContextAsync` (Phase 1) | Handles MSAL cache, per-tenant context pooling |
---
## Common Pitfalls
### Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException
**What goes wrong:** `folder.StorageMetrics.TotalSize` throws `PropertyOrFieldNotInitializedException` at runtime.
**Why it happens:** CSOM lazy-loading — if `StorageMetrics` is not in the Load expression, the proxy object exists but has no data.
**How to avoid:** Always include `f => f.StorageMetrics` in the `ctx.Load(folder, ...)` lambda.
**Warning signs:** Exception message contains "The property or field 'StorageMetrics' has not been initialized".
### Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed
**What goes wrong:** Accessing `row["Size"]` returns object — Size comes back as a string `"12345"` not a long.
**Why it happens:** `ResultTable.ResultRows` is `IEnumerable<IDictionary<string, object>>`. All values are strings from the search index.
**How to avoid:** Always parse with `long.TryParse(row["Size"]?.ToString() ?? "0", out var sizeBytes)`. Strip non-numeric characters as PS does: `Regex.Replace(sizeStr, "[^0-9]", "")`.
**Warning signs:** `InvalidCastException` when binding Size to a numeric column.
### Pitfall 3: Search API Returns Duplicates for Versioned Files
**What goes wrong:** Files with many versions appear multiple times in results via `/_vti_history/` paths.
**Why it happens:** SharePoint indexes each version as a separate item in some cases.
**How to avoid:** Filter items where `Path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase)` — port of PS line 4973.
**Warning signs:** Duplicate file paths in results with `_vti_history` segment.
### Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue
**What goes wrong:** `LastModified` shows as 01/01/0001 for empty folders.
**Why it happens:** SharePoint returns a default DateTime for folders with no modifications.
**How to avoid:** Check `lastModified > DateTime.MinValue` before formatting. Fall back to `folder.TimeLastModified` if `StorageMetrics.LastModified` is unset.
**Warning signs:** "01/01/0001" in the LastModified column.
### Pitfall 5: KQL Query Text Exceeds 4096 Characters
**What goes wrong:** Search query silently fails or returns error for very long KQL strings.
**Why it happens:** SharePoint Search has a 4096-character KQL text boundary.
**How to avoid:** For extension filters with many extensions, use `(FileExtension:a OR FileExtension:b OR ...)` and validate total length before calling. Warn user if limit approached.
**Warning signs:** Zero results returned when many extensions entered; no CSOM exception.
### Pitfall 6: CAML FSObjType Field Name
**What goes wrong:** CAML query for folders returns no results.
**Why it happens:** The internal CAML field name is `FSObjType`, not `FileSystemObjectType`. Using the wrong name returns no matches silently.
**How to avoid:** Use `<FieldRef Name='FSObjType' />` (integer) with `<Value Type='Integer'>1</Value>`. Confirmed by PS reference line 5011 which uses CSOM `FileSystemObjectType.Folder` comparison.
**Warning signs:** Zero items returned from folder CAML query on a library known to have folders.
### Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path
**What goes wrong:** `Get-PnPFolderStorageMetric -FolderSiteRelativeUrl` requires a path relative to the web root (e.g., `Shared Documents`), not the server root (e.g., `/sites/MySite/Shared Documents`).
**Why it happens:** CSOM `Folder.StorageMetrics` uses server-relative URLs, so you need to strip the web's ServerRelativeUrl prefix.
**How to avoid:** Load `ctx.Web.ServerRelativeUrl` first, then compute: `siteRelUrl = rootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/')`. Use `ctx.Web.GetFolderByServerRelativeUrl(siteAbsoluteUrl)` which accepts full server-relative paths.
**Warning signs:** 404/FileNotFoundException from CSOM when calling StorageMetrics.
---
## Code Examples
### Loading StorageMetrics (STOR-01/02/03)
```csharp
// Source: MS Learn — StorageMetrics Class; [MS-CSOMSPT] TotalFileStreamSize definition
ctx.Load(ctx.Web, w => w.ServerRelativeUrl, w => w.Url, w => w.Title);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
string webSrl = ctx.Web.ServerRelativeUrl.TrimEnd('/');
// Per-library: iterate document libraries
ctx.Load(ctx.Web.Lists, lists => lists.Include(
l => l.Title, l => l.BaseType, l => l.Hidden, l => l.RootFolder.ServerRelativeUrl));
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
foreach (var list in ctx.Web.Lists)
{
if (list.Hidden || list.BaseType != BaseType.DocumentLibrary) continue;
string siteRelUrl = list.RootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/');
Folder rootFolder = ctx.Web.GetFolderByServerRelativeUrl(list.RootFolder.ServerRelativeUrl);
ctx.Load(rootFolder,
f => f.StorageMetrics,
f => f.TimeLastModified,
f => f.ServerRelativeUrl);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
var node = new StorageNode
{
Name = list.Title,
Url = $"{ctx.Web.Url.TrimEnd('/')}/{siteRelUrl}",
SiteTitle = ctx.Web.Title,
Library = list.Title,
TotalSizeBytes = rootFolder.StorageMetrics.TotalSize,
FileStreamSizeBytes = rootFolder.StorageMetrics.TotalFileStreamSize,
TotalFileCount = rootFolder.StorageMetrics.TotalFileCount,
LastModified = rootFolder.StorageMetrics.LastModified > DateTime.MinValue
? rootFolder.StorageMetrics.LastModified
: rootFolder.TimeLastModified,
IndentLevel = 0,
Children = new List<StorageNode>()
};
// Recursive subfolder collection up to maxDepth
if (maxDepth > 0)
await CollectSubfoldersAsync(ctx, list.RootFolder.ServerRelativeUrl, node, 1, maxDepth, progress, ct);
}
```
### KQL Build from SearchOptions
```csharp
// Source: PS reference lines 4747-4763
private static string BuildKql(SearchOptions opts)
{
var parts = new List<string> { "ContentType:Document" };
if (opts.Extensions.Length > 0)
{
var extParts = opts.Extensions.Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
parts.Add($"({string.Join(" OR ", extParts)})");
}
if (opts.CreatedAfter.HasValue)
parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
if (opts.CreatedBefore.HasValue)
parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
if (opts.ModifiedAfter.HasValue)
parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
if (opts.ModifiedBefore.HasValue)
parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
if (!string.IsNullOrEmpty(opts.CreatedBy))
parts.Add($"Author:\"{opts.CreatedBy}\"");
if (!string.IsNullOrEmpty(opts.ModifiedBy))
parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
if (!string.IsNullOrEmpty(opts.Library))
parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");
return string.Join(" AND ", parts);
}
```
### Parsing Search ResultRows
```csharp
// Source: PS reference lines 4971-4987
private static SearchResult ParseRow(IDictionary<string, object> row)
{
static string Str(IDictionary<string, object> r, string key) =>
r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;
static DateTime? Date(IDictionary<string, object> r, string key)
{
var s = Str(r, key);
return DateTime.TryParse(s, out var dt) ? dt : null;
}
static long ParseSize(IDictionary<string, object> r, string key)
{
var raw = Str(r, key);
var digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
return long.TryParse(digits, out var v) ? v : 0L;
}
return new SearchResult
{
Title = Str(row, "Title"),
Path = Str(row, "Path"),
FileExtension = Str(row, "FileExtension"),
Created = Date(row, "Created"),
LastModified = Date(row, "LastModifiedTime"),
Author = Str(row, "Author"),
ModifiedBy = Str(row, "ModifiedBy"),
SizeBytes = ParseSize(row, "Size")
};
}
```
---
## Localization Keys Needed
The following keys are needed for Phase 3 Views. Keys from the PS reference (lines 2747-2813) are remapped to the C# `Strings.resx` naming convention. Existing keys already in `Strings.resx` are marked with (existing).
### Storage Tab
| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.storage` | `Storage` | (existing — already in Strings.resx line 77) |
| `chk.per.lib` | `Per-Library Breakdown` | new |
| `chk.subsites` | `Include Subsites` | new |
| `lbl.folder.depth` | `Folder depth:` | (existing — shared with permissions) |
| `chk.max.depth` | `Maximum (all levels)` | (existing — shared with permissions) |
| `stor.note` | `Note: deeper folder scans on large sites may take several minutes.` | new |
| `btn.gen.storage` | `Generate Metrics` | new |
| `btn.open.storage` | `Open Report` | new |
| `stor.col.library` | `Library` | new |
| `stor.col.site` | `Site` | new |
| `stor.col.files` | `Files` | new |
| `stor.col.size` | `Size` | new |
| `stor.col.versions` | `Versions` | new |
| `stor.col.lastmod` | `Last Modified` | new |
| `stor.col.share` | `Share of Total` | new |
### File Search Tab
| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.search` | `File Search` | (existing — already in Strings.resx line 79) |
| `grp.search.filters` | `Search Filters` | new |
| `lbl.extensions` | `Extension(s):` | new |
| `ph.extensions` | `docx pdf xlsx` | new (placeholder) |
| `lbl.regex` | `Name / Regex:` | new |
| `ph.regex` | `Ex: report.* or \.bak$` | new (placeholder) |
| `chk.created.after` | `Created after:` | new |
| `chk.created.before` | `Created before:` | new |
| `chk.modified.after` | `Modified after:` | new |
| `chk.modified.before` | `Modified before:` | new |
| `lbl.created.by` | `Created by:` | new |
| `ph.created.by` | `First Last or email` | new (placeholder) |
| `lbl.modified.by` | `Modified by:` | new |
| `ph.modified.by` | `First Last or email` | new (placeholder) |
| `lbl.library` | `Library:` | new |
| `ph.library` | `Optional relative path e.g. Shared Documents` | new (placeholder) |
| `lbl.max.results` | `Max results:` | new |
| `btn.run.search` | `Run Search` | new |
| `btn.open.search` | `Open Results` | new |
| `srch.col.name` | `File Name` | new |
| `srch.col.ext` | `Extension` | new |
| `srch.col.created` | `Created` | new |
| `srch.col.modified` | `Modified` | new |
| `srch.col.author` | `Created By` | new |
| `srch.col.modby` | `Modified By` | new |
| `srch.col.size` | `Size` | new |
### Duplicates Tab
| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.duplicates` | `Duplicates` | (existing — already in Strings.resx line 83) |
| `grp.dup.type` | `Duplicate Type` | new |
| `rad.dup.files` | `Duplicate files` | new |
| `rad.dup.folders` | `Duplicate folders` | new |
| `grp.dup.criteria` | `Comparison Criteria` | new |
| `lbl.dup.note` | `Name is always the primary criterion. Check additional criteria:` | new |
| `chk.dup.size` | `Same size` | new |
| `chk.dup.created` | `Same creation date` | new |
| `chk.dup.modified` | `Same modification date` | new |
| `chk.dup.subfolders` | `Same subfolder count` | new |
| `chk.dup.filecount` | `Same file count` | new |
| `chk.include.subsites` | `Include subsites` | new |
| `ph.dup.lib` | `All (leave empty)` | new (placeholder) |
| `btn.run.scan` | `Run Scan` | new |
| `btn.open.results` | `Open Results` | new |
---
## Duplicate Detection Scale — Known Concern Resolution
The STATE.md concern ("Duplicate detection at scale (100k+ files) — Graph API hash enumeration limits") is resolved: the PS reference does NOT use file hashes. It uses name+size+date grouping, which is exactly what DUPL-01/02/03 specify. The requirements do not mention hash-based deduplication.
**Scale analysis:**
- File duplicates use the Search API. SharePoint Search caps at 50,000 results (StartRow=50,000 max). A site with 100k+ files will be capped at 50,000 returned results. This is the same cap as SRCH-02, and is a known/accepted limitation.
- Folder duplicates use CAML pagination. `SharePointPaginationHelper.GetAllItemsAsync` handles arbitrary folder counts with RowLimit=2000 pagination — no effective upper bound.
- Client-side GroupBy on 50,000 items is instantaneous (Dictionary-based O(n) operation).
- **No Graph API or SHA256 content hashing is needed.** The concern was about a potential v2 enhancement not required by DUPL-01/02/03.
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `Get-PnPFolderStorageMetric` (PS cmdlet) | CSOM `Folder.StorageMetrics` | Phase 3 migration | One CSOM round-trip per folder; no PnP PS module required |
| `Submit-PnPSearchQuery` (PS cmdlet) | CSOM `KeywordQuery` + `SearchExecutor` | Phase 3 migration | Same pagination model; TrimDuplicates=false explicit |
| `Get-PnPListItem` for folders (PS) | `SharePointPaginationHelper.GetAllItemsAsync` with CAML | Phase 3 migration | Reuses Phase 1 helper; handles 5000-item threshold |
| Storage TreeView control | Flat DataGrid with IndentLevel + IValueConverter | Phase 3 design decision | Better UI virtualization for large sites |
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | xUnit 2.9.3 |
| Config file | none (SDK auto-discovery) |
| Quick run command | `dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "Category!=Integration" -x` |
| Full suite command | `dotnet test SharepointToolbox.slnx` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| STOR-01/02 | `StorageService.CollectStorageAsync` returns `StorageNode` list | unit (mock ISessionManager) | `dotnet test --filter "StorageServiceTests"` | ❌ Wave 0 |
| STOR-03 | VersionSizeBytes = TotalSizeBytes - FileStreamSizeBytes | unit | `dotnet test --filter "StorageNodeTests"` | ❌ Wave 0 |
| STOR-04 | `StorageCsvExportService.BuildCsv` produces correct header and rows | unit | `dotnet test --filter "StorageCsvExportServiceTests"` | ❌ Wave 0 |
| STOR-05 | `StorageHtmlExportService.BuildHtml` contains toggle JS and nested tables | unit | `dotnet test --filter "StorageHtmlExportServiceTests"` | ❌ Wave 0 |
| SRCH-01 | `SearchService` builds correct KQL from `SearchOptions` | unit | `dotnet test --filter "SearchServiceTests"` | ❌ Wave 0 |
| SRCH-02 | Search loop exits when `startRow > 50_000` | unit | `dotnet test --filter "SearchServiceTests"` | ❌ Wave 0 |
| SRCH-03 | `SearchCsvExportService.BuildCsv` produces correct header | unit | `dotnet test --filter "SearchCsvExportServiceTests"` | ❌ Wave 0 |
| SRCH-04 | `SearchHtmlExportService.BuildHtml` contains sort JS and filter input | unit | `dotnet test --filter "SearchHtmlExportServiceTests"` | ❌ Wave 0 |
| DUPL-01 | `MakeKey` function groups identical name+size+date items | unit | `dotnet test --filter "DuplicatesServiceTests"` | ❌ Wave 0 |
| DUPL-02 | CAML query targets `FSObjType=1`; `FileCount = ItemChildCount - FolderChildCount` | unit (logic only) | `dotnet test --filter "DuplicatesServiceTests"` | ❌ Wave 0 |
| DUPL-03 | `DuplicatesHtmlExportService.BuildHtml` contains group cards with ok/diff badges | unit | `dotnet test --filter "DuplicatesHtmlExportServiceTests"` | ❌ Wave 0 |
**Note:** `StorageService`, `SearchService`, and `DuplicatesService` depend on live CSOM — service-level tests use Skip like `PermissionsServiceTests`. ViewModel tests use Moq for `IStorageService`, `ISearchService`, `IDuplicatesService` following `PermissionsViewModelTests` pattern. Export service tests are fully unit-testable (no CSOM).
### Sampling Rate
- **Per task commit:** `dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj -x`
- **Per wave merge:** `dotnet test SharepointToolbox.slnx`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `SharepointToolbox.Tests/Services/StorageServiceTests.cs` — covers STOR-01/02 (stub + Skip like PermissionsServiceTests)
- [ ] `SharepointToolbox.Tests/Services/Export/StorageCsvExportServiceTests.cs` — covers STOR-04
- [ ] `SharepointToolbox.Tests/Services/Export/StorageHtmlExportServiceTests.cs` — covers STOR-05
- [ ] `SharepointToolbox.Tests/Services/SearchServiceTests.cs` — covers SRCH-01/02 (KQL build + pagination cap logic)
- [ ] `SharepointToolbox.Tests/Services/Export/SearchCsvExportServiceTests.cs` — covers SRCH-03
- [ ] `SharepointToolbox.Tests/Services/Export/SearchHtmlExportServiceTests.cs` — covers SRCH-04
- [ ] `SharepointToolbox.Tests/Services/DuplicatesServiceTests.cs` — covers DUPL-01/02 composite key logic
- [ ] `SharepointToolbox.Tests/Services/Export/DuplicatesHtmlExportServiceTests.cs` — covers DUPL-03
- [ ] `SharepointToolbox.Tests/ViewModels/StorageViewModelTests.cs` — covers STOR-01 ViewModel (Moq IStorageService)
- [ ] `SharepointToolbox.Tests/ViewModels/SearchViewModelTests.cs` — covers SRCH-01/02 ViewModel
- [ ] `SharepointToolbox.Tests/ViewModels/DuplicatesViewModelTests.cs` — covers DUPL-01/02 ViewModel
---
## Open Questions
1. **StorageMetrics.LastModified vs TimeLastModified**
- What we know: `StorageMetrics.LastModified` exists per the API docs. `Folder.TimeLastModified` is a separate CSOM property.
- What's unclear: Whether `StorageMetrics.LastModified` can return `DateTime.MinValue` for recently created empty folders in all SharePoint Online tenants.
- Recommendation: Load both (`f => f.StorageMetrics, f => f.TimeLastModified`) and prefer `StorageMetrics.LastModified` when it is `> DateTime.MinValue`, falling back to `TimeLastModified`.
2. **Search index freshness for duplicate detection**
- What we know: SharePoint Search is eventually consistent — newly created files may not appear for up to 15 minutes.
- What's unclear: Whether users expect real-time accuracy or accept eventual consistency.
- Recommendation: Document in UI that search-based results (files) reflect the search index, not the current state. Add a note in the log output.
3. **Multiple-site file search scope**
- What we know: The PS reference scopes search to `$siteUrl` context only (one site per search). SRCH-01 says "across sites" in the goal description but the requirements only specify search criteria, not multi-site.
- What's unclear: Whether SRCH-01 requires multi-site search in one operation or per-site.
- Recommendation: Implement per-site search (matching PS reference). Multi-site search would require separate `ClientContext` per site plus result merging — treat as a future enhancement.
---
## Sources
### Primary (HIGH confidence)
- [StorageMetrics Class — MS Learn CSOM reference](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics?view=sharepoint-csom) — properties TotalSize, TotalFileStreamSize, TotalFileCount, LastModified confirmed
- [StorageMetrics.TotalSize — MS Learn](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics.totalsize?view=sharepoint-csom) — confirmed as Int64, ReadOnly
- [[MS-CSOMSPT] TotalFileStreamSize](https://learn.microsoft.com/en-us/openspecs/sharepoint_protocols/ms-csomspt/635464fc-8505-43fa-97d7-02229acdb3c5) — confirmed definition: "Aggregate stream size in bytes for all files... Excludes version, metadata, list item attachment, and non-customized document sizes"
- [SearchExecutor Class — MS Learn CSOM reference](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor?view=sharepoint-csom) — namespace `Microsoft.SharePoint.Client.Search.Query`, assembly `Microsoft.SharePoint.Client.Search.Portable.dll`
- [Search limits for SharePoint — MS Learn](https://learn.microsoft.com/en-us/sharepoint/search-limits) — StartRow max 50,000 (boundary), RowLimit max 500 (boundary) confirmed
- [SharepointToolbox/bin/Debug output] — `Microsoft.SharePoint.Client.Search.dll` confirmed present as transitive dep
### Secondary (MEDIUM confidence)
- [Load storage metric from SPO — longnlp.github.io](https://longnlp.github.io/load-storage-metric-from-SPO) — CSOM Load pattern: `ctx.Load(folder, f => f.StorageMetrics)` verified
- [Fetch all results from SharePoint Search using CSOM — usefulscripts.wordpress.com](https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/) — KeywordQuery + SearchExecutor pagination pattern with StartRow; confirmed against official docs
- PowerShell reference `Sharepoint_ToolBox.ps1` lines 1621-1780 (Export-StorageToHTML), 2112-2233 (Export-SearchResultsToHTML), 2235-2406 (Export-DuplicatesToHTML), 4432-4534 (storage scan), 4747-4808 (file search), 4937-5059 (duplicate scan) — authoritative reference implementation
### Tertiary (LOW confidence — implementation detail, verify when coding)
- [SharePoint CSOM Q&A — Getting size of subsite](https://learn.microsoft.com/en-us/answers/questions/1518977/getting-size-of-a-subsite-using-csom) — general pattern confirmed; specific edge cases not verified
- [Pagination for large result sets — MS Learn](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets) — DocId-based pagination beyond 50k exists but is not needed for Phase 3
---
## Metadata
**Confidence breakdown:**
- Standard Stack: HIGH — no new packages needed; Search.dll confirmed present; all APIs verified against MS docs
- Architecture Patterns: HIGH — direct port of working PS reference; CSOM API shapes confirmed
- Pitfalls: HIGH for StorageMetrics loading, search result typing, vti_history filter (all from PS reference or official docs); MEDIUM for KQL length limit (documented but not commonly hit)
- Localization keys: HIGH — directly extracted from PS reference lines 2747-2813
**Research date:** 2026-04-02
**Valid until:** 2026-07-01 (CSOM APIs stable; SharePoint search limits stable; re-verify if PnP.Framework upgrades past 1.18)