Files
Sharepoint-Toolbox/.planning/milestones/v1.0-phases/03-storage/03-RESEARCH.md
Dev 724fdc550d chore: complete v1.0 milestone
Archive 5 phases (36 plans) to milestones/v1.0-phases/.
Archive roadmap, requirements, and audit to milestones/.
Evolve PROJECT.md with shipped state and validated requirements.
Collapse ROADMAP.md to one-line milestone summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 09:19:03 +02:00

757 lines
44 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 3: Storage and File Operations - Research
**Researched:** 2026-04-02
**Domain:** CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection
**Confidence:** HIGH
---
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|-----------------|
| STOR-01 | User can view storage consumption per library on a site | CSOM `Folder.StorageMetrics` (one Load call per folder) + flat DataGrid with indent column |
| STOR-02 | User can view storage consumption per site with configurable folder depth | Recursive `Collect-FolderStorage` pattern translated to async CSOM; depth guard via split-count |
| STOR-03 | Storage metrics include total size, version size, item count, and last modified date | `StorageMetrics.TotalSize`, `TotalFileStreamSize`, `TotalFileCount`, `StorageMetrics.LastModified`; version size = TotalSize - TotalFileStreamSize |
| STOR-04 | User can export storage metrics to CSV | New `StorageCsvExportService` — same UTF-8 BOM pattern as Phase 2 |
| STOR-05 | User can export storage metrics to interactive HTML with collapsible tree view | New `StorageHtmlExportService` — port PS lines 1621-1780; toggle() JS + nested table rows |
| SRCH-01 | User can search files across sites using multiple criteria | `KeywordQuery` + `SearchExecutor` (CSOM search); KQL built from filter params; client-side Regex post-filter |
| SRCH-02 | User can configure maximum search results (up to 50,000) | SharePoint Search `StartRow` hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max |
| SRCH-03 | User can export search results to CSV | New `SearchCsvExportService` |
| SRCH-04 | User can export search results to interactive HTML (sortable, filterable) | New `SearchHtmlExportService` — port PS lines 2112-2233; sortable columns via data attributes |
| DUPL-01 | User can scan for duplicate files by name, size, creation date, modification date | Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed |
| DUPL-02 | User can scan for duplicate folders by name, subfolder count, file count | `SharePointPaginationHelper.GetAllItemsAsync` with CAML `FSObjType=1`; read `FolderChildCount`, `ItemChildCount` from field values |
| DUPL-03 | User can export duplicate report to HTML with grouped display and visual indicators | New `DuplicatesHtmlExportService` — port PS lines 2235-2406; collapsible group cards, ok/diff badges |
</phase_requirements>
---
## Summary
Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — `Microsoft.SharePoint.Client.Search.dll` is already in the output folder as a transitive dependency of PnP.Framework 1.18.0.
**Storage** uses CSOM `Folder.StorageMetrics` (loaded via `ctx.Load(folder, f => f.StorageMetrics)`). One CSOM round-trip per folder. Version size is derived as `TotalSize - TotalFileStreamSize`. The data model is a recursive tree (site → library → folder → subfolder), flattened to a `DataGrid` with an indent-level column for WPF display. The HTML export ports the PS `Export-StorageToHTML` function (PS lines 1621-1780) with its toggle(i) JS pattern.
**File Search** uses `Microsoft.SharePoint.Client.Search.Query.KeywordQuery` + `SearchExecutor`. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is `StartRow += 500` per batch; the hard ceiling is `StartRow = 50,000` (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233.
**Duplicate Detection** uses the same Search API for file duplicates (with all documents query) and `SharePointPaginationHelper.GetAllItemsAsync` with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation.
**Primary recommendation:** Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns.
---
## User Constraints
No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt.
### Locked Decisions
- .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2)
- PnP.Framework 1.18.0 (CSOM-based SharePoint access)
- No new major packages preferred — only add if truly necessary
- Microsoft.Extensions.Hosting DI
- Serilog logging
- xUnit 2.9.3 tests
### Deferred / Out of Scope
- Content hashing for duplicate detection (v2)
- Storage charts/graphs (v2 requirement VIZZ-01/02/03)
- Cross-tenant file search
---
## Standard Stack
### Core (no new packages needed)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| PnP.Framework | 1.18.0 | CSOM access, `ClientContext` | Already in project |
| Microsoft.SharePoint.Client.Search.dll | (via PnP.Framework) | `KeywordQuery`, `SearchExecutor` | Transitive dep — confirmed present in `bin/Debug/net10.0-windows/` |
| CommunityToolkit.Mvvm | 8.4.2 | `[ObservableProperty]`, `AsyncRelayCommand` | Already in project |
| Microsoft.Extensions.Hosting | 10.x | DI container | Already in project |
| Serilog | 4.3.1 | Structured logging | Already in project |
| xUnit | 2.9.3 | Tests | Already in project |
| Moq | 4.20.72 | Mock interfaces in tests | Already in project |
**No new NuGet packages required.** `Microsoft.SharePoint.Client.Search.dll` ships as a transitive dependency of PnP.Framework — confirmed present at `SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll`.
### New Models Needed
| Model | Location | Fields |
|-------|----------|--------|
| `StorageNode` | `Core/Models/StorageNode.cs` | `string Name`, `string Url`, `string SiteTitle`, `string Library`, `long TotalSizeBytes`, `long FileStreamSizeBytes`, `long TotalFileCount`, `DateTime? LastModified`, `int IndentLevel`, `List<StorageNode> Children` |
| `SearchResult` | `Core/Models/SearchResult.cs` | `string Title`, `string Path`, `string FileExtension`, `DateTime? Created`, `DateTime? LastModified`, `string Author`, `string ModifiedBy`, `long SizeBytes` |
| `DuplicateGroup` | `Core/Models/DuplicateGroup.cs` | `string GroupKey`, `string Name`, `List<DuplicateItem> Items` |
| `DuplicateItem` | `Core/Models/DuplicateItem.cs` | `string Name`, `string Path`, `string Library`, `long? SizeBytes`, `DateTime? Created`, `DateTime? Modified`, `int? FolderCount`, `int? FileCount` |
| `StorageScanOptions` | `Core/Models/StorageScanOptions.cs` | `bool PerLibrary`, `bool IncludeSubsites`, `int FolderDepth` |
| `SearchOptions` | `Core/Models/SearchOptions.cs` | `string[] Extensions`, `string? Regex`, `DateTime? CreatedAfter`, `DateTime? CreatedBefore`, `DateTime? ModifiedAfter`, `DateTime? ModifiedBefore`, `string? CreatedBy`, `string? ModifiedBy`, `string? Library`, `int MaxResults` |
| `DuplicateScanOptions` | `Core/Models/DuplicateScanOptions.cs` | `string Mode` ("Files"/"Folders"), `bool MatchSize`, `bool MatchCreated`, `bool MatchModified`, `bool MatchSubfolderCount`, `bool MatchFileCount`, `bool IncludeSubsites`, `string? Library` |
---
## Architecture Patterns
### Recommended Project Structure (additions only)
```
SharepointToolbox/
├── Core/Models/
│ ├── StorageNode.cs # new
│ ├── SearchResult.cs # new
│ ├── DuplicateGroup.cs # new
│ ├── DuplicateItem.cs # new
│ ├── StorageScanOptions.cs # new
│ ├── SearchOptions.cs # new
│ └── DuplicateScanOptions.cs # new
├── Services/
│ ├── IStorageService.cs # new
│ ├── StorageService.cs # new
│ ├── ISearchService.cs # new
│ ├── SearchService.cs # new
│ ├── IDuplicatesService.cs # new
│ ├── DuplicatesService.cs # new
│ └── Export/
│ ├── StorageCsvExportService.cs # new
│ ├── StorageHtmlExportService.cs # new
│ ├── SearchCsvExportService.cs # new
│ ├── SearchHtmlExportService.cs # new
│ └── DuplicatesHtmlExportService.cs # new
├── ViewModels/Tabs/
│ ├── StorageViewModel.cs # new
│ ├── SearchViewModel.cs # new
│ └── DuplicatesViewModel.cs # new
└── Views/Tabs/
├── StorageView.xaml # new
├── StorageView.xaml.cs # new
├── SearchView.xaml # new
├── SearchView.xaml.cs # new
├── DuplicatesView.xaml # new
└── DuplicatesView.xaml.cs # new
```
### Pattern 1: CSOM StorageMetrics Load
**What:** Load `Folder.StorageMetrics` with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched.
**When to use:** Whenever reading storage data for a folder or library root.
**Example:**
```csharp
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics
// + https://longnlp.github.io/load-storage-metric-from-SPO
// Get folder by server-relative URL (library root or subfolder)
Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl);
ctx.Load(folder,
f => f.StorageMetrics, // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified
f => f.TimeLastModified, // alternative timestamp if StorageMetrics.LastModified is null
f => f.ServerRelativeUrl,
f => f.Name);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
long totalBytes = folder.StorageMetrics.TotalSize;
long streamBytes = folder.StorageMetrics.TotalFileStreamSize; // current-version files only
long versionBytes = Math.Max(0L, totalBytes - streamBytes); // version overhead
long fileCount = folder.StorageMetrics.TotalFileCount;
DateTime? lastMod = folder.StorageMetrics.IsPropertyAvailable("LastModified")
? folder.StorageMetrics.LastModified
: folder.TimeLastModified;
```
**Unit:** `TotalSize` and `TotalFileStreamSize` are in **bytes** (Int64). `TotalFileStreamSize` is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = `TotalSize - TotalFileStreamSize`.
### Pattern 2: KQL Search with Pagination
**What:** Use `KeywordQuery` + `SearchExecutor` (in `Microsoft.SharePoint.Client.Search.Query`) to execute a KQL query, paginating 500 rows at a time via `StartRow`.
**When to use:** SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection).
**Example:**
```csharp
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor
// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/
using Microsoft.SharePoint.Client.Search.Query;
// namespace: Microsoft.SharePoint.Client.Search.Query
// assembly: Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep)
var allResults = new List<IDictionary<string, object>>();
int startRow = 0;
const int batchSize = 500;
do
{
ct.ThrowIfCancellationRequested();
var kq = new KeywordQuery(ctx)
{
QueryText = kql, // e.g. "ContentType:Document AND FileExtension:pdf"
StartRow = startRow,
RowLimit = batchSize,
TrimDuplicates = false
};
// Explicit managed properties to retrieve
kq.SelectProperties.AddRange(new[]
{
"Title", "Path", "Author", "LastModifiedTime",
"FileExtension", "Created", "ModifiedBy", "Size"
});
var executor = new SearchExecutor(ctx);
ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
// Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again
var table = clientResult.Value
.FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
if (table == null) break;
int retrieved = table.RowCount;
foreach (System.Collections.Hashtable row in table.ResultRows)
{
allResults.Add(row.Cast<System.Collections.DictionaryEntry>()
.ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty));
}
progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…"));
startRow += batchSize;
}
while (startRow < maxResults && startRow <= 50_000 // platform hard cap
&& allResults.Count < maxResults);
```
**Critical detail:** `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` wraps `ctx.ExecuteQuery()`. Call it AFTER `executor.ExecuteQuery(kq)` — do NOT call `ctx.ExecuteQuery()` directly afterward.
**StartRow limit:** SharePoint Search imposes a hard boundary of 50,000 for `StartRow`. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02.
**KQL field mappings (from PS reference lines 4747-4763):**
- Extension: `FileExtension:pdf OR FileExtension:docx`
- Created after/before: `Created>=2024-01-01` / `Created<=2024-12-31`
- Modified after/before: `Write>=2024-01-01` / `Write<=2024-12-31`
- Created by: `Author:"First Last"`
- Modified by: `ModifiedBy:"First Last"`
- Library path: `Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"`
- Documents only: `ContentType:Document`
### Pattern 3: Folder Enumeration for Duplicate Folders
**What:** Use `SharePointPaginationHelper.GetAllItemsAsync` with a CAML filter on `FSObjType = 1` (folders). Read `FolderChildCount` and `ItemChildCount` from `FieldValues`.
**When to use:** DUPL-02 (folder duplicate scan).
**Example:**
```csharp
// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern
var camlQuery = new CamlQuery
{
ViewXml = @"<View Scope='RecursiveAll'>
<Query>
<Where>
<Eq>
<FieldRef Name='FSObjType' />
<Value Type='Integer'>1</Value>
</Eq>
</Where>
</Query>
<RowLimit>2000</RowLimit>
</View>"
};
await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct))
{
var fv = item.FieldValues;
var name = fv["FileLeafRef"]?.ToString() ?? string.Empty;
var fileRef = fv["FileRef"]?.ToString() ?? string.Empty;
var subCount = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
var fileCount = Math.Max(0, childCount - subCount);
var created = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
var modified = fv["Modified"] is DateTime md ? md : (DateTime?)null;
// ...build DuplicateItem
}
```
### Pattern 4: Duplicate Composite Key (name+size+date grouping)
**What:** Build a string composite key from the fields the user selected, then `GroupBy(key).Where(g => g.Count() >= 2)`.
**When to use:** DUPL-01 (files) and DUPL-02 (folders).
**Example:**
```csharp
// Source: PS reference lines 4942-4949 (MakeKey function)
private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
{
var parts = new List<string> { item.Name.ToLowerInvariant() };
if (opts.MatchSize && item.SizeBytes.HasValue) parts.Add(item.SizeBytes.Value.ToString());
if (opts.MatchCreated && item.Created.HasValue) parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
if (opts.MatchModified && item.Modified.HasValue) parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
if (opts.MatchFileCount && item.FileCount.HasValue) parts.Add(item.FileCount.Value.ToString());
return string.Join("|", parts);
}
var groups = allItems
.GroupBy(i => MakeKey(i, opts))
.Where(g => g.Count() >= 2)
.Select(g => new DuplicateGroup
{
GroupKey = g.Key,
Name = g.First().Name,
Items = g.ToList()
})
.OrderByDescending(g => g.Items.Count)
.ToList();
```
### Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid
**What:** Flatten the recursive tree (site → library → folder → subfolder) into a flat `List<StorageNode>` where each node carries an `IndentLevel`. The WPF `DataGrid` renders a `Margin` on the name cell based on `IndentLevel`.
**When to use:** STOR-01/02 WPF display.
**Rationale for DataGrid over TreeView:** WPF `TreeView` requires hierarchical `HierarchicalDataTemplate` and loses virtualization with deep nesting. A flat `DataGrid` with `VirtualizingPanel.IsVirtualizing="True"` stays performant for thousands of rows and is trivially sortable.
**Example:**
```csharp
// Flatten tree to observable list for DataGrid binding
private static void FlattenTree(StorageNode node, int level, List<StorageNode> result)
{
node.IndentLevel = level;
result.Add(node);
foreach (var child in node.Children)
FlattenTree(child, level + 1, result);
}
```
```xml
<!-- WPF DataGrid cell template for name column with indent -->
<DataGridTemplateColumn Header="Library / Folder" Width="*">
<DataGridTemplateColumn.CellTemplate>
<DataTemplate>
<TextBlock Text="{Binding Name}"
Margin="{Binding IndentLevel, Converter={StaticResource IndentConverter}}" />
</DataTemplate>
</DataGridTemplateColumn.CellTemplate>
</DataGridTemplateColumn>
```
Use `IValueConverter` mapping `IndentLevel``new Thickness(IndentLevel * 16, 0, 0, 0)`.
### Pattern 6: Storage HTML Collapsible Tree
**What:** The HTML export uses inline nested tables with `display:none` rows toggled by `toggle(i)` JS. Each library/folder that has children gets a unique numeric index.
**When to use:** STOR-05 export.
**Key design (from PS lines 1621-1780):**
- A global `_togIdx` counter assigns unique IDs to collapsible rows: `<tr id='sf-{i}' style='display:none'>`.
- A `<button onclick='toggle({i})'>` triggers `row.style.display = visible ? 'none' : 'table-row'`.
- Library rows embed a nested `<table class='sf-tbl'>` inside the collapsible row (colspan spanning all columns).
- This is a pure inline pattern — no external JS or CSS dependencies.
- In C# the counter is a field on `StorageHtmlExportService` reset at the start of each `BuildHtml()` call.
### Anti-Patterns to Avoid
- **Loading StorageMetrics without including it in ctx.Load:** `folder.StorageMetrics.TotalSize` throws `PropertyOrFieldNotInitializedException` if `StorageMetrics` is not included in the Load expression. Always use `ctx.Load(folder, f => f.StorageMetrics, ...)`.
- **Calling ctx.ExecuteQuery() after executor.ExecuteQuery(kq):** The search executor pattern requires calling `ctx.ExecuteQuery()` ONCE (inside `ExecuteQueryRetryAsync`). Calling it twice is a no-op at best, throws at worst.
- **StartRow > 50,000:** SharePoint Search hard boundary — will return zero results or error. Cap loop exit at `startRow <= 50_000`.
- **Modifying ObservableCollection from Task.Run:** Same rule as Phase 2 — accumulate in `List<T>` on background thread, then `Dispatcher.InvokeAsync(() => StorageResults = new ObservableCollection<T>(list))`.
- **Recursive CSOM calls without depth guard:** Without a depth guard, `Collect-FolderStorage` on a deep site can make thousands of CSOM round-trips. Always pass `MaxDepth` and check `currentDepth >= maxDepth` before recursing.
- **Building a TreeView for storage display:** WPF TreeView loses UI virtualization with more than ~1000 visible items. Use DataGrid with IndentLevel.
- **Version size from index:** The Search API's `Size` property is the current-version file size, not total including versions. Only `StorageMetrics.TotalFileStreamSize` vs `TotalSize` gives accurate version overhead.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| CSOM throttle retry | Custom retry loop | `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` (Phase 1) | Already handles 429/503 with exponential backoff |
| List pagination | Raw `ExecuteQuery` loop | `SharePointPaginationHelper.GetAllItemsAsync` (Phase 1) | Handles 5000-item threshold, CAML position continuation |
| Search pagination | Manual `do/while` per search | Same `KeywordQuery`+`SearchExecutor` pattern (internal to SearchService) | Wrap in a helper method inside `SearchService` to avoid duplication across SRCH and DUPL features |
| HTML header/footer boilerplate | New template each export service | Copy from existing `HtmlExportService` pattern (Phase 2) | Consistent `<!DOCTYPE>`, viewport meta, `Segoe UI` font stack |
| CSV field escaping | Custom escaping | RFC 4180 `Csv()` helper pattern from Phase 2 `CsvExportService` | Already handles quotes, empty values, UTF-8 BOM |
| OperationProgress reporting | New progress model | `OperationProgress.Indeterminate(msg)` + `new OperationProgress(current, total, msg)` (Phase 1) | Already wired to UI via `FeatureViewModelBase` |
| Tenant context management | Directly create `ClientContext` | `ISessionManager.GetOrCreateContextAsync` (Phase 1) | Handles MSAL cache, per-tenant context pooling |
---
## Common Pitfalls
### Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException
**What goes wrong:** `folder.StorageMetrics.TotalSize` throws `PropertyOrFieldNotInitializedException` at runtime.
**Why it happens:** CSOM lazy-loading — if `StorageMetrics` is not in the Load expression, the proxy object exists but has no data.
**How to avoid:** Always include `f => f.StorageMetrics` in the `ctx.Load(folder, ...)` lambda.
**Warning signs:** Exception message contains "The property or field 'StorageMetrics' has not been initialized".
### Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed
**What goes wrong:** Accessing `row["Size"]` returns object — Size comes back as a string `"12345"` not a long.
**Why it happens:** `ResultTable.ResultRows` is `IEnumerable<IDictionary<string, object>>`. All values are strings from the search index.
**How to avoid:** Always parse with `long.TryParse(row["Size"]?.ToString() ?? "0", out var sizeBytes)`. Strip non-numeric characters as PS does: `Regex.Replace(sizeStr, "[^0-9]", "")`.
**Warning signs:** `InvalidCastException` when binding Size to a numeric column.
### Pitfall 3: Search API Returns Duplicates for Versioned Files
**What goes wrong:** Files with many versions appear multiple times in results via `/_vti_history/` paths.
**Why it happens:** SharePoint indexes each version as a separate item in some cases.
**How to avoid:** Filter items where `Path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase)` — port of PS line 4973.
**Warning signs:** Duplicate file paths in results with `_vti_history` segment.
### Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue
**What goes wrong:** `LastModified` shows as 01/01/0001 for empty folders.
**Why it happens:** SharePoint returns a default DateTime for folders with no modifications.
**How to avoid:** Check `lastModified > DateTime.MinValue` before formatting. Fall back to `folder.TimeLastModified` if `StorageMetrics.LastModified` is unset.
**Warning signs:** "01/01/0001" in the LastModified column.
### Pitfall 5: KQL Query Text Exceeds 4096 Characters
**What goes wrong:** Search query silently fails or returns error for very long KQL strings.
**Why it happens:** SharePoint Search has a 4096-character KQL text boundary.
**How to avoid:** For extension filters with many extensions, use `(FileExtension:a OR FileExtension:b OR ...)` and validate total length before calling. Warn user if limit approached.
**Warning signs:** Zero results returned when many extensions entered; no CSOM exception.
### Pitfall 6: CAML FSObjType Field Name
**What goes wrong:** CAML query for folders returns no results.
**Why it happens:** The internal CAML field name is `FSObjType`, not `FileSystemObjectType`. Using the wrong name returns no matches silently.
**How to avoid:** Use `<FieldRef Name='FSObjType' />` (integer) with `<Value Type='Integer'>1</Value>`. Confirmed by PS reference line 5011 which uses CSOM `FileSystemObjectType.Folder` comparison.
**Warning signs:** Zero items returned from folder CAML query on a library known to have folders.
### Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path
**What goes wrong:** `Get-PnPFolderStorageMetric -FolderSiteRelativeUrl` requires a path relative to the web root (e.g., `Shared Documents`), not the server root (e.g., `/sites/MySite/Shared Documents`).
**Why it happens:** CSOM `Folder.StorageMetrics` uses server-relative URLs, so you need to strip the web's ServerRelativeUrl prefix.
**How to avoid:** Load `ctx.Web.ServerRelativeUrl` first, then compute: `siteRelUrl = rootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/')`. Use `ctx.Web.GetFolderByServerRelativeUrl(siteAbsoluteUrl)` which accepts full server-relative paths.
**Warning signs:** 404/FileNotFoundException from CSOM when calling StorageMetrics.
---
## Code Examples
### Loading StorageMetrics (STOR-01/02/03)
```csharp
// Source: MS Learn — StorageMetrics Class; [MS-CSOMSPT] TotalFileStreamSize definition
ctx.Load(ctx.Web, w => w.ServerRelativeUrl, w => w.Url, w => w.Title);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
string webSrl = ctx.Web.ServerRelativeUrl.TrimEnd('/');
// Per-library: iterate document libraries
ctx.Load(ctx.Web.Lists, lists => lists.Include(
l => l.Title, l => l.BaseType, l => l.Hidden, l => l.RootFolder.ServerRelativeUrl));
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
foreach (var list in ctx.Web.Lists)
{
if (list.Hidden || list.BaseType != BaseType.DocumentLibrary) continue;
string siteRelUrl = list.RootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/');
Folder rootFolder = ctx.Web.GetFolderByServerRelativeUrl(list.RootFolder.ServerRelativeUrl);
ctx.Load(rootFolder,
f => f.StorageMetrics,
f => f.TimeLastModified,
f => f.ServerRelativeUrl);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
var node = new StorageNode
{
Name = list.Title,
Url = $"{ctx.Web.Url.TrimEnd('/')}/{siteRelUrl}",
SiteTitle = ctx.Web.Title,
Library = list.Title,
TotalSizeBytes = rootFolder.StorageMetrics.TotalSize,
FileStreamSizeBytes = rootFolder.StorageMetrics.TotalFileStreamSize,
TotalFileCount = rootFolder.StorageMetrics.TotalFileCount,
LastModified = rootFolder.StorageMetrics.LastModified > DateTime.MinValue
? rootFolder.StorageMetrics.LastModified
: rootFolder.TimeLastModified,
IndentLevel = 0,
Children = new List<StorageNode>()
};
// Recursive subfolder collection up to maxDepth
if (maxDepth > 0)
await CollectSubfoldersAsync(ctx, list.RootFolder.ServerRelativeUrl, node, 1, maxDepth, progress, ct);
}
```
### KQL Build from SearchOptions
```csharp
// Source: PS reference lines 4747-4763
private static string BuildKql(SearchOptions opts)
{
var parts = new List<string> { "ContentType:Document" };
if (opts.Extensions.Length > 0)
{
var extParts = opts.Extensions.Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
parts.Add($"({string.Join(" OR ", extParts)})");
}
if (opts.CreatedAfter.HasValue)
parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
if (opts.CreatedBefore.HasValue)
parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
if (opts.ModifiedAfter.HasValue)
parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
if (opts.ModifiedBefore.HasValue)
parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
if (!string.IsNullOrEmpty(opts.CreatedBy))
parts.Add($"Author:\"{opts.CreatedBy}\"");
if (!string.IsNullOrEmpty(opts.ModifiedBy))
parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
if (!string.IsNullOrEmpty(opts.Library))
parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");
return string.Join(" AND ", parts);
}
```
### Parsing Search ResultRows
```csharp
// Source: PS reference lines 4971-4987
private static SearchResult ParseRow(IDictionary<string, object> row)
{
static string Str(IDictionary<string, object> r, string key) =>
r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;
static DateTime? Date(IDictionary<string, object> r, string key)
{
var s = Str(r, key);
return DateTime.TryParse(s, out var dt) ? dt : null;
}
static long ParseSize(IDictionary<string, object> r, string key)
{
var raw = Str(r, key);
var digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
return long.TryParse(digits, out var v) ? v : 0L;
}
return new SearchResult
{
Title = Str(row, "Title"),
Path = Str(row, "Path"),
FileExtension = Str(row, "FileExtension"),
Created = Date(row, "Created"),
LastModified = Date(row, "LastModifiedTime"),
Author = Str(row, "Author"),
ModifiedBy = Str(row, "ModifiedBy"),
SizeBytes = ParseSize(row, "Size")
};
}
```
---
## Localization Keys Needed
The following keys are needed for Phase 3 Views. Keys from the PS reference (lines 2747-2813) are remapped to the C# `Strings.resx` naming convention. Existing keys already in `Strings.resx` are marked with (existing).
### Storage Tab
| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.storage` | `Storage` | (existing — already in Strings.resx line 77) |
| `chk.per.lib` | `Per-Library Breakdown` | new |
| `chk.subsites` | `Include Subsites` | new |
| `lbl.folder.depth` | `Folder depth:` | (existing — shared with permissions) |
| `chk.max.depth` | `Maximum (all levels)` | (existing — shared with permissions) |
| `stor.note` | `Note: deeper folder scans on large sites may take several minutes.` | new |
| `btn.gen.storage` | `Generate Metrics` | new |
| `btn.open.storage` | `Open Report` | new |
| `stor.col.library` | `Library` | new |
| `stor.col.site` | `Site` | new |
| `stor.col.files` | `Files` | new |
| `stor.col.size` | `Size` | new |
| `stor.col.versions` | `Versions` | new |
| `stor.col.lastmod` | `Last Modified` | new |
| `stor.col.share` | `Share of Total` | new |
### File Search Tab
| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.search` | `File Search` | (existing — already in Strings.resx line 79) |
| `grp.search.filters` | `Search Filters` | new |
| `lbl.extensions` | `Extension(s):` | new |
| `ph.extensions` | `docx pdf xlsx` | new (placeholder) |
| `lbl.regex` | `Name / Regex:` | new |
| `ph.regex` | `Ex: report.* or \.bak$` | new (placeholder) |
| `chk.created.after` | `Created after:` | new |
| `chk.created.before` | `Created before:` | new |
| `chk.modified.after` | `Modified after:` | new |
| `chk.modified.before` | `Modified before:` | new |
| `lbl.created.by` | `Created by:` | new |
| `ph.created.by` | `First Last or email` | new (placeholder) |
| `lbl.modified.by` | `Modified by:` | new |
| `ph.modified.by` | `First Last or email` | new (placeholder) |
| `lbl.library` | `Library:` | new |
| `ph.library` | `Optional relative path e.g. Shared Documents` | new (placeholder) |
| `lbl.max.results` | `Max results:` | new |
| `btn.run.search` | `Run Search` | new |
| `btn.open.search` | `Open Results` | new |
| `srch.col.name` | `File Name` | new |
| `srch.col.ext` | `Extension` | new |
| `srch.col.created` | `Created` | new |
| `srch.col.modified` | `Modified` | new |
| `srch.col.author` | `Created By` | new |
| `srch.col.modby` | `Modified By` | new |
| `srch.col.size` | `Size` | new |
### Duplicates Tab
| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.duplicates` | `Duplicates` | (existing — already in Strings.resx line 83) |
| `grp.dup.type` | `Duplicate Type` | new |
| `rad.dup.files` | `Duplicate files` | new |
| `rad.dup.folders` | `Duplicate folders` | new |
| `grp.dup.criteria` | `Comparison Criteria` | new |
| `lbl.dup.note` | `Name is always the primary criterion. Check additional criteria:` | new |
| `chk.dup.size` | `Same size` | new |
| `chk.dup.created` | `Same creation date` | new |
| `chk.dup.modified` | `Same modification date` | new |
| `chk.dup.subfolders` | `Same subfolder count` | new |
| `chk.dup.filecount` | `Same file count` | new |
| `chk.include.subsites` | `Include subsites` | new |
| `ph.dup.lib` | `All (leave empty)` | new (placeholder) |
| `btn.run.scan` | `Run Scan` | new |
| `btn.open.results` | `Open Results` | new |
---
## Duplicate Detection Scale — Known Concern Resolution
The STATE.md concern ("Duplicate detection at scale (100k+ files) — Graph API hash enumeration limits") is resolved: the PS reference does NOT use file hashes. It uses name+size+date grouping, which is exactly what DUPL-01/02/03 specify. The requirements do not mention hash-based deduplication.
**Scale analysis:**
- File duplicates use the Search API. SharePoint Search caps at 50,000 results (StartRow=50,000 max). A site with 100k+ files will be capped at 50,000 returned results. This is the same cap as SRCH-02, and is a known/accepted limitation.
- Folder duplicates use CAML pagination. `SharePointPaginationHelper.GetAllItemsAsync` handles arbitrary folder counts with RowLimit=2000 pagination — no effective upper bound.
- Client-side GroupBy on 50,000 items is instantaneous (Dictionary-based O(n) operation).
- **No Graph API or SHA256 content hashing is needed.** The concern was about a potential v2 enhancement not required by DUPL-01/02/03.
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `Get-PnPFolderStorageMetric` (PS cmdlet) | CSOM `Folder.StorageMetrics` | Phase 3 migration | One CSOM round-trip per folder; no PnP PS module required |
| `Submit-PnPSearchQuery` (PS cmdlet) | CSOM `KeywordQuery` + `SearchExecutor` | Phase 3 migration | Same pagination model; TrimDuplicates=false explicit |
| `Get-PnPListItem` for folders (PS) | `SharePointPaginationHelper.GetAllItemsAsync` with CAML | Phase 3 migration | Reuses Phase 1 helper; handles 5000-item threshold |
| Storage TreeView control | Flat DataGrid with IndentLevel + IValueConverter | Phase 3 design decision | Better UI virtualization for large sites |
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | xUnit 2.9.3 |
| Config file | none (SDK auto-discovery) |
| Quick run command | `dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "Category!=Integration" -x` |
| Full suite command | `dotnet test SharepointToolbox.slnx` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| STOR-01/02 | `StorageService.CollectStorageAsync` returns `StorageNode` list | unit (mock ISessionManager) | `dotnet test --filter "StorageServiceTests"` | ❌ Wave 0 |
| STOR-03 | VersionSizeBytes = TotalSizeBytes - FileStreamSizeBytes | unit | `dotnet test --filter "StorageNodeTests"` | ❌ Wave 0 |
| STOR-04 | `StorageCsvExportService.BuildCsv` produces correct header and rows | unit | `dotnet test --filter "StorageCsvExportServiceTests"` | ❌ Wave 0 |
| STOR-05 | `StorageHtmlExportService.BuildHtml` contains toggle JS and nested tables | unit | `dotnet test --filter "StorageHtmlExportServiceTests"` | ❌ Wave 0 |
| SRCH-01 | `SearchService` builds correct KQL from `SearchOptions` | unit | `dotnet test --filter "SearchServiceTests"` | ❌ Wave 0 |
| SRCH-02 | Search loop exits when `startRow > 50_000` | unit | `dotnet test --filter "SearchServiceTests"` | ❌ Wave 0 |
| SRCH-03 | `SearchCsvExportService.BuildCsv` produces correct header | unit | `dotnet test --filter "SearchCsvExportServiceTests"` | ❌ Wave 0 |
| SRCH-04 | `SearchHtmlExportService.BuildHtml` contains sort JS and filter input | unit | `dotnet test --filter "SearchHtmlExportServiceTests"` | ❌ Wave 0 |
| DUPL-01 | `MakeKey` function groups identical name+size+date items | unit | `dotnet test --filter "DuplicatesServiceTests"` | ❌ Wave 0 |
| DUPL-02 | CAML query targets `FSObjType=1`; `FileCount = ItemChildCount - FolderChildCount` | unit (logic only) | `dotnet test --filter "DuplicatesServiceTests"` | ❌ Wave 0 |
| DUPL-03 | `DuplicatesHtmlExportService.BuildHtml` contains group cards with ok/diff badges | unit | `dotnet test --filter "DuplicatesHtmlExportServiceTests"` | ❌ Wave 0 |
**Note:** `StorageService`, `SearchService`, and `DuplicatesService` depend on live CSOM — service-level tests use Skip like `PermissionsServiceTests`. ViewModel tests use Moq for `IStorageService`, `ISearchService`, `IDuplicatesService` following `PermissionsViewModelTests` pattern. Export service tests are fully unit-testable (no CSOM).
### Sampling Rate
- **Per task commit:** `dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj -x`
- **Per wave merge:** `dotnet test SharepointToolbox.slnx`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `SharepointToolbox.Tests/Services/StorageServiceTests.cs` — covers STOR-01/02 (stub + Skip like PermissionsServiceTests)
- [ ] `SharepointToolbox.Tests/Services/Export/StorageCsvExportServiceTests.cs` — covers STOR-04
- [ ] `SharepointToolbox.Tests/Services/Export/StorageHtmlExportServiceTests.cs` — covers STOR-05
- [ ] `SharepointToolbox.Tests/Services/SearchServiceTests.cs` — covers SRCH-01/02 (KQL build + pagination cap logic)
- [ ] `SharepointToolbox.Tests/Services/Export/SearchCsvExportServiceTests.cs` — covers SRCH-03
- [ ] `SharepointToolbox.Tests/Services/Export/SearchHtmlExportServiceTests.cs` — covers SRCH-04
- [ ] `SharepointToolbox.Tests/Services/DuplicatesServiceTests.cs` — covers DUPL-01/02 composite key logic
- [ ] `SharepointToolbox.Tests/Services/Export/DuplicatesHtmlExportServiceTests.cs` — covers DUPL-03
- [ ] `SharepointToolbox.Tests/ViewModels/StorageViewModelTests.cs` — covers STOR-01 ViewModel (Moq IStorageService)
- [ ] `SharepointToolbox.Tests/ViewModels/SearchViewModelTests.cs` — covers SRCH-01/02 ViewModel
- [ ] `SharepointToolbox.Tests/ViewModels/DuplicatesViewModelTests.cs` — covers DUPL-01/02 ViewModel
---
## Open Questions
1. **StorageMetrics.LastModified vs TimeLastModified**
- What we know: `StorageMetrics.LastModified` exists per the API docs. `Folder.TimeLastModified` is a separate CSOM property.
- What's unclear: Whether `StorageMetrics.LastModified` can return `DateTime.MinValue` for recently created empty folders in all SharePoint Online tenants.
- Recommendation: Load both (`f => f.StorageMetrics, f => f.TimeLastModified`) and prefer `StorageMetrics.LastModified` when it is `> DateTime.MinValue`, falling back to `TimeLastModified`.
2. **Search index freshness for duplicate detection**
- What we know: SharePoint Search is eventually consistent — newly created files may not appear for up to 15 minutes.
- What's unclear: Whether users expect real-time accuracy or accept eventual consistency.
- Recommendation: Document in UI that search-based results (files) reflect the search index, not the current state. Add a note in the log output.
3. **Multiple-site file search scope**
- What we know: The PS reference scopes search to `$siteUrl` context only (one site per search). SRCH-01 says "across sites" in the goal description but the requirements only specify search criteria, not multi-site.
- What's unclear: Whether SRCH-01 requires multi-site search in one operation or per-site.
- Recommendation: Implement per-site search (matching PS reference). Multi-site search would require separate `ClientContext` per site plus result merging — treat as a future enhancement.
---
## Sources
### Primary (HIGH confidence)
- [StorageMetrics Class — MS Learn CSOM reference](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics?view=sharepoint-csom) — properties TotalSize, TotalFileStreamSize, TotalFileCount, LastModified confirmed
- [StorageMetrics.TotalSize — MS Learn](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics.totalsize?view=sharepoint-csom) — confirmed as Int64, ReadOnly
- [[MS-CSOMSPT] TotalFileStreamSize](https://learn.microsoft.com/en-us/openspecs/sharepoint_protocols/ms-csomspt/635464fc-8505-43fa-97d7-02229acdb3c5) — confirmed definition: "Aggregate stream size in bytes for all files... Excludes version, metadata, list item attachment, and non-customized document sizes"
- [SearchExecutor Class — MS Learn CSOM reference](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor?view=sharepoint-csom) — namespace `Microsoft.SharePoint.Client.Search.Query`, assembly `Microsoft.SharePoint.Client.Search.Portable.dll`
- [Search limits for SharePoint — MS Learn](https://learn.microsoft.com/en-us/sharepoint/search-limits) — StartRow max 50,000 (boundary), RowLimit max 500 (boundary) confirmed
- [SharepointToolbox/bin/Debug output] — `Microsoft.SharePoint.Client.Search.dll` confirmed present as transitive dep
### Secondary (MEDIUM confidence)
- [Load storage metric from SPO — longnlp.github.io](https://longnlp.github.io/load-storage-metric-from-SPO) — CSOM Load pattern: `ctx.Load(folder, f => f.StorageMetrics)` verified
- [Fetch all results from SharePoint Search using CSOM — usefulscripts.wordpress.com](https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/) — KeywordQuery + SearchExecutor pagination pattern with StartRow; confirmed against official docs
- PowerShell reference `Sharepoint_ToolBox.ps1` lines 1621-1780 (Export-StorageToHTML), 2112-2233 (Export-SearchResultsToHTML), 2235-2406 (Export-DuplicatesToHTML), 4432-4534 (storage scan), 4747-4808 (file search), 4937-5059 (duplicate scan) — authoritative reference implementation
### Tertiary (LOW confidence — implementation detail, verify when coding)
- [SharePoint CSOM Q&A — Getting size of subsite](https://learn.microsoft.com/en-us/answers/questions/1518977/getting-size-of-a-subsite-using-csom) — general pattern confirmed; specific edge cases not verified
- [Pagination for large result sets — MS Learn](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets) — DocId-based pagination beyond 50k exists but is not needed for Phase 3
---
## Metadata
**Confidence breakdown:**
- Standard Stack: HIGH — no new packages needed; Search.dll confirmed present; all APIs verified against MS docs
- Architecture Patterns: HIGH — direct port of working PS reference; CSOM API shapes confirmed
- Pitfalls: HIGH for StorageMetrics loading, search result typing, vti_history filter (all from PS reference or official docs); MEDIUM for KQL length limit (documented but not commonly hit)
- Localization keys: HIGH — directly extracted from PS reference lines 2747-2813
**Research date:** 2026-04-02
**Valid until:** 2026-07-01 (CSOM APIs stable; SharePoint search limits stable; re-verify if PnP.Framework upgrades past 1.18)