# Phase 3: Storage and File Operations - Research

**Researched:** 2026-04-02
**Domain:** CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection
**Confidence:** HIGH

---

<phase_requirements>
## Phase Requirements

| ID | Description | Research Support |
|----|-------------|-----------------|
| STOR-01 | User can view storage consumption per library on a site | CSOM `Folder.StorageMetrics` (one Load call per folder) + flat DataGrid with indent column |
| STOR-02 | User can view storage consumption per site with configurable folder depth | Recursive `Collect-FolderStorage` pattern translated to async CSOM; depth guard via split-count |
| STOR-03 | Storage metrics include total size, version size, item count, and last modified date | `StorageMetrics.TotalSize`, `TotalFileStreamSize`, `TotalFileCount`, `StorageMetrics.LastModified`; version size = TotalSize - TotalFileStreamSize |
| STOR-04 | User can export storage metrics to CSV | New `StorageCsvExportService` — same UTF-8 BOM pattern as Phase 2 |
| STOR-05 | User can export storage metrics to interactive HTML with collapsible tree view | New `StorageHtmlExportService` — port PS lines 1621-1780; toggle() JS + nested table rows |
| SRCH-01 | User can search files across sites using multiple criteria | `KeywordQuery` + `SearchExecutor` (CSOM search); KQL built from filter params; client-side Regex post-filter |
| SRCH-02 | User can configure maximum search results (up to 50,000) | SharePoint Search `StartRow` hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max |
| SRCH-03 | User can export search results to CSV | New `SearchCsvExportService` |
| SRCH-04 | User can export search results to interactive HTML (sortable, filterable) | New `SearchHtmlExportService` — port PS lines 2112-2233; sortable columns via data attributes |
| DUPL-01 | User can scan for duplicate files by name, size, creation date, modification date | Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed |
| DUPL-02 | User can scan for duplicate folders by name, subfolder count, file count | `SharePointPaginationHelper.GetAllItemsAsync` with CAML `FSObjType=1`; read `FolderChildCount`, `ItemChildCount` from field values |
| DUPL-03 | User can export duplicate report to HTML with grouped display and visual indicators | New `DuplicatesHtmlExportService` — port PS lines 2235-2406; collapsible group cards, ok/diff badges |
</phase_requirements>

---

## Summary

Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — `Microsoft.SharePoint.Client.Search.dll` is already in the output folder as a transitive dependency of PnP.Framework 1.18.0.

**Storage** uses CSOM `Folder.StorageMetrics` (loaded via `ctx.Load(folder, f => f.StorageMetrics)`). One CSOM round-trip per folder. Version size is derived as `TotalSize - TotalFileStreamSize`. The data model is a recursive tree (site → library → folder → subfolder), flattened to a `DataGrid` with an indent-level column for WPF display. The HTML export ports the PS `Export-StorageToHTML` function (PS lines 1621-1780) with its toggle(i) JS pattern.

**File Search** uses `Microsoft.SharePoint.Client.Search.Query.KeywordQuery` + `SearchExecutor`. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is `StartRow += 500` per batch; the hard ceiling is `StartRow = 50,000` (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233.

**Duplicate Detection** uses the same Search API for file duplicates (with all documents query) and `SharePointPaginationHelper.GetAllItemsAsync` with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation.

**Primary recommendation:** Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns.

---

## User Constraints

No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt.

### Locked Decisions
- .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2)
- PnP.Framework 1.18.0 (CSOM-based SharePoint access)
- No new major packages preferred — only add if truly necessary
- Microsoft.Extensions.Hosting DI
- Serilog logging
- xUnit 2.9.3 tests

### Deferred / Out of Scope
- Content hashing for duplicate detection (v2)
- Storage charts/graphs (v2 requirement VIZZ-01/02/03)
- Cross-tenant file search

---

## Standard Stack

### Core (no new packages needed)

| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| PnP.Framework | 1.18.0 | CSOM access, `ClientContext` | Already in project |
| Microsoft.SharePoint.Client.Search.dll | (via PnP.Framework) | `KeywordQuery`, `SearchExecutor` | Transitive dep — confirmed present in `bin/Debug/net10.0-windows/` |
| CommunityToolkit.Mvvm | 8.4.2 | `[ObservableProperty]`, `AsyncRelayCommand` | Already in project |
| Microsoft.Extensions.Hosting | 10.x | DI container | Already in project |
| Serilog | 4.3.1 | Structured logging | Already in project |
| xUnit | 2.9.3 | Tests | Already in project |
| Moq | 4.20.72 | Mock interfaces in tests | Already in project |

**No new NuGet packages required.** `Microsoft.SharePoint.Client.Search.dll` ships as a transitive dependency of PnP.Framework — confirmed present at `SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll`.

### New Models Needed

| Model | Location | Fields |
|-------|----------|--------|
| `StorageNode` | `Core/Models/StorageNode.cs` | `string Name`, `string Url`, `string SiteTitle`, `string Library`, `long TotalSizeBytes`, `long FileStreamSizeBytes`, `long TotalFileCount`, `DateTime? LastModified`, `int IndentLevel`, `List<StorageNode> Children` |
| `SearchResult` | `Core/Models/SearchResult.cs` | `string Title`, `string Path`, `string FileExtension`, `DateTime? Created`, `DateTime? LastModified`, `string Author`, `string ModifiedBy`, `long SizeBytes` |
| `DuplicateGroup` | `Core/Models/DuplicateGroup.cs` | `string GroupKey`, `string Name`, `List<DuplicateItem> Items` |
| `DuplicateItem` | `Core/Models/DuplicateItem.cs` | `string Name`, `string Path`, `string Library`, `long? SizeBytes`, `DateTime? Created`, `DateTime? Modified`, `int? FolderCount`, `int? FileCount` |
| `StorageScanOptions` | `Core/Models/StorageScanOptions.cs` | `bool PerLibrary`, `bool IncludeSubsites`, `int FolderDepth` |
| `SearchOptions` | `Core/Models/SearchOptions.cs` | `string[] Extensions`, `string? Regex`, `DateTime? CreatedAfter`, `DateTime? CreatedBefore`, `DateTime? ModifiedAfter`, `DateTime? ModifiedBefore`, `string? CreatedBy`, `string? ModifiedBy`, `string? Library`, `int MaxResults` |
| `DuplicateScanOptions` | `Core/Models/DuplicateScanOptions.cs` | `string Mode` ("Files"/"Folders"), `bool MatchSize`, `bool MatchCreated`, `bool MatchModified`, `bool MatchSubfolderCount`, `bool MatchFileCount`, `bool IncludeSubsites`, `string? Library` |

---

## Architecture Patterns

### Recommended Project Structure (additions only)

```
SharepointToolbox/
├── Core/Models/
│   ├── StorageNode.cs           # new
│   ├── SearchResult.cs          # new
│   ├── DuplicateGroup.cs        # new
│   ├── DuplicateItem.cs         # new
│   ├── StorageScanOptions.cs    # new
│   ├── SearchOptions.cs         # new
│   └── DuplicateScanOptions.cs  # new
├── Services/
│   ├── IStorageService.cs       # new
│   ├── StorageService.cs        # new
│   ├── ISearchService.cs        # new
│   ├── SearchService.cs         # new
│   ├── IDuplicatesService.cs    # new
│   ├── DuplicatesService.cs     # new
│   └── Export/
│       ├── StorageCsvExportService.cs   # new
│       ├── StorageHtmlExportService.cs  # new
│       ├── SearchCsvExportService.cs    # new
│       ├── SearchHtmlExportService.cs   # new
│       └── DuplicatesHtmlExportService.cs # new
├── ViewModels/Tabs/
│   ├── StorageViewModel.cs      # new
│   ├── SearchViewModel.cs       # new
│   └── DuplicatesViewModel.cs   # new
└── Views/Tabs/
    ├── StorageView.xaml          # new
    ├── StorageView.xaml.cs       # new
    ├── SearchView.xaml           # new
    ├── SearchView.xaml.cs        # new
    ├── DuplicatesView.xaml       # new
    └── DuplicatesView.xaml.cs    # new
```

### Pattern 1: CSOM StorageMetrics Load

**What:** Load `Folder.StorageMetrics` with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched.

**When to use:** Whenever reading storage data for a folder or library root.

**Example:**
```csharp
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics
// + https://longnlp.github.io/load-storage-metric-from-SPO

// Get folder by server-relative URL (library root or subfolder)
Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl);
ctx.Load(folder,
    f => f.StorageMetrics,               // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified
    f => f.TimeLastModified,             // alternative timestamp if StorageMetrics.LastModified is null
    f => f.ServerRelativeUrl,
    f => f.Name);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

long totalBytes    = folder.StorageMetrics.TotalSize;
long streamBytes   = folder.StorageMetrics.TotalFileStreamSize;  // current-version files only
long versionBytes  = Math.Max(0L, totalBytes - streamBytes);      // version overhead
long fileCount     = folder.StorageMetrics.TotalFileCount;
DateTime? lastMod  = folder.StorageMetrics.IsPropertyAvailable("LastModified")
    ? folder.StorageMetrics.LastModified
    : folder.TimeLastModified;
```

**Unit:** `TotalSize` and `TotalFileStreamSize` are in **bytes** (Int64). `TotalFileStreamSize` is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = `TotalSize - TotalFileStreamSize`.

### Pattern 2: KQL Search with Pagination

**What:** Use `KeywordQuery` + `SearchExecutor` (in `Microsoft.SharePoint.Client.Search.Query`) to execute a KQL query, paginating 500 rows at a time via `StartRow`.

**When to use:** SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection).

**Example:**
```csharp
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor
// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/

using Microsoft.SharePoint.Client.Search.Query;

// namespace: Microsoft.SharePoint.Client.Search.Query
// assembly:  Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep)

var allResults = new List<IDictionary<string, object>>();
int startRow = 0;
const int batchSize = 500;

do
{
    ct.ThrowIfCancellationRequested();

    var kq = new KeywordQuery(ctx)
    {
        QueryText      = kql,          // e.g. "ContentType:Document AND FileExtension:pdf"
        StartRow       = startRow,
        RowLimit       = batchSize,
        TrimDuplicates = false
    };
    // Explicit managed properties to retrieve
    kq.SelectProperties.AddRange(new[]
    {
        "Title", "Path", "Author", "LastModifiedTime",
        "FileExtension", "Created", "ModifiedBy", "Size"
    });

    var executor = new SearchExecutor(ctx);
    ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
    await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
    // Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again

    var table = clientResult.Value
        .FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
    if (table == null) break;

    int retrieved = table.RowCount;
    foreach (System.Collections.Hashtable row in table.ResultRows)
    {
        allResults.Add(row.Cast<System.Collections.DictionaryEntry>()
            .ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty));
    }

    progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…"));
    startRow += batchSize;
}
while (startRow < maxResults && startRow <= 50_000 // platform hard cap
       && allResults.Count < maxResults);
```

**Critical detail:** `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` wraps `ctx.ExecuteQuery()`. Call it AFTER `executor.ExecuteQuery(kq)` — do NOT call `ctx.ExecuteQuery()` directly afterward.

**StartRow limit:** SharePoint Search imposes a hard boundary of 50,000 for `StartRow`. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02.

**KQL field mappings (from PS reference lines 4747-4763):**
- Extension: `FileExtension:pdf OR FileExtension:docx`
- Created after/before: `Created>=2024-01-01` / `Created<=2024-12-31`
- Modified after/before: `Write>=2024-01-01` / `Write<=2024-12-31`
- Created by: `Author:"First Last"`
- Modified by: `ModifiedBy:"First Last"`
- Library path: `Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*"`
- Documents only: `ContentType:Document`

### Pattern 3: Folder Enumeration for Duplicate Folders

**What:** Use `SharePointPaginationHelper.GetAllItemsAsync` with a CAML filter on `FSObjType = 1` (folders). Read `FolderChildCount` and `ItemChildCount` from `FieldValues`.

**When to use:** DUPL-02 (folder duplicate scan).

**Example:**
```csharp
// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern

var camlQuery = new CamlQuery
{
    ViewXml = @"<View Scope='RecursiveAll'>
                    <Query>
                        <Where>
                            <Eq>
                                <FieldRef Name='FSObjType' />
                                <Value Type='Integer'>1</Value>
                            </Eq>
                        </Where>
                    </Query>
                    <RowLimit>2000</RowLimit>
                </View>"
};

await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct))
{
    var fv = item.FieldValues;
    var name       = fv["FileLeafRef"]?.ToString() ?? string.Empty;
    var fileRef    = fv["FileRef"]?.ToString() ?? string.Empty;
    var subCount   = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
    var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
    var fileCount  = Math.Max(0, childCount - subCount);
    var created    = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
    var modified   = fv["Modified"] is DateTime md ? md : (DateTime?)null;
    // ...build DuplicateItem
}
```

### Pattern 4: Duplicate Composite Key (name+size+date grouping)

**What:** Build a string composite key from the fields the user selected, then `GroupBy(key).Where(g => g.Count() >= 2)`.

**When to use:** DUPL-01 (files) and DUPL-02 (folders).

**Example:**
```csharp
// Source: PS reference lines 4942-4949 (MakeKey function)

private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
{
    var parts = new List<string> { item.Name.ToLowerInvariant() };
    if (opts.MatchSize    && item.SizeBytes.HasValue)    parts.Add(item.SizeBytes.Value.ToString());
    if (opts.MatchCreated && item.Created.HasValue)      parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
    if (opts.MatchModified && item.Modified.HasValue)    parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
    if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
    if (opts.MatchFileCount && item.FileCount.HasValue)  parts.Add(item.FileCount.Value.ToString());
    return string.Join("|", parts);
}

var groups = allItems
    .GroupBy(i => MakeKey(i, opts))
    .Where(g => g.Count() >= 2)
    .Select(g => new DuplicateGroup
    {
        GroupKey = g.Key,
        Name     = g.First().Name,
        Items    = g.ToList()
    })
    .OrderByDescending(g => g.Items.Count)
    .ToList();
```

### Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid

**What:** Flatten the recursive tree (site → library → folder → subfolder) into a flat `List<StorageNode>` where each node carries an `IndentLevel`. The WPF `DataGrid` renders a `Margin` on the name cell based on `IndentLevel`.

**When to use:** STOR-01/02 WPF display.

**Rationale for DataGrid over TreeView:** WPF `TreeView` requires hierarchical `HierarchicalDataTemplate` and loses virtualization with deep nesting. A flat `DataGrid` with `VirtualizingPanel.IsVirtualizing="True"` stays performant for thousands of rows and is trivially sortable.

**Example:**
```csharp
// Flatten tree to observable list for DataGrid binding
private static void FlattenTree(StorageNode node, int level, List<StorageNode> result)
{
    node.IndentLevel = level;
    result.Add(node);
    foreach (var child in node.Children)
        FlattenTree(child, level + 1, result);
}
```

```xml
<!-- WPF DataGrid cell template for name column with indent -->
<DataGridTemplateColumn Header="Library / Folder" Width="*">
    <DataGridTemplateColumn.CellTemplate>
        <DataTemplate>
            <TextBlock Text="{Binding Name}"
                       Margin="{Binding IndentLevel, Converter={StaticResource IndentConverter}}" />
        </DataTemplate>
    </DataGridTemplateColumn.CellTemplate>
</DataGridTemplateColumn>
```

Use `IValueConverter` mapping `IndentLevel` → `new Thickness(IndentLevel * 16, 0, 0, 0)`.

### Pattern 6: Storage HTML Collapsible Tree

**What:** The HTML export uses inline nested tables with `display:none` rows toggled by `toggle(i)` JS. Each library/folder that has children gets a unique numeric index.

**When to use:** STOR-05 export.

**Key design (from PS lines 1621-1780):**
- A global `_togIdx` counter assigns unique IDs to collapsible rows: `<tr id='sf-{i}' style='display:none'>`.
- A `<button onclick='toggle({i})'>` triggers `row.style.display = visible ? 'none' : 'table-row'`.
- Library rows embed a nested `<table class='sf-tbl'>` inside the collapsible row (colspan spanning all columns).
- This is a pure inline pattern — no external JS or CSS dependencies.
- In C# the counter is a field on `StorageHtmlExportService` reset at the start of each `BuildHtml()` call.

### Anti-Patterns to Avoid

- **Loading StorageMetrics without including it in ctx.Load:** `folder.StorageMetrics.TotalSize` throws `PropertyOrFieldNotInitializedException` if `StorageMetrics` is not included in the Load expression. Always use `ctx.Load(folder, f => f.StorageMetrics, ...)`.
- **Calling ctx.ExecuteQuery() after executor.ExecuteQuery(kq):** The search executor pattern requires calling `ctx.ExecuteQuery()` ONCE (inside `ExecuteQueryRetryAsync`). Calling it twice is a no-op at best, throws at worst.
- **StartRow > 50,000:** SharePoint Search hard boundary — will return zero results or error. Cap loop exit at `startRow <= 50_000`.
- **Modifying ObservableCollection from Task.Run:** Same rule as Phase 2 — accumulate in `List<T>` on background thread, then `Dispatcher.InvokeAsync(() => StorageResults = new ObservableCollection<T>(list))`.
- **Recursive CSOM calls without depth guard:** Without a depth guard, `Collect-FolderStorage` on a deep site can make thousands of CSOM round-trips. Always pass `MaxDepth` and check `currentDepth >= maxDepth` before recursing.
- **Building a TreeView for storage display:** WPF TreeView loses UI virtualization with more than ~1000 visible items. Use DataGrid with IndentLevel.
- **Version size from index:** The Search API's `Size` property is the current-version file size, not total including versions. Only `StorageMetrics.TotalFileStreamSize` vs `TotalSize` gives accurate version overhead.

---

## Don't Hand-Roll

| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| CSOM throttle retry | Custom retry loop | `ExecuteQueryRetryHelper.ExecuteQueryRetryAsync` (Phase 1) | Already handles 429/503 with exponential backoff |
| List pagination | Raw `ExecuteQuery` loop | `SharePointPaginationHelper.GetAllItemsAsync` (Phase 1) | Handles 5000-item threshold, CAML position continuation |
| Search pagination | Manual `do/while` per search | Same `KeywordQuery`+`SearchExecutor` pattern (internal to SearchService) | Wrap in a helper method inside `SearchService` to avoid duplication across SRCH and DUPL features |
| HTML header/footer boilerplate | New template each export service | Copy from existing `HtmlExportService` pattern (Phase 2) | Consistent `<!DOCTYPE>`, viewport meta, `Segoe UI` font stack |
| CSV field escaping | Custom escaping | RFC 4180 `Csv()` helper pattern from Phase 2 `CsvExportService` | Already handles quotes, empty values, UTF-8 BOM |
| OperationProgress reporting | New progress model | `OperationProgress.Indeterminate(msg)` + `new OperationProgress(current, total, msg)` (Phase 1) | Already wired to UI via `FeatureViewModelBase` |
| Tenant context management | Directly create `ClientContext` | `ISessionManager.GetOrCreateContextAsync` (Phase 1) | Handles MSAL cache, per-tenant context pooling |

---

## Common Pitfalls

### Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException
**What goes wrong:** `folder.StorageMetrics.TotalSize` throws `PropertyOrFieldNotInitializedException` at runtime.
**Why it happens:** CSOM lazy-loading — if `StorageMetrics` is not in the Load expression, the proxy object exists but has no data.
**How to avoid:** Always include `f => f.StorageMetrics` in the `ctx.Load(folder, ...)` lambda.
**Warning signs:** Exception message contains "The property or field 'StorageMetrics' has not been initialized".

### Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed
**What goes wrong:** Accessing `row["Size"]` returns object — Size comes back as a string `"12345"` not a long.
**Why it happens:** `ResultTable.ResultRows` is `IEnumerable<IDictionary<string, object>>`. All values are strings from the search index.
**How to avoid:** Always parse with `long.TryParse(row["Size"]?.ToString() ?? "0", out var sizeBytes)`. Strip non-numeric characters as PS does: `Regex.Replace(sizeStr, "[^0-9]", "")`.
**Warning signs:** `InvalidCastException` when binding Size to a numeric column.

### Pitfall 3: Search API Returns Duplicates for Versioned Files
**What goes wrong:** Files with many versions appear multiple times in results via `/_vti_history/` paths.
**Why it happens:** SharePoint indexes each version as a separate item in some cases.
**How to avoid:** Filter items where `Path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase)` — port of PS line 4973.
**Warning signs:** Duplicate file paths in results with `_vti_history` segment.

### Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue
**What goes wrong:** `LastModified` shows as 01/01/0001 for empty folders.
**Why it happens:** SharePoint returns a default DateTime for folders with no modifications.
**How to avoid:** Check `lastModified > DateTime.MinValue` before formatting. Fall back to `folder.TimeLastModified` if `StorageMetrics.LastModified` is unset.
**Warning signs:** "01/01/0001" in the LastModified column.

### Pitfall 5: KQL Query Text Exceeds 4096 Characters
**What goes wrong:** Search query silently fails or returns error for very long KQL strings.
**Why it happens:** SharePoint Search has a 4096-character KQL text boundary.
**How to avoid:** For extension filters with many extensions, use `(FileExtension:a OR FileExtension:b OR ...)` and validate total length before calling. Warn user if limit approached.
**Warning signs:** Zero results returned when many extensions entered; no CSOM exception.

### Pitfall 6: CAML FSObjType Field Name
**What goes wrong:** CAML query for folders returns no results.
**Why it happens:** The internal CAML field name is `FSObjType`, not `FileSystemObjectType`. Using the wrong name returns no matches silently.
**How to avoid:** Use `<FieldRef Name='FSObjType' />` (integer) with `<Value Type='Integer'>1</Value>`. Confirmed by PS reference line 5011 which uses CSOM `FileSystemObjectType.Folder` comparison.
**Warning signs:** Zero items returned from folder CAML query on a library known to have folders.

### Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path
**What goes wrong:** `Get-PnPFolderStorageMetric -FolderSiteRelativeUrl` requires a path relative to the web root (e.g., `Shared Documents`), not the server root (e.g., `/sites/MySite/Shared Documents`).
**Why it happens:** CSOM `Folder.StorageMetrics` uses server-relative URLs, so you need to strip the web's ServerRelativeUrl prefix.
**How to avoid:** Load `ctx.Web.ServerRelativeUrl` first, then compute: `siteRelUrl = rootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/')`. Use `ctx.Web.GetFolderByServerRelativeUrl(siteAbsoluteUrl)` which accepts full server-relative paths.
**Warning signs:** 404/FileNotFoundException from CSOM when calling StorageMetrics.

---

## Code Examples

### Loading StorageMetrics (STOR-01/02/03)

```csharp
// Source: MS Learn — StorageMetrics Class; [MS-CSOMSPT] TotalFileStreamSize definition

ctx.Load(ctx.Web, w => w.ServerRelativeUrl, w => w.Url, w => w.Title);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

string webSrl = ctx.Web.ServerRelativeUrl.TrimEnd('/');

// Per-library: iterate document libraries
ctx.Load(ctx.Web.Lists, lists => lists.Include(
    l => l.Title, l => l.BaseType, l => l.Hidden, l => l.RootFolder.ServerRelativeUrl));
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

foreach (var list in ctx.Web.Lists)
{
    if (list.Hidden || list.BaseType != BaseType.DocumentLibrary) continue;

    string siteRelUrl = list.RootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/');
    Folder rootFolder = ctx.Web.GetFolderByServerRelativeUrl(list.RootFolder.ServerRelativeUrl);
    ctx.Load(rootFolder,
        f => f.StorageMetrics,
        f => f.TimeLastModified,
        f => f.ServerRelativeUrl);
    await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);

    var node = new StorageNode
    {
        Name              = list.Title,
        Url               = $"{ctx.Web.Url.TrimEnd('/')}/{siteRelUrl}",
        SiteTitle         = ctx.Web.Title,
        Library           = list.Title,
        TotalSizeBytes    = rootFolder.StorageMetrics.TotalSize,
        FileStreamSizeBytes = rootFolder.StorageMetrics.TotalFileStreamSize,
        TotalFileCount    = rootFolder.StorageMetrics.TotalFileCount,
        LastModified      = rootFolder.StorageMetrics.LastModified > DateTime.MinValue
                            ? rootFolder.StorageMetrics.LastModified
                            : rootFolder.TimeLastModified,
        IndentLevel       = 0,
        Children          = new List<StorageNode>()
    };

    // Recursive subfolder collection up to maxDepth
    if (maxDepth > 0)
        await CollectSubfoldersAsync(ctx, list.RootFolder.ServerRelativeUrl, node, 1, maxDepth, progress, ct);
}
```

### KQL Build from SearchOptions

```csharp
// Source: PS reference lines 4747-4763

private static string BuildKql(SearchOptions opts)
{
    var parts = new List<string> { "ContentType:Document" };

    if (opts.Extensions.Length > 0)
    {
        var extParts = opts.Extensions.Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
        parts.Add($"({string.Join(" OR ", extParts)})");
    }
    if (opts.CreatedAfter.HasValue)
        parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
    if (opts.CreatedBefore.HasValue)
        parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
    if (opts.ModifiedAfter.HasValue)
        parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
    if (opts.ModifiedBefore.HasValue)
        parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
    if (!string.IsNullOrEmpty(opts.CreatedBy))
        parts.Add($"Author:\"{opts.CreatedBy}\"");
    if (!string.IsNullOrEmpty(opts.ModifiedBy))
        parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
    if (!string.IsNullOrEmpty(opts.Library))
        parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");

    return string.Join(" AND ", parts);
}
```

### Parsing Search ResultRows

```csharp
// Source: PS reference lines 4971-4987

private static SearchResult ParseRow(IDictionary<string, object> row)
{
    static string Str(IDictionary<string, object> r, string key) =>
        r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;

    static DateTime? Date(IDictionary<string, object> r, string key)
    {
        var s = Str(r, key);
        return DateTime.TryParse(s, out var dt) ? dt : null;
    }

    static long ParseSize(IDictionary<string, object> r, string key)
    {
        var raw = Str(r, key);
        var digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
        return long.TryParse(digits, out var v) ? v : 0L;
    }

    return new SearchResult
    {
        Title         = Str(row, "Title"),
        Path          = Str(row, "Path"),
        FileExtension = Str(row, "FileExtension"),
        Created       = Date(row, "Created"),
        LastModified  = Date(row, "LastModifiedTime"),
        Author        = Str(row, "Author"),
        ModifiedBy    = Str(row, "ModifiedBy"),
        SizeBytes     = ParseSize(row, "Size")
    };
}
```

---

## Localization Keys Needed

The following keys are needed for Phase 3 Views. Keys from the PS reference (lines 2747-2813) are remapped to the C# `Strings.resx` naming convention. Existing keys already in `Strings.resx` are marked with (existing).

### Storage Tab

| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.storage` | `Storage` | (existing — already in Strings.resx line 77) |
| `chk.per.lib` | `Per-Library Breakdown` | new |
| `chk.subsites` | `Include Subsites` | new |
| `lbl.folder.depth` | `Folder depth:` | (existing — shared with permissions) |
| `chk.max.depth` | `Maximum (all levels)` | (existing — shared with permissions) |
| `stor.note` | `Note: deeper folder scans on large sites may take several minutes.` | new |
| `btn.gen.storage` | `Generate Metrics` | new |
| `btn.open.storage` | `Open Report` | new |
| `stor.col.library` | `Library` | new |
| `stor.col.site` | `Site` | new |
| `stor.col.files` | `Files` | new |
| `stor.col.size` | `Size` | new |
| `stor.col.versions` | `Versions` | new |
| `stor.col.lastmod` | `Last Modified` | new |
| `stor.col.share` | `Share of Total` | new |

### File Search Tab

| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.search` | `File Search` | (existing — already in Strings.resx line 79) |
| `grp.search.filters` | `Search Filters` | new |
| `lbl.extensions` | `Extension(s):` | new |
| `ph.extensions` | `docx pdf xlsx` | new (placeholder) |
| `lbl.regex` | `Name / Regex:` | new |
| `ph.regex` | `Ex: report.* or \.bak$` | new (placeholder) |
| `chk.created.after` | `Created after:` | new |
| `chk.created.before` | `Created before:` | new |
| `chk.modified.after` | `Modified after:` | new |
| `chk.modified.before` | `Modified before:` | new |
| `lbl.created.by` | `Created by:` | new |
| `ph.created.by` | `First Last or email` | new (placeholder) |
| `lbl.modified.by` | `Modified by:` | new |
| `ph.modified.by` | `First Last or email` | new (placeholder) |
| `lbl.library` | `Library:` | new |
| `ph.library` | `Optional relative path e.g. Shared Documents` | new (placeholder) |
| `lbl.max.results` | `Max results:` | new |
| `btn.run.search` | `Run Search` | new |
| `btn.open.search` | `Open Results` | new |
| `srch.col.name` | `File Name` | new |
| `srch.col.ext` | `Extension` | new |
| `srch.col.created` | `Created` | new |
| `srch.col.modified` | `Modified` | new |
| `srch.col.author` | `Created By` | new |
| `srch.col.modby` | `Modified By` | new |
| `srch.col.size` | `Size` | new |

### Duplicates Tab

| Key | EN Value | Notes |
|-----|----------|-------|
| `tab.duplicates` | `Duplicates` | (existing — already in Strings.resx line 83) |
| `grp.dup.type` | `Duplicate Type` | new |
| `rad.dup.files` | `Duplicate files` | new |
| `rad.dup.folders` | `Duplicate folders` | new |
| `grp.dup.criteria` | `Comparison Criteria` | new |
| `lbl.dup.note` | `Name is always the primary criterion. Check additional criteria:` | new |
| `chk.dup.size` | `Same size` | new |
| `chk.dup.created` | `Same creation date` | new |
| `chk.dup.modified` | `Same modification date` | new |
| `chk.dup.subfolders` | `Same subfolder count` | new |
| `chk.dup.filecount` | `Same file count` | new |
| `chk.include.subsites` | `Include subsites` | new |
| `ph.dup.lib` | `All (leave empty)` | new (placeholder) |
| `btn.run.scan` | `Run Scan` | new |
| `btn.open.results` | `Open Results` | new |

---

## Duplicate Detection Scale — Known Concern Resolution

The STATE.md concern ("Duplicate detection at scale (100k+ files) — Graph API hash enumeration limits") is resolved: the PS reference does NOT use file hashes. It uses name+size+date grouping, which is exactly what DUPL-01/02/03 specify. The requirements do not mention hash-based deduplication.

**Scale analysis:**
- File duplicates use the Search API. SharePoint Search caps at 50,000 results (StartRow=50,000 max). A site with 100k+ files will be capped at 50,000 returned results. This is the same cap as SRCH-02, and is a known/accepted limitation.
- Folder duplicates use CAML pagination. `SharePointPaginationHelper.GetAllItemsAsync` handles arbitrary folder counts with RowLimit=2000 pagination — no effective upper bound.
- Client-side GroupBy on 50,000 items is instantaneous (Dictionary-based O(n) operation).
- **No Graph API or SHA256 content hashing is needed.** The concern was about a potential v2 enhancement not required by DUPL-01/02/03.

---

## State of the Art

| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `Get-PnPFolderStorageMetric` (PS cmdlet) | CSOM `Folder.StorageMetrics` | Phase 3 migration | One CSOM round-trip per folder; no PnP PS module required |
| `Submit-PnPSearchQuery` (PS cmdlet) | CSOM `KeywordQuery` + `SearchExecutor` | Phase 3 migration | Same pagination model; TrimDuplicates=false explicit |
| `Get-PnPListItem` for folders (PS) | `SharePointPaginationHelper.GetAllItemsAsync` with CAML | Phase 3 migration | Reuses Phase 1 helper; handles 5000-item threshold |
| Storage TreeView control | Flat DataGrid with IndentLevel + IValueConverter | Phase 3 design decision | Better UI virtualization for large sites |

---

## Validation Architecture

### Test Framework

| Property | Value |
|----------|-------|
| Framework | xUnit 2.9.3 |
| Config file | none (SDK auto-discovery) |
| Quick run command | `dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "Category!=Integration" -x` |
| Full suite command | `dotnet test SharepointToolbox.slnx` |

### Phase Requirements → Test Map

| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| STOR-01/02 | `StorageService.CollectStorageAsync` returns `StorageNode` list | unit (mock ISessionManager) | `dotnet test --filter "StorageServiceTests"` | ❌ Wave 0 |
| STOR-03 | VersionSizeBytes = TotalSizeBytes - FileStreamSizeBytes | unit | `dotnet test --filter "StorageNodeTests"` | ❌ Wave 0 |
| STOR-04 | `StorageCsvExportService.BuildCsv` produces correct header and rows | unit | `dotnet test --filter "StorageCsvExportServiceTests"` | ❌ Wave 0 |
| STOR-05 | `StorageHtmlExportService.BuildHtml` contains toggle JS and nested tables | unit | `dotnet test --filter "StorageHtmlExportServiceTests"` | ❌ Wave 0 |
| SRCH-01 | `SearchService` builds correct KQL from `SearchOptions` | unit | `dotnet test --filter "SearchServiceTests"` | ❌ Wave 0 |
| SRCH-02 | Search loop exits when `startRow > 50_000` | unit | `dotnet test --filter "SearchServiceTests"` | ❌ Wave 0 |
| SRCH-03 | `SearchCsvExportService.BuildCsv` produces correct header | unit | `dotnet test --filter "SearchCsvExportServiceTests"` | ❌ Wave 0 |
| SRCH-04 | `SearchHtmlExportService.BuildHtml` contains sort JS and filter input | unit | `dotnet test --filter "SearchHtmlExportServiceTests"` | ❌ Wave 0 |
| DUPL-01 | `MakeKey` function groups identical name+size+date items | unit | `dotnet test --filter "DuplicatesServiceTests"` | ❌ Wave 0 |
| DUPL-02 | CAML query targets `FSObjType=1`; `FileCount = ItemChildCount - FolderChildCount` | unit (logic only) | `dotnet test --filter "DuplicatesServiceTests"` | ❌ Wave 0 |
| DUPL-03 | `DuplicatesHtmlExportService.BuildHtml` contains group cards with ok/diff badges | unit | `dotnet test --filter "DuplicatesHtmlExportServiceTests"` | ❌ Wave 0 |

**Note:** `StorageService`, `SearchService`, and `DuplicatesService` depend on live CSOM — service-level tests use Skip like `PermissionsServiceTests`. ViewModel tests use Moq for `IStorageService`, `ISearchService`, `IDuplicatesService` following `PermissionsViewModelTests` pattern. Export service tests are fully unit-testable (no CSOM).

### Sampling Rate

- **Per task commit:** `dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj -x`
- **Per wave merge:** `dotnet test SharepointToolbox.slnx`
- **Phase gate:** Full suite green before `/gsd:verify-work`

### Wave 0 Gaps

- [ ] `SharepointToolbox.Tests/Services/StorageServiceTests.cs` — covers STOR-01/02 (stub + Skip like PermissionsServiceTests)
- [ ] `SharepointToolbox.Tests/Services/Export/StorageCsvExportServiceTests.cs` — covers STOR-04
- [ ] `SharepointToolbox.Tests/Services/Export/StorageHtmlExportServiceTests.cs` — covers STOR-05
- [ ] `SharepointToolbox.Tests/Services/SearchServiceTests.cs` — covers SRCH-01/02 (KQL build + pagination cap logic)
- [ ] `SharepointToolbox.Tests/Services/Export/SearchCsvExportServiceTests.cs` — covers SRCH-03
- [ ] `SharepointToolbox.Tests/Services/Export/SearchHtmlExportServiceTests.cs` — covers SRCH-04
- [ ] `SharepointToolbox.Tests/Services/DuplicatesServiceTests.cs` — covers DUPL-01/02 composite key logic
- [ ] `SharepointToolbox.Tests/Services/Export/DuplicatesHtmlExportServiceTests.cs` — covers DUPL-03
- [ ] `SharepointToolbox.Tests/ViewModels/StorageViewModelTests.cs` — covers STOR-01 ViewModel (Moq IStorageService)
- [ ] `SharepointToolbox.Tests/ViewModels/SearchViewModelTests.cs` — covers SRCH-01/02 ViewModel
- [ ] `SharepointToolbox.Tests/ViewModels/DuplicatesViewModelTests.cs` — covers DUPL-01/02 ViewModel

---

## Open Questions

1. **StorageMetrics.LastModified vs TimeLastModified**
   - What we know: `StorageMetrics.LastModified` exists per the API docs. `Folder.TimeLastModified` is a separate CSOM property.
   - What's unclear: Whether `StorageMetrics.LastModified` can return `DateTime.MinValue` for recently created empty folders in all SharePoint Online tenants.
   - Recommendation: Load both (`f => f.StorageMetrics, f => f.TimeLastModified`) and prefer `StorageMetrics.LastModified` when it is `> DateTime.MinValue`, falling back to `TimeLastModified`.

2. **Search index freshness for duplicate detection**
   - What we know: SharePoint Search is eventually consistent — newly created files may not appear for up to 15 minutes.
   - What's unclear: Whether users expect real-time accuracy or accept eventual consistency.
   - Recommendation: Document in UI that search-based results (files) reflect the search index, not the current state. Add a note in the log output.

3. **Multiple-site file search scope**
   - What we know: The PS reference scopes search to `$siteUrl` context only (one site per search). SRCH-01 says "across sites" in the goal description but the requirements only specify search criteria, not multi-site.
   - What's unclear: Whether SRCH-01 requires multi-site search in one operation or per-site.
   - Recommendation: Implement per-site search (matching PS reference). Multi-site search would require separate `ClientContext` per site plus result merging — treat as a future enhancement.

---

## Sources

### Primary (HIGH confidence)

- [StorageMetrics Class — MS Learn CSOM reference](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics?view=sharepoint-csom) — properties TotalSize, TotalFileStreamSize, TotalFileCount, LastModified confirmed
- [StorageMetrics.TotalSize — MS Learn](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics.totalsize?view=sharepoint-csom) — confirmed as Int64, ReadOnly
- [[MS-CSOMSPT] TotalFileStreamSize](https://learn.microsoft.com/en-us/openspecs/sharepoint_protocols/ms-csomspt/635464fc-8505-43fa-97d7-02229acdb3c5) — confirmed definition: "Aggregate stream size in bytes for all files... Excludes version, metadata, list item attachment, and non-customized document sizes"
- [SearchExecutor Class — MS Learn CSOM reference](https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor?view=sharepoint-csom) — namespace `Microsoft.SharePoint.Client.Search.Query`, assembly `Microsoft.SharePoint.Client.Search.Portable.dll`
- [Search limits for SharePoint — MS Learn](https://learn.microsoft.com/en-us/sharepoint/search-limits) — StartRow max 50,000 (boundary), RowLimit max 500 (boundary) confirmed
- [SharepointToolbox/bin/Debug output] — `Microsoft.SharePoint.Client.Search.dll` confirmed present as transitive dep

### Secondary (MEDIUM confidence)

- [Load storage metric from SPO — longnlp.github.io](https://longnlp.github.io/load-storage-metric-from-SPO) — CSOM Load pattern: `ctx.Load(folder, f => f.StorageMetrics)` verified
- [Fetch all results from SharePoint Search using CSOM — usefulscripts.wordpress.com](https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/) — KeywordQuery + SearchExecutor pagination pattern with StartRow; confirmed against official docs
- PowerShell reference `Sharepoint_ToolBox.ps1` lines 1621-1780 (Export-StorageToHTML), 2112-2233 (Export-SearchResultsToHTML), 2235-2406 (Export-DuplicatesToHTML), 4432-4534 (storage scan), 4747-4808 (file search), 4937-5059 (duplicate scan) — authoritative reference implementation

### Tertiary (LOW confidence — implementation detail, verify when coding)

- [SharePoint CSOM Q&A — Getting size of subsite](https://learn.microsoft.com/en-us/answers/questions/1518977/getting-size-of-a-subsite-using-csom) — general pattern confirmed; specific edge cases not verified
- [Pagination for large result sets — MS Learn](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets) — DocId-based pagination beyond 50k exists but is not needed for Phase 3

---

## Metadata

**Confidence breakdown:**
- Standard Stack: HIGH — no new packages needed; Search.dll confirmed present; all APIs verified against MS docs
- Architecture Patterns: HIGH — direct port of working PS reference; CSOM API shapes confirmed
- Pitfalls: HIGH for StorageMetrics loading, search result typing, vti_history filter (all from PS reference or official docs); MEDIUM for KQL length limit (documented but not commonly hit)
- Localization keys: HIGH — directly extracted from PS reference lines 2747-2813

**Research date:** 2026-04-02
**Valid until:** 2026-07-01 (CSOM APIs stable; SharePoint search limits stable; re-verify if PnP.Framework upgrades past 1.18)