44 KiB
Phase 3: Storage and File Operations - Research
Researched: 2026-04-02 Domain: CSOM StorageMetrics, SharePoint KQL Search, WPF DataGrid, duplicate detection Confidence: HIGH
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| STOR-01 | User can view storage consumption per library on a site | CSOM Folder.StorageMetrics (one Load call per folder) + flat DataGrid with indent column |
| STOR-02 | User can view storage consumption per site with configurable folder depth | Recursive Collect-FolderStorage pattern translated to async CSOM; depth guard via split-count |
| STOR-03 | Storage metrics include total size, version size, item count, and last modified date | StorageMetrics.TotalSize, TotalFileStreamSize, TotalFileCount, StorageMetrics.LastModified; version size = TotalSize - TotalFileStreamSize |
| STOR-04 | User can export storage metrics to CSV | New StorageCsvExportService — same UTF-8 BOM pattern as Phase 2 |
| STOR-05 | User can export storage metrics to interactive HTML with collapsible tree view | New StorageHtmlExportService — port PS lines 1621-1780; toggle() JS + nested table rows |
| SRCH-01 | User can search files across sites using multiple criteria | KeywordQuery + SearchExecutor (CSOM search); KQL built from filter params; client-side Regex post-filter |
| SRCH-02 | User can configure maximum search results (up to 50,000) | SharePoint Search StartRow hard cap is 50,000 (boundary); 500 rows/batch × 100 pages = 50,000 max |
| SRCH-03 | User can export search results to CSV | New SearchCsvExportService |
| SRCH-04 | User can export search results to interactive HTML (sortable, filterable) | New SearchHtmlExportService — port PS lines 2112-2233; sortable columns via data attributes |
| DUPL-01 | User can scan for duplicate files by name, size, creation date, modification date | Search API (same as SRCH) + client-side GroupBy composite key; no content hashing needed |
| DUPL-02 | User can scan for duplicate folders by name, subfolder count, file count | SharePointPaginationHelper.GetAllItemsAsync with CAML FSObjType=1; read FolderChildCount, ItemChildCount from field values |
| DUPL-03 | User can export duplicate report to HTML with grouped display and visual indicators | New DuplicatesHtmlExportService — port PS lines 2235-2406; collapsible group cards, ok/diff badges |
| </phase_requirements> |
Summary
Phase 3 introduces three feature areas (Storage Metrics, File Search, Duplicate Detection), each requiring a dedicated ViewModel, View, Service, and export services. All three areas can be implemented without adding new NuGet packages — Microsoft.SharePoint.Client.Search.dll is already in the output folder as a transitive dependency of PnP.Framework 1.18.0.
Storage uses CSOM Folder.StorageMetrics (loaded via ctx.Load(folder, f => f.StorageMetrics)). One CSOM round-trip per folder. Version size is derived as TotalSize - TotalFileStreamSize. The data model is a recursive tree (site → library → folder → subfolder), flattened to a DataGrid with an indent-level column for WPF display. The HTML export ports the PS Export-StorageToHTML function (PS lines 1621-1780) with its toggle(i) JS pattern.
File Search uses Microsoft.SharePoint.Client.Search.Query.KeywordQuery + SearchExecutor. KQL is assembled from UI filter fields (extension, date range, creator, editor, library path). Pagination is StartRow += 500 per batch; the hard ceiling is StartRow = 50,000 (SharePoint Search boundary), which means the 50,000 max-results requirement (SRCH-02) is exactly the platform limit. Client-side Regex is applied after retrieval. The HTML export ports PS lines 2112-2233.
Duplicate Detection uses the same Search API for file duplicates (with all documents query) and SharePointPaginationHelper.GetAllItemsAsync with FSObjType CAML filter for folder duplicates. Items are grouped client-side by a composite key (name + optional size/dates/counts). No content hashing is needed — the DUPL-01/02/03 requirements specify name+size+dates, which exactly matches the PS reference implementation.
Primary recommendation: Three ViewModels (StorageViewModel, SearchViewModel, DuplicatesViewModel), three service interfaces, six export services (storage CSV/HTML, search CSV/HTML, duplicates HTML — duplicates CSV is bonus), all extending existing Phase 2 patterns.
User Constraints
No CONTEXT.md exists for Phase 3 (no /gsd:discuss-phase was run). All decisions below are from the locked technology stack in the prompt.
Locked Decisions
- .NET 10 LTS + WPF + MVVM (CommunityToolkit.Mvvm 8.4.2)
- PnP.Framework 1.18.0 (CSOM-based SharePoint access)
- No new major packages preferred — only add if truly necessary
- Microsoft.Extensions.Hosting DI
- Serilog logging
- xUnit 2.9.3 tests
Deferred / Out of Scope
- Content hashing for duplicate detection (v2)
- Storage charts/graphs (v2 requirement VIZZ-01/02/03)
- Cross-tenant file search
Standard Stack
Core (no new packages needed)
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| PnP.Framework | 1.18.0 | CSOM access, ClientContext |
Already in project |
| Microsoft.SharePoint.Client.Search.dll | (via PnP.Framework) | KeywordQuery, SearchExecutor |
Transitive dep — confirmed present in bin/Debug/net10.0-windows/ |
| CommunityToolkit.Mvvm | 8.4.2 | [ObservableProperty], AsyncRelayCommand |
Already in project |
| Microsoft.Extensions.Hosting | 10.x | DI container | Already in project |
| Serilog | 4.3.1 | Structured logging | Already in project |
| xUnit | 2.9.3 | Tests | Already in project |
| Moq | 4.20.72 | Mock interfaces in tests | Already in project |
No new NuGet packages required. Microsoft.SharePoint.Client.Search.dll ships as a transitive dependency of PnP.Framework — confirmed present at SharepointToolbox/bin/Debug/net10.0-windows/Microsoft.SharePoint.Client.Search.dll.
New Models Needed
| Model | Location | Fields |
|---|---|---|
StorageNode |
Core/Models/StorageNode.cs |
string Name, string Url, string SiteTitle, string Library, long TotalSizeBytes, long FileStreamSizeBytes, long TotalFileCount, DateTime? LastModified, int IndentLevel, List<StorageNode> Children |
SearchResult |
Core/Models/SearchResult.cs |
string Title, string Path, string FileExtension, DateTime? Created, DateTime? LastModified, string Author, string ModifiedBy, long SizeBytes |
DuplicateGroup |
Core/Models/DuplicateGroup.cs |
string GroupKey, string Name, List<DuplicateItem> Items |
DuplicateItem |
Core/Models/DuplicateItem.cs |
string Name, string Path, string Library, long? SizeBytes, DateTime? Created, DateTime? Modified, int? FolderCount, int? FileCount |
StorageScanOptions |
Core/Models/StorageScanOptions.cs |
bool PerLibrary, bool IncludeSubsites, int FolderDepth |
SearchOptions |
Core/Models/SearchOptions.cs |
string[] Extensions, string? Regex, DateTime? CreatedAfter, DateTime? CreatedBefore, DateTime? ModifiedAfter, DateTime? ModifiedBefore, string? CreatedBy, string? ModifiedBy, string? Library, int MaxResults |
DuplicateScanOptions |
Core/Models/DuplicateScanOptions.cs |
string Mode ("Files"/"Folders"), bool MatchSize, bool MatchCreated, bool MatchModified, bool MatchSubfolderCount, bool MatchFileCount, bool IncludeSubsites, string? Library |
Architecture Patterns
Recommended Project Structure (additions only)
SharepointToolbox/
├── Core/Models/
│ ├── StorageNode.cs # new
│ ├── SearchResult.cs # new
│ ├── DuplicateGroup.cs # new
│ ├── DuplicateItem.cs # new
│ ├── StorageScanOptions.cs # new
│ ├── SearchOptions.cs # new
│ └── DuplicateScanOptions.cs # new
├── Services/
│ ├── IStorageService.cs # new
│ ├── StorageService.cs # new
│ ├── ISearchService.cs # new
│ ├── SearchService.cs # new
│ ├── IDuplicatesService.cs # new
│ ├── DuplicatesService.cs # new
│ └── Export/
│ ├── StorageCsvExportService.cs # new
│ ├── StorageHtmlExportService.cs # new
│ ├── SearchCsvExportService.cs # new
│ ├── SearchHtmlExportService.cs # new
│ └── DuplicatesHtmlExportService.cs # new
├── ViewModels/Tabs/
│ ├── StorageViewModel.cs # new
│ ├── SearchViewModel.cs # new
│ └── DuplicatesViewModel.cs # new
└── Views/Tabs/
├── StorageView.xaml # new
├── StorageView.xaml.cs # new
├── SearchView.xaml # new
├── SearchView.xaml.cs # new
├── DuplicatesView.xaml # new
└── DuplicatesView.xaml.cs # new
Pattern 1: CSOM StorageMetrics Load
What: Load Folder.StorageMetrics with a single round-trip per folder. StorageMetrics is a child object — you must include it in the Load expression or it will not be fetched.
When to use: Whenever reading storage data for a folder or library root.
Example:
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.storagemetrics
// + https://longnlp.github.io/load-storage-metric-from-SPO
// Get folder by server-relative URL (library root or subfolder)
Folder folder = ctx.Web.GetFolderByServerRelativeUrl(serverRelativeUrl);
ctx.Load(folder,
f => f.StorageMetrics, // pulls TotalSize, TotalFileStreamSize, TotalFileCount, LastModified
f => f.TimeLastModified, // alternative timestamp if StorageMetrics.LastModified is null
f => f.ServerRelativeUrl,
f => f.Name);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
long totalBytes = folder.StorageMetrics.TotalSize;
long streamBytes = folder.StorageMetrics.TotalFileStreamSize; // current-version files only
long versionBytes = Math.Max(0L, totalBytes - streamBytes); // version overhead
long fileCount = folder.StorageMetrics.TotalFileCount;
DateTime? lastMod = folder.StorageMetrics.IsPropertyAvailable("LastModified")
? folder.StorageMetrics.LastModified
: folder.TimeLastModified;
Unit: TotalSize and TotalFileStreamSize are in bytes (Int64). TotalFileStreamSize is the aggregate stream size for current-version file content only — it excludes version history, metadata, and attachments (confirmed by [MS-CSOMSPT]). Version storage = TotalSize - TotalFileStreamSize.
Pattern 2: KQL Search with Pagination
What: Use KeywordQuery + SearchExecutor (in Microsoft.SharePoint.Client.Search.Query) to execute a KQL query, paginating 500 rows at a time via StartRow.
When to use: SRCH-01/02/03/04 (file search) and DUPL-01 (file duplicate detection).
Example:
// Source: https://learn.microsoft.com/en-us/dotnet/api/microsoft.sharepoint.client.search.query.searchexecutor
// + https://usefulscripts.wordpress.com/2015/09/11/how-to-fetch-all-results-from-sharepoint-search-using-dot-net-managed-csom/
using Microsoft.SharePoint.Client.Search.Query;
// namespace: Microsoft.SharePoint.Client.Search.Query
// assembly: Microsoft.SharePoint.Client.Search.dll (via PnP.Framework transitive dep)
var allResults = new List<IDictionary<string, object>>();
int startRow = 0;
const int batchSize = 500;
do
{
ct.ThrowIfCancellationRequested();
var kq = new KeywordQuery(ctx)
{
QueryText = kql, // e.g. "ContentType:Document AND FileExtension:pdf"
StartRow = startRow,
RowLimit = batchSize,
TrimDuplicates = false
};
// Explicit managed properties to retrieve
kq.SelectProperties.AddRange(new[]
{
"Title", "Path", "Author", "LastModifiedTime",
"FileExtension", "Created", "ModifiedBy", "Size"
});
var executor = new SearchExecutor(ctx);
ClientResult<ResultTableCollection> clientResult = executor.ExecuteQuery(kq);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
// Note: ctx.ExecuteQuery() is called inside ExecuteQueryRetryAsync — do NOT call again
var table = clientResult.Value
.FirstOrDefault(t => t.TableType == KnownTableTypes.RelevantResults);
if (table == null) break;
int retrieved = table.RowCount;
foreach (System.Collections.Hashtable row in table.ResultRows)
{
allResults.Add(row.Cast<System.Collections.DictionaryEntry>()
.ToDictionary(e => e.Key.ToString()!, e => e.Value ?? string.Empty));
}
progress.Report(new OperationProgress(allResults.Count, maxResults, $"Retrieved {allResults.Count} results…"));
startRow += batchSize;
}
while (startRow < maxResults && startRow <= 50_000 // platform hard cap
&& allResults.Count < maxResults);
Critical detail: ExecuteQueryRetryHelper.ExecuteQueryRetryAsync wraps ctx.ExecuteQuery(). Call it AFTER executor.ExecuteQuery(kq) — do NOT call ctx.ExecuteQuery() directly afterward.
StartRow limit: SharePoint Search imposes a hard boundary of 50,000 for StartRow. With batch size 500, max pages = 100, max results = 50,000. This exactly satisfies SRCH-02.
KQL field mappings (from PS reference lines 4747-4763):
- Extension:
FileExtension:pdf OR FileExtension:docx - Created after/before:
Created>=2024-01-01/Created<=2024-12-31 - Modified after/before:
Write>=2024-01-01/Write<=2024-12-31 - Created by:
Author:"First Last" - Modified by:
ModifiedBy:"First Last" - Library path:
Path:"https://tenant.sharepoint.com/sites/x/Shared Documents*" - Documents only:
ContentType:Document
Pattern 3: Folder Enumeration for Duplicate Folders
What: Use SharePointPaginationHelper.GetAllItemsAsync with a CAML filter on FSObjType = 1 (folders). Read FolderChildCount and ItemChildCount from FieldValues.
When to use: DUPL-02 (folder duplicate scan).
Example:
// Source: PS reference lines 5010-5036; Phase 2 SharePointPaginationHelper pattern
var camlQuery = new CamlQuery
{
ViewXml = @"<View Scope='RecursiveAll'>
<Query>
<Where>
<Eq>
<FieldRef Name='FSObjType' />
<Value Type='Integer'>1</Value>
</Eq>
</Where>
</Query>
<RowLimit>2000</RowLimit>
</View>"
};
await foreach (var item in SharePointPaginationHelper.GetAllItemsAsync(ctx, list, camlQuery, ct))
{
var fv = item.FieldValues;
var name = fv["FileLeafRef"]?.ToString() ?? string.Empty;
var fileRef = fv["FileRef"]?.ToString() ?? string.Empty;
var subCount = Convert.ToInt32(fv["FolderChildCount"] ?? 0);
var childCount = Convert.ToInt32(fv["ItemChildCount"] ?? 0);
var fileCount = Math.Max(0, childCount - subCount);
var created = fv["Created"] is DateTime cr ? cr : (DateTime?)null;
var modified = fv["Modified"] is DateTime md ? md : (DateTime?)null;
// ...build DuplicateItem
}
Pattern 4: Duplicate Composite Key (name+size+date grouping)
What: Build a string composite key from the fields the user selected, then GroupBy(key).Where(g => g.Count() >= 2).
When to use: DUPL-01 (files) and DUPL-02 (folders).
Example:
// Source: PS reference lines 4942-4949 (MakeKey function)
private static string MakeKey(DuplicateItem item, DuplicateScanOptions opts)
{
var parts = new List<string> { item.Name.ToLowerInvariant() };
if (opts.MatchSize && item.SizeBytes.HasValue) parts.Add(item.SizeBytes.Value.ToString());
if (opts.MatchCreated && item.Created.HasValue) parts.Add(item.Created.Value.Date.ToString("yyyy-MM-dd"));
if (opts.MatchModified && item.Modified.HasValue) parts.Add(item.Modified.Value.Date.ToString("yyyy-MM-dd"));
if (opts.MatchSubfolderCount && item.FolderCount.HasValue) parts.Add(item.FolderCount.Value.ToString());
if (opts.MatchFileCount && item.FileCount.HasValue) parts.Add(item.FileCount.Value.ToString());
return string.Join("|", parts);
}
var groups = allItems
.GroupBy(i => MakeKey(i, opts))
.Where(g => g.Count() >= 2)
.Select(g => new DuplicateGroup
{
GroupKey = g.Key,
Name = g.First().Name,
Items = g.ToList()
})
.OrderByDescending(g => g.Items.Count)
.ToList();
Pattern 5: Storage Recursive Tree → Flat Row List for DataGrid
What: Flatten the recursive tree (site → library → folder → subfolder) into a flat List<StorageNode> where each node carries an IndentLevel. The WPF DataGrid renders a Margin on the name cell based on IndentLevel.
When to use: STOR-01/02 WPF display.
Rationale for DataGrid over TreeView: WPF TreeView requires hierarchical HierarchicalDataTemplate and loses virtualization with deep nesting. A flat DataGrid with VirtualizingPanel.IsVirtualizing="True" stays performant for thousands of rows and is trivially sortable.
Example:
// Flatten tree to observable list for DataGrid binding
private static void FlattenTree(StorageNode node, int level, List<StorageNode> result)
{
node.IndentLevel = level;
result.Add(node);
foreach (var child in node.Children)
FlattenTree(child, level + 1, result);
}
<!-- WPF DataGrid cell template for name column with indent -->
<DataGridTemplateColumn Header="Library / Folder" Width="*">
<DataGridTemplateColumn.CellTemplate>
<DataTemplate>
<TextBlock Text="{Binding Name}"
Margin="{Binding IndentLevel, Converter={StaticResource IndentConverter}}" />
</DataTemplate>
</DataGridTemplateColumn.CellTemplate>
</DataGridTemplateColumn>
Use IValueConverter mapping IndentLevel → new Thickness(IndentLevel * 16, 0, 0, 0).
Pattern 6: Storage HTML Collapsible Tree
What: The HTML export uses inline nested tables with display:none rows toggled by toggle(i) JS. Each library/folder that has children gets a unique numeric index.
When to use: STOR-05 export.
Key design (from PS lines 1621-1780):
- A global
_togIdxcounter assigns unique IDs to collapsible rows:<tr id='sf-{i}' style='display:none'>. - A
<button onclick='toggle({i})'>triggersrow.style.display = visible ? 'none' : 'table-row'. - Library rows embed a nested
<table class='sf-tbl'>inside the collapsible row (colspan spanning all columns). - This is a pure inline pattern — no external JS or CSS dependencies.
- In C# the counter is a field on
StorageHtmlExportServicereset at the start of eachBuildHtml()call.
Anti-Patterns to Avoid
- Loading StorageMetrics without including it in ctx.Load:
folder.StorageMetrics.TotalSizethrowsPropertyOrFieldNotInitializedExceptionifStorageMetricsis not included in the Load expression. Always usectx.Load(folder, f => f.StorageMetrics, ...). - Calling ctx.ExecuteQuery() after executor.ExecuteQuery(kq): The search executor pattern requires calling
ctx.ExecuteQuery()ONCE (insideExecuteQueryRetryAsync). Calling it twice is a no-op at best, throws at worst. - StartRow > 50,000: SharePoint Search hard boundary — will return zero results or error. Cap loop exit at
startRow <= 50_000. - Modifying ObservableCollection from Task.Run: Same rule as Phase 2 — accumulate in
List<T>on background thread, thenDispatcher.InvokeAsync(() => StorageResults = new ObservableCollection<T>(list)). - Recursive CSOM calls without depth guard: Without a depth guard,
Collect-FolderStorageon a deep site can make thousands of CSOM round-trips. Always passMaxDepthand checkcurrentDepth >= maxDepthbefore recursing. - Building a TreeView for storage display: WPF TreeView loses UI virtualization with more than ~1000 visible items. Use DataGrid with IndentLevel.
- Version size from index: The Search API's
Sizeproperty is the current-version file size, not total including versions. OnlyStorageMetrics.TotalFileStreamSizevsTotalSizegives accurate version overhead.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| CSOM throttle retry | Custom retry loop | ExecuteQueryRetryHelper.ExecuteQueryRetryAsync (Phase 1) |
Already handles 429/503 with exponential backoff |
| List pagination | Raw ExecuteQuery loop |
SharePointPaginationHelper.GetAllItemsAsync (Phase 1) |
Handles 5000-item threshold, CAML position continuation |
| Search pagination | Manual do/while per search |
Same KeywordQuery+SearchExecutor pattern (internal to SearchService) |
Wrap in a helper method inside SearchService to avoid duplication across SRCH and DUPL features |
| HTML header/footer boilerplate | New template each export service | Copy from existing HtmlExportService pattern (Phase 2) |
Consistent <!DOCTYPE>, viewport meta, Segoe UI font stack |
| CSV field escaping | Custom escaping | RFC 4180 Csv() helper pattern from Phase 2 CsvExportService |
Already handles quotes, empty values, UTF-8 BOM |
| OperationProgress reporting | New progress model | OperationProgress.Indeterminate(msg) + new OperationProgress(current, total, msg) (Phase 1) |
Already wired to UI via FeatureViewModelBase |
| Tenant context management | Directly create ClientContext |
ISessionManager.GetOrCreateContextAsync (Phase 1) |
Handles MSAL cache, per-tenant context pooling |
Common Pitfalls
Pitfall 1: StorageMetrics PropertyOrFieldNotInitializedException
What goes wrong: folder.StorageMetrics.TotalSize throws PropertyOrFieldNotInitializedException at runtime.
Why it happens: CSOM lazy-loading — if StorageMetrics is not in the Load expression, the proxy object exists but has no data.
How to avoid: Always include f => f.StorageMetrics in the ctx.Load(folder, ...) lambda.
Warning signs: Exception message contains "The property or field 'StorageMetrics' has not been initialized".
Pitfall 2: Search ResultRows Type Is IDictionary-like But Not Strongly Typed
What goes wrong: Accessing row["Size"] returns object — Size comes back as a string "12345" not a long.
Why it happens: ResultTable.ResultRows is IEnumerable<IDictionary<string, object>>. All values are strings from the search index.
How to avoid: Always parse with long.TryParse(row["Size"]?.ToString() ?? "0", out var sizeBytes). Strip non-numeric characters as PS does: Regex.Replace(sizeStr, "[^0-9]", "").
Warning signs: InvalidCastException when binding Size to a numeric column.
Pitfall 3: Search API Returns Duplicates for Versioned Files
What goes wrong: Files with many versions appear multiple times in results via /_vti_history/ paths.
Why it happens: SharePoint indexes each version as a separate item in some cases.
How to avoid: Filter items where Path.Contains("/_vti_history/", StringComparison.OrdinalIgnoreCase) — port of PS line 4973.
Warning signs: Duplicate file paths in results with _vti_history segment.
Pitfall 4: StorageMetrics.LastModified May Be DateTime.MinValue
What goes wrong: LastModified shows as 01/01/0001 for empty folders.
Why it happens: SharePoint returns a default DateTime for folders with no modifications.
How to avoid: Check lastModified > DateTime.MinValue before formatting. Fall back to folder.TimeLastModified if StorageMetrics.LastModified is unset.
Warning signs: "01/01/0001" in the LastModified column.
Pitfall 5: KQL Query Text Exceeds 4096 Characters
What goes wrong: Search query silently fails or returns error for very long KQL strings.
Why it happens: SharePoint Search has a 4096-character KQL text boundary.
How to avoid: For extension filters with many extensions, use (FileExtension:a OR FileExtension:b OR ...) and validate total length before calling. Warn user if limit approached.
Warning signs: Zero results returned when many extensions entered; no CSOM exception.
Pitfall 6: CAML FSObjType Field Name
What goes wrong: CAML query for folders returns no results.
Why it happens: The internal CAML field name is FSObjType, not FileSystemObjectType. Using the wrong name returns no matches silently.
How to avoid: Use <FieldRef Name='FSObjType' /> (integer) with <Value Type='Integer'>1</Value>. Confirmed by PS reference line 5011 which uses CSOM FileSystemObjectType.Folder comparison.
Warning signs: Zero items returned from folder CAML query on a library known to have folders.
Pitfall 7: StorageService Needs Web.ServerRelativeUrl to Compute Site-Relative Path
What goes wrong: Get-PnPFolderStorageMetric -FolderSiteRelativeUrl requires a path relative to the web root (e.g., Shared Documents), not the server root (e.g., /sites/MySite/Shared Documents).
Why it happens: CSOM Folder.StorageMetrics uses server-relative URLs, so you need to strip the web's ServerRelativeUrl prefix.
How to avoid: Load ctx.Web.ServerRelativeUrl first, then compute: siteRelUrl = rootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/'). Use ctx.Web.GetFolderByServerRelativeUrl(siteAbsoluteUrl) which accepts full server-relative paths.
Warning signs: 404/FileNotFoundException from CSOM when calling StorageMetrics.
Code Examples
Loading StorageMetrics (STOR-01/02/03)
// Source: MS Learn — StorageMetrics Class; [MS-CSOMSPT] TotalFileStreamSize definition
ctx.Load(ctx.Web, w => w.ServerRelativeUrl, w => w.Url, w => w.Title);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
string webSrl = ctx.Web.ServerRelativeUrl.TrimEnd('/');
// Per-library: iterate document libraries
ctx.Load(ctx.Web.Lists, lists => lists.Include(
l => l.Title, l => l.BaseType, l => l.Hidden, l => l.RootFolder.ServerRelativeUrl));
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
foreach (var list in ctx.Web.Lists)
{
if (list.Hidden || list.BaseType != BaseType.DocumentLibrary) continue;
string siteRelUrl = list.RootFolder.ServerRelativeUrl.Substring(webSrl.Length).TrimStart('/');
Folder rootFolder = ctx.Web.GetFolderByServerRelativeUrl(list.RootFolder.ServerRelativeUrl);
ctx.Load(rootFolder,
f => f.StorageMetrics,
f => f.TimeLastModified,
f => f.ServerRelativeUrl);
await ExecuteQueryRetryHelper.ExecuteQueryRetryAsync(ctx, progress, ct);
var node = new StorageNode
{
Name = list.Title,
Url = $"{ctx.Web.Url.TrimEnd('/')}/{siteRelUrl}",
SiteTitle = ctx.Web.Title,
Library = list.Title,
TotalSizeBytes = rootFolder.StorageMetrics.TotalSize,
FileStreamSizeBytes = rootFolder.StorageMetrics.TotalFileStreamSize,
TotalFileCount = rootFolder.StorageMetrics.TotalFileCount,
LastModified = rootFolder.StorageMetrics.LastModified > DateTime.MinValue
? rootFolder.StorageMetrics.LastModified
: rootFolder.TimeLastModified,
IndentLevel = 0,
Children = new List<StorageNode>()
};
// Recursive subfolder collection up to maxDepth
if (maxDepth > 0)
await CollectSubfoldersAsync(ctx, list.RootFolder.ServerRelativeUrl, node, 1, maxDepth, progress, ct);
}
KQL Build from SearchOptions
// Source: PS reference lines 4747-4763
private static string BuildKql(SearchOptions opts)
{
var parts = new List<string> { "ContentType:Document" };
if (opts.Extensions.Length > 0)
{
var extParts = opts.Extensions.Select(e => $"FileExtension:{e.TrimStart('.').ToLowerInvariant()}");
parts.Add($"({string.Join(" OR ", extParts)})");
}
if (opts.CreatedAfter.HasValue)
parts.Add($"Created>={opts.CreatedAfter.Value:yyyy-MM-dd}");
if (opts.CreatedBefore.HasValue)
parts.Add($"Created<={opts.CreatedBefore.Value:yyyy-MM-dd}");
if (opts.ModifiedAfter.HasValue)
parts.Add($"Write>={opts.ModifiedAfter.Value:yyyy-MM-dd}");
if (opts.ModifiedBefore.HasValue)
parts.Add($"Write<={opts.ModifiedBefore.Value:yyyy-MM-dd}");
if (!string.IsNullOrEmpty(opts.CreatedBy))
parts.Add($"Author:\"{opts.CreatedBy}\"");
if (!string.IsNullOrEmpty(opts.ModifiedBy))
parts.Add($"ModifiedBy:\"{opts.ModifiedBy}\"");
if (!string.IsNullOrEmpty(opts.Library))
parts.Add($"Path:\"{opts.SiteUrl.TrimEnd('/')}/{opts.Library.TrimStart('/')}*\"");
return string.Join(" AND ", parts);
}
Parsing Search ResultRows
// Source: PS reference lines 4971-4987
private static SearchResult ParseRow(IDictionary<string, object> row)
{
static string Str(IDictionary<string, object> r, string key) =>
r.TryGetValue(key, out var v) ? v?.ToString() ?? string.Empty : string.Empty;
static DateTime? Date(IDictionary<string, object> r, string key)
{
var s = Str(r, key);
return DateTime.TryParse(s, out var dt) ? dt : null;
}
static long ParseSize(IDictionary<string, object> r, string key)
{
var raw = Str(r, key);
var digits = System.Text.RegularExpressions.Regex.Replace(raw, "[^0-9]", "");
return long.TryParse(digits, out var v) ? v : 0L;
}
return new SearchResult
{
Title = Str(row, "Title"),
Path = Str(row, "Path"),
FileExtension = Str(row, "FileExtension"),
Created = Date(row, "Created"),
LastModified = Date(row, "LastModifiedTime"),
Author = Str(row, "Author"),
ModifiedBy = Str(row, "ModifiedBy"),
SizeBytes = ParseSize(row, "Size")
};
}
Localization Keys Needed
The following keys are needed for Phase 3 Views. Keys from the PS reference (lines 2747-2813) are remapped to the C# Strings.resx naming convention. Existing keys already in Strings.resx are marked with (existing).
Storage Tab
| Key | EN Value | Notes |
|---|---|---|
tab.storage |
Storage |
(existing — already in Strings.resx line 77) |
chk.per.lib |
Per-Library Breakdown |
new |
chk.subsites |
Include Subsites |
new |
lbl.folder.depth |
Folder depth: |
(existing — shared with permissions) |
chk.max.depth |
Maximum (all levels) |
(existing — shared with permissions) |
stor.note |
Note: deeper folder scans on large sites may take several minutes. |
new |
btn.gen.storage |
Generate Metrics |
new |
btn.open.storage |
Open Report |
new |
stor.col.library |
Library |
new |
stor.col.site |
Site |
new |
stor.col.files |
Files |
new |
stor.col.size |
Size |
new |
stor.col.versions |
Versions |
new |
stor.col.lastmod |
Last Modified |
new |
stor.col.share |
Share of Total |
new |
File Search Tab
| Key | EN Value | Notes |
|---|---|---|
tab.search |
File Search |
(existing — already in Strings.resx line 79) |
grp.search.filters |
Search Filters |
new |
lbl.extensions |
Extension(s): |
new |
ph.extensions |
docx pdf xlsx |
new (placeholder) |
lbl.regex |
Name / Regex: |
new |
ph.regex |
Ex: report.* or \.bak$ |
new (placeholder) |
chk.created.after |
Created after: |
new |
chk.created.before |
Created before: |
new |
chk.modified.after |
Modified after: |
new |
chk.modified.before |
Modified before: |
new |
lbl.created.by |
Created by: |
new |
ph.created.by |
First Last or email |
new (placeholder) |
lbl.modified.by |
Modified by: |
new |
ph.modified.by |
First Last or email |
new (placeholder) |
lbl.library |
Library: |
new |
ph.library |
Optional relative path e.g. Shared Documents |
new (placeholder) |
lbl.max.results |
Max results: |
new |
btn.run.search |
Run Search |
new |
btn.open.search |
Open Results |
new |
srch.col.name |
File Name |
new |
srch.col.ext |
Extension |
new |
srch.col.created |
Created |
new |
srch.col.modified |
Modified |
new |
srch.col.author |
Created By |
new |
srch.col.modby |
Modified By |
new |
srch.col.size |
Size |
new |
Duplicates Tab
| Key | EN Value | Notes |
|---|---|---|
tab.duplicates |
Duplicates |
(existing — already in Strings.resx line 83) |
grp.dup.type |
Duplicate Type |
new |
rad.dup.files |
Duplicate files |
new |
rad.dup.folders |
Duplicate folders |
new |
grp.dup.criteria |
Comparison Criteria |
new |
lbl.dup.note |
Name is always the primary criterion. Check additional criteria: |
new |
chk.dup.size |
Same size |
new |
chk.dup.created |
Same creation date |
new |
chk.dup.modified |
Same modification date |
new |
chk.dup.subfolders |
Same subfolder count |
new |
chk.dup.filecount |
Same file count |
new |
chk.include.subsites |
Include subsites |
new |
ph.dup.lib |
All (leave empty) |
new (placeholder) |
btn.run.scan |
Run Scan |
new |
btn.open.results |
Open Results |
new |
Duplicate Detection Scale — Known Concern Resolution
The STATE.md concern ("Duplicate detection at scale (100k+ files) — Graph API hash enumeration limits") is resolved: the PS reference does NOT use file hashes. It uses name+size+date grouping, which is exactly what DUPL-01/02/03 specify. The requirements do not mention hash-based deduplication.
Scale analysis:
- File duplicates use the Search API. SharePoint Search caps at 50,000 results (StartRow=50,000 max). A site with 100k+ files will be capped at 50,000 returned results. This is the same cap as SRCH-02, and is a known/accepted limitation.
- Folder duplicates use CAML pagination.
SharePointPaginationHelper.GetAllItemsAsynchandles arbitrary folder counts with RowLimit=2000 pagination — no effective upper bound. - Client-side GroupBy on 50,000 items is instantaneous (Dictionary-based O(n) operation).
- No Graph API or SHA256 content hashing is needed. The concern was about a potential v2 enhancement not required by DUPL-01/02/03.
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
Get-PnPFolderStorageMetric (PS cmdlet) |
CSOM Folder.StorageMetrics |
Phase 3 migration | One CSOM round-trip per folder; no PnP PS module required |
Submit-PnPSearchQuery (PS cmdlet) |
CSOM KeywordQuery + SearchExecutor |
Phase 3 migration | Same pagination model; TrimDuplicates=false explicit |
Get-PnPListItem for folders (PS) |
SharePointPaginationHelper.GetAllItemsAsync with CAML |
Phase 3 migration | Reuses Phase 1 helper; handles 5000-item threshold |
| Storage TreeView control | Flat DataGrid with IndentLevel + IValueConverter | Phase 3 design decision | Better UI virtualization for large sites |
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | xUnit 2.9.3 |
| Config file | none (SDK auto-discovery) |
| Quick run command | dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj --filter "Category!=Integration" -x |
| Full suite command | dotnet test SharepointToolbox.slnx |
Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| STOR-01/02 | StorageService.CollectStorageAsync returns StorageNode list |
unit (mock ISessionManager) | dotnet test --filter "StorageServiceTests" |
❌ Wave 0 |
| STOR-03 | VersionSizeBytes = TotalSizeBytes - FileStreamSizeBytes | unit | dotnet test --filter "StorageNodeTests" |
❌ Wave 0 |
| STOR-04 | StorageCsvExportService.BuildCsv produces correct header and rows |
unit | dotnet test --filter "StorageCsvExportServiceTests" |
❌ Wave 0 |
| STOR-05 | StorageHtmlExportService.BuildHtml contains toggle JS and nested tables |
unit | dotnet test --filter "StorageHtmlExportServiceTests" |
❌ Wave 0 |
| SRCH-01 | SearchService builds correct KQL from SearchOptions |
unit | dotnet test --filter "SearchServiceTests" |
❌ Wave 0 |
| SRCH-02 | Search loop exits when startRow > 50_000 |
unit | dotnet test --filter "SearchServiceTests" |
❌ Wave 0 |
| SRCH-03 | SearchCsvExportService.BuildCsv produces correct header |
unit | dotnet test --filter "SearchCsvExportServiceTests" |
❌ Wave 0 |
| SRCH-04 | SearchHtmlExportService.BuildHtml contains sort JS and filter input |
unit | dotnet test --filter "SearchHtmlExportServiceTests" |
❌ Wave 0 |
| DUPL-01 | MakeKey function groups identical name+size+date items |
unit | dotnet test --filter "DuplicatesServiceTests" |
❌ Wave 0 |
| DUPL-02 | CAML query targets FSObjType=1; FileCount = ItemChildCount - FolderChildCount |
unit (logic only) | dotnet test --filter "DuplicatesServiceTests" |
❌ Wave 0 |
| DUPL-03 | DuplicatesHtmlExportService.BuildHtml contains group cards with ok/diff badges |
unit | dotnet test --filter "DuplicatesHtmlExportServiceTests" |
❌ Wave 0 |
Note: StorageService, SearchService, and DuplicatesService depend on live CSOM — service-level tests use Skip like PermissionsServiceTests. ViewModel tests use Moq for IStorageService, ISearchService, IDuplicatesService following PermissionsViewModelTests pattern. Export service tests are fully unit-testable (no CSOM).
Sampling Rate
- Per task commit:
dotnet test SharepointToolbox.Tests/SharepointToolbox.Tests.csproj -x - Per wave merge:
dotnet test SharepointToolbox.slnx - Phase gate: Full suite green before
/gsd:verify-work
Wave 0 Gaps
SharepointToolbox.Tests/Services/StorageServiceTests.cs— covers STOR-01/02 (stub + Skip like PermissionsServiceTests)SharepointToolbox.Tests/Services/Export/StorageCsvExportServiceTests.cs— covers STOR-04SharepointToolbox.Tests/Services/Export/StorageHtmlExportServiceTests.cs— covers STOR-05SharepointToolbox.Tests/Services/SearchServiceTests.cs— covers SRCH-01/02 (KQL build + pagination cap logic)SharepointToolbox.Tests/Services/Export/SearchCsvExportServiceTests.cs— covers SRCH-03SharepointToolbox.Tests/Services/Export/SearchHtmlExportServiceTests.cs— covers SRCH-04SharepointToolbox.Tests/Services/DuplicatesServiceTests.cs— covers DUPL-01/02 composite key logicSharepointToolbox.Tests/Services/Export/DuplicatesHtmlExportServiceTests.cs— covers DUPL-03SharepointToolbox.Tests/ViewModels/StorageViewModelTests.cs— covers STOR-01 ViewModel (Moq IStorageService)SharepointToolbox.Tests/ViewModels/SearchViewModelTests.cs— covers SRCH-01/02 ViewModelSharepointToolbox.Tests/ViewModels/DuplicatesViewModelTests.cs— covers DUPL-01/02 ViewModel
Open Questions
-
StorageMetrics.LastModified vs TimeLastModified
- What we know:
StorageMetrics.LastModifiedexists per the API docs.Folder.TimeLastModifiedis a separate CSOM property. - What's unclear: Whether
StorageMetrics.LastModifiedcan returnDateTime.MinValuefor recently created empty folders in all SharePoint Online tenants. - Recommendation: Load both (
f => f.StorageMetrics, f => f.TimeLastModified) and preferStorageMetrics.LastModifiedwhen it is> DateTime.MinValue, falling back toTimeLastModified.
- What we know:
-
Search index freshness for duplicate detection
- What we know: SharePoint Search is eventually consistent — newly created files may not appear for up to 15 minutes.
- What's unclear: Whether users expect real-time accuracy or accept eventual consistency.
- Recommendation: Document in UI that search-based results (files) reflect the search index, not the current state. Add a note in the log output.
-
Multiple-site file search scope
- What we know: The PS reference scopes search to
$siteUrlcontext only (one site per search). SRCH-01 says "across sites" in the goal description but the requirements only specify search criteria, not multi-site. - What's unclear: Whether SRCH-01 requires multi-site search in one operation or per-site.
- Recommendation: Implement per-site search (matching PS reference). Multi-site search would require separate
ClientContextper site plus result merging — treat as a future enhancement.
- What we know: The PS reference scopes search to
Sources
Primary (HIGH confidence)
- StorageMetrics Class — MS Learn CSOM reference — properties TotalSize, TotalFileStreamSize, TotalFileCount, LastModified confirmed
- StorageMetrics.TotalSize — MS Learn — confirmed as Int64, ReadOnly
- [MS-CSOMSPT] TotalFileStreamSize — confirmed definition: "Aggregate stream size in bytes for all files... Excludes version, metadata, list item attachment, and non-customized document sizes"
- SearchExecutor Class — MS Learn CSOM reference — namespace
Microsoft.SharePoint.Client.Search.Query, assemblyMicrosoft.SharePoint.Client.Search.Portable.dll - Search limits for SharePoint — MS Learn — StartRow max 50,000 (boundary), RowLimit max 500 (boundary) confirmed
- [SharepointToolbox/bin/Debug output] —
Microsoft.SharePoint.Client.Search.dllconfirmed present as transitive dep
Secondary (MEDIUM confidence)
- Load storage metric from SPO — longnlp.github.io — CSOM Load pattern:
ctx.Load(folder, f => f.StorageMetrics)verified - Fetch all results from SharePoint Search using CSOM — usefulscripts.wordpress.com — KeywordQuery + SearchExecutor pagination pattern with StartRow; confirmed against official docs
- PowerShell reference
Sharepoint_ToolBox.ps1lines 1621-1780 (Export-StorageToHTML), 2112-2233 (Export-SearchResultsToHTML), 2235-2406 (Export-DuplicatesToHTML), 4432-4534 (storage scan), 4747-4808 (file search), 4937-5059 (duplicate scan) — authoritative reference implementation
Tertiary (LOW confidence — implementation detail, verify when coding)
- SharePoint CSOM Q&A — Getting size of subsite — general pattern confirmed; specific edge cases not verified
- Pagination for large result sets — MS Learn — DocId-based pagination beyond 50k exists but is not needed for Phase 3
Metadata
Confidence breakdown:
- Standard Stack: HIGH — no new packages needed; Search.dll confirmed present; all APIs verified against MS docs
- Architecture Patterns: HIGH — direct port of working PS reference; CSOM API shapes confirmed
- Pitfalls: HIGH for StorageMetrics loading, search result typing, vti_history filter (all from PS reference or official docs); MEDIUM for KQL length limit (documented but not commonly hit)
- Localization keys: HIGH — directly extracted from PS reference lines 2747-2813
Research date: 2026-04-02 Valid until: 2026-07-01 (CSOM APIs stable; SharePoint search limits stable; re-verify if PnP.Framework upgrades past 1.18)