Files
Sharepoint-Toolbox/.planning/research/PITFALLS.md
2026-04-08 10:57:27 +02:00

803 lines
67 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Pitfalls Research
**Domain:** C#/WPF SharePoint Online administration desktop tool (PowerShell-to-C# rewrite)
**Researched:** 2026-04-02
**Confidence:** HIGH (critical pitfalls verified via official docs, PnP GitHub issues, and known existing codebase problems)
---
## Critical Pitfalls
### Pitfall 1: Calling PnP/CSOM Methods Synchronously on the UI Thread
**What goes wrong:**
`AuthenticationManager.GetContext()`, `ExecuteQuery()`, and similar PnP Framework / CSOM calls are blocking network operations. If called directly on the WPF UI thread — even inside a button click handler — the entire window freezes until the call completes. This is precisely what causes the UI freezes in the current PowerShell app, and the problem migrates verbatim into C# if async patterns are not used from day one.
A subtler variant: using `.Result` or `.Wait()` on a `Task` from the UI thread. The UI thread holds a `SynchronizationContext`; the async continuation needs that same context to resume; deadlock ensues. The application hangs with no exception and no feedback.
**Why it happens:**
Developers migrating from PowerShell think in sequential terms and instinctively port one-liner calls directly to event handler bodies. The WPF framework does not prevent synchronous blocking — it just stops processing messages, which looks like a freeze.
**How to avoid:**
- Every SharePoint/PnP call must be wrapped in `await Task.Run(...)` or use the async overloads directly (`ExecuteQueryRetryAsync`, `GetContextAsync`).
- Never use `.Result`, `.Wait()`, or `Task.GetAwaiter().GetResult()` on the UI thread.
- Establish a project-wide convention: all ViewModels execute SharePoint operations through `async Task` methods with `CancellationToken` parameters. Codify this in architecture docs from Phase 1.
- Use `ConfigureAwait(false)` in all service/repository layer code (below ViewModel level) so continuations do not need to return to the UI thread unnecessarily.
**Warning signs:**
- Any `void` method containing a PnP call.
- Any `Task.Result` or `.Wait()` in ViewModel or code-behind.
- Button click handlers that are not `async`.
- Application hangs for seconds at a time when switching tenants or starting operations.
**Phase to address:** Foundation/infrastructure phase (first phase). This pattern must be established before any feature work begins. Retrofitting async throughout a codebase is one of the most expensive rewrites possible.
---
### Pitfall 2: Replicating Silent Error Suppression from the PowerShell Original
**What goes wrong:**
The existing codebase has 38 empty `catch` blocks and 27 instances of `-ErrorAction SilentlyContinue`. During a rewrite, developers under time pressure port the "working" behavior, which means they replicate the silent failures. The C# version appears to work in demos but hides the same class of bugs: group member additions that silently did nothing, storage scans that silently skipped folders, JSON loads that silently returned empty defaults from corrupted files.
**Why it happens:**
Port-from-working-code instinct. The original returned a result (even if wrong), so the C# version is written to also return a result without questioning whether an error was swallowed. Also, `try { ... } catch (Exception) { }` in C# is syntactically shorter and less ceremonial than PowerShell's equivalent, making it easy to write reflexively.
**How to avoid:**
- Treat every `catch` block as code that requires a positive decision: log and recover, log and rethrow, or log and surface to the user. A `catch` that does none of these three things is a bug.
- Adopt a structured logging pattern (e.g., `ILogger<T>` with `Microsoft.Extensions.Logging`) from Phase 1 so logging is never optional.
- Create a custom `SharePointOperationException` hierarchy that preserves original exceptions and adds context (which site, which operation, which user) before rethrowing. This prevents exception swallowing during the port.
- In PR reviews, flag any empty or log-only catch blocks that do not surface the error to the user as a defect.
**Warning signs:**
- Any `catch (Exception ex) { }` with no body.
- Any `catch` block that only calls `_logger.LogWarning` but returns a success result to the caller.
- Operations that complete in < 1 second when they should take 510 seconds (silent skip).
- Users reporting "the button did nothing" with no error shown.
**Phase to address:** Foundation/infrastructure phase. Define the error handling strategy and base exception types before porting any features.
---
### Pitfall 3: SharePoint List View Threshold (5 000 Items) Causing Unhandled Exceptions
**What goes wrong:**
Any CSOM or PnP Framework call that queries a SharePoint list without explicit pagination throws a `Microsoft.SharePoint.Client.ServerException` with message "The attempted operation is prohibited because it exceeds the list view threshold" when the list contains more than 5 000 items. In the current PowerShell code this is partially masked by `-ErrorAction SilentlyContinue`. In C# it becomes an unhandled exception that crashes the operation unless explicitly caught and handled.
Real tenant libraries with 5 000+ files are common. Permissions reports, storage scans, and file search are all affected.
**Why it happens:**
Developers test against small tenant sites during development. The threshold is not hit, tests pass, the feature ships. First production use against a real client library fails.
**How to avoid:**
- All `GetItems`, `GetListItems`, and folder-enumeration calls must use `CamlQuery` with `RowLimit` set to a page size (5002 000), iterating with `ListItemCollectionPosition` until exhausted.
- For Graph SDK paths, use the `PageIterator` pattern; never call `.GetAsync()` on a collection without a `$top` parameter.
- The storage recursion function (`Collect-FolderStorage` equivalent) must default to depth 34, not 999, and show estimated time before starting.
- Write an integration test against a seeded list of 6 000 items before shipping each feature that enumerates list items.
**Warning signs:**
- Any `GetItems` call without a `CamlQuery` with explicit `RowLimit`.
- Any Graph SDK call to list items without `.Top(n)`.
- `ServerException` appearing in logs from client sites but not in dev testing.
**Phase to address:** Each feature phase that touches list enumeration (permissions, storage, file search). The pagination helper should be a shared utility written in the foundation phase and reused everywhere.
---
### Pitfall 4: Multi-Tenant Token Cache Race Conditions and Stale Tokens
**What goes wrong:**
The design requires cached authentication sessions so users can switch between client tenants without re-authenticating. MSAL.NET token caches are not thread-safe by default. If two background operations run concurrently against different tenants, cache read/write races produce corrupted cache state, silent auth failures, or one tenant's token being used for another tenant's request.
A secondary problem: when an Azure AD app registration's permissions change (e.g., a new Graph scope is granted), MSAL returns the cached token for the old scope. The operation fails with a 403 but looks like a permissions error, not a stale cache error, sending the developer on a false debugging path.
**Why it happens:**
Multi-tenant caching is not covered in most MSAL.NET tutorials, which show single-tenant flows. The token cache API (`TokenCacheCallback`, `BeforeAccessNotification`, `AfterAccessNotification`) is low-level and easy to implement incorrectly.
**How to avoid:**
- Use `Microsoft.Identity.Client.Extensions.Msal` (`MsalCacheHelper`) for file-based, cross-process-safe token persistence. This is the Microsoft-recommended approach for desktop public client apps.
- The `AuthenticationManager` instance in PnP Framework accepts a `tokenCacheCallback`; wire it to `MsalCacheHelper` so cache is persisted safely per-tenant.
- Scope the `IPublicClientApplication` instance per-ClientId (app registration), not per-tenant URL. Different tenants share the same client app but have different account entries in the cache.
- Implement an explicit "clear cache for tenant" action in the UI so users can force re-authentication when permissions change.
- Never share a single `AuthenticationManager` instance across concurrent operations on different tenants without locking.
**Warning signs:**
- Intermittent 401 or 403 errors that resolve after restarting the app.
- User reports "wrong tenant data shown" (cross-tenant token bleed).
- `MsalUiRequiredException` thrown only on the second or third operation of a session.
**Phase to address:** Authentication/multi-tenant infrastructure phase (early, before any feature uses the auth layer).
---
### Pitfall 5: WPF ObservableCollection Updates from Background Threads
**What goes wrong:**
Populating a `DataGrid` or `ListView` bound to an `ObservableCollection<T>` from a background `Task` or `Task.Run` throws a `NotSupportedException`: "This type of CollectionView does not support changes to its SourceCollection from a thread different from the Dispatcher thread." The exception crashes the background operation. If it is swallowed (see Pitfall 2), the UI simply does not update.
This maps directly to the current app's runspace-to-UI communication via synchronized hashtables polled by a timer. The C# version must use the Dispatcher or the MVVM toolkit equivalently.
**Why it happens:**
In a `Task.Run` lambda, the continuation runs on a thread pool thread, not the UI thread. Developers add items to the collection inside that lambda. It works in small-scale testing (timing may work) but fails under load.
**How to avoid:**
- Never add items to an `ObservableCollection<T>` from a non-UI thread.
- Preferred pattern: collect results into a plain `List<T>` on the background thread, then `await Application.Current.Dispatcher.InvokeAsync(() => { Items = new ObservableCollection<T>(list); })` in one atomic swap.
- For streaming progress (show items as they arrive), use `BindingOperations.EnableCollectionSynchronization` with a lock object at initialization, then add items with the lock held.
- Use `IProgress<T>` with `Progress<T>` (captures the UI `SynchronizationContext` at construction) to report incremental results safely.
**Warning signs:**
- `InvalidOperationException` or `NotSupportedException` in logs referencing `CollectionView`.
- UI lists that do not update despite background operation completing.
- Items appearing out of order or partially in lists.
**Phase to address:** Foundation/infrastructure phase. Define the progress-reporting and collection-update patterns before porting any feature that returns lists of results.
---
### Pitfall 6: WPF Trimming Breaks Self-Contained EXE
**What goes wrong:**
Publishing a WPF app as a self-contained single EXE with `PublishTrimmed=true` silently removes types that WPF and XAML use via reflection at runtime. The app compiles and publishes successfully but crashes at startup or throws `TypeInitializationException` when opening a window whose XAML references a type that was trimmed. PnP Framework and MSAL also use reflection heavily; trimming removes their internal types.
**Why it happens:**
The .NET trimmer performs static analysis and removes code it cannot prove is referenced. XAML data binding, converters, `DataTemplateSelector`, `IValueConverter`, and `DynamicResource` are resolved at runtime via reflection — the trimmer cannot see these references.
**How to avoid:**
- Do not use `PublishTrimmed=true` for WPF + PnP Framework + MSAL projects. The EXE will be larger (~150 MB self-contained is expected and acceptable per PROJECT.md).
- Use `PublishSingleFile=true` with `SelfContained=true` and `IncludeAllContentForSelfExtract=true`, but without trimming. This bundles the runtime into the EXE correctly.
- Verify the single-file output in CI by running the EXE on a clean machine (no .NET installed) before each release.
- Set `<PublishReadyToRun>true</PublishReadyToRun>` for startup performance improvement instead of trimming.
**Warning signs:**
- Publish profile has `<PublishTrimmed>true</PublishTrimmed>`.
- "Works on dev machine, crashes on client machine" with `TypeInitializationException` or `MissingMethodException`.
- EXE is suspiciously small (< 50 MB for a self-contained WPF app).
**Phase to address:** Distribution/packaging phase. Establish the publish profile with correct flags before any release packaging work.
---
### Pitfall 7: Async Void in Command Handlers Swallows Exceptions
**What goes wrong:**
In WPF, button `Click` event handlers are `void`-returning delegates. Developers writing `async void` handlers (e.g., `private async void OnRunButtonClick(...)`) create methods where exceptions thrown after an `await` are raised on the `SynchronizationContext` rather than returned as a faulted `Task`. These exceptions cannot be caught by a caller and will crash the process (or be silently eaten by `Application.DispatcherUnhandledException` without the stack context needed to debug them).
**Why it happens:**
MVVM `ICommand` requires a `void Execute(object parameter)` signature. New C# developers write `async void Execute(...)` without understanding the consequence. The `CommunityToolkit.Mvvm` provides `AsyncRelayCommand` to solve this correctly, but it is not the obvious choice.
**How to avoid:**
- Never write `async void` anywhere in the codebase except the required WPF event handler entry points in code-behind, and only when those entry points immediately delegate to an `async Task` ViewModel method.
- Use `AsyncRelayCommand` from `CommunityToolkit.Mvvm` for all commands that invoke async operations. It wraps the `Task`, exposes `ExecutionTask`, `IsRunning`, and `IsCancellationRequested`, and handles exceptions via `AsyncRelayCommandOptions.FlowExceptionsToTaskScheduler`.
- Wire a global `Application.DispatcherUnhandledException` handler and `TaskScheduler.UnobservedTaskException` handler that log full stack traces and show a user-facing error dialog. This is the last line of defense.
**Warning signs:**
- Any `async void` method outside of a `MainWindow.xaml.cs` entry point.
- Commands implemented as `async void Execute(...)` in ViewModels.
- Exceptions that appear in logs with no originating ViewModel context.
**Phase to address:** Foundation/infrastructure phase (MVVM base classes and command patterns established before any feature code).
---
### Pitfall 8: SharePoint API Throttling Not Handled (429/503)
**What goes wrong:**
SharePoint Online and Microsoft Graph enforce per-app, per-tenant throttling. Bulk operations (permissions scan across 50+ sites, storage scan on 10 000+ folders, bulk member additions) generate enough API calls to trigger HTTP 429 or 503 responses. Without explicit retry-after handling, the operation fails partway through with an unhandled `HttpRequestException` and leaves the user with partial results and no indication of how to resume.
**Why it happens:**
PnP.PowerShell handled this invisibly for the PowerShell app. PnP Framework in C# does have built-in retry via `ExecuteQueryRetryAsync`, but developers unfamiliar with C#-side PnP may use the raw CSOM `ExecuteQuery()` or direct `HttpClient` calls that lack this protection.
**How to avoid:**
- Always use `ExecuteQueryRetryAsync` (never `ExecuteQuery`) for all CSOM batch calls.
- When using Graph SDK, use the `GraphServiceClient` with the default retry handler enabled — it handles 429 with `Retry-After` header respect automatically.
- For multi-site bulk operations, add a short delay (100300 ms) between site connections to avoid burst throttling. Implement a configurable concurrency limit (default: sequential or max 3 parallel).
- Surface throttling events in the progress log: "Rate limited, retrying in 15s…" so the user knows the operation is paused, not hung.
**Warning signs:**
- Raw `ExecuteQuery()` calls anywhere in the codebase.
- `HttpRequestException` with 429 status in logs.
- Operations that fail consistently at the same approximate item count across multiple runs.
**Phase to address:** Foundation/infrastructure phase for the retry handler; each feature phase must use the established pattern.
---
### Pitfall 9: Resource Disposal Gaps in Long-Running Operations
**What goes wrong:**
`ClientContext` objects returned by `AuthenticationManager.GetContext()` are `IDisposable`. If a background `Task` is cancelled or throws an exception mid-operation, a `ClientContext` created in the try block is not disposed if the `finally` block is missing. Over a long session (MSP workflow: dozens of tenant switches, multiple scans), leaked `ClientContext` objects accumulate unmanaged resources and eventually cause connection refusals or memory degradation. This is the C# equivalent of the runspace disposal gaps in the current codebase.
**Why it happens:**
`using` statements are the idiomatic C# solution, but they do not compose well with async cancellation. Developers use `try/catch` without `finally`, or structure the code so the `using` scope is exited before the `Task` completes.
**How to avoid:**
- Always obtain `ClientContext` inside a `using` statement or `await using` if using C# 8+ disposable pattern: `await using var ctx = await authManager.GetContextAsync(url, token)`.
- Wrap the entire operation body in `try/finally` with disposal in the `finally` block when `await using` is not applicable.
- When a `CancellationToken` is triggered, let the `OperationCanceledException` propagate naturally; the `using` / `finally` will still execute.
- Add a unit test for the "cancelled mid-operation" path that verifies `ClientContext.Dispose()` is called.
**Warning signs:**
- `GetContext` calls without `using`.
- `catch (Exception) { return; }` that bypasses a `ClientContext` created earlier in the method.
- Memory growth over a multi-hour MSP session visible in Task Manager.
**Phase to address:** Foundation/infrastructure phase (define the context acquisition pattern) and validated in each feature phase.
---
### Pitfall 10: JSON Settings Corruption on Concurrent Writes
**What goes wrong:**
The app writes profiles, settings, and templates to JSON files on disk. If the user triggers two rapid operations (e.g., saves a profile while a background scan completes and updates settings), both code paths may attempt to write the same file simultaneously. The second write overwrites a partially-written first write, producing a truncated or syntactically invalid JSON file. On next startup, the file fails to parse and silently returns empty defaults — erasing all user profiles.
This is a known bug in the current app (CONCERNS.md: "Profile JSON file: no transaction semantics").
**Why it happens:**
File I/O is not inherently thread-safe. `System.Text.Json`'s `JsonSerializer.SerializeAsync` writes to a stream but does not protect the file from concurrent access by another code path.
**How to avoid:**
- Serialize all writes to each JSON file through a single `SemaphoreSlim(1)` per file. Acquire before reading or writing, release in `finally`.
- Use write-then-replace: write to `filename.tmp`, validate the JSON by deserializing it, then `File.Move(tmp, original, overwrite: true)`. An interrupted write leaves the original intact.
- On startup, if the primary file is invalid, check for a `.tmp` or `.bak` version before falling back to defaults — and log which fallback was used.
**Warning signs:**
- Profile file occasionally empty after normal use.
- `JsonException` on startup that the user cannot reproduce on demand.
- App loaded with correct profiles yesterday, empty profiles today.
**Phase to address:** Foundation/infrastructure phase (data access layer). Must be solved before any feature persists data.
---
## Technical Debt Patterns
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Copy PowerShell logic verbatim into a `Task.Run` | Fast initial port, works locally | Inherits all silent failures, no cancellation, no progress reporting | Never — always re-examine the logic |
| `async void` command handlers | Compiles and runs | Exceptions crash app silently; no cancellation propagation | Only for WPF event entry points that immediately call `async Task` |
| Direct `ExecuteQuery()` without retry | Simpler call site | Crashes on throttling for real client tenants | Never — use `ExecuteQueryRetryAsync` |
| Single shared `AuthenticationManager` instance | Simple instantiation | Token cache race conditions under concurrent operations | Only if all operations are strictly sequential (initial MVP, clearly documented) |
| Load entire list into memory before display | Simple binding | `OutOfMemoryException` on libraries with 50k+ items | Only for lists known to be small and bounded (e.g., profiles list) |
| No `CancellationToken` propagation | Simpler method signatures | Operations cannot be cancelled; UI stuck waiting | Never for operations > 2 seconds |
| Hard-code English fallback strings in code | Quick to write | Breaks FR locale; strings diverge from key system | Never — always use resource keys |
---
## Integration Gotchas
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| PnP Framework `GetContext` | Calling on UI thread synchronously | Always `await Task.Run(() => authManager.GetContext(...))` or use `GetContextAsync` |
| MSAL token cache (multi-tenant) | One `IPublicClientApplication` per call | One `IPublicClientApplication` per ClientId, long-lived, with `MsalCacheHelper` wired |
| SharePoint list enumeration | No `RowLimit` in `CamlQuery` | Always paginate with `RowLimit` ≤ 2 000 and `ListItemCollectionPosition` |
| Graph SDK paging | Calling `.GetAsync()` on collections without `$top` | Use `PageIterator` or explicit `.Top(n)` on every collection request |
| PnP `ExecuteQueryRetryAsync` | Forgetting to `await`; using synchronous `ExecuteQuery` | Always `await ctx.ExecuteQueryRetryAsync()` |
| WPF `ObservableCollection` | Modifying from `Task.Run` lambda | Collect into `List<T>`, then assign via `Dispatcher.InvokeAsync` |
| PnP Management Shell client ID | Using the shared PnP app ID in a multi-tenant production tool | Register a dedicated Azure AD app per deployment; don't rely on PnP's shared registration |
| SharePoint Search API (KQL) | No result limit, assuming all results returned | Always set `RowLimit`; results capped at 500 per page, max 50 000 total |
---
## Performance Traps
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| Loading all `ObservableCollection` items before displaying any | UI freezes until entire operation completes | Use `IProgress<T>` to stream items as they arrive; enable UI virtualization | Any list > ~500 items |
| WPF virtualization disabled by `ScrollViewer.CanContentScroll=False` or grouping | DataGrid scroll is sluggish with 200+ rows | Never disable `CanContentScroll`; set `VirtualizingPanel.IsVirtualizingWhenGrouping=True` | > 200 rows in a DataGrid |
| Adding items to `ObservableCollection` one-by-one from background | Thousands of UI binding notifications; UI jank | Batch-load: assign `new ObservableCollection<T>(list)` once | > 50 items added in a loop |
| Permissions scan without depth limit | Scan takes hours on deep folder structures | Default depth 34; show estimated time; require explicit user override for deeper | Sites with > 5 folder levels |
| HTML report built entirely in memory | `OutOfMemoryException` or report generation takes minutes | Stream HTML to file; write rows as they are produced, not after full scan | > 10 000 rows in report |
| Sequential site processing for multi-site reports | Report for 20 sites takes 20× single-site time | Process up to 3 sites concurrently with `SemaphoreSlim`; show per-site progress | > 5 sites selected |
| Duplicate `Connect-PnPOnline` calls per operation | Redundant browser popups or token refreshes | Cache authenticated `ClientContext` per (tenant, clientId) for session lifetime | Any operation that reconnects unnecessarily |
---
## Security Mistakes
| Mistake | Risk | Prevention |
|---------|------|------------|
| Storing Client ID in plaintext JSON profile | Low on its own (Client ID is not a secret), but combined with tenant URL it eases targeted phishing | Document that Client ID is not a secret; optionally encrypt the profile file with DPAPI `ProtectedData.Protect` for defence-in-depth |
| Writing temp files with tenant credentials to `%TEMP%` | File readable by other processes on the same user account; not cleaned up on crash | Use `SecureString` in-memory for transient auth data; delete temp files in `finally` blocks; prefer named pipes or in-memory channels |
| No validation of tenant URL format before connecting | Typo sends auth token to wrong endpoint; user confused by misleading auth error | Validate against regex `^https://[a-zA-Z0-9-]+\.sharepoint\.com` before any connection attempt |
| Logging full exception messages that include HTTP request URLs | Tenant URLs and item paths exposed in log files readable on shared machines | Strip or redact SharePoint URLs in log output at `Debug` level; keep them out of `Information`-level user-visible logs |
| Bundling PnP Management Shell client ID (shared multi-tenant app) | App uses a shared identity not owned by the deploying organisation; harder to audit and revoke | Require each deployment to use a dedicated app registration; document the registration steps clearly |
---
## UX Pitfalls
| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| No cancellation for operations > 5 seconds | User closes app via Task Manager; loses in-progress results; must restart | Every operation exposed in UI must accept a `CancellationToken`; show a "Cancel" button that is always enabled during operation |
| Progress bar with no ETA or item count | User cannot judge whether to wait or cancel | Show "Scanned X of Y sites" or "X items found"; update every 0.5 s minimum |
| Error messages showing raw exception text | Non-technical admin users see stack traces and `ServerException: CSOM call failed` | Translate known error types to plain-language messages; offer a "Copy technical details" link for support escalation |
| Silent success on bulk operations with partial failures | User thinks all 50 members were added; 12 failed silently | Show a per-item result summary: "38 added successfully, 12 failed — see details" |
| Language switches require app restart | FR-speaking users see flickering English then French on startup | Load correct language before any UI is shown; apply language from settings before `InitializeComponent` |
| Permissions report jargon ("Full Control", "Contribute", "Limited Access") shown raw | Non-technical stakeholders do not understand the report | Map SharePoint permission levels to plain-language equivalents in the report output; keep raw names in a "technical details" expandable section |
---
## "Looks Done But Isn't" Checklist
- [ ] **Multi-tenant session switching:** Verify that switching from Tenant A to Tenant B does not return Tenant A's data. Test with two real tenants, not two sites in the same tenant.
- [ ] **Operation cancellation:** Verify that pressing Cancel stops the operation within 2 seconds and leaves no zombie threads or unreleased `ClientContext` objects.
- [ ] **5 000+ item libraries:** Verify permissions report and storage scan complete without `ServerException` on a real library with > 5 000 items (not a test tenant with 50 items).
- [ ] **Self-contained EXE on clean machine:** Install the EXE on a machine with no .NET runtime installed; verify startup and a complete workflow before every release.
- [ ] **JSON file corruption recovery:** Corrupt a profile JSON file manually; verify the app starts, logs the corruption, does not silently return empty profiles, and preserves the backup.
- [ ] **Concurrent writes:** Simultaneously trigger "Save profile" and "Export settings" from two rapid button clicks; verify neither file is truncated.
- [ ] **Large HTML reports:** Generate a permissions report for a site with > 5 000 items; verify the HTML file opens in a browser in < 10 seconds and the DataGrid is scrollable.
- [ ] **FR locale completeness:** Switch to French; verify no UI string shows an untranslated key or hardcoded English text.
- [ ] **Throttling recovery:** Simulate a 429 response; verify the operation pauses, logs "Retrying in Xs", and completes successfully after the retry interval.
---
## Recovery Strategies
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| Async/sync deadlocks introduced in foundation | HIGH — requires refactoring all affected call chains | Identify all `.Result`/`.Wait()` calls with a codebase grep; convert bottom-up (services first, then ViewModels) |
| Silent failures ported from PowerShell | MEDIUM — requires audit of every catch block | Search all `catch` blocks; classify each as log-and-recover, log-and-rethrow, or log-and-surface; fix one feature at a time |
| Token cache corruption | LOW — clear the cache file and re-authenticate | Expose a "Clear cached sessions" action in the UI; document in troubleshooting guide |
| JSON profile file corruption | LOW if backup exists, HIGH if no backup | Implement write-then-replace before first release; add backup-on-corrupt logic to deserializer |
| WPF trimming breaks EXE | MEDIUM — need to republish with trimming disabled | Update publish profile, re-run publish, retest EXE on clean machine |
| Missing pagination on large lists | MEDIUM — need to refactor per-feature enumeration | Create shared pagination helper; replace calls feature by feature; test each against 6 000-item library |
---
## Pitfall-to-Phase Mapping
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| Sync/async deadlocks on UI thread | Phase 1: Foundation — establish async-first patterns | Code review checklist: no `.Result`/`.Wait()` in any ViewModel or event handler |
| Silent error suppression replication | Phase 1: Foundation — define error handling strategy and base types | Automated lint rule (Roslyn analyser or SonarQube) flagging empty catch blocks |
| SharePoint 5 000-item threshold | Phase 1: Foundation — write shared paginator; reused in all features | Integration test against 6 000-item library for every feature that enumerates lists |
| Multi-tenant token cache race | Phase 1: Foundation — auth layer with `MsalCacheHelper` | Test: two concurrent operations on different tenants return correct data |
| ObservableCollection cross-thread updates | Phase 1: Foundation — define progress-reporting pattern | Automated test: populate collection from background thread; verify no exception |
| WPF trimming breaks EXE | Final distribution phase | CI step: run published EXE on a clean Windows VM, assert startup and one workflow completes |
| Async void command handlers | Phase 1: Foundation — establish MVVM base with `AsyncRelayCommand` | Code review: no `async void` in ViewModel files |
| API throttling unhandled | Phase 1: Foundation — retry handler; applied by every feature | Load test: run storage scan against a tenant with rate-limiting; verify retry log entry |
| Resource disposal gaps | Phase 1: Foundation — context acquisition pattern | Unit test: cancel a long operation mid-run; verify `ClientContext.Dispose` called |
| JSON concurrent write corruption | Phase 1: Foundation — write-then-replace + `SemaphoreSlim` | Stress test: 100 concurrent save calls; verify file always parseable after all complete |
---
## Sources
- PnP Framework GitHub issue #961: `AuthenticationManager.GetContext` freeze in C# desktop app — https://github.com/pnp/pnpframework/issues/961
- PnP Framework GitHub issue #447: `AuthenticationManager.GetContext` hanging in ASP.NET — https://github.com/pnp/pnpframework/issues/447
- Microsoft Learn: Token cache serialization (MSAL.NET) — https://learn.microsoft.com/en-us/entra/msal/dotnet/how-to/token-cache-serialization
- Microsoft Learn: SharePoint Online list view threshold — https://learn.microsoft.com/en-us/troubleshoot/sharepoint/lists-and-libraries/items-exceeds-list-view-threshold
- Microsoft Learn: Single-file publishing overview — https://learn.microsoft.com/en-us/dotnet/core/deploying/single-file/overview
- dotnet/wpf GitHub issue #4216: `PublishTrimmed` causes `Unhandled Exception` in self-contained WPF app — https://github.com/dotnet/wpf/issues/4216
- dotnet/wpf GitHub issue #6096: Trimming for WPF — https://github.com/dotnet/wpf/issues/6096
- Microsoft .NET Blog: Await, and UI, and deadlocks — https://devblogs.microsoft.com/dotnet/await-and-ui-and-deadlocks-oh-my/
- Microsoft Learn: AsyncRelayCommand (CommunityToolkit.Mvvm) — https://learn.microsoft.com/en-us/dotnet/communitytoolkit/mvvm/asyncrelaycommand
- Microsoft Learn: Graph SDK paging — https://learn.microsoft.com/en-us/graph/sdks/paging
- Microsoft Learn: Graph throttling guidance — https://learn.microsoft.com/en-us/graph/throttling
- Rick Strahl's Web Log: Async and Async Void Event Handling in WPF — https://weblog.west-wind.com/posts/2022/Apr/22/Async-and-Async-Void-Event-Handling-in-WPF
- Existing codebase CONCERNS.md audit (2026-04-02) — `.planning/codebase/CONCERNS.md`
---
*Pitfalls research for: C#/WPF SharePoint Online administration desktop tool (PowerShell-to-C# rewrite)*
*Researched: 2026-04-02*
---
---
# v2.2 Pitfalls: Report Branding & User Directory
**Milestone:** v2.2 — HTML report branding (MSP/client logos) + user directory browse mode
**Researched:** 2026-04-08
**Confidence:** HIGH for logo handling and Graph pagination (multiple authoritative sources); MEDIUM for print CSS specifics (verified via MDN/W3C but browser rendering varies)
These pitfalls are specific to adding logo branding to the existing HTML export services and replacing the people-picker search with a full directory browse mode. They complement the v1.0 foundation pitfalls above.
---
## Critical Pitfalls (v2.2)
### Pitfall v2.2-1: Base64 Logo Encoding Bloats Every Report File
**What goes wrong:**
The five existing HTML export services (`HtmlExportService`, `UserAccessHtmlExportService`, `StorageHtmlExportService`, `SearchHtmlExportService`, `DuplicatesHtmlExportService`) are self-contained by design — no external dependencies. The natural instinct is to embed logos as inline `data:image/...;base64,...` strings in the `<style>` or `<img src>` tag of every report. This works, but base64 encoding inflates image size by ~33%. A 200 KB PNG logo becomes 267 KB of base64 text, inlined into every single exported HTML file. An MSP generating 10 reports per client per month accumulates significant bloat per file, and the logo data is re-read, re-encoded, and re-concatenated into the `StringBuilder` on every export call.
The secondary problem is that `StringBuilder.AppendLine` with a very long base64 string (a 500 KB logo becomes ~667 KB of text) causes a single string allocation of that size per report, wasted immediately after the file is written.
**Why it happens:**
The "self-contained HTML" design goal (no external files) is correct for portability. Developers apply it literally and embed every image inline. They test with a small 20 KB PNG and never notice. Production logos from clients are often 300600 KB originals.
**Consequences:**
- Report files 300700 KB larger than necessary — not catastrophic, but noticeable when opening in a browser.
- Logo bytes are re-allocated in memory on every export call — fine for occasional use, wasteful in batch scenarios.
- If the same logo is stored in `AppSettings` or `TenantProfile` as a raw file path, it is read from disk and re-encoded on every export. File I/O error at export time if the path is invalid.
**Prevention:**
1. Enforce a file size limit at import time: reject logos > 512 KB. Display a warning in the settings UI. This keeps base64 strings under ~700 KB worst case.
2. Cache the base64 string. Store it in the `AppSettings`/`TenantProfile` model as the pre-encoded base64 string (not the original file path), so it is computed once on import and reused on every export. `TenantProfile` and `AppSettings` already serialize to JSON — base64 strings serialize cleanly.
3. Enforce image dimensions in the import UI: warn if the image is wider than 800 px and suggest the user downscale. A 200×60 px logo at 72 dpi is sufficient for an HTML report header.
4. When reading from the JSON-persisted base64 string, do not re-decode and re-encode. Inject it directly into the `<img src="data:image/png;base64,{cachedBase64}">` tag.
**Detection:**
- Export a report and check the generated HTML file size. If it is > 100 KB before any data rows are added, the logo is too large.
- Profile `BuildHtml` with a 500 KB logo attached — memory allocation spike is visible in the .NET diagnostic tools.
**Phase to address:** Logo import/settings phase. The size validation and pre-encoding strategy must be established before any export service is modified to accept logo parameters. If the export services are modified first with raw file-path injection, every caller must be updated again later.
---
### Pitfall v2.2-2: Graph API Full Directory Listing Requires Explicit Pagination — 999-User Hard Cap Per Page
**What goes wrong:**
The existing `GraphUserSearchService` uses `$filter` with `startsWith` and `$top=10` — a narrow search, not a full listing. The new user directory browse mode needs to fetch all users in a tenant. Graph API `GET /users` returns a maximum of 999 users per page (not 1000 — the valid range for `$top` is 1999). Without explicit pagination using `@odata.nextLink`, the call silently returns at most 999 users regardless of tenant size. A 5 000-user tenant appears to have 999 users in the directory with no error or indication of truncation.
**Why it happens:**
Developers see `$top=999` and assume a single call returns everything for "normal" tenants. The Graph SDK's `.GetAsync()` call returns a `UserCollectionResponse` with a `Value` list and an `OdataNextLink` property. If `OdataNextLink` is not checked, pagination stops after the first page. The existing `SearchUsersAsync` intentionally returns only 10 results — the pagination concern was never encountered there.
**Consequences:**
- The directory browse mode silently shows fewer users than the tenant contains.
- An MSP auditing a 3 000-user client tenant sees only 999 users with no warning.
- Guest/service accounts in the first 999 may appear; those after page 1 are invisible.
**Prevention:**
Use the Graph SDK's `PageIterator<User, UserCollectionResponse>` for all full directory fetches. This is the Graph SDK's built-in mechanism for transparent pagination:
```csharp
var users = new List<User>();
var response = await graphClient.Users.GetAsync(config =>
{
config.QueryParameters.Select = new[] { "displayName", "userPrincipalName", "mail", "userType" };
config.QueryParameters.Top = 999;
config.QueryParameters.Orderby = new[] { "displayName" };
}, ct);
var pageIterator = PageIterator<User, UserCollectionResponse>.CreatePageIterator(
graphClient,
response,
user => { users.Add(user); return true; },
request => { request.Headers.Add("ConsistencyLevel", "eventual"); return request; });
await pageIterator.IterateAsync(ct);
```
Always pass `CancellationToken` through the iterator. For tenants with 10 000+ users, this will make multiple sequential API calls — surface progress to the user ("Loading directory... X users loaded").
**Detection:**
- Request `$count=true` with `ConsistencyLevel: eventual` on the first page call. Compare the returned `@odata.count` to the number of items received after full iteration. If they differ, pagination was incomplete.
- Test against a tenant with > 1 000 users before shipping the directory browse feature.
**Phase to address:** User directory browse implementation phase. The interface `IGraphUserSearchService` will need a new method `GetAllUsersAsync` alongside the existing `SearchUsersAsync` — do not collapse them.
---
### Pitfall v2.2-3: Graph API Directory Listing Returns Guest, Service, and Disabled Accounts Without Filtering
**What goes wrong:**
`GET /users` returns all user objects in the tenant: active members, disabled accounts, B2B guest users (`userType eq 'Guest'`), on-premises sync accounts, and service/bot accounts. In an MSP context, a client's SharePoint tenant may have dozens of guest users from external collaborators and several service accounts (e.g., `sharepoint@clientdomain.com`, `MicrosoftTeams@clientdomain.com`). If the directory browse mode shows all 3 000 raw entries, admins spend time scrolling past noise to find real staff.
Filtering on `userType` helps for guests but there is no clean Graph filter for "service accounts" — it is a convention, not a Graph property. There is also no Graph filter for disabled accounts from the basic `$filter` syntax without `ConsistencyLevel: eventual`.
**Why it happens:**
The people-picker search in v1.1 is text-driven — the user types a name, noise is naturally excluded. A browse mode showing all users removes that implicit filter and exposes the raw directory.
**Consequences:**
- Directory appears larger and noisier than expected for MSP clients.
- Admin selects the wrong account (service account instead of user) and runs an audit that returns no meaningful results.
- Guest accounts from previous collaborations appear as valid targets.
**Prevention:**
Apply a default filter in the directory listing that excludes obvious non-staff entries, while allowing the user to toggle the filter off:
- Default: `$filter=accountEnabled eq true and userType eq 'Member'` — this excludes guests and disabled accounts. Requires no `ConsistencyLevel` header (supported in standard filter mode).
- Provide a checkbox in the directory browse UI: "Include guest accounts" that adds `or userType eq 'Guest'` to the filter.
- For service account noise: apply a client-side secondary filter that hides entries where `displayName` contains common service patterns (`SharePoint`, `Teams`, `No Reply`, `Admin`) — this is a heuristic and should be opt-in, not default.
Note: filtering `accountEnabled eq true` in the `$filter` parameter without `ConsistencyLevel: eventual` works on the v1.0 `/users` endpoint. Verify before release.
**Detection:**
- Count the raw user total vs. the filtered total for a test tenant. If they differ by more than 20%, the default filter is catching real users — review the filter logic.
**Phase to address:** User directory browse implementation phase, before the UI is built. The filter strategy must be baked into the service interface so the ViewModel does not need to know about it.
---
### Pitfall v2.2-4: Full Directory Load Hangs the UI Without Progress Feedback
**What goes wrong:**
Fetching 3 000 users with page iteration takes 38 seconds depending on tenant size and Graph latency. The existing people-picker search is a debounced 500 ms call that returns quickly. The directory browse "Load All" operation is fundamentally different in character. Without progress feedback, the user sees a frozen list and either waits or clicks the button again (triggering a second concurrent load).
The existing `IsBusy` / `IsRunning` pattern on `AsyncRelayCommand` will disable the button, but there is no count feedback in the existing ViewModel pattern for this case.
**Why it happens:**
Developers implement the API call first, wire it to a button, and test with a 50-user dev tenant where it returns in < 500 ms. The latency problem is only discovered when testing against a real client.
**Consequences:**
- On first use with a large tenant, the admin thinks the feature is broken and restarts the app.
- If the command is not properly guarded, double-clicks trigger two concurrent Graph requests populating the same `ObservableCollection`.
**Prevention:**
- Add a `DirectoryLoadStatus` observable property: `"Loading... X users"` updated via `IProgress<int>` inside the `PageIterator` callback.
- Use `BindingOperations.EnableCollectionSynchronization` on the users `ObservableCollection` so items can be streamed in as each page arrives rather than waiting for full iteration.
- The `AsyncRelayCommand` `CanExecute` must return `false` while loading is in progress (the toolkit does this automatically when `IsRunning` is true — verify it is wired).
- Add a cancellation button that is enabled during the load, using the same `CancellationToken` passed to `PageIterator.IterateAsync`.
**Detection:**
- Test with a mock that simulates 10 pages of 999 users each, adding a 200 ms delay between pages. The UI should show incrementing count feedback throughout.
**Phase to address:** User directory browse ViewModel phase.
---
### Pitfall v2.2-5: Logo File Format Validation Is Skipped, Causing Broken Images in Reports
**What goes wrong:**
The `OpenFileDialog` filter (`*.png;*.jpg;*.jpeg`) prevents selecting a `.exe` file, but it does not validate that the selected file is actually a valid image. A user may select a file that was renamed with a `.png` extension but is actually a PDF, a corrupted download, or an SVG (which is XML text, not a binary image format). When the file is read and base64-encoded, the string is valid base64, but the browser renders a broken image icon in the HTML report.
WPF's `BitmapImage` will throw an exception on corrupt or unsupported binary files. SVG files loaded as a `BitmapImage` throw because SVG is not a WPF-native raster format.
A second failure mode: `BitmapImage` throws `NotSupportedException` or `FileFormatException` for EXIF-corrupt JPEGs. This is a known .NET issue where WPF's BitmapImage is strict about EXIF metadata validity.
**Why it happens:**
The file picker filter is treated as sufficient validation. EXIF corruption is not anticipated because it is invisible to casual inspection.
**Consequences:**
- Report is generated successfully from the app's perspective, but every page has a broken image icon where the logo should appear.
- The user does not see the error until they open the HTML file.
- EXIF-corrupt JPEG from a phone camera or scanner is a realistic scenario in an MSP workflow.
**Prevention:**
After file selection and before storing the path or encoding:
1. Load the file as a `BitmapImage` in a `try/catch`. If it throws, reject the file and show a user-friendly error: "The selected file could not be read as an image. Please select a valid PNG or JPEG file."
2. Check `BitmapImage.PixelWidth` and `PixelHeight` after load — a 0×0 image is invalid.
3. For EXIF-corrupt JPEGs: `BitmapCreateOptions.IgnoreColorProfile` and `BitmapCacheOption.OnLoad` reduce (but do not eliminate) EXIF-related exceptions. Wrap the load in a retry with these options if the initial load fails.
4. Do not accept SVG files. The file filter should explicitly include only `*.png;*.jpg;*.jpeg;*.bmp;*.gif`. SVG requires a third-party library (e.g., SharpVectors) to rasterize — out of scope for this milestone.
5. After successful load, verify the resulting base64 string decodes back to a valid image (round-trip check) before persisting to JSON.
**Detection:**
- Unit test: attempt to load a `.txt` file renamed to `.png` and a known EXIF-corrupt JPEG. Verify both are rejected with a user-visible error, not a silent crash.
**Phase to address:** Logo import/settings phase. Validation must be in place before the logo path or base64 is persisted.
---
### Pitfall v2.2-6: Logo Path Stored in JSON Settings Becomes Stale After EXE Redistribution
**What goes wrong:**
The simplest implementation of logo storage is to persist the file path (`C:\Users\admin\logos\msp-logo.png`) in `AppSettings` JSON. This works on the machine where the logo was imported. When the tool is redistributed to another MSP technician (or when the admin reinstalls Windows), the path no longer exists. The export service reads the path, the file is missing, and the logo is silently omitted from new reports — or worse, throws an unhandled `FileNotFoundException`.
**Why it happens:**
Path storage is the simplest approach. Base64 storage feels "heavy." The problem is only discovered when a colleague opens the tool on their own machine.
**Consequences:**
- Client-branded reports stop including the logo without any warning.
- The user does not know the logo is missing until a client complains about the unbranded report.
- The `AppSettings.DataFolder` pattern is already established in the codebase — the team may assume all assets follow the same pattern, but logos are user-supplied files, not app-generated data.
**Prevention:**
Store logos as base64 strings directly in `AppSettings` and `TenantProfile` JSON, not as file paths. The import action reads the file once, encodes it, stores the string, and the original file path is discarded after import. This makes the settings file fully portable across machines.
The concern about JSON file size is valid but manageable: a 512 KB PNG becomes ~700 KB of base64, which increases the settings JSON file by that amount. For a tool that already ships as a 200 MB EXE, a 1 MB settings file is acceptable. Document this design decision explicitly.
Alternative if file-path storage is preferred: copy the logo file into a `logos/` subdirectory of `AppSettings.DataFolder` at import time (use a stable filename like `msp-logo.png`), store only the relative path in JSON, and resolve it relative to `DataFolder` at export time. This is portable as long as the DataFolder travels with the settings.
**Detection:**
- After importing a logo, manually edit `AppSettings.json` and verify the logo data is stored correctly.
- Move the settings JSON to a different machine and verify a report is generated with the logo intact.
**Phase to address:** Logo import/settings phase. The storage strategy must be decided and implemented before any export service accepts logo data.
---
## Moderate Pitfalls (v2.2)
### Pitfall v2.2-7: Logo Breaks HTML Report Print Layout
**What goes wrong:**
The existing HTML export services produce print-friendly reports (flat tables, no JavaScript required for static reading). Adding a logo `<img>` tag to the report header introduces two print layout risks:
1. **Logo too large:** An `<img>` without explicit CSS constraints stretches to its natural pixel size. A 1200×400 px banner image pushes the stats cards and table off the first page, breaking the expected report layout.
2. **Image not printed:** Some users open HTML reports and use "Print to PDF." Browsers' print stylesheets apply `@media print` rules. By default, most browsers print background images but not inline `<img>` elements with `display:none` — this is usually not a problem, but logos inside `<div>` containers with `overflow:hidden` or certain CSS transforms may be clipped or omitted in print rendering.
**Why it happens:**
Logo sizing is set by the designer in the settings UI but the reports are opened in diverse browsers (Chrome, Edge, Firefox) with varying print margin defaults. The logo is tested visually on-screen but not in a print preview.
**Prevention:**
- Constrain all logo `<img>` elements with explicit CSS: `max-height: 60px; max-width: 200px; object-fit: contain;`. This prevents the image from overflowing its container regardless of the original image dimensions.
- Add a `@media print` block in the report's inline CSS that keeps the logo visible and appropriately sized: `@media print { .report-logo { max-height: 48px; max-width: 160px; } }`.
- Use `break-inside: avoid` on the header `<div>` containing both logos and the report title so a page break never splits the header from the first stat card.
- Test "Print to PDF" in Edge (Chromium) before shipping — it is the most common browser for MSP tools on Windows.
**Detection:**
- Open a generated report in Edge, use Ctrl+P, check print preview. Verify the logo appears on page 1 and the table is not pushed to page 2 by an oversized image.
**Phase to address:** HTML report template phase when logo injection is added to `BuildHtml`.
---
### Pitfall v2.2-8: ConsistencyLevel Header Amplifies Graph Throttling for Directory Listing
**What goes wrong:**
The existing `GraphUserSearchService` already uses `ConsistencyLevel: eventual` with `$count=true` for its `startsWith` filter query. This is required for the advanced filter syntax. However, applying `ConsistencyLevel: eventual` to a full directory listing with `$top=999` and `$orderby=displayName` forces Graph to route requests through a consistency-checked path rather than a lightweight read cache. Microsoft documentation confirms this increases the cost of each request against throttling limits.
For a tenant with 10 000 users (11 pages of 999), firing 11 consecutive requests with `ConsistencyLevel: eventual` is significantly more expensive than 11 standard read requests. Under sustained MSP use (multiple tenants audited back-to-back), this can trigger per-app throttling (HTTP 429) after 23 directory loads in quick succession.
**Why it happens:**
`ConsistencyLevel: eventual` is already in the existing service and developers copy it to the new `GetAllUsersAsync` method because it was needed for `$count` support.
**Prevention:**
For `GetAllUsersAsync`, evaluate whether `ConsistencyLevel: eventual` is actually needed:
- `$orderby=displayName` on `/users` does **not** require `ConsistencyLevel: eventual` — standard `$orderby` on `displayName` is supported without it.
- `$count=true` does require `ConsistencyLevel: eventual`. If user count is needed for progress feedback, request it only on the first page, then use the returned `@odata.count` value without adding the header to subsequent page requests. The `PageIterator` does not automatically carry the header to next-link requests — verify this behaviour.
- If `ConsistencyLevel: eventual` is not needed for the primary listing, omit it from `GetAllUsersAsync`. Use it only when `$search` or `$count` are required.
**Detection:**
- Load the full directory for two different tenants back-to-back. Check for HTTP 429 responses in the Serilog output. If throttling occurs within the first two loads, `ConsistencyLevel` overhead is the likely cause.
**Phase to address:** User directory browse service implementation phase.
---
### Pitfall v2.2-9: WPF ListView with 5 000+ Users Freezes Without UI Virtualization
**What goes wrong:**
A WPF `ListView` or `DataGrid` bound to an `ObservableCollection<DirectoryUser>` with 5 000 items renders all 5 000 item containers on first bind if UI virtualization is disabled or inadvertently defeated. This causes a 510 second freeze when the directory loads and ~200 MB of additional memory for the rendered rows, even though only ~20 rows are visible in the viewport.
Virtualization is defeated by any of these common mistakes:
- The `ListView` is inside a `ScrollViewer` that wraps both the list and other content (`ScrollViewer.CanContentScroll=False` is the kill switch).
- The `ItemsPanel` is overridden with a non-virtualizing panel (`StackPanel` instead of `VirtualizingStackPanel`).
- Items are added one-by-one to the `ObservableCollection` (each addition fires a `CollectionChanged` notification, causing incremental layout passes — 5 000 separate layout passes are expensive).
**Why it happens:**
The existing people-picker `SearchResults` collection has at most 10 items — virtualization was never needed and its absence was never noticed. The directory browse `ObservableCollection` is a different scale.
**Prevention:**
- Use a `ListView` with its default `VirtualizingStackPanel` (do not override `ItemsPanel`).
- Set `VirtualizingPanel.IsVirtualizing="True"`, `VirtualizingPanel.VirtualizationMode="Recycling"`, and `ScrollViewer.CanContentScroll="True"` explicitly — do not rely on defaults being correct after a XAML edit.
- Never add items to the collection one-by-one from the background thread. Use `BindingOperations.EnableCollectionSynchronization` and assign `new ObservableCollection<T>(loadedList)` in one operation after all pages have been fetched, or batch-swap when each page arrives.
- For 5 000+ items, add a search-filter input above the directory list that filters the bound `ICollectionView` — this reduces the rendered item count to a navigable size without requiring the user to scroll 5 000 rows.
**Detection:**
- Load a 3 000-user directory into the ListView. Open Windows Task Manager. The WPF process should not spike above 300 MB during list rendering. Scroll should be smooth (60 fps) with recycling enabled.
**Phase to address:** User directory browse View/XAML phase.
---
### Pitfall v2.2-10: Dual Logo Injection Requires Coordinated Changes Across All Five HTML Export Services
**What goes wrong:**
There are five independent `HtmlExportService`-style classes, each with its own `BuildHtml` method that builds the full HTML document from scratch using `StringBuilder`. Adding logo support means changing all five methods. If logos are added to only two or three services (the ones the developer remembers), the other reports ship without branding. The inconsistency is subtle — the tool "works," but branded exports alternate with unbranded exports depending on which tab generated the report.
**Why it happens:**
Each export service was written independently and shares no base class. There is no shared "HTML report header" component that all services delegate to. Each service owns its complete `<!DOCTYPE html>` block.
**Consequences:**
- Permissions report is branded; duplicates report is not.
- Client notices inconsistency and questions the tool's reliability.
- Future changes to the report header (adding a timestamp, changing the color scheme) must be applied to all five files separately.
**Prevention:**
Before adding logo injection to any service, extract a shared `HtmlReportHeader` helper method (or a small `HtmlReportBuilder` base class/utility) that generates the `<head>`, `<style>`, and branded header `<div>` consistently. All five services call this shared method with a `BrandingOptions` parameter (MSP logo base64, client logo base64, report title). This is a refactoring prerequisite — not optional if branding consistency is required.
The refactoring is low-risk: the CSS blocks in all five services are nearly identical (confirmed by reading the code), so consolidation is straightforward.
**Detection:**
- After branding is implemented, export one report from each of the five export services. Open all five in a browser side by side and verify logos appear in all five.
**Phase to address:** HTML report template refactoring phase — this must be done before logo injection, not after.
---
## Minor Pitfalls (v2.2)
### Pitfall v2.2-11: `User.Read.All` Permission Scope May Not Be Granted for Full Directory Listing
**What goes wrong:**
The existing `SearchUsersAsync` uses `startsWith` filter queries that work with `User.ReadBasic.All` (the least-privileged scope for user listing). Full directory browse with all user properties may require `User.Read.All`, depending on which properties are selected. If the Azure AD app registration used by MSP clients only has `User.ReadBasic.All` consented (which is sufficient for the v1.1 people-picker), the `GetAllUsersAsync` call may silently return partial data or throw a 403.
`User.ReadBasic.All` returns only: `displayName`, `givenName`, `id`, `mail`, `photo`, `securityIdentifier`, `surname`, `userPrincipalName`. Requesting `accountEnabled` or `userType` (needed for filtering out guests/disabled accounts per Pitfall v2.2-3) requires `User.Read.All`.
**Prevention:**
- Define the exact `$select` fields needed for the directory browse feature and verify each field is accessible under `User.ReadBasic.All` before assuming `User.Read.All` is required.
- If `User.Read.All` is required, update the app registration documentation and display a clear message in the tool if the required permission is missing (catch the 403 and surface it as "Insufficient permissions — User.Read.All is required for directory browse mode").
- Add `User.Read.All` to the requested scopes in `MsalClientFactory` alongside existing scopes.
**Detection:**
- Test the directory browse against a tenant where the app registration has only `User.ReadBasic.All` consented. Verify the error message is user-readable, not a raw `ServiceException`.
**Phase to address:** User directory browse service interface phase.
---
### Pitfall v2.2-12: Logo Preview in Settings UI Holds a File Lock
**What goes wrong:**
When showing a logo preview in the WPF settings UI using `BitmapImage` with a file URI (`new BitmapImage(new Uri(filePath))`), WPF may hold a read lock on the file until the `BitmapImage` is garbage collected. If the user then tries to re-import a different logo (which involves overwriting the same file), the file write fails with a sharing violation. This is a known WPF `BitmapImage` quirk.
**Prevention:**
Load logo previews with `BitmapCacheOption.OnLoad` and set `UriSource` then call `EndInit()`:
```csharp
var bitmap = new BitmapImage();
bitmap.BeginInit();
bitmap.UriSource = new Uri(filePath);
bitmap.CacheOption = BitmapCacheOption.OnLoad;
bitmap.EndInit();
bitmap.Freeze(); // Makes it immutable and thread-safe; also releases the file handle
```
`Freeze()` is the critical call — it forces the image to be fully decoded into memory and releases the file handle immediately, preventing file locks.
**Detection:**
- Import a logo, then immediately try to overwrite the source file using Windows Explorer. Without `Freeze()`, the file is locked. With `Freeze()`, the overwrite succeeds.
**Phase to address:** Settings UI / logo import phase.
---
## Phase-Specific Warnings (v2.2)
| Phase Topic | Likely Pitfall | Mitigation |
|-------------|---------------|------------|
| Logo import + settings persistence | Base64 bloat (v2.2-1) + path staleness (v2.2-6) | Store pre-encoded base64 in JSON; enforce 512 KB import limit |
| Logo import + settings persistence | Invalid/corrupt image file (v2.2-5) | Validate via `BitmapImage` load before persisting; `Freeze()` to release handle (v2.2-12) |
| HTML report template refactoring | Inconsistent branding across 5 services (v2.2-10) | Extract shared header builder before touching any service |
| HTML report template | Print layout broken by oversized logo (v2.2-7) | Add `max-height/max-width` CSS and `@media print` block |
| Graph directory service | Silent truncation at 999 users (v2.2-2) | Use `PageIterator`; request `$count` on first page for progress |
| Graph directory service | Guest/service account noise (v2.2-3) | Default filter `accountEnabled eq true and userType eq 'Member'`; UI toggle for guests |
| Graph directory service | Throttling from ConsistencyLevel header (v2.2-8) | Omit `ConsistencyLevel: eventual` from standard listing; use only when `$search` or `$count` required |
| Graph directory service | Missing permission scope (v2.2-11) | Verify `User.Read.All` vs. `User.ReadBasic.All` against required fields; update app registration docs |
| Directory browse ViewModel | UI freeze during load (v2.2-4) | Stream pages via `IProgress<int>`; cancellable `AsyncRelayCommand` |
| Directory browse View (XAML) | ListView freeze with 5 000+ items (v2.2-9) | Explicit virtualization settings; batch `ObservableCollection` assignment; filter input |
---
## v2.2 Integration Gotchas
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Logo base64 in `AppSettings` JSON | Store file path; re-encode on every export | Store pre-encoded base64 string at import time; inject directly into `<img src>` |
| `BitmapImage` logo preview | Default `BitmapImage` constructor holds file lock | Use `BeginInit/EndInit` with `BitmapCacheOption.OnLoad` and call `Freeze()` |
| Graph `GetAllUsersAsync` | Single `GetAsync` call; no pagination | Always use `PageIterator<User, UserCollectionResponse>` |
| Graph `$top` parameter | `$top=1000` — invalid; silently rounds down | Maximum valid value is `999` |
| Graph directory filter | No filter — returns all account types | Default: `accountEnabled eq true and userType eq 'Member'` |
| `ConsistencyLevel: eventual` | Applied to all Graph requests by habit | Required only for `$search`, `$filter` with non-standard operators, and `$count` |
| HTML export services | Logo injected in only the modified services | Extract shared header builder; all five services use it |
| WPF ListView with large user list | No virtualization settings, items added one-by-one | Explicit `VirtualizingPanel` settings; assign `new ObservableCollection<T>(list)` once |
---
## v2.2 "Looks Done But Isn't" Checklist
- [ ] **Logo size limit enforced:** Import a 600 KB PNG. Verify the UI rejects it with a clear message and does not silently accept it.
- [ ] **Corrupt image rejected:** Rename a `.txt` file to `.png` and attempt to import. Verify rejection with user-friendly error.
- [ ] **Logo portability:** Import a logo on machine A, copy the settings JSON to machine B (without the original file), generate a report. Verify the logo appears.
- [ ] **All five report types branded:** Export one report from each of the five HTML export services. Open all five in a browser and verify logos appear in all.
- [ ] **Print layout intact:** Open each branded report type in Edge, Ctrl+P, print preview. Verify logo appears on page 1 and table is not displaced.
- [ ] **Directory listing complete (large tenant):** Connect to a tenant with > 1 000 users. Load the full directory. Verify user count matches the Azure AD count shown in the Azure portal.
- [ ] **Directory load cancellation:** Start a directory load and click Cancel before it completes. Verify the list shows partial results or is cleared, no crash, and the button re-enables.
- [ ] **Guest account filter:** Verify guests are excluded by default. Verify the "Include guests" toggle adds them back.
- [ ] **ListView performance:** Load 3 000 users into the directory list. Verify scroll is smooth and memory use is reasonable (< 400 MB total).
- [ ] **FR locale for new UI strings:** All logo import labels, error messages, and directory browse UI strings must have FR translations. Verify no untranslated keys appear when FR is active.
---
## v2.2 Sources
- Microsoft Learn: List users (Graph v1.0) — https://learn.microsoft.com/en-us/graph/api/user-list?view=graph-rest-1.0
- Microsoft Learn: Graph API throttling guidance — https://learn.microsoft.com/en-us/graph/throttling
- Microsoft Learn: Graph API service-specific throttling limits — https://learn.microsoft.com/en-us/graph/throttling-limits
- Microsoft Learn: Graph SDK paging / PageIterator — https://learn.microsoft.com/en-us/graph/sdks/paging
- Microsoft Learn: Graph permissions — User.ReadBasic.All vs User.Read.All — https://learn.microsoft.com/en-us/graph/permissions-reference
- Rick Strahl's Web Log: Working around the WPF ImageSource Blues (2024) — https://weblog.west-wind.com/posts/2024/Jan/03/Working-around-the-WPF-ImageSource-Blues
- Rick Strahl's Web Log: HTML to PDF Generation using the WebView2 Control (2024) — https://weblog.west-wind.com/posts/2024/Mar/26/Html-to-PDF-Generation-using-the-WebView2-Control
- MDN Web Docs: CSS Printing — https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_media_queries/Printing
- Microsoft Learn: BitmapImage / BitmapCacheOption — https://learn.microsoft.com/en-us/dotnet/api/system.windows.media.imaging.bitmapcacheoption
- Microsoft Learn: Optimize WPF control performance (virtualization) — https://learn.microsoft.com/en-us/dotnet/desktop/wpf/advanced/optimizing-performance-controls
- Microsoft Q&A: WPF BitmapImage complains about EXIF corrupt metadata — https://learn.microsoft.com/en-us/answers/questions/1457132/wpf-bitmapimage-complains-about-exif-corrupt-metad
- Microsoft Q&A: What is the suggested way for filtering non-human accounts from /users — https://learn.microsoft.com/en-us/answers/questions/280526/what-is-the-suggested-34way34-or-best-options-for.html
- DebugBear: Page Speed — Avoid Large Base64 data URLs — https://www.debugbear.com/blog/base64-data-urls-html-css
- Graph API — how to avoid throttling (Tech Community) — https://techcommunity.microsoft.com/blog/fasttrackforazureblog/graph-api-integration-for-saas-developers/4038603
- Existing codebase: `UserAccessHtmlExportService.cs`, `HtmlExportService.cs`, `GraphUserSearchService.cs` (reviewed 2026-04-08)
---
*v2.2 pitfalls appended: 2026-04-08*