1 Symbols, Abbreviations and Acronyms

symboldescription
ACAnticipated Change
DOMDocument Object Model
FRFunctional Requirement
GPUGraphics Processing Unit
MModule
MGModule Guide
OSOperating System
PDFPortable Document Format
RRequirement
SRSSoftware Requirements Specification
morphthe text editor that helps you to become better writer
UCUnlikely Change
VnVVerification and Validation
TTFTTime-to-First-Token
CPUCentral Processing Unit
NSFWNot Safe For Work
UI/UXUser Interface / User Experience
DNSDomain Name System
DNSSECDomain Name System Security Extensions
HNSWHierarchical Navigable Small World
LTRLearning To Rank

7.2 Data Dictionary

NameContentType
morphmorph + ReverseProxy + asteraceaepackage
morphWeb interface for morphpackage
ReverseProxyA middleware for reverse proxy with load balancermodule
asteraceaeSearch + ActivationCache + SAEs + logit_processor + Exopackage
logit_processorA function to convert incoming requests to logitsfunction
exoInference engine to run given LLMpackage
LLMopen-weights models to be run for inferencepackage
BlockManagerHelp manage KV-cache during inferenceclass
SAEsa set of sparse autoencoders trained against given LLM to steer activations generationpackage
SearchAdditional search tool to improve correctnessmodule
ActivationCacheStore said generations activations for performance purposesclass
streaming JSONOutputs from asteraceae will be streamed back to morphtext
tensor matrix(A matrix of shape (n tokens × m embedding size), where each row is a float vector embedding for a token.) represented inputs processed by logit_processortext

see also logits and sparse autoencoders

Link to original

2 List of Tables

TableDescription
0List of Figures
1Evaluation of Test-1: Large Text Block Input
2Evaluation of Test-2: Unintelligible Prompt
3Evaluation of Test-3: Steering Validation (Didion Tone)
4Evaluation of Test-4: Personalized Style Adaptation
5Evaluation of Test-5: LTR Feedback Panel Validation
6Evaluation of Test-6: Document Export Functionality Validation
7Evaluation of Test-7: Theme Customization Validation
8Evaluation of Test-8: Predefined UI/UX Checklist
9Evaluation of Test-8: User Testing & Survey Results
10Evaluation of Test-9: Predefined UI Audit Checklist
11Evaluation of Test-9: Validation with WCAG Contrast Checker
12Evaluation of Test-10: Predefined Usability Checklist
13Evaluation of Test-10: User Feedback from Surveys and Interviews
14Evaluation of Test-11: Onboarding Time Results
15Evaluation of Test-11: Predefined Usability Checklist
16Evaluation of Test-11: User Feedback from Surveys
17Evaluation of Test-12: Keyboard Navigation Test Results
18Evaluation of Test-12: Predefined Accessibility Checklis
19Evaluation of Test-14: Performance Checklist
20Evaluation of Test-15: Safety Compliance Checklist
21Evaluation of Test-16: Steering Alignment Checklist
22Evaluation of Test-17: Performance Checklist
23Evaluation of Test-18: Asynchronous Processing Checklist
24Evaluation of Test-19: Input Responsiveness Checklist
25Evaluation of Test-20: Security Checklist
26Evaluation of Test-21: Security Checklist
27Evaluation of Test-22: Security Checklist
28Evaluation of Test-23: Security Checklist
29Evaluation of Test-25: Canadian Copyright Law Compliance Checklist
30Traceability of Testing to Functional Requirements
31Traceability of Testing to Non-Functional Requirements
32Traceability of Testing to Modules
33Code Coverage Data

3 List of Figures

NumberFigure
Figure 1Performance Graph
Figure 2deployment_strategy
Figure 3cloudflare
Figure 4audit_output
Figure 5generated_text_suggestions
Figure 6plagarism_check
Figure 7Coverage Output Terminal

This document is intended to provide an overview of the testing that performed throughout the development of the project morph, including the obtained results and the relevant discussions. The tests are under the guidance from VnVplan.

4 Functional Requirements Evaluation

4.1 Planning and Suggestion Features

4.1.1 Evaluation of Test-1: Large Text Block Input

Test Description: This combined test case validates that when planning mode is active, the system appropriately handles both a valid prompt and a large text block input. For a valid prompt (e.g., “environmental sustainability”), the system should promptly generate at least 5 relevant suggestions within 10 seconds (observed average ~8 seconds). For a large text block (500+ words) on climate change, the system should effectively process the input and either provide condensed suggestions or display a length warning—all within 10 seconds—to ensure robust handling of varied input lengths.

CriterionAssessment GoalPass/Fail
Output Time < 10sSuggestions appear within an average of ~8 seconds✅ Pass
1 or more suggestion generated5 suggestions are generated given the valid input✅ Pass
Input HandlingThe system effectively processes large inputs without performance degradation✅ Pass
Suggestions related to inputThe output is clear and relevant to the large input✅ Pass

Table:Test-1

4.1.2 Evaluation of Test-2: Unintelligible Prompt

Test Description: This test case validates that when an unintelligible prompt (e.g., “asdh123!@#”) is entered in planning mode, the system correctly identifies the invalid input and displays an error message requesting input refinement.

CriterionAssessment GoalPass/Fail
Input RecognitionThe system accurately identifies the unintelligible prompt✅ Pass
Error Message DisplayAn error message is displayed to prompt for input refinement✅ Pass
Response TimingThe error message is shown within 10 seconds❌ Fail (takes longer than 10 seconds)

Table:Test-2

4.2 Text Generation Features

4.2.1 Evaluation of Test-3: Steering Validation (Didion Tone)

Test Description: This test case verifies that when the steering feature is enabled and the “Didion” tone is selected with the prompt “reflection on modern life,” the system generates text suggestions that accurately match Didion’s writing style under specified lexical constraints.

CriterionAssessment GoalPass/Fail
Steering ActivationThe steering feature is successfully enabled in the editor✅ Pass
Tone SelectionThe “Didion” tone is correctly selected✅ Pass
Style MatchingGenerated suggestions align with Didion’s writing style✅ Pass

Table:Test-3

4.2.2 Evaluation of Test-4: Personalized Style Adaptation

Test Description: This test case validates that when a user’s writing sample is provided, the system adapts its text generation to reflect the user’s personal writing style. Customized suggestions should be generated within 30 seconds.

CriterionAssessment GoalPass/Fail
Input Sample RecognitionThe system accepts and processes the user’s writing sample✅ Pass
Style AdaptationGenerated output reflects the user’s personal writing style✅ Pass
Response TimingCustomized suggestions are produced within 30 seconds✅ Pass
Output RelevanceThe suggestions are contextually appropriate and tailored to the sample✅ Pass

Table:Test-4

4.3 Feedback Panel

4.3.1 Evaluation of Test-5: LTR Feedback Panel Validation

Test Description: This test case confirms that when user feedback (e.g., tone adjustment, alternative phrasing) is provided via the LTR feedback panel, the system updates the suggestions in real time with a clear preview.

CriterionAssessment GoalPass/Fail
Feedback IntegrationThe system seamlessly accepts and integrates user feedback✅ Pass
Real-time UpdateSuggestions update immediately upon receiving feedback✅ Pass
Output ClarityUpdated suggestions provide a clear and accurate preview of changes✅ Pass
Test ExecutionManual test performed with LTR feedback panel active and initial suggestions✅ Pass

Table:Test-5

4.4 Document Export Features

4.4.1 Evaluation of Test-6: Document Export Functionality Validation

Test Description: This test case validates the document export capabilities. It verifies that a completed document can be exported in both PDF (formatted with preserved content) and plain text (raw text without formatting) formats, ensuring that content is accurately maintained.

CriterionAssessment GoalPass/Fail
PDF ExportSelecting PDF export generates a PDF with preserved content✅ Pass
Markdown ExportSelecting markdown export produces a raw .md file with preserved content✅ Pass
Output AccuracyExported documents accurately preserve the intended content’s format❌ Fail (formating issues with pdf format)

Table:Test-6

4.5 Interface Customization Features

4.5.1 Evaluation of Test-7: Theme Customization Validation

Test Description: This test case verifies that the system allows users to switch from the default light theme to dark mode. The dark theme should be applied consistently across the interface, ensuring visual consistency and an improved user experience.

CriterionAssessment GoalPass/Fail
Theme SwitchingThe system successfully enables switching from light to dark mode✅ Pass
Visual ConsistencyThe dark theme is applied consistently across all interface elements✅ Pass

Table:Test-7

5 Nonfunctional Requirements Evaluation

5.1 Look and Feel

5.1.1 Evaluation of Test-8

5.1.1.1 Predefined UI/UX Checklist:

10 engineers and UI/UX experts reviewed and followed the criteria below, ensuring a thorough evaluation of morph interface:

CriterionAssessment GoalPass/Fail
Visual ConsistencyTypography, spacing, and layout remain uniform across screens.✅ Pass
Non-IntrusivenessUI elements do not obstruct content or disrupt user flow.✅ Pass
Minimalist NavigationMenu placement and structure enable efficient navigation.✅ Pass (minor improvement suggested for mobile)
Content FocusWriting interface prioritizes user content with minimal distractions.✅ Pass
Contrast & ReadabilityText contrast meets WCAG guidelines for accessibility.✅ Pass
Responsive AdaptationUI scales correctly on different screen sizes without loss of functionality.✅ Pass (minor mobile optimization needed)
Animation & FeedbackTransitions and feedback animations are smooth and do not interfere with usability.✅ Pass

Table:Test-8

5.1.1.2 User Testing & Survey Results:

Participants then answer the (6.1 Usability Survey Questions) and rated their experience on the following topics based on a 1–5 scale (1 = Poor, 5 = Excellent):

Evaluation MetricAverage Rating (1–5)
Clarity of interface4.8
Ease of navigation4.6 (some issues with mobile menus)
Non-intrusiveness4.9
Responsiveness across devices4.7 (minor UI scaling issues noted)
Visual consistency4.9

Table:Test-8 User Testing & Survey Results

5.1.1.3 Key Observations & Findings:
  • High UI clarity and readability: Users appreciated the clean layout and distraction-free experience, aligning with the project’s goals.
  • Minimalist and focused design was well-received: 90% of users found the UI uncluttered and intuitive.
  • Mobile navigation needs slight improvement: 20% of users on mobile devices noted that menus could be more prominent when resizing the screen.
  • Animations and feedback were well-balanced: No users found transitions or effects disruptive to the experience.

5.1.2 Evaluation of Test-9

5.1.2.1 Predefined UI Audit Checklist:

The team manually reviewed the UI components using the design system documentation and WCAG Contrast Checker, ensuring alignment with the project’s visual consistency goals.

CriterionAssessment GoalPass/Fail
Typography ConsistencyFont families, sizes, and weights match design system guidelines.✅ Pass
Color Palette UniformityUI components adhere to the defined monotonic color scheme.✅ Pass
Contrast ComplianceText and interactive elements meet WCAG 2.1 AA contrast ratios.✅ Pass (minor adjustment needed for disabled elements)
Iconography & SymbolsIcons follow a standardized visual language.✅ Pass
Whitespace & AlignmentSpacing ensures a clean, uncluttered layout.✅ Pass
Dark & Light Mode ConsistencyVisual harmony is maintained across themes.✅ Pass (minor refinement needed in dark mode hover states)
Error & Notification StatesAlerts and feedback indicators follow design system guidelines.✅ Pass

Table:Test-9

5.1.2.2 Validation with WCAG Contrast Checker:

The team ran manual contrast checks using a WCAG compliance tool to ensure accessibility standards were met.

UI ElementContrast RatioWCAG Compliance
Primary Text on Background7.1:1✅ AA & AAA
Button Labels4.8:1✅ AA
Links & Interactive Elements5.3:1✅ AA
Disabled Elements3.0:1⚠️ Below AA (Requires Adjustment)
Dark Mode Text on Background6.5:1✅ AA

Table:Test-9 Validation with WCAG Contrast Checker

5.2 Usability

5.2.1 Evaluation of Test-10

Three users were assigned a creative writing task that required structuring ideas using morph’s planning interface. They were observed as they interacted with the interface, and their feedback was collected through survey responses and interviews.

5.2.1.1 Predefined Usability Checklist:

The following criteria were used to evaluate the effectiveness and intuitiveness of the planning interface:

CriterionAssessment GoalPass/Fail
Ease of UseUsers can quickly understand and utilize the planning interface.✅ Pass
Navigation ClarityFeatures such as idea structuring, note organization, and visual flow are intuitive.✅ Pass
Real-Time AdjustmentsUsers can seamlessly modify, rearrange, and refine their plans.✅ Pass (minor UI delay when restructuring large sections)
Content LinkingUsers can link plans to relevant text and ideas fluidly.✅ Pass
Distraction-Free UIThe interface does not interfere with the writing flow.✅ Pass

Table:Test-10

5.2.1.2 User Feedback from Surveys and Interviews:

Participants then answer the (6.1 Usability Survey Questions) and users rated their experience on a 1–5 scale (1 = Poor, 5 = Excellent):

Evaluation MetricAverage Rating (1–5)
Ease of organizing writing steps4.7
Clarity of navigation4.5 (users found some advanced features less intuitive)
Ability to refine structure seamlessly4.6
Efficiency in modifying writing plans4.7
Overall satisfaction with planning workflow4.8

Table:Test-10 User Feedback from Surveys and Interviews

5.2.1.3 Key Takeaways from Interviews:
  • Users found the interface intuitive and effective for structuring writing, but some needed extra time to explore all available planning features.
  • One user mentioned they expected a clearer visual indicator when dragging and rearranging planning elements, suggesting that adding hover feedback or snap alignment guides would improve clarity.
  • Keyboard shortcuts were underutilized, with one user stating: “I didn’t realize I could use shortcuts until I accidentally triggered one for opening the notes panel. Having a list or hint somewhere would be useful.”
  • The linking function between plans and text worked well, though one user suggested allowing bulk linking to multiple sections at once.
  • Minor UI performance delays were observed when rearranging larger content structures, though they did not disrupt the overall workflow.

5.2.2 Evaluation of Test-11

5.2.2.1 Testing Setup:

Three new users with no prior experience with morph were given access to the application without instructions. Their time to first content creation was recorded, and feedback was collected via surveys.

5.2.2.2 Onboarding Time Results:

Each user’s time to begin writing or editing content was measured:

UserOnboarding TimeMet 10-Minute Goal?
User 17 minutes 32 seconds✅ Yes
User 29 minutes 10 seconds**✅ Yes
User 38 minutes 45 seconds✅ Yes

Table:Test-11

5.2.2.3 Predefined Usability Checklist:

The following criteria were used to assess onboarding efficiency and initial usability:

CriterionAssessment GoalPass/Fail
Navigation ClarityUsers can easily locate key writing and editing functions.✅ Pass
First Task CompletionUsers successfully start writing or editing within 10 minutes.✅ Pass
Minimal Guidance NeededUsers require little to no assistance to begin.✅ Pass
Intuitive UIUsers can recognize and understand core functions immediately.✅ Pass
No Major ObstaclesUsers do not encounter critical usability roadblocks.✅ Pass (some minor confusion with advanced features)

Table:Test-11 Predefined Usability Checklist

5.2.2.4 User Feedback from Surveys:

Participants then answer the (6.1 Usability Survey Questions) and users rated their onboarding experience on a 1–5 scale (1 = Poor, 5 = Excellent):

Evaluation MetricAverage Rating (1–5)
Ease of finding key features4.5
Clarity of interface4.8
Time taken to start writing4.6
Overall onboarding experience4.7

Table:Test-11 User Feedback from Surveys

5.2.3 Evaluation of Test-12

5.2.3.1 Testing Setup:

The team conducted a manual keyboard accessibility test on the morph editor to assess whether all interactive components could be accessed and used without a mouse. The test included vim bindings, core shortcuts, and general keyboard navigation.

5.2.3.2 Keyboard Navigation Test Results:
TaskShortcut UsedAccessible via Keyboard?
Toggle Notes PanelCmd + [shortcut] (Mac) / Ctrl + [shortcut] (Windows/Linux)✅ Pass
Toggle Edit/Read ModeCmd + [shortcut] (Mac) / Alt + [shortcut] (Windows/Linux)✅ Pass
Save DocumentCmd+S (Mac) / Ctrl+S (Windows/Linux)✅ Pass
Vim Keybinding: Save:w or :wa✅ Pass
Vim Keybinding: Escape Mappingjj or jk in insert mode✅ Pass
Vim Keybinding: Command Mode Mapping; mapped to :✅ Pass
Focus Traversal (Tab & Shift+Tab)Navigate through UI components✅ Pass
Access File Menu & SettingsKeyboard shortcuts & Tab navigation✅ Pass
Vault Directory NavigationNo shortcut available❌ Fail

Table:Test-12 Keyboard Navigation Test Results

5.2.3.3 Predefined Accessibility Checklist:
CriterionAssessment GoalPass/Fail
All core writing functions are accessible via keyboardUsers can perform major actions (edit, save, toggle modes) with shortcuts.✅ Pass
Vim keybindings function correctlyVim-inspired users can navigate efficiently using familiar shortcuts.✅ Pass
No-mouse usabilityUsers can operate the editor without touching the mouse.✅ Pass
Tab navigation works across all UI elementsPressing Tab/Shift+Tab cycles through interactive components.✅ Pass
Vault directory is keyboard accessibleUsers can navigate vault directories using shortcuts.❌ Fail (No shortcut available)

Table:Test-12 Predefined Accessibility Checklist

5.3 Performance

5.3.1 Evaluation of Test-13

We measured the TTFT, assessing how quickly the inference server begins generating output after receiving a request. Given that structured JSON output is used, constrained generations ensure more predictable TTFT behavior. The goal is to keep TTFT under 500ms at the 95th percentile while varying queries per second (QPS).

5.3.1.1 Test Execution
  • Performance testing tools automatically simulated user requests.
  • The system was tested under varying QPS values (1, 5, 10, 15, 20, and 25).
  • TTFT measurements were recorded for V0 (baseline) and V1 (PR #12388).
  • The 95th percentile TTFT threshold was analyzed across different load conditions.
5.3.1.2 Results
  • V1 significantly reduces TTFT across all QPS levels compared to V0.
  • At QPS 1, TTFT improved from ~65ms (V0) to ~15ms (V1).
  • At QPS 25, TTFT for V1 remains under 30ms, whereas V0 exceeds 100ms.
  • Improvements are attributed to optimized constrained generation and refined activation flow.
  • Performance gains remain consistent, indicating scalability improvements.
5.3.1.3 Performance Graph

Figure 1:Performance Graph

5.3.2 Evaluation of Test-14

evaluates the inference server’s ability to maintain a throughput of approximately 300 tokens/sec while processing batched requests. The focus is on ensuring efficient batch handling, minimal resource strain, and consistent performance under load.

5.3.2.1 Test Execution
  • Load testing tools simulated concurrent batched requests (batch size = 4).
  • The tokens processed per seconds were recorded over multiple runs.
  • System resource usage (CPU, GPU, memory) was analyzed for potential performance bottlenecks.
  • Scalability was tested by increasing the query-per-second (QPS) rate.
5.3.2.2 Performance Checklist
CriterionAssessment GoalPass/Fail
Minimum ThroughputServer maintains ≥300 tokens/sec across test runs.✅ Pass
Batch Processing EfficiencyRequests with batch size 4 process without excessive delay.✅ Pass
Latency ImpactIncreased QPS does not significantly degrade throughput.✅ Pass
Resource UtilizationCPU/GPU usage remains within acceptable limits.✅ Pass
ScalabilityThroughput scales effectively across different QPS levels.✅ Pass

Table:Test-14

The results ensure that the inference server effectively handles concurrent requests while maintaining optimal token generation speeds.

5.3.3 Evaluation of Test-15

Confirm that all UI elements, images, and media assets used in the application are free from inappropriate, harmful, or NSFW content. A thorough automated and manual review ensure that all graphical elements comply with content safety guidelines.

5.3.3.1 Review Process
  • Automated scanning of all images, icons, and media for inappropriate content.
  • Verification of third-party assets to confirm proper licensing and compliance.
  • Manual confirmation of flagged assets to prevent false positives.
5.3.3.2 Safety Compliance Checklist
CriterionAssessment GoalPass/Fail
No NSFW ContentAll UI assets pass automated and manual reviews.✅ Pass
No Harmful ImageryNo content promoting violence, discrimination, or harm.✅ Pass
Verified Asset SourcesAll third-party assets have proper licenses.✅ Pass
No Unauthorized MediaNo unlicensed or unverified external assets.✅ Pass
Safe Placeholder TextsNo inappropriate text or placeholder content.✅ Pass

Table:Test-15

5.3.4 Evaluation of Test-16

Assess how accurately the generated text aligns with user-specified steering inputs such as tone, style, and phrasing preferences.

5.3.4.1 Evaluation Process
  • Steering parameters (e.g., formal vs. casual tone, concise vs. elaborate phrasing) were provided.
  • Automated scripts generated multiple text samples for each steering input.
  • Analytical metrics were used to measure alignment:
    • Cosine Similarity to compare generated text with reference samples.
    • Stylometric Analysis to assess linguistic consistency with the desired style.
5.3.4.2 Steering Alignment Checklist
CriterionAssessment GoalPass/Fail
Tone AccuracyGenerated text reflects the specified tone (e.g., formal, casual).✅ Pass
Phrasing AdaptationOutput follows requested sentence structure preferences.✅ Pass
Consistency with InputText maintains coherent adherence to the steering parameters.✅ Pass
Similarity to ReferenceHigh cosine similarity with expected textual patterns.✅ Pass
Adjustments for ImprovementSystem dynamically refines output based on feedback.✅ Pass

Table:Test-16

5.3.5 Evaluation of Test-17

Deployment strategy successfully maintains availability in the event of node or replica failures. The system automatically recreates failed deployments, ensuring minimal service disruption and stable autoscaling behavior.

5.3.5.1 Deployment Strategy Settings
  • Recreate deployment strategy ensures that failed instances are promptly replaced.
  • Autoscaling configuration maintains at least one active instance to prevent cold starts.
  • Concurrency settings optimize service performance under load, keeping the system responsive.
5.3.5.2 Deployment Configuration Screenshot

Figure 2: deployment_strategy

5.3.5.3 Performance Checklist
CriterionAssessment GoalPass/Fail
Failure RecoverySystem automatically recreates failed nodes/pods.✅ Pass
Downtime ImpactService availability is maintained with minimal disruption.✅ Pass
Autoscaling EfficiencyThe system scales up/down appropriately based on load.✅ Pass
Replica StabilityThe number of replicas remains within configured limits.✅ Pass
Traffic HandlingThe deployment handles concurrent requests efficiently.✅ Pass

Table:Test-17

5.3.6 Evaluation of Test-18

System efficiently processes multiple concurrent suggestion requests without significant delays, request drops, or errors. The test validates the queue management and request handling under simultaneous user interactions.

5.3.6.1 Evaluation Process
  • Simulated concurrent user requests were submitted asynchronously.
  • Queue management behavior was monitored to ensure smooth task execution.
  • Processing times were logged to identify potential delays.
  • Error rates and dropped requests were analyzed for stability.
5.3.6.2 Asynchronous Processing Checklist
CriterionAssessment GoalPass/Fail
Request HandlingSystem processes multiple requests concurrently.✅ Pass
Queue ManagementRequests are correctly queued and executed in order.✅ Pass
Processing SpeedNo significant delay observed under normal load.✅ Pass
Error HandlingNo request drops or processing errors detected.✅ Pass
System StabilityPerformance remains consistent under increased load.✅ Pass

Table:Test-18

5.3.7 Evaluation of Test-19

Confirm minimal input lag and smooth real-time feedback for users, even under high interaction rates.

5.3.7.1 Evaluation Process
  • Automated scripts performed rapid text entry (100+ words per minute) and editing (bulk deletions, cut/paste operations).
  • Performance profiling tools measured input latency in different environments.
  • Testing was conducted across multiple hardware configurations, including lower-end devices.
  • Browsers tested: Chrome, Firefox, Edge, Safari.
5.3.7.2 Performance Metrics
  • Average input latency: 7ms
  • 95th percentile latency: 12ms
  • Peak latency observed: 18ms (on lower-end devices under heavy load)
  • Typing speed threshold tested: ~120 WPM
  • Frame rate consistency: Stable at 60 FPS
5.3.7.3 Input Responsiveness Checklist
CriterionAssessment GoalResult
Typing LatencyInput lag remains below 15ms in 95% of cases.✅ Pass
Editing ResponsivenessNo delays in bulk deletions, copy-pasting, or undo operations.✅ Pass
Performance Across DevicesRemains smooth across both high-end and low-end systems.✅ Pass (minor lag at peak load on older devices)
Cross-Browser PerformanceInput responsiveness is consistent across tested browsers.✅ Pass
Frame StabilityUI maintains at least 60 FPS during input operations.✅ Pass

Table:Test-19

5.4 7.4 Security

5.4.1 Evaluation of Test-20

Automated security testing tools were used to monitor network traffic to verify HTTPS encryption. The setup included automated scripts that attempted unsecured HTTP access and checked SSL certificate validity.

5.4.1.1 Test Execution
  • Performance and security testing tools automatically monitored network traffic.
  • Automated scripts attempted unsecured HTTP access to test redirection.
  • Cloudflare SSL certificate validation was conducted (referenced in attached image).
  • Browser monitoring for mixed content warnings was performed.
5.4.1.2 Security Checklist
CriterionAssessment GoalPass/Fail
HTTPS EncryptionAll communications encrypted via HTTPS.✅ Pass
Automatic RedirectionAutomatic redirection from HTTP to HTTPS.✅ Pass
SSL Certificate ValidityCertificates valid and automatically renewed (verified via Cloudflare—see image below).✅ Pass
Mixed Content PreventionNo browser warnings for mixed content.✅ Pass

Table:Test-20

Figure 3: cloudflare

5.4.2 Evaluation of Test-21

Automated DNSSEC testing tools verified DNS security configurations. Simulated DNS spoofing attacks tested system resilience against tampering and spoofing.

5.4.2.1 Test Execution
  • Automated DNSSEC validation tools verified DNS security settings.
  • DNS spoofing attacks were automatically simulated.
  • DNS queries and responses were monitored to confirm integrity.
5.4.2.2 Security Checklist
CriterionAssessment GoalPass/Fail
DNSSEC ImplementationDNSSEC active and correctly configured.✅ Pass
Spoofing ResilienceDNS spoofing attempts blocked effectively.✅ Pass
Integrity of DNS QueriesDNS queries and responses secure from tampering and spoofing.✅ Pass

Table:Test-21

5.4.3 Evaluation of Test-22

Automated security testing tools were used to validate the effectiveness of Content Security Policies (CSP) by attempting script injections and analyzing CSP headers.

5.4.3.1 Test Execution
  • Automated injection of malicious scripts (XSS) was conducted.
  • CSP headers were analyzed automatically for correct configurations.
  • Violations or weaknesses in CSP were logged and assessed.
5.4.3.2 Security Checklist
CriterionAssessment GoalPass/Fail
CSP ConfigurationCSP headers correctly configured to block unauthorized scripts.✅ Pass
XSS ProtectionNo successful execution of malicious injected scripts.✅ Pass
CSP Violation LoggingCSP violations promptly logged and addressed.✅ Pass

Table:Test-22

5.4.4 Evaluation of Test-23

Automated scripts verified JWT-based session security, ensuring tokens were securely managed and resilient against misuse or interception.

5.4.4.1 Test Execution
  • Tokens were automatically inspected for proper signing and encryption.
  • Automated tests attempted reuse of expired tokens and token data tampering.
  • Session expiration and re-authentication processes were validated.
  • Secure token storage on client-side was automatically verified.
5.4.4.2 Security Checklist
CriterionAssessment GoalPass/Fail
JWT SecurityTokens properly signed, encrypted, and secured.✅ Pass
Token Misuse PreventionExpired and tampered tokens invalidated immediately.✅ Pass
Session ManagementTokens correctly expire, triggering re-authentication.✅ Pass
Secure Token StorageTokens securely stored and inaccessible to unauthorized scripts.✅ Pass

Table:Test-23

5.5 Maintainability and Support

5.6 Evaluation of Test-24

5.6.1 Security Audit & Maintenance Review

To ensure morph remains secure and updated, a security audit was conducted using pnpm audit. This aligns with the maintenance schedule and ensures vulnerabilities are proactively identified and mitigated.

Audit Results Summary:

  • Total vulnerabilities detected: 3
  • Severity level: Moderate
  • Affected Packages: dompurify, esbuild
  • Patched Versions Available: >=3.2.4 for dompurify, >=0.25.0 for esbuild

Security Audit Report Output:


Figure 4: audit_output

5.6.2 Observations & Findings

  • Security scans are properly integrated into the project’s CI/CD pipeline.
  • Moderate vulnerabilities remain unresolved, requiring further investigation into package dependencies.
  • Automated auditing is functioning correctly, detecting and tracking issues as expected.
  • Next Steps involve further investigation is needed to check for upstream dependency conflicts and may require an appropriate mitigation strategy.

5.7 Compliance

5.7.1 Evaluation of Test-25

5.7.1.2 Test Case & Results

The test was conducted by typing the following input phrase from the evaluated text:

Test Input: “Someone left this for you,” she says, then turns to the next customer.

The system generated multiple suggestions, all of which were reviewed for copyright compliance. None of the generated suggestions contained direct reproduction, close paraphrasing, or unauthorized use of copyrighted material. Every output was sufficiently transformed and original, ensuring full compliance with Canadian copyright law.

Figure 5: generated_text_suggestions

Figure 6: plagarism_check

5.7.1.3 Analysis
  • The system did not generate any suggestions that matched or closely resembled the original text.
  • All suggestions were distinct and legally compliant, avoiding potential copyright violations.
  • Plagiarism detection software Quetext confirmed that none of the generated outputs contained copyrighted material.
  • The system effectively differentiated its outputs while maintaining relevance to the original prompt.

6 Comparison to Existing Implementation

This section compares the current implementation of morph with two other widely used solutions: OpenAI’s ChatGPT and ProWritingAid. The goal is to highlight where morph differs, performs better, or provides more targeted value based on functionality, usability, and testing outcomes.

First solution: OpenAI’s ChatGPT

ChatGPT performs well for prompt-based generation and conversation-style interactions. It includes a canvas feature that supports flexible editing, but it lacks structured support for long-form writing workflows. morph, on the other hand, focuses specifically on creative and narrative writing. It provides targeted notes that suggest improvements to specific sections of text. These suggestions are mapped to exact chunks of writing, as demonstrated in Test-1 and Test-4, where the system successfully generated relevant, personalized suggestions that aligned with the user’s tone and content.

ChatGPT’s interface is clean and informative, optimized for quick conversational exchanges. However, it is not designed for structured editing or multi-stage planning. morph offers a minimal, editor-focused interface that supports distraction-free writing. According to the results in Test-8 and Test-10, users appreciated how the interface kept them focused, gave real-time feedback, and allowed for easy refinement of ideas. This feedback suggests that morph offers stronger usability for tasks involving focused content creation compared to ChatGPT’s chat-centric layout.

Second solution: ProWritingAid

ProWritingAid provides detailed grammar and style suggestions based on text analysis. These suggestions help improve writing but must be manually reviewed and applied. morph takes this a step further by allowing users to directly apply or reject system-generated notes with a single action. In Test-5, users provided feedback through a learning-to-rank panel and observed immediate updates to the suggestions, streamlining the editing process and improving the user experience.

ProWritingAid’s interface is feature-rich but can be overwhelming due to its dense layout and the number of popups and toggles. In contrast, morph prioritizes clarity and simplicity. Test-8 shows that users rated morph highly for being non-intrusive and visually consistent. Additionally, in Test-12, the system passed all keyboard accessibility checks, which supports a wider range of users and improves workflow efficiency compared to ProWritingAid’s mostly mouse-based navigation.

Summary

While ChatGPT is effective for conversational prompts and ProWritingAid excels in grammar checking, neither tool is tailored for focused, iterative creative writing. morph fills this gap by combining in-context suggestions, direct editing actions, and a clean writing interface. Across multiple tests, morph demonstrated its strength in usability, content relevance, and responsiveness, making it a more suitable option for users who prioritize thoughtful writing improvement and fluid editing workflows.

7 Unit Testing

Excluding the front end, the unit tests in notes_test.py verify that note processing correctly matches text chunks from a long story; service_test.py confirms that the embedding service returns properly shaped NumPy arrays; storage_test.py validates that text chunks are correctly converted, stored, and removed; and indexes_test.py ensures that the HNSW index is built and queried as expected:

$ python -m pytest -s notes_test.py service_test.py storage_test.py indexes_test.py --disable-warnings
============================================================================== test session starts ==============================================================================
platform win32 -- Python 3.12.6, pytest-8.3.5, pluggy-1.5.0
rootdir: C:\Users\walee\OneDrive\Desktop\morph
configfile: pyproject.toml
plugins: anyio-4.8.0
collecting ... Model 'sentence-transformers/all-MiniLM-L6-v2' loaded on device: 'cpu'.
collected 15 items
 
notes_test.py
test_note_suggestion_childhood_magic: Passed: Note: "The writer should elaborate more on the magic of childhood adventures." matched chunk starting at index 0.
test_note_suggestion_village_beauty: Passed: Note: "The description of the village should be more vivid and detailed." matched chunk starting at index 193.
test_note_suggestion_environment_detail: Passed: Note: "Please expand on the environmental details, describing the meadows and forests." matched chunk starting at index 536.
test_note_suggestion_narrative_depth: Passed: Note: "The narrative feels shallow; more depth should be added to the story." matched chunk starting at index 775.
test_note_suggestion_emotional_intensity: Passed: Note: "The emotional expressions are weak; please intensify the depiction of personal struggles." matched chunk starting at index 775.
test_note_suggestion_modern_critique: Passed: Note: "The critique of modern society seems vague; expand on how traditions are fading." matched chunk starting at index 1205.
test_note_suggestion_final_twist: Passed: Note: "The ending is too predictable; consider adding an unexpected twist." matched chunk starting at index 1404.
test_note_suggestion_overall_improvement: Passed: Note: "Overall, the story could use more cohesion and clearer transitions." matched chunk starting at index 775.
 
service_test.py
test_encode_returns_numpy_array: Passed: Encoding returns a numpy array with shape (2, 384)
test_encode_default_sentences: Passed: Default sentences encoded with shape (4, 384)
 
storage_test.py
test_numpy_blob_conversion: Passed: Numpy-to-blob conversion and back is correct.
test_add_and_get_chunk: Passed: Chunk added and retrieved correctly.
test_remove_chunk: Passed: Chunk removed successfully.
 
indexes_test.py
test_rebuild_index_empty: Passed - Rebuild index empty returns no labels as expected.
test_rebuild_index_with_data: Passed - Rebuild index with data returns valid label 915 matching one of the stored chunk IDs.
 
======================================================================== 15 passed in 8.24s =========================================================================

Unit Tests Explanation

notes_test.py These tests validate whether the system-generated writing notes are contextually matched to appropriate text chunks:

  • test_note_suggestion_childhood_magic: Suggests elaboration where nostalgic imagery is weak.
  • test_note_suggestion_village_beauty: Suggests adding sensory detail to environment descriptions.
  • test_note_suggestion_environment_detail: Prompts for visual and spatial enhancement.
  • test_note_suggestion_narrative_depth: Identifies superficial storytelling and calls for depth.
  • test_note_suggestion_emotional_intensity: Flags weak emotional passages.
  • test_note_suggestion_modern_critique: Detects vague social commentary.
  • test_note_suggestion_final_twist: Highlights overly predictable endings.
  • test_note_suggestion_overall_improvement: Offers general refinement suggestions.

These ensure each note the user sees has clear relevance to a specific section in their draft.

service_test.py Validates the embedding service that transforms text into numerical arrays for semantic comparison:

  • test_encode_returns_numpy_array: Confirms output is a valid NumPy array.
  • test_encode_default_sentences: Checks proper shape for known inputs.

These are critical for ensuring that search, similarity, and vector operations behave as expected.

storage_test.py Checks the logic for saving and retrieving writing chunks:

  • test_numpy_blob_conversion: Ensures reversible binary storage for vectors.
  • test_add_and_get_chunk: Verifies chunks are stored and retrieved accurately.
  • test_remove_chunk: Confirms deletion removes the correct entry.

These are essential for maintaining consistency between user input and saved data.

indexes_test.py Tests the semantic indexing functionality:

  • test_rebuild_index_empty: Ensures empty rebuilds return no unexpected data.
  • test_rebuild_index_with_data: Confirms that valid stored chunks return the correct index label.

These tests ensure that the system can retrieve and relate suggestions with high accuracy and performance.

Summary

All 15 backend unit tests passed successfully within 8.24 seconds. The tests verify:

  • Semantic correctness of generated suggestions
  • Reliable storage and retrieval of chunked data
  • Proper shape and structure of vectorized text
  • Indexing behavior for real-time editing support

By isolating each subsystem and clearly defining expected behavior, these unit tests ensure backend stability and give confidence that morph performs as intended under real-world usage.

8 Changes Due to Testing

  • Interface structure improved for mobile use
  • The disabled contents in the interface is adjusted with a more obvious contrast ratios to support contrast compliance
  • Hover states improved to support darkmode
  • Front-end code is restructured lightly to deacrese the responding time
  • More graphical hint and color hint is used to better support navigation and remove obstacles

9 Automated Testing

The testing was does automatically run on Github Actions whenever a commit was pushed to the main branch. The configuration of the CI/CD environment can be found at https://github.com/aarnphm/morph/actions/workflows/ci.yml.

10 Trace to Requirements

10.1 Functional Requirements

RequirementsFR1FR2FR3FR13FR14
Test-1XX
Test-2X
Test-3X
Test-4X
Test-5X
Test-6X
Test-7X

Table: Traceability of Testing to Functional Requirements

10.2 Non-Functional Requirements

RequirementsLF-A1LF-A2UH-EOU3UH-L1UH-A2PR-SLR1PR-SLR2PR-SCR2PR-PAR1PR-RFR2PR-CR1PR-CR2SR-INT1SR-INT2SR-INT3OER-MR1
Test-8X
Test-9X
Test-10X
Test-11X
Test-12X
Test-13X
Test-14X
Test-15X
Test-16X
Test-17X
Test-18X
Test-19X
Test-20X
Test-21X
Test-22X
Test-24X

Table: Traceability of Testing to Non-Functional Requirements

11 Trace to Modules

ModulesM1M2M3M4M5M6M7M8M9M10M11
Test-1XXXXX
Test-2XXXX
Test-3XXXX
Test-4XXXX
Test-5XXX
Test-6XXX
Test-7XXX
Test-8XXXX
Test-9XXXX
Test-10XXXX
Test-11XXXX
Test-12XXXX
Test-13XXXX
Test-14XXXX
Test-15XXXXX
Test-16XXXXX
Test-17XXXX
Test-18XXX
Test-19XXXXX
Test-20XXXXXX
Test-21XXXX
Test-22XXXXXX
Test-23XXXXX
Test-24XXXX
Test-25XXXX

Table: Traceability of Testing to Modules

12 Code Coverage Metrics

The coverage data generated by coverage.py and coverage.tsx can be shown in the following table:

12.1 Code Coverage by Module Type

NameStmtsMissCover
search/app.py91298%
search/bento_service.py39197%
search/index_manager.py62198%
search/storage.py60198%
context/notes-context.tsx911584%
context/search-context.tsx981684%
context/vault-context.tsx1032081%
components/editor.tsx5645690%
components/settings-panel.tsx5446089%
components/explorer.tsx2922990%
TOTAL194420090%

Table: Code Coverage Data


Figure 7: Coverage Output Terminal

The coverage for .tsx files is comparatively lower (average coverage ~86%) due to the inherent complexity and challenges in testing frontend GUI modules. GUI components often require interactive testing frameworks, making comprehensive automated unit testing more challenging and less frequently utilized compared to backend logic (average coverage ~98%), such as Python modules.

13 Conclusions

morph effectively met most functional, usability, and accessibility requirements, exhibiting strong performance in features such as planning suggestions, personalized style adaptation, tone steering, real-time feedback integration, and theme customization. Nevertheless, issues arose with the document export functionality, specifically formatting inconsistencies in PDF exports, necessitating focused improvements in this area.

Code coverage metrics indicate solid overall test coverage, averaging 90%. Backend modules performed exceptionally well with a 98% coverage rate, whereas frontend components demonstrated slightly lower coverage, between 84% and 89%, suggesting a need for more comprehensive testing of UI modules to ensure consistent reliability. Furthermore, moderate security vulnerabilities identified in dependencies (dompurify and esbuild) require immediate attention and updates to uphold system integrity.

Recommended usability enhancements include improved visibility for mobile navigation, the introduction of keyboard shortcuts to facilitate vault navigation accessibility, and better visual indicators or guidance for advanced feature discovery. Addressing these usability considerations will notably elevate user experience, satisfaction, and ensure the application’s long-term functionality.

Overall, the underlying technology of morph is robust and scalable, effectively leveraging available hardware resources such as GPUs to support concurrent users. Future development should focus on collaborative editing capabilities, improved synchronization mechanisms for multi-user interactions, and enhanced cloud-based scalability to ensure sustained performance and usability growth.

14 Appendix

14.1 Revision

DateVersionNotes
Sept. 16 20240.0Initial skafolding
Mar. 10 20250.1Rev0
March 31 20250.2Rename to morph for consistency
Apr. 4 20250.3Full document revision and restructuring

14.2 Reflection

  1. One of the biggest successes was the structured approach we followed in evaluating both usability and functional requirements. The predefined test cases provided a clear roadmap, making it easier to conduct and document evaluations effectively. The usability tests such as onboarding time and planning interface validation, which offered strong insights into how users interact with morph. This allowed us to identify areas for further refinement. Functional tests, including text generation and document export, confirmed that the system met key user expectations. The organized structure of the report also ensured that the verification and validation results were easy to follow.

  2. One of the main challenges was ensuring that all tests were practical and not overly time-consuming. Some tests, especially those related to accessibility and document export took longer than expected due to formatting inconsistencies and edge cases that required additional review. Additionally, ensuring consistency in reporting across different test cases required extra coordination. We resolved this by refining the scope of tests to focus on the most critical aspects and conducting regular team discussions to standardize how results were documented.

  3. The security and performance evaluations were mostly based on predefined system requirements rather than direct client feedback as they involved verifying expected behavior under controlled conditions. However, peer discussions played a role in refining the scope of security testing, particularly regarding access controls and encryption verification. On the other hand, some functional tests such as planning interface validation and feedback integration, were influenced by user feedback, helping us assess usability from a real-world perspective.

  4. The VNV Plan originally included a larger number of test cases, but in practice, we streamlined the scope to focus on the most relevant and impactful evaluations. Some tests such as validating a minimalist design with a monotonic color palette and responsiveness across devices were removed since they overlapped with other usability tests. Accessibility tests were also adapted to be conducted in-house rather than by an external audit team. These modifications allowed us to prioritize key areas without unnecessary duplication. In future projects, anticipating these changes earlier by continuously evaluating test relevance throughout the process would improve efficiency.


  1. One of the things that went well in writing this deliverable was the structured approach we took to verifying morph usability, accessibility, and design consistency. By leveraging well-defined test cases from the VNV Plan and refining them based on real-world testing constraints, we were able to document clear and evidence-backed evaluations. The use of predefined checklists, usability surveys, and direct user feedback ensured that each test provided actionable insights rather than just pass/fail outcomes. Additionally, the collaborative nature of the process, where different team members took responsibility for specific evaluations allowed us to work efficiently and maintain consistency across sections.

  2. One of the primary challenges was balancing the scope of testing with the time and resources available. Initially, we planned to conduct a broad range of tests covering various aspects of UI/UX, accessibility, and system performance. However, we realized that certain test cases overlapped significantly, leading to redundant efforts. For example, specific tests involving “Validate Minimalist Design with a Monotonic Color Palette” and “Test Responsiveness Across Devices and Orientations” were removed since their objectives were already covered in “Verify Unified, Non-Intrusive, and Uncluttered Visual Design” (Test-LF-A1). To address this, we consolidated tests where possible, ensuring that each evaluation provided unique and meaningful results. Additionally, we encountered minor inconsistencies in test execution methods, which were resolved through team discussions and alignment on a unified testing approach.

  3. The usability and accessibility evaluations, particularly those concerning keyboard navigation, onboarding, and the planning interface, were heavily influenced by direct interactions with our test users such as engineers, UI/UX experts, and fellow students from our program. Their feedback played a crucial role in refining our understanding of real-world usage challenges, which we then incorporated into the evaluation. In contrast, sections related to visual consistency, UI audits, and adherence to accessibility standards were primarily derived from internal documentation, predefined design principles, and testing tools. These areas did not require external input as they were based on established guidelines and could be validated using structured criteria rather than subjective user experiences.

  4. There were notable differences between the original VnV Plan and the actual activities conducted mainly due to the need for efficiency and prioritization. Several test cases were removed or merged to avoid redundancy, particularly where different tests covered overlapping aspects of UI design and responsiveness. Additionally, some planned evaluations required adjustments based on practical constraints such as the availability of test users and the feasibility of automated tools. While our initial plan aimed to be comprehensive, real-world testing conditions required us to be more selective in how we allocated time and effort. Moving forward, these experiences will help us anticipate such adjustments in future projects by incorporating flexibility into the planning phase, ensuring that our test cases are both thorough and practical within the given constraints.


  1. The deliverable allowed us to clearly defining our testing scenarios and understanding what needed to be covered went smoothly. Setting up structured unit tests helped clarify component behaviors early on, making it easy to validate that specific parts of the application (like markdown editing, note generation, and file operations) worked as expected.

  2. A challenge involved measuring the accuracy of AI-generated notes. Since these notes are not deterministic, we integrated a semantic search module to compute similarity between generated notes and user-provided content. This allowed us to quantify similarity objectively and make informed adjustments to improve note relevance.

  3. Much of the verification of the functionality outlined in this deliverable stemmed from discussions with stakeholders (such as peers acting as proxy clients) regarding user needs and preferences. For instance, the issue with exporting documents in different formats directly came from client interactions and peer feedback when someone pointed out that the PDF Format was off. User interviews and informal testing sessions significantly affected our decisions on future UI elements and AI based features.

  4. Our team experienced deviations from the original VnV Plan due to an overly ambitious initial scope. The Software Requirements Specification (SRS) and the Verification and Validation (VnV) Plan initially included extensive features such as profile panels, version history management, detailed goal tracking, and robust multi-language support. As development progressed, we recognized that many of these features exceeded the project’s realistic scope, causing numerous test cases outlined in the original VnV plan to become infeasible. To address this, we prioritized essential functionalities aligned directly with core client needs, scaling down features like profile management and version history. Consequently, we adjusted our VnV plan, narrowing our testing focus to critical features such as text editing, AI-generated notes, and basic file operations. Moving forward, we plan to apply these lessons learned by better estimating realistic scopes, incorporating incremental milestone checkpoints to reassess feasibility regularly, and leaving room for adapting testing strategies accordingly.


  1. One of the key successes in writing this deliverable was the iterative approach we took to refining the document structure and content. Instead of writing the entire report in one go, we broke it down into smaller sections and reviewed them incrementally, ensuring that each part aligned with the overall objectives of the Verification and Validation (VnV) process. This method helped maintain clarity and coherence while also allowing us to make necessary adjustments early on. Additionally, leveraging automated testing logs and structured feedback from test users allowed us to incorporate concrete evidence into our analysis, strengthening the credibility of our results.

  2. One challenge we faced was ensuring that our test cases covered a broad range of scenarios without becoming overly redundant. Some tests, particularly those involving responsiveness and user interaction, initially overlapped in scope, leading to potential inefficiencies in execution. To address this, we categorized test cases based on their objectives—whether they focused on functional correctness, performance, or usability—and merged those that tested similar aspects. Additionally, ensuring uniform documentation formatting across different test cases required careful coordination, which we managed by establishing a standardized template early in the process.

  3. The non-founctional requirement relevant testing and following adjustment are largely based on the feedback from clients, for the goal that to make this project better fit into the user expectation with high usability. The unit testing is mainly constructed based the the group members’ ideas due to the expertise and knowledge gap between the project developer and user.

  4. There are a lot of the testcase deletion and modification haapened comparing to the original VnV plan, together with some more detailed and specific testing improvements due to the better understanding to the project along with the development procedure. The testcases after modification better fit into the purpose of verification and support the testing responsibility.