Artifacts and Repository Scanner¶
Overview¶
The scanner defined in api/repos_scanner.py extracts structured “traceability”
information from a code/documentation repository based on a YAML configuration.
It can:
Clone a repository in a per-user temp folder.
Extract “snippets” (text sections) from an API reference document.
Derive work items from the snippets: - Documents and nested documents - Software requirements (and nested) - Test specifications - Test cases - Justifications
Generate or re-use records in the database for all of the above (traceability generation).
Command Line Interface¶
The module can be executed directly to run a scan using a user’s configuration file:
python api/repos_scanner.py --userid 2 [--logfile 20251115_120000.log]
--userid: required. Used to locate the per-user config underapi/user-files/<USERID>.config/config.yamland to generate DB entities under that user.--logfile: optional. When omitted, a timestamped log file is created in the same directory.
Logs are written to: api/user-files/<USERID>.config/<logfile>.
User Files and Layout¶
On a scan, the scanner mirrors the API reference document into the user’s files:
Repository is cloned into a per-user temp directory (sanitized path).
The matching API reference document is copied into
api/user-files/<USERID>/<api>_<branch>.<ext>and used for section extraction.
Configuration Primer¶
The main configuration is YAML-based. At the top level:
api:
- name: ["myApi"] # one or more API names (expanded as separate runs)
library: "myLib"
library_version: "__ref__" # supports magic variables (branch/ref/version)
repository:
url: "https://github.com/org/repo.git"
branch: "main"
filename_pattern: "*.md"
folder_pattern: "docs"
hidden: false
file_contains: [] # OR filter: include files containing any
file_not_contains: [] # OR filter: exclude files containing any
snippets:
rules:
- name: "API Sections"
section:
start:
line_equal: "START"
strip: true
end:
line_equal: "END"
strip: true
# Optional: split/filter/transform (explained below)
documents: { rules: [...] }
software_requirements: { rules: [...] }
test_specifications: { rules: [...] }
test_cases: { rules: [...] }
justifications: { rules: [...] }
You can define nested rules (e.g., documents under documents) by supplying a sub-tree with rules.
Section Extraction (start/end)¶
Sections are identified in two steps:
startselects the starting point(s)endtrims each selected section to its end
Both blocks accept a rich set of matchers (use exactly one per block, or combine with closest):
line_starting_with/line_not_starting_withline_ending_with/line_not_ending_withline_contains/line_not_containsline_equal/line_not_equalline_regex(regular expression)line(exact line index) andat(exact character index)
Common options:
strip(bool|string|list-of-chars): default False. When True, trims whitespace; or pass characters.lstrip/rstrip: likestripbut applied only to left/right.case_sensitive: default False.skip_top_items/skip_bottom_items: skip matched sections at the beginning/end of the result list.first_only: if True, stop after first match in each scanned range.
Closest and Extend¶
closestcan re-anchor a start range to the closest line (e.g., to the nearest header above).extendallows adding lines up/down by a fixed count after a start has been identified.
Example:
section:
start:
line_contains: "Title:"
closest:
direction: "up"
line_starting_with: "# "
end:
line_starting_with: "## " # next header
first_only: true
Filtering (after extraction; before transform)¶
Once start/end have produced sections, you can filter them:
section:
start: { line_equal: "START", strip: true }
end: { line_equal: "END", strip: true }
filter:
contains: ["keep this"] # OR of substrings
not_contains: ["exclude"] # OR of substrings
regex: ["\\bID-\\d+\\b"] # OR of regexes
case_sensitive: false # optional, default false
Filtering runs before transform. Elements are kept if:
they match any of
contains(if provided), andthey do not match any of
not_contains(if provided), andthey match any of the
regex(if provided).
Split (optional)¶
After end-trimming (and before filter/transform), sections can be split into multiple elements:
section:
start: { line_equal: "START" }
end: { line_equal: "END" }
split:
delimiter: "\n---\n"
strip: true
keep_empty: false
Each split part becomes a new element with a recalculated index relative to the original text.
Transform (final text changes)¶
Transforms are applied last (after filtering). Supported operations:
uppercase/lowercase/camelcase/stripsuffix(requiresvalue) /prefix(requiresvalue)replace(requireswhatandwith)regex_sub(what/patternandwith; flags viaflags: "ims"or booleans:ignorecase,multiline,dotall)
Example:
section:
start: { line_equal: "START", strip: true }
end: { line_equal: "END", strip: true }
filter:
contains: ["keep"]
transform:
- { how: "regex_sub", what: "\\s+", with: " " }
- { how: "uppercase" }
Work Item Rules¶
Each work item type has a field schema. The scanner will extract values from the reference document (or use constants) and then combine them to produce items. Highlights:
Snippets:
SNIPPET_FIELDS:section(str),offset(int)
Documents:
DOCUMENT_FIELDS:title,description,document_type,spdx_relation,url,coverageSupports nested documents (
documents.rules)
Software Requirements:
SOFTWARE_REQUIREMENT_FIELDS:title,description,coverageSupports nested software requirements, test specifications, test cases
Test Specifications:
TEST_SPECIFICATION_FIELDS:title,preconditions,test_description,expected_behavior,coverage
Test Cases:
TEST_CASE_FIELDS:title,description,repository(optional),relative_path(optional),coverage
For each rule, you can use constants (value) or extraction (start/end) per field. Lists must align in
length; otherwise the rule is skipped.
Traceability Generation¶
TraceabilityGenerator walks the extracted traceability structure and:
Creates or reuses DB entities (APIs, Documents, Software Requirements, Test Specifications, Test Cases, Justifications).
Persists mappings and nested relations.
Commits per step for robustness.
It uses the logged-in user (--userid) as creator for all new entities.
Notes and Recommendations¶
For exact string comparisons on lines, prefer
strip: trueto normalize whitespace.Use
first_onlyto prevent multiple end matches within the same element.skip_top_items/skip_bottom_itemscan be set independently for start and end.Filtering is case-insensitive by default; set
case_sensitive: trueif needed.For nested rules, remember each nested block is itself a full ruleset with its own
repositoryand field configs.