Artifacts and Repository Scanner¶

Overview¶

The scanner defined in api/repos_scanner.py extracts structured “traceability” information from a code/documentation repository based on a YAML configuration. It can:

Clone a repository in a per-user temp folder.
Extract “snippets” (text sections) from an API reference document.
Derive work items from the snippets: - Documents and nested documents - Software requirements (and nested) - Test specifications - Test cases - Justifications
Generate or re-use records in the database for all of the above (traceability generation).

Command Line Interface¶

The module can be executed directly to run a scan using a user’s configuration file:

python api/repos_scanner.py --userid 2 [--logfile 20251115_120000.log]

--userid: required. Used to locate the per-user config under api/user-files/<USERID>.config/config.yaml and to generate DB entities under that user.
--logfile: optional. When omitted, a timestamped log file is created in the same directory.

Logs are written to: api/user-files/<USERID>.config/<logfile>.

User Files and Layout¶

On a scan, the scanner mirrors the API reference document into the user’s files:

Repository is cloned into a per-user temp directory (sanitized path).
The matching API reference document is copied into api/user-files/<USERID>/<api>_<branch>.<ext> and used for section extraction.

Configuration Primer¶

The main configuration is YAML-based. At the top level:

api:
  - name: ["myApi"]              # one or more API names (expanded as separate runs)
    library: "myLib"
    library_version: "__ref__"   # supports magic variables (branch/ref/version)
    repository:
      url: "https://github.com/org/repo.git"
      branch: "main"
      filename_pattern: "*.md"
      folder_pattern: "docs"
      hidden: false
      file_contains: []          # OR filter: include files containing any
      file_not_contains: []      # OR filter: exclude files containing any
    snippets:
      rules:
        - name: "API Sections"
          section:
            start:
              line_equal: "START"
              strip: true
            end:
              line_equal: "END"
              strip: true
          # Optional: split/filter/transform (explained below)
          documents: { rules: [...] }
          software_requirements: { rules: [...] }
          test_specifications: { rules: [...] }
          test_cases: { rules: [...] }
          justifications: { rules: [...] }

You can define nested rules (e.g., documents under documents) by supplying a sub-tree with rules.

Section Extraction (start/end)¶

Sections are identified in two steps:

start selects the starting point(s)
end trims each selected section to its end

Both blocks accept a rich set of matchers (use exactly one per block, or combine with closest):

line_starting_with / line_not_starting_with
line_ending_with / line_not_ending_with
line_contains / line_not_contains
line_equal / line_not_equal
line_regex (regular expression)
line (exact line index) and at (exact character index)

Common options:

strip (bool|string|list-of-chars): default False. When True, trims whitespace; or pass characters.
lstrip / rstrip: like strip but applied only to left/right.
case_sensitive: default False.
skip_top_items / skip_bottom_items: skip matched sections at the beginning/end of the result list.
first_only: if True, stop after first match in each scanned range.

Closest and Extend¶

closest can re-anchor a start range to the closest line (e.g., to the nearest header above).
extend allows adding lines up/down by a fixed count after a start has been identified.

Example:

section:
  start:
    line_contains: "Title:"
    closest:
      direction: "up"
      line_starting_with: "# "
  end:
    line_starting_with: "## "  # next header
    first_only: true

Filtering (after extraction; before transform)¶

Once start/end have produced sections, you can filter them:

section:
  start: { line_equal: "START", strip: true }
  end:   { line_equal: "END",   strip: true }
  filter:
    contains: ["keep this"]       # OR of substrings
    not_contains: ["exclude"]     # OR of substrings
    regex: ["\\bID-\\d+\\b"]      # OR of regexes
    case_sensitive: false         # optional, default false

Filtering runs before transform. Elements are kept if:

they match any of contains (if provided), and
they do not match any of not_contains (if provided), and
they match any of the regex (if provided).

Split (optional)¶

After end-trimming (and before filter/transform), sections can be split into multiple elements:

section:
  start: { line_equal: "START" }
  end:   { line_equal: "END" }
  split:
    delimiter: "\n---\n"
    strip: true
    keep_empty: false

Each split part becomes a new element with a recalculated index relative to the original text.

Transform (final text changes)¶

Transforms are applied last (after filtering). Supported operations:

uppercase / lowercase / camelcase / strip
suffix (requires value) / prefix (requires value)
replace (requires what and with)
regex_sub (what/pattern and with; flags via flags: "ims" or booleans: ignorecase, multiline, dotall)

Example:

section:
  start: { line_equal: "START", strip: true }
  end:   { line_equal: "END",   strip: true }
  filter:
    contains: ["keep"]
  transform:
    - { how: "regex_sub", what: "\\s+", with: " " }
    - { how: "uppercase" }

Work Item Rules¶

Each work item type has a field schema. The scanner will extract values from the reference document (or use constants) and then combine them to produce items. Highlights:

Snippets:
- SNIPPET_FIELDS: section (str), offset (int)
Documents:
- DOCUMENT_FIELDS: title, description, document_type, spdx_relation, url, coverage
- Supports nested documents (documents.rules)
Software Requirements:
- SOFTWARE_REQUIREMENT_FIELDS: title, description, coverage
- Supports nested software requirements, test specifications, test cases
Test Specifications:
- TEST_SPECIFICATION_FIELDS: title, preconditions, test_description, expected_behavior, coverage
Test Cases:
- TEST_CASE_FIELDS: title, description, repository (optional), relative_path (optional), coverage

For each rule, you can use constants (value) or extraction (start/end) per field. Lists must align in length; otherwise the rule is skipped.

Traceability Generation¶

TraceabilityGenerator walks the extracted traceability structure and:

Creates or reuses DB entities (APIs, Documents, Software Requirements, Test Specifications, Test Cases, Justifications).
Persists mappings and nested relations.
Commits per step for robustness.

It uses the logged-in user (--userid) as creator for all new entities.

Notes and Recommendations¶

For exact string comparisons on lines, prefer strip: true to normalize whitespace.
Use first_only to prevent multiple end matches within the same element.
skip_top_items/skip_bottom_items can be set independently for start and end.
Filtering is case-insensitive by default; set case_sensitive: true if needed.
For nested rules, remember each nested block is itself a full ruleset with its own repository and field configs.