sauron development: new file indexing maintaining system

I want to make this record as a post to ensure I can remember these nice decisions in the future. Let's start with a brief history of the project. A few years ago, I think is 2020 July, I found Racket has no nice IDE for many reasons. Mainly, we only have a small community, though everyone is so productive(except me of course). Secondly, the semantics of your editing code is not easy to sure, because

Racket is about creating new programming languages quickly.

Thus, a Racket IDE actually is a Languages IDE, almost impossible. Luckily, this is not totally impossible for some reasons, which I would tell you later. In beginning, I start from a standalone executable and quickly found I can let it be a DrRacket(default editor for Racket) plugin, then people can just install sauron easier and start to enjoy it. Till here, I have already made a project manager, file tree-view, an improved REPL, and Git integration. These helped DrRacket become a more complete IDE.

Now, I already have an IDE, but relying on the DrRacket default jump to definition makes me think. This is the original reason and will push us far in the following story, the next section starts to explain what was the problem.

1. Jump to definition

The default behavior can only jump to the import statement, let's say the following:

(require xxx)

(from-xxx ...)

I can only go to the position of xxx. This is not a wanted behavior, I expect it would go to the definition place in another file rather than go to the import place in the current file in the case. But before we go deep, it's time to explain why ensuring semantic might not be a problem.

The most used language is racket/base, racket, racket/gui, typed/racket, they are just well supported by DrRacket's tools.
Reusing the module system and getting semantic support is possible, in fact, the collection turnstile provide these nicely, you can also expand your language to reuse built-in forms in racket. For example, if your language says var x = 1, expanding it to (define x 1) will get support from DrRacket.
If your language really cannot base on these, I have a future proposal to solve it, that's your language can talk to an API about how to jump to definition or anything people need. I will back to this point at guessing class and API.

Back to the problem, I found racket-mode already solve it, so I check their code(below).

(define/override (syncheck:add-arrow/name-dup/pxpy
                  _def-src def-beg def-end _def-px _def-py
                  _use-src use-beg use-end _use-px _use-py
                  _actual? _level require-arrow? _name-dup?)
  (when (and (valid-beg/end? def-beg def-end)
             (valid-beg/end? use-beg use-end))
    ;; Consolidate the add-arrow/name-dup items into a hash table
    ;; with one item per definition. The key is the definition position.
    ;; The value is the set of its uses' positions.
    (hash-update! ht-defs/uses
                  (list (substring code-str def-beg def-end)
                        (match require-arrow?
                          ['module-lang 'module-lang]
                          [#t           'import]
                          [#f           'local])
                        (add1 def-beg)
                        (add1 def-end))
                  (λ (v) (set-add v (list (add1 use-beg)
                                          (add1 use-end))))
                  (set))
    (unless require-arrow?
      (send this syncheck:add-mouse-over-status "" use-beg use-end "defined locally"))))

(define/override (syncheck:add-jump-to-definition _src beg end id-sym path submods)
  ;; - drracket/check-syntax only reports the file, not the
  ;;   position within. We can find that using our def-in-file.
  ;;
  ;; - drracket/check-syntax uses identifier-binding which can't
  ;;   follow some contracting and renaming provides. As a result,
  ;;   the value of the id here can be wrong. For example it will
  ;;   report "provide/contract-id-make-traversal.1" for
  ;;   "make-traversal". When the reported id differs from that in
  ;;   the source text, we report both to try with def-in-file.
  ;;
  ;; However, calling def-in-file here/now for all jumps would be
  ;; quite slow. Futhermore, a user might not actually use the
  ;; jumps -- maybe not any, probably not most, certainly not all.
  ;;
  ;; Sound like a job for a thunk, e.g. racket/promise
  ;; delay/force? We can't marshal a promise between Racket back
  ;; end and Emacs front end. We can do the moral equivalent:
  ;; Simply return the info that the front end should give to the
  ;; "def/drr" command if/as/when needed.
  (when (and (valid-beg/end? beg end) (file-exists? path))
    (define src-str (substring code-str beg end))
    (define drracket-id-str (symbol->string id-sym))
    (interval-map-set! im-jumps beg end
                       (list (path->string path)
                             submods
                             (if (equal? drracket-id-str src-str)
                                 (list drracket-id-str)
                                 (list drracket-id-str src-str))))))

After researching, I found two points

syncheck-annotations<%> can build some helpful information as IDE, this is how DrRacket points out errors and provides many hints.
based on it building more is possible, racket-mode shows this.

So I create another one in sauron:

(define/override (syncheck:add-arrow/name-dup start-src-obj start-left start-right
                                              end-src-obj end-left end-right
                                              actual?
                                              level
                                              require-arrow?
                                              name-dup?)
  (define id (string->symbol (send text get-text end-left end-right)))
  (unless require-arrow?
    (interval-map-set! bindings
                       end-left
                       (add1 end-right)
                       (binding id start-left start-right #f))))

(define/override (syncheck:add-jump-to-definition source-obj start end id filename submods)
  (interval-map-set! bindings start (add1 end) (binding id #f #f filename)))

It's easy to understand the query flow, for a file, I build an annotation result(also called indexing in many IDE, or record in sauron) just like above. When the editor is asking the definition position of a range or a position, it would get something from the annotation result, due to the code it could be:

(binding id start-left start-right #f) when this is a local variable, collected via syncheck:add-arrow/name-dup
(binding id #f #f filename) when the identifier is an external one, collected via syncheck:add-jump-to-definition

The local one is boring, just move the cursor to the target range(what DrRacket already does), the external one is the point. Then I have to collect top-level definitions for each file via the below code.

(define/override (syncheck:add-definition-target source-obj start end id mods)
  (log:debug "syncheck:add-definition-target ~a:~a" source-obj id)
  (hash-set! defs id (binding id start end src)))

When an editor just finds it gets (binding id #f #f filename), it will stop and prepare to open another tab for the file, and then ask another record to get the definition of =id=(what above just collected)! The whole flow completes correctly just like racket-mode. Later, I also port this to racket-langserver, this makes all mainstream development tools share the same ability!

DrRacket: sauron
VSCode, Vim: racket-langserver
Emacs: racket-mode

Is time to push our boundaries.

2. Stuck

Quickly, I ambitious started to do the reverse, jump to usages from a definition. Unfortunately, this is inefficient in the end because of my implementation. I know I must create something differently from fundamental, the record management needs a new approach. Since per file per record, I quickly decide to use the model per file per thread. The idea was made last year, however, I cannot push it forward till recently.

3. Detect problems

I dig into the implementation, this section briefly shows what I know from this process.

I have a big hashmap for mapping from a file to its record. Thus, although each file gets a thread as an updater, the updater will get blocked on writing the hashmap since the hashmap is thread-safe. Hence, the read-write lock would also block the editor randomly even if the editor just wants to query something.
When any file changes, I create updaters for every program file in the project, this is inefficient.

As corresponding, the following are solutions.

Rather than store the data in hashmap, store them in the thread! Then the hashmap became from file to the thread, the thread called maintainer now! Now, all querying at least not blocking by any lock, only scheduling problem.
With the record maintainer, any changes became sending messages to the corresponding maintainer of a file via thread-send. Simplify the abstraction a lot.

Maintainer HashMap

        │
        │
        │ create
        │                         ┌────────────────┐
        │                         │     Editor     │
  ┌─────▼──────┐                  │                │
  │ maintainer ◄──────────────────┤  current-file  │
  └────────────┘                  │                │
                  send message    └────────────────┘

These all happened in commit 0b1714e45, and amend in commit 930db7f92.

4. Improvement result

The result is impressive.

opening a project only takes at most 2 seconds.
UI gets no blocking
The memory usage suppresses to less than 0.5 times at peak. Notice this is not an accurate measurement due to I didn't measure in the same version, I remember the peak is over 1000MB in Racket 8.2, but now is around 500MB in Racket 8.5.
Collect records and get prepared to respond is faster now

5. Further

Finally, I have already reported the current situation of sauron, and am ready to show some future.

5.1. Jump to definition(Again)

With the maintainer system, I can reduce the blocking time now! Below is the current code for jump to definition.

(define (jump-to-definition editor from-pos)
  (define filepath (send editor get-filename))
  (match (jump-to-def filepath from-pos)
    [(binding id #f #f path)
     (define frame (send (send editor get-tab) get-frame))
     (prepare-editor-for frame path)
     (match (get-def path id)
       [(struct* binding ([start start] [end end]))
        (send (send frame get-editor) set-position start end)])]
    [(struct* binding ([start start] [end end]))
     (send editor set-position start end)]
    [_ (log:info "cannot jump to definition from ~a:~a" filepath from-pos)]))

(define (prepare-editor-for frame path)
  (define tab-of-path-<?> (send frame find-matching-tab path))
  (if tab-of-path-<?>
      ; when we already have a tab for the path, switch to it
      (send frame change-to-tab tab-of-path-<?>)
      ; when we don't have a tab for the path, open one
      (send frame open-in-new-tab path)))

The prepare-editor-for needs time to complete, thus, for non-blocking version maybe send a message like

(define (jump-to-definition editor from-pos)
  (thread-send (get-maintainer (send editor get-filename))
               (list 'jump-to-definition (current-thread) editor from-pos)))

;;; maintainer side
(match (thread-receive)
  [(list 'jump-to-definition from editor from-pos)
   (match-define (struct* record ([bindings bindings])) cached-record)
   (define binding (interval-map-ref bindings from-pos #f))
   ; same code ...
   ])

5.1.1. Another solution(blocking version)

If blocking is desired behavior, the maintainer system still can help to simplify the code logic.

(define (jump-to-definition editor from-pos)
  (define filepath (send editor get-filename))
  (thread-send (get-maintainer filepath)
               (list 'jump-to-definition (current-thread) editor from-pos)))
  (define frame (send (send editor get-tab) get-frame))
  (match (thread-receive)
    [(list 'open path) (prepare-editor-for frame path)]
    [else (void)])
  (match (thread-receive)
    [bind (send (send frame get-editor) set-position (binding-start bind) (binding-end bind))]
    [else (log:info "cannot jump to definition from ~a:~a" filepath from-pos)]))

;;; maintainer side
(match (thread-receive)
  [(list 'jump-to-definition from editor from-pos)
   (match-define (struct* record ([bindings bindings])) cached-record)
   (define bind (interval-map-ref bindings from-pos #f))
   (if (binding-external? bind)
       (begin (thread-send from 'nothing-to-do)
              (thread-send from bind))
       (begin (thread-send from (list 'open (binding-external? bind))
              ; get definition from another maintainer then return that position
              )))])

The point is, whatever we want, the model already gets much easier.

5.2. Jump to usage

This is the next big thing and the original purpose. With maintainers, I can add one more local state for each maintainer called user list, this is initial from every time the maintainer just updated their record.

                      maintainer1                             maintainer2
                           │                                       │
                           │                                       │
  update record from path  │                                       │
                           │                                       │
                           │                                       │
       find import usages  │      send message: I'm using you      │
                           ├──────────────────────────────────────►│
                           │                                       │ add maintainer1 into the user list
                           │                                       │
                           │                                       │                                    ┌──────┐
                           │                                       │◄───────────────────────────────────┤Editor│
                           │                                       │        who are using xxx?          └──────┘
                           │                                       │                                          ▲
                           │                                       │ send to all maintainers in the user list │
                           │◄──────────────────────────────────────┤ say: where are the usages of xxx?        │
check if still are using?  │                                       │                                          │
                           │                                       │                                          │
                      Yes  ├──────────────────────────────────────►│                                          │
                           │   send message: (list use1 use2 ...)  │ Good, wait all maintainers and collect   │
                           │                                       ├──────────────────────────────────────────┘
                           │                                       │   Send collected result back to editor
                           │                                       │
                           │                                       │
                           │                                       │
                       No  ├──────────────────────────────────────►│ Oops, remove the maintainer from user list
                           │                                       │
                           │                                       │
                           ▼                                       ▼

Can demonstrate the flow like above is so nice, and this is a good example attest that how the maintainer system abridged our brains work.

5.3. Guessing class and API

Racket class didn't get the same support as normal define things, and this is because of some first-class and macro problems.

The first possible solution is using what macrology pointed out many syntax-property that is used by syncheck-annotations<%>, at below.

disappeared-use
disappeared-binding
sub-range-binders
origin
original-for-check-syntax

The API I'm talking about is exactly these, the point is to imagine if we improve the class macro to help references point to the correct location of the method definition. Another approach is if we track via finding (send xxx) or (class xxx) and maintain them in the maintainer system. I would try both in the future.

5.4. tree sitter

This is a possible solution for the class problem in Racket, but this didn't get very much interest.

5.5. GUI

I would try to dig into MacOS or Linux underlying for GUI improvement due to the response experience is not good enough right now.

6. Conslusion

I show you how sauron came to today, and what's the future. I want to invite you to explore more problems in your daily development and tell in the community, or even contribute your code. Can't wait to see them! Have a nice day~~~