encode scheme object in 64 bits

Encode scheme objects in a word(64bits) is fun and important for dynamic language! The idea is using 3 bits for tagging, 61 bits for data. For small types like integer, boolean, char, how to use that 61 bits is clear , but how to handle complex type like pair, closure? Let's distinguish them to two parts, small type and complex type.

0.1. Tag

But first, you need some tag utility

(define fixnum-shift 2)
(define fixnum-mask 3)

(define ptr-mask 7) ; mask for pointer type tag
(define ptr-mask-inv #xfffffff8) ; mask for pointer value

(define pair-tag 1)
(define vec-tag 2)
(define str-tag 3)
(define struct-tag 4)
(define void-tag 5)
(define closure-tag 6)

(define char-mask 255) ; character type mask
(define char-shift 8)
(define char-tag 7)

(define bool-mask 255)
(define bool-shift 8)
(define bool-tag 15)

0.2. Small type

Generate immediate representation for small type is not quite complex. Basically, it's (v << type-shift) | type-tag or type-tag.

(define (immediate-rep x)
  (cond
    [(integer? x) (bitwise-and (arithmetic-shift x fixnum-shift) #xffffffff)]
    [(char? x) (bitwise-ior (arithmetic-shift (char->integer x) char-shift) char-tag)]
    [(boolean? x)
     (if x
         (bitwise-ior (arithmetic-shift 1 bool-shift) bool-tag)
         bool-tag)]
    [(null? x) pair-tag]
    [(void? x) void-tag]))

0.3. Complex type

0.3.1. Pair

If the object is a pair(cons in Lisp), the idea is put it into heap with two words size.

car	cdr
64bits	64bits

Then encode it's pointer as pair pointer. Above desciprtion is the following function.

export fn _scheme_cons(car: i64, cdr: i64) callconv(.C) i64 {
    const p: [*]i64 = @intToPtr([*]i64, @ptrToInt(c.GC_malloc(2 * rep.WORDSIZE)));
    p[0] = car;
    p[1] = cdr;
    return @intCast(i64, @ptrToInt(p)) | rep.PAIR_TAG;
}

0.3.2. Vector

Vector is important structure for speed up access elements heavily and don't change size frequently usage. The idea is allocating enough size for elements(a chunk of memory).

size	data
64bits	64bits * size

Then encode it's pointer as vector pointer. Above desciprtion is the following function.

export fn _scheme_make_vector(length: i64, filled_by: i64) callconv(.C) i64 {
    const len: i64 = length >> rep.FIXNUM_SHIFT;
    const p: [*]i64 = @intToPtr([*]i64, @ptrToInt(c.GC_malloc(rep.WORDSIZE * (1 + @intCast(usize, len)))));
    p[0] = len;
    const p2: [*]i64 = p + 1;
    var i: usize = 0;
    while (i <= len) : (i += 1) {
        p2[i] = filled_by;
    }
    return @intCast(i64, @ptrToInt(p)) | rep.VEC_TAG;
}

0.3.3. String

String is simple, because it's a vector! But to optimize memory usage, you don't use encoded char objects, but char itself. Since char takes 8bits(u8) only, data takes 8 * size bits.

size	data
64bits	8bits * size

Thus, the following function shows.

export fn _scheme_make_string(length: i64, filled_by: i64) callconv(.C) i64 {
    const len: i64 = length >> rep.FIXNUM_SHIFT;
    const p: [*]i64 = @intToPtr([*]i64, @ptrToInt(c.GC_malloc(rep.WORDSIZE + @intCast(usize, len))));
    p[0] = len;
    var p2: [*]u8 = @ptrCast([*]u8, p + 1);
    var i: usize = 0;
    while (i <= len) : (i += 1) {
        p2[i] = @intCast(u8, filled_by >> rep.CHAR_SHIFT);
    }
    return @intCast(i64, @ptrToInt(p)) | rep.STR_TAG;
}

0.3.4. Closure

Closure is the most important runtime object for any dynamic funcitonal langauges, the idea is same as pair encoding, allocates two words for two objects then encodes pointer as closure pointer. And here encode environment as a vector, which is done from compiler side.

function pointer	environment pointer
64bits	64bits

Then you get the following function.

export fn _scheme_make_closure(fn_pointer: i64, env: i64) callconv(.C) i64 {
    const p: [*]i64 = @intToPtr([*]i64, @ptrToInt(c.GC_malloc(2 * rep.WORDSIZE)));
    p[0] = fn_pointer;
    p[1] = env;
    return @intCast(i64, @ptrToInt(p)) | rep.CLOSURE_TAG;
}

Well, if you look back to pair section, definitions are same internally! In fact, use pair for closure is ok! You only have to track closure use points to do proper conversion for them in compiler.

0.3.5. Struct

Struct is a chunk a memory with meaningful reference way, since every object in this implementation way is one word, a definition like the following takes 3 words.

(struct date (year month day))

It's simple to convert (date 1582 10 15) to the following pesudo program.

var date_instance = @intToPtr([*]i64, _scheme_make_struct(3));
date_instance[0] = encode_integer(1582);
date_instance[1] = encode_integer(10);
date_instance[2] = encode_integer(15);

Then you can figure out the definition of _scheme_make_struct.

export fn _scheme_make_struct(len: usize) callconv(.C) i64 {
    const p: [*]i64 = @intToPtr([*]i64, @ptrToInt(c.GC_malloc(rep.WORDSIZE * @intCast(usize, len))));
    return @intCast(i64, @ptrToInt(p)) | rep.STRUCT_TAG;
}

Conversion for all struct constructor calls should be done by compiler, since all access to struct fields is known, you have the following safe conversion.

(date-year (date 1582 10 15))

To pesudo program.

var date_instance = @intToPtr([*]i64, _scheme_make_struct(3));
// ...
_ = date_instance[0];

1. Conclusion

Hope this article shows a lot to help you have idea about how dynamic language handles runtime objects, there are still something missing in there, let me discuss them here.

61 bits for integer is not enough, so in scheme this concept called fixnum, any number exceed the size will be handled by a big number structure. Floating number will join this fun, some implementation also supports complex number.
closure needs closure conversion step to make them, that's also in my posting plan to explain it.
related project shows some known bugs about vector, welcome to solve them for me XD.

Now, take a break and have a nice tea time for yourself! See you in next article!