A new grammar for C# part 5

The term “static”

C# has a keyword “static” as does Java, C++ and C. The term seems to have originated in the PL/I language and was later adopted by BCPL and various derivatives. Back at the time these languages were developed the term referred exclusively to permanently allocated memory as opposed to heap or stack allocated memory.

Memory – variables and so on, that was static existed throughout the entire lifetime of an application, the address of a static item remained the same throughout the application’s lifetime. This use of the term to refer exclusively to a kind of memory is how the new grammar will use it.

In C# the term (like on some other related languages) also refers to non-instance methods and in this case refer to uniqueness, static classes for example refers more to the uniqueness of the class instance then the kind of memory associated with it and static methods likewise refer to methods not associated with an object instance.

A method is not really memory in fact all executable code resides in read-only code pages in the processes address space if you think about it, so the term when used on methods refers primarily to the “uniqeuness” of the method with respect to instance methods (which have state associated with them through a “this” pointer).

This new grammar has no “static class” instead we have a dedicated keyword “singlet” to refer to what C# calls a “static class”, thus “singlet” means “exists only once” rather than exists permanently, two related but distinct concepts.

For these reasons methods and functions that are not instance members of some class are declared “singlet” and the term “static” is reserved for variables (fields) and things that “occupy” memory (writeable memory unlike code pages).

This might seem to some like an exercise in pedantry but it is really aimed at distinguishing between the two concepts of “always exists” and “exists only once”.

A new grammar for C# part 4

Lets write some code

Since my last post I have stepped back from writing about these grammar ideas and decided to put some code together, by writing a basic lexical analyzer and then a basic parser I can experiment more and evaluate different ideas.

All of this is written in C# and is available in a public GitHub repository novus (I decided to give this exercise a name).

Writing the lexical analyzer wasn’t too much effort, I’ve written these before and in languages less powerful than C#, some design decisions are questionable but this is really a technical exercise so I rely on periodic refactoring to improve the design rather that striving for design excellence when I start.

The lexical analyzer can be supplied with either raw text or a file pathname, either way it does the same thing with the data.

For overall simplicity we have a SourceFile class that represents the source code, that simply consumes the text and creates a list of Character objects which contains the actual character plus some auxilliary details like line number and so on.

The analyzer supports some helpful features too:

  • It’s mapping table (used by the internal FSM) is populated from a text file making it easy to add token definitions.
  • Peek token – lets us see a token without consuming it.
  • Peek next ‘n’ tokens.
  • Skip forward to the next token of a specified type.
  • Get (consume) the next token.
  • Push a token back so that it will be the next one seen next time we do a Peek or a Get.
  • Push a set of tokens back

Each of these features is helpful under various circumstances as we parse the token stream, for example we may be parsing some construct and encounter a { symbol, we might expect to see that too but to avoid handling the token there and then we can “push” it back and then call a dedicated parse method that will see the { and more naturally parse the block delimited by the { and then (presumed) }.

Initially the set of tokens defined the definition file is a subset of what we see in C#, this grammar after all is based largely on C# and deviates from C# only in areas that I think are improvements, much of C#’s syntax can remain unchanged, for example the syntax of generic argument lists in C# is very good as it is.

The token definition file is actually embedded as a resource in the parser project which means we can freely edit it as a file yet there is no separate file present at runtime which reduces risks as we experiment.

The states and token names used in the file actually correspond to some enums in the source code which makes it easy to import the text.

The token definition file

Each line in the file contains the following information:

  • The current state that the scanner is in
  • the character (or character class) that was just read from the source.
  • An action to perform.
  • A new state for the scanner to enter.
  • A name for the token at that point in the scan.

The order of lines in the file is completely irrelevant. From the file you can see that in the INITIAL state (this is the state the scanner is in when it is started and after any token has been recognized) if it sees a / it appends that (to a text buffer) and goes to the new state SLASH. Then in that state if it sees a * it appends that too and enters the SLASH_STAR state. In that state if it reads any character – except a * – it remains in the SLASH_STAR state and keeps appending the characters to the buffer.

The only way out of this state is to see a * and then a / at which point it appends that / returns the token (a block comment) and sets the state once again to INITIAL ready to begin recognizing another token.

The actions are few in number and are represented by an enum these are explained by comments in the enum. As you can see the scanner is to all intents and purposes a finite state machine.

A Self Modifying Scanner

This scanner has an unusual property, the ability to modify its state machine as it runs. I’ve never seen this in a lexical analyzer before (that doesn’t mean it has not been done before of course) and it arose purely from an experimental idea about a problem with literal strings in a language, namely how to represent literal strings in source when we need to embed string delimiter characters inside that literal string.

One way to solve this problem is to be able to dynamically change the delimiters used for a string during the lexical analysis of the source file, so for this there is a preprocessor directive #add_delimiter

When the scanner has recognized a preprocessor directive it examines it and if it is add_delimiter or remove_delimitter its does not return that to the caller (the parser) but instead updates its token table and adds the entries neccesary to allow the scanner to recognize string literals with the new delimiter.

As you can see the first add_delimiter enables the source code to use ^^^ as the literal string delimiter, from the instant the scanner has consumed the add_delimter directive, that character sequence becomes a valid string literal delimiter. Once the remove directive is processed the characters ^^^ are no longer recognized as delimiters and can be embedded in other strings as is shown the second example above.

The above code is part of a unit test too.

The standard delimiter ” is always recognized and remains in effect at all times, there are other aspects to this too that I’ll discuss in a future post.

A new grammar for C# part 3

Real world examples

Lets now take a look at what are real examples of the kind of things we want to do, adding new keywords. In the context of this blog, a “statement keyword” or simply “keyword” is a word that appears at the start of a statement, that’s why it’s so named, it is a “key” that allows the parser to “understand” the statement, the tokens that follow the keyword.

There are other words that are like keywords but they do not appear at the start of a statement, in C# for example “static” is such a term, in this blog such terms are called “option keywords”.

Let’s summarize this

  • Statement Keyword – or simply “keyword” a term that appears as the first token in a statement.
  • Option Keywords – terms with a special meaning within a statement and express desired options.
  • Reserved Word – names, that is text that meets the criteria of an identifier but cannot be so used.

In C# the keywords “while” or “switch” or “goto” meet this definition of keywords and the words “case” or “static” or “abstract” or “do” are option keywords (not necessarily optional keywords though).

Adding a new (statement) keyword is a challenge, consider adding the new keyword “until” to the C# language.

Because method invocation statements begin with a user-defined identifier, we have a problem, it is legal to create a method in C# named “until” and that could accept a boolean expression. Adding a new “until” keyword immediately raises this problem, grammatically (that is using token sequences alone) these are not distinguishable:

Method Call

until (a > b); // undesirable to some yet perfectly legal.

Until Loop

until (a > b); // not much of a loop, but perfectly legal.

There are several ways we can deal with this, but they all involve some baggage. For example we could insist in the grammar that an “until” is always followed by a block statement, it cannot have an empty statement:

until (a > b)
{

}

That might work but a “while” loop has no such restriction, so we would introduce more asymmetry to the language.

As the number of asymmetries within a language increase the overall flexibility of the language – its ability to be extended – diminishes, each time we add such a feature we add some additional special case and over time that could lead to a situation where changes are very hard, adding one thing might require changing something else and that change might lead to a backward compatibility problem.

We want to devise a grammar in which it is almost trivial to add a new keyword like “until” if we had such a grammar our language could grow and grow yet never encounter backward compatibility problems.

This is why we require every statement – even method invocations – to begin with a keyword unless the statement is an assignment where there’s no need to use a keyword.

A new grammar for C# part 2

Taking Stock

We’ve covered quite a few deatils about the things we want to see changed as we move from a grammar rooted in C. It’s time to draft out some basics of how such a grammar might look. In this initial exploration we are ignoring generics and interfaces.

We’ll use BNF as the means of documenting this.

<compilation-unit> ::= <source-file> { <source-file> }

<source-file> ::= <using> { <using> } <type-def> { <typedef> }

<type-def> ::=

type <identifier> <type-kind> [ <type-options> ] “{” <type-members> “}”

<type-kind> ::= class | struct | record | singlet

<type-options> ::= [ <access-type> ] [ sealed ] [abstract] [etc]

This shows how all classes, struct etc. are all declared with the type keyword followed by the name of the type and then various optional pieces that are not too dissimilar to C#.

<type-members> ::= <type-member> [ <type-member> ]

<type-member> ::= <field-def> | <prop-def> | <method-def> | <function-def>

<field-def> ::= def <identifier> <typename> [<field-options>]

<prop-def> ::=

<method-def> ::= def <identifier> <paramlist> [singlet] <method-body>

<paramlist> ::= ( <paramtype-spec> [, <paramtype-spec> ] )

<paramtype-spec> ::= <identifier> <typename> [ ref | out ] | [ ptr | pointer]

<function-def> ::= def <identifier> [ <paramlist> ] <return-type> <method-body>

<return-type> ::= ( <typename> | [ ptr | pointer] )

<method-body> ::= “{” { <statement> } “}}

<function-body> ::= “{” { <statement> } “}”

<statement> ::= <assignment> | <keyword-statement>

<assignment> ::= <target> = <expression> “;”

<keyword-statement> ::= <call-statement> | <return-statement> | <goto-statement> | <if-statement> /* plus others */

<call-statement> ::= call <identifier> [ <call-arglist> ] “;”

<call-arglist> ::= “(” <argument> [ “,” <argument> ] “)”

This is a decent first stab at outlining the grammar, we can see here that keywords need not be reserved so long as we can always distinguish between an assignment and a keyword statement.

Note also how method invocations differs from what has become common with C derivatives:

call MyNoArgsMethod;

Methods that take no arguments do not require empty parentheses, they are redundant. We can always tell a method call because it always begins with call. Of course if we support optional arguments then perhaps we would require the parentheses here when no args are supplied as a visual aid that the method is not truly a zero argument method.

A new grammar for C# part 1

Background

This article arose from discussions about the limitations of C#’s grammar, most notably how extending the language and adding new keywords can lead to backward compatibility considerations that ultimately introduce a non-uniformity that can become a burden for developers.

C# has both ‘keywords’ (which in C# also means reserved) and ‘contextual keywords’ (which in C# means a keyword that’s not reserved). There are mixed feelings about whether and to what degree contextual keywords complicate the use of the language too.

The grammar used for C# is derived from Java, C++ and C and so carries many of the grammatical traits found in these languages and some of these though heavily entrenched can lead to challenges when implementing new language features.

Of course, we cannot have a “new grammar” for C# because it would cease to be C# but we can explore alternative grammars that deliver the same features as C# and that’s what this article explores.

Many years ago I developed a compiler for a now somewhat antiquated programming language, it wasn’t a bad language but it wasn’t perfect either like all languages it had its strengths and weaknesses. When I began development of the compiler I’d already been using the language professionally for some ten years or more so knew it very well already, this made it a bit easier to understand the ANSI standard document that defined its grammar.

The language had one feature more than any other, that made a deep impression on me and that was it had no reserved words.

Yes you could if you really wanted to, create functions named “if” and have labels named “goto” or variables called “return” all of this was possible. I didn’t really think about this much when I was a trainee programmer (as we were then called) I kind of took it for granted that “serious” languages had these kinds of capabilities and never really cared much about it.

Goals

These are the goals of this exercise:

  1. A complete absence of reserved words.
  2. A minimal deviation from current appearance.
  3. Remove inconsistences by establishing new ways of representing things.
  4. Explore options for building in the ability to add new keywords (terms that appear at the start of a statement)
  5. Explore options for building in the ability to add new language “functions” like nameof, sizeof etc
  6. Explore options for building in the ability to add new symbolic operators akin to +, -, *, -> and so on.
  7. Explore options for building in the ability to add new named operators akin to “is”, “and”, “await” and so on.

The idea here is not to change the grammar, the syntax just because we want a different one, but to eliminate reserved words and inconsistencies thus allow an unlimited potential to add new keywords as the language is enhanced over time. The grammar would thus completely eliminate the backward compatibility problem.

The terms ‘reserved word’ and ‘keyword’ are used here based on the traditional definitions found in the literature and summarized here.

Motivation

The motivation is an intellectual one, a desire to experiment and see what can emerge and if anything can be learned. I’ve designed and implemented compilers for languages that have no reserved words and I enjoy writing in C# a great deal too so developing a translator as a proof of concept could be a fascinating undertaking.

Current Shortcomings

It’s helpful to itemize the various shortcomings of the C# grammar as it stands today. Some of these are unique to C# and others are attributable to C or C++.

Reserved Words

Most programming languages but not all, have a concept of restricting the names we can use for things so as not to clash with keywords in the language like “if” and “goto” and “return”. C has this and it remains as a characteristic of all of its derivatives. Although reserving words and relying on keyword matching during parsing offers an obvious simplicity it is not actually necessary.

By designing a grammar that eliminates reserved words we can potentially make it easier to add new keywords to the language over time, no matter what new keyword we introduce there’s no possibility of a collision with a user defined name, backward compatibility is guaranteed.

This is legal:

public class record { }

yet this is not:

public record class { }

this is asymmetric and again is not because this was a desired language feature but because of the pragmatics of implementing changes to the language.

Void

The keyword “void” makes a function declaration “look like” any other function declaration, it serves to explicitly indicate “nothing” as if that were a type. It serves no useful purpose and blurs the distinction between methods and functions. Methods simply do not return results, they have no “value” not even “void” and the differences between functions and methods should ideally not be blurred, they are distinct concepts and because of their semantic difference rightfully warrant a syntactic difference.

Method Invocation

The blurring between methods and functions is evident also when it comes to how they are invoked. A method (that returns no result) cannot be used within an expression yet a function (that does return a result) can be explicitly invoked without regard to the returned result very much looking as if it were a method.

Static Classes

A “static class” is not actually a class at all, it has none of the defining characteristics of an OO class. It is a different kind of thing altogether and has more in common with a namespace than a class. Representing this as a distinct “type” or concept with its own type name would be an improvement. We can’t define a static struct or static record and the association with “class” is misleading.

Symbolic Operators

C# (like most other derivatives from C) has a rather large set of operators and some of these are in fact a mix of several underlying operators. For example postfix ++ represents a compound operation (addition and assignment) that allows it to be used inside or outside of expressions. Operators that behave as values and at the same time as assignments, blurring the distinction, are undesirable.

We can pass an assignment as an argument too, this is legal, well defined, yet potentially confusing.

Keyword Operators

C# includes a number of things described as “operators” but are named and have a function-like invocation syntax, but not in every case. The keywords “true” and “false” are also curiously described as being operators, this is to enable them to be overloaded yet they are also just values like 5 or “telescope” and the value of overloading them is questionable but if desired there might be a better way to define the concept.

The “default” operator too is odd it can be invoked with and without an argument (well, lexically it appears that way) whereas a user written (non void) function cannot, the parentheses are always required.

Also some operators names are reserved and others not, for example its perfectly legal to define a function named “nameof” that returns a string but we cannot define a function named “sizeof”, the reasons for this asymmetry are likely due to the pragmatics of implementation and not by design.

Possible Solutions

In no particular order these are some of the ideas that we could use to address the concerns listed above.

Eliminate reserved words

We can design the language and its grammar so that no words are reserved. This leaves us free to introduce new keywords in the future and never have a backward compatibility issue.

Methods and Functions

Increase the syntactic distinction between methods and functions by eliminating the term “void” and insisting that the only way to invoke a method is to use a “call” keyword and the only ways to invoke a function is as part of expression evaluation and possibly also allow a function to also be invoked with “call” if we want to ignore the returned value.

Singlet

The idea with singlet is to introduce another term for “static class” so that we avoid the use of “class” altogether, there will thus be four kinds of types: class, struct, record and singlet.

Remove some symbolic operators

Operators that behave like assignments and expressions will be removed. The only way code can then alter memory will be through the use of the assignment statement.

Introduce builtin functions

These are functions that are an inherent part of the language, like “nameof” and “sizeof” and “default”. We will stop regarding these as operators and will restrict operator names to symbols like > and << and so on. To address the problem of name collisions with such function we will define a new syntax for their invocation.

To invoke a builtin function like “nameof” we would code that as

result = nameof[result];

This will be the only permissible use of square brackets. Indexing arrays and indexers and pointers can all use the standard ( ) parentheses without any problem if we define the grammar appropriately.

Builtin functions can be defined as being part of a system namespace. A system namespace is a namespace who’s name is illegal as a user defined identifier. For example builtin functions could be defined like this (this is pseudo code because we may alter or even abandon the C# syntax for defining namespaces and the way that access modifiers are managed):

namespace @language.@builtin
{
   public singlet functions
   {

      nameof[arg object] object
      {

      }

   }
}

Any method or function in the @language namespace can only be invoked by its unqualified name, attempting this will fail:

result = @language.@builtin.nameof[something];

This idea of special namespaces is draft and may not be needed, the use of [ ] is sufficient to indicate that we are calling a builtin function. Thus builtin functions can be added with ease because a collision with a user defined function is impossible.

Naturally some kind of info message could be emitted if user code does contain a function definition with the same name as a builtin function just in case the user did not intend it and possibly emit messages too if there are unqualified calls to user functions that have the same name as a builtin:

// Info message generated

result = nameof(something); // did user really mean [ ] ?

result = support.nameof(something); // NP this is qualified so unlikely to be a problem.

Another option here is to simply not even bother with reserved namespace names and just disallow user code to define function that use [ ] in their arg definition.

This syntax enables a parser to easily recognize a builtin function, it is simply an identifier followed by [<stuff>] where <stuff> is just an expression commalist.

If we do this it is almost trivial to add further builtin functions over time and never need to reserve their names.

Equality and Assignment

We can dispense with the “==” operator too, there is no compelling reason to have these two tokens when the meaning of a single “=” can easily be determined from the context. It must be either part of an assignment statement or must be part of an expression.

Multiple assignments too could be removed, this again clouds the distinction between assignment and expression evaluation.

Pointers

In C and its derivatives, the symbol “&” denotes the operation “get the address of…” whereas the symbol “*” denotes the operation “the contents of…”. We can replace these with new builtin functions that have names and free up these characters once more. In the notation we’re considering for builtins we’d write this kind of thing:

var ptr = addr[data];
var data = cont[ptr];

This simplifies our use of symbols and symbolic operators because we no longer have to be cognizant of the fact that “*” can mean multiply or pointer dereference or pointer type declaration.

To declare a pointer we have several options but there is a whole area to be explored here when it comes to declarations of types and values.

But for declaring a pointer these spring to mind:

def date_time_ptr DateTime pointer;
def date_time_ptr pointer (DateTime);
def date_time_ptr pointer <DateTime>;

But this is a bigger area, how we declare names and their attributes is another area altogether.

Abbreviations

Numerous terms could support abbreviated names, for example:

def define
ptr pointer
addr address
cont content
pub public
int internal // don't worry about this one, we're eliminating reserved words!

Abbreviations can be helpful because they can be used for more terse code that requires just some familiarity with the language or more verbose where the clarity and readability of some code is particularly important perhaps in code that performs a critical role in particular industries.

Keywords

A statement that is not an assignment statement will be regarded as a keyword statement. The keyword has its own syntactic form and might be simple or complex. A keyword might be followed by several additional terms called “options” and although these two are words they are not keywords as they cannot be used as the start token of a statement.

Some of the keywords we can expect to see in the final grammar are:

  • call keyword
  • if
  • switch
  • throw
  • catch
  • continue
  • break
  • return

Keyword Options

  • public, private etc – options of type and method declaration
  • enum, class, struct etc – options of type declarations
  • virtual – option for method declarations

The idea here is to differentiate the keyword that drives how the statement is parsed from additional terms that are part of the statement but not keywords themselves, that is to say all statements that are not assignment statements begin with a keyword, the keyword is simply that first term.

The ‘static’ keyword.

C# like Java, C++ and C use the convention that ‘static’ means both a kind of storage and a kind of method. Historically ‘static’ was one of three kinds of memory – stack, heap and storage that was in a fixed location, not created or released during code execution.

Using the same term for non-instance methods in an OO language is unattractive because it implies some relationship to its use as a “kind of ” memory.

Although its true that static methods cannot refer to instance fields/properties (unless these are in some way passed in) that is static methods can only refer to static members, that does not mean that its helpful to refer to the method itself as static. All code is in fact static because it is situated in memory managed by the system, actually in virtual memory pages that are not writable, executable code is never present on a stack or heap.

So in this new grammar “static” methods are labelled as “single” methods, there is conceptually only one of them unlike instance methods.

Thus we have a type called “singlet” can exist only once and methods/functions of a type called “single” that share that trait of being one-off, not related to instance data.

Namespaces

There may be value in allowing types have more than one qualified name. Consider a reflection based design where we use namespaces to define classes used in some version of some product:

namespace Datasets.v6_0
{
   public class StudentInfo
   {
      // v 6 specific details
   }
}

Here we have a class that for v6 is named: Dataset.v6_0.StudentInfo. Code, at runtime can get the “version 6” version of StudentInfo easily because we’ve defined a convention using namespace names, so for version 7 we’d have – likely in the same source file:

namespace Datasets.v6_0
{
   public class StudentInfo
   {
      // v 6 specific details
   }
}

namespace Datasets.v7_0
{
   public class StudentInfo
   {
      // v 7 specific details
   }
}

So, at runtime the system can find the right class type for some version. But what if this particular class was not changed in version 7? We’d need to duplicate the class definition. It might be far better to be able to do the equivalent of this:

namespace Datasets.v6_0
namespace Datasets.v7_0
{
   public class StudentInfo
   {
      // v 6 specific details
   }
}

If this were possible the reflection code could look for version 7 types and find them, but could also find the version 6 types too and these are the same type.

This is a way of saying that the version 7 of some type is the same as the version 6 type and therefore the type is defined just once.

We could also consider allowing namespaces components to be numeric, for example:

namespace Releases.1.2
{

}

namespace Releases.2.4
{


}

Not clear yet on the implications of allowing this, it basically amounts to allowing identifiers to begin with a digit…

The new C# range feature

The new C# range feature is a helpful addition to the language, however it appears to be a good example of where the committee approach to managing language design and improvements may have stifled common sense.

In this post I’m going to explain why I have formed this opinion.

Here’s a snippet from the current Microsoft documentation:

“A range specifies the start and end of a range. Ranges are exclusive, meaning the end is not included in the range.”

This already begins to reveal the potential for misunderstanding, the paragraph says that ranges are exclusive but then has to qualify that (because it’s not true) with the proviso that actually only the end is the part that exclusive.

So right away we have to regard the start value of a range as being fundamentally different from the end value of a range, they are different concepts. One identifies an element offset from the start of the array and the other identifies an element offset – 1 from the start of the array. (By offset I mean offset from address).

That is to say that ^0 (using the array in the examples below) is the offset of a [non-existent element] – 1 (* size of element) from the array’s base address.

Imagine we are discussing something tangible like a range of houses in a street that has 10 houses, we might say things like “I want 4 houses starting at the 5th house” this is a natural way to express the concept, but in C# using ranges we’d express this as “I want the range 4..8”

The new range feature offers no support for the number of elements, only the index of the first and last (+ 1 don’t forget) elements from the start of the array. There is also support for specifying an index from the end (again, + 1) of an array, this is indicated by using the operator

^

So, since we know we have 10 houses we could express our range as 4…^3, but there is no support for expressing how many contiguous elements we want which would have been a very helpful feature to include.

In my experience small misunderstandings about offsets, positions, counts, start elements and end elements are frequently a source of subtle and frustrating bugs, so surely making it more intuitive to express these ideas is what a language should strive to do?

Examples

Here are some examples.

            var houses = new string[]
             {
                "House 1", // 0
                "House 2", // 1
                "House 3", // 2
                "House 4", // 3
                "House 5", // 4
                "House 6", // 5
                "House 7", // 6
                "House 8", // 7
                "House 9", // 8
                "House 10" // 9
             };

            // (remember the second value in a .. expression is always EXCLUSIVE)
            var set1 = houses[0..10]; // all elements
            var set2 = houses[0..^0]; // all elements 

            var last1 = houses[9];  // House 10
            var last2 = houses[^1]; // House 10 

            

 

Notice how the representation of “the last element” differs in the case of a range and in the case of a single subscript? In the case of a single subscript it is 9 or ^1 but in a range it is represented by 10 or ^0.

This is because the second value in a range is ALWAYS decremented (or is that incremented?) at runtime in order to derive the actual index of the element.

Missing the obvious

We can see from the above examples that the following basic concepts are present in these problems:

  1. Inclusivity and Exclusivity
  2. Directionality
  3. Contiguous element count

Inclusivity

The inclusivity/exclusivity is “hard wired” into the language now, it is fixed that the first value in a range is always regarded as inclusive and the last value in a range is always regarded as exclusive. A developer has no means of explicitly expressing this.

Directionality

The “direction” of a value in a range is expressed by either including or omitting the ^ operator as a prefix. The absence of the operator means “from the start” and the presence of the operator means “from the end”, this isn’t too bad but could have been better – by providing two tokens that mean “from the start” and “from the end” we’d end up with more symmetric looking range expression and these are easier to reason about.

Element Count

Given that many examples discussed on the internet about the new range feature often contrast it with LINQ it is striking that this basic concept has been left out of range expressions.

How it could have been implemented

So after all my complaining is there really any significantly better way this could have been designed? I think so, by focusing on the three concepts above and emphasizing symmetry (or at least providing the ability to express things symmetrically) a much more intuitive and natural syntax could have been devised.

Inclusivity and Exclusivity

Rather than forcing this and making it differ in the first and last part of a range, a small set of tokens could have been introduced.

        static void Main(string[] args)
        {
            var houses = new string[]
             {
                "House 1", // 0 - 9
                "House 2", // 1 - 8
                "House 3", // 2 - 7
                "House 4", // 3 - 6
                "House 5", // 4 - 5
                "House 6", // 5 - 4
                "House 7", // 6 - 3
                "House 8", // 7 - 2
                "House 9", // 8 - 1
                "House 10" // 9 - 0
             };

            // House 4, House 5 and House 6 (implicitly inclusive)
            var set1 = houses[3 .. 5];      

            // House 4, House 5 and House 6 explicitly inclusive
            var set2 = houses[3 >..< 5];

            // House 5, exclude 3rd element from start and exclude 5th element from start
            var set3 = houses[3 <..> 5];

            // House 5, House 6, exclude 3rd element from start but include 5th element from start
            var set4 = houses[3 <..< 5];

            // House 4, House 5, include 3rd element from start but exclude 5th element from start
            var set5 = houses[3 >..> 5];  

        }
        

As you can see this requires just four simple tokens

  • <..>
  • <..<
  • >..>
  • >..<

And allows us to completely and symmetrically express inclusivity and exclusivity for both the start and the end elements.

Directionality

This can be expressed adequately with the ^ operator as it now stands, but again having explicit tokens for each direction would have been an improvement:

        static void Main(string[] args)
        {
            var houses = new string[]
             {
                "House 1", // 0 - 9
                "House 2", // 1 - 8
                "House 3", // 2 - 7
                "House 4", // 3 - 6
                "House 5", // 4 - 5
                "House 6", // 5 - 4
                "House 7", // 6 - 3
                "House 8", // 7 - 2
                "House 9", // 8 - 1
                "House 10" // 9 - 0
             };

            // House 4, House 5 and House 6 (implicitly inclusive)
            var set1a = houses[ 3 ..  5];
            var set1b = houses[^6 .. ^4];

            // House 4, House 5 and House 6 explicitly inclusive
            var set2a = houses[ 3 >..<  5];
            var set2b = houses[^6 >..< ^4];

            // House 5, exclude 3rd element from start and exclude 5th element from start
            var set3a = houses[ 3 <..>  5];
            var set3b = houses[^6 <..> ^4];

            // House 5, House 6, exclude 3rd element from start but include 5th element from start
            var set4a = houses[ 3 <..<  5];
            var set4b = houses[^6 <..< ^4];

            // House 4, House 5, include 3rd element from start but exclude 5th element from start
            var set5a = houses[ 3 >..>  5];
            var set5b = houses[^6 >..> ^4];

        }
        

This approach had it been taken would have made it natural to regard the first element (lowest address element) of an array as element zero without the hat operator and the last element (highest address element) of an array as zero with the hat operator, this symmetry would in turn make it easier to reason about code.

Element count

It is possible to express element count by specifying the same value for the start and end of a range and adding the count to that end range value but you must be cautious when using the hat operator, here are two ranges expressed with and without the hat operator – they yield 2 and 3 elements respectively: (This is legal C# hence the element number comments are different to my examples above)

        static void Main(string[] args)
        {
            var houses = new string[]
             {
                "House 1", // 0 - 10
                "House 2", // 1 - 9
                "House 3", // 2 - 8
                "House 4", // 3 - 7
                "House 5", // 4 - 6
                "House 6", // 5 - 5
                "House 7", // 6 - 4
                "House 8", // 7 - 3
                "House 9", // 8 - 2
                "House 10" // 9 - 1
             };

            // House 4, House 5 and House 6 (implicitly inclusive)

            int TWO_ELEMENTS = 2;
            int THREE_ELEMENTS = 3;

            var set1a = houses[3..(3 + TWO_ELEMENTS)];
            var set1b = houses[3..(3 + THREE_ELEMENTS)];

            var set2a = houses[^7..^(7 - TWO_ELEMENTS)];
            var set2b = houses[^7..^(7 - THREE_ELEMENTS)];

        }
        

I have created some extension methods that simplify this kind of thing, as it stands it can get quite bewildering. For example consider the statement “get me two elements starting at element 2 from the end”.

Well what does this mean? well element 2 (counting from the end) is “House 9” and we want two elements – but if we’re doing this “from” the end should we expect an array that has two elements, element 0 being “House 9” and element 1 being “House 8”? after all this is “from the end” where counting moves to elements at lower addresses?

 

 

Net Revision Tool

to developers

I work with a team of around ten or so other developers and we’re gradually moving our applications to Git. At this stage we still have what I’d called “legacy practices” and we mostly build and deploy our applications, services and sites manually.

We had a need to recently deploy quite a few services and applications as part of a roll out of a large project, as I began this work I was thinking how nice it would be to somehow tie in the deployed code to the Git commit or tag from which the code was built, enabling us to troubleshoot any bugs or problems more rapidly and accurately by helping a developer to build from the very same source a deployed troublesome app was built from.

Clearly what I needed was a way to associated the commit ID of the latest commit with one or more generated assemblies, ideally by setting some assembly attribute to the SHA-1 of the commit and possibly more, if only…

I assumed that this is probably something I’d need to think about and probably design and build myself, but it wouldn’t be an hours work – clearly.

So while sipping yet another coffee I did a web search for some of the terms that naturally come up for this question and after a short time I found a post on StackOverflow all about this subject and noticed mention of a utility named NetRevisionTool – I read the post with mounting interest and I forked and cloned the tool’s repo in order to explore it further.

It didn’t take me long to recognize that this a very simple yet powerful little utility, and after a short time I understood that making a simple and straightforward change to any Visual Studio project would give me what I sought and more!

Basically all one has to do is add a pre and post build event to the project and add a single attribute to the projects’s AssemblyInfo.cs file. The build events are merely invocations of the tool’s .exe with some basic command line arguments.

The prebuild event analyzes a mask string in the AssemblyInformationVersion attribute, this mask describes the content and format of the string to be generated by the tool, it then generates the required string and writes it into AssemblyInfo.cs (it backs up the file first).

Then the build itself runs and the generated string gets embedded into the asembly.

Finally the postbuild event runs which simply restores the backed up copy of AsssemblyInfo.cs ready for the next build whenever that might be. (For this reason its important to designate the postbuild event to run “Always”).

Options

The tool itself has command line options and the mask string also has a host of predefined values you can embed, here’s the string I settled on for the time being (I’ve colored the brcaes for easier readability):

{b:uymd-} {b:uhms:} -> {mname} -> {cname} -> {branch} -> {CHASH:8} {!:(Warning – There were uncommited changes present)}

This will cause the tool to generate a string that contains the build date and build time, the name of the computer on which the build executed, the name of the commiter, the name of the branch the commit was on at the time of the build, the first 8 chars of the SHA-1 and finally a warning message that will be appended if the repo at the time of the build had any uncommitted changes.

(I added support for {mname} myself and made this a pull request to the tool developer’s original repo, I also added a command line option /echo which will cause the generated string to be written to the Visual Studio output window when the build runs).

Naturally it’s preferable to always deploy code that doesn’t have the warning message, also we use Github and so after a pull-request merge the most recent commit is always a merge commit with author name “Github” so this too is ideally what one would see in the final generated message because we can be certain that the commit has indeed been merged and cannot ever disappear due to the developer rebasing as could be the case if the commit were only local or on their fork.

The question of where exactly to put the executable to that all developers can invoke it when their builds run obviously comes up and after a short discussion we decided to put the tool onto an existing shared network drive – not a robust solution perhaps but easily sufficient for our current day to day working practices.

Once I’d convinced myself this did what I needed and I settled on the options and so on, it takes me just a couple of minutes to add this to any Visual Studio project. For small teams or even lone developers, I cannot overstate how valuable this little tool is and I’m surprised it isn’t more well known to developers!

A Coroutine Library in C#

In this post I’m going to introduce an implementation of coroutines written in C#, the code I’m divulging is the result of my initial foraging into this unfamiliar (to me) concept so please bear this in mind if it appears a little preliminary.

The coroutine library provides a mechanism for two or more methods to transfer control to one another in such a way that when a method resumes execution, it does so at the position from which it previously passed control to another method.

The C# language includes support for writing iterator methods which use the yield return operation to return control to the caller in such a way that when re-invoked at a later time, execution resumes at the statement following the yield return. This mechanism is the basis for my implementation of coroutines.

The approach I’ve used is to define an abstract base class which provides a means to call user written iterator methods (henceforth referred to as coroutines) so that they can return the necessary information to enable another iterator method to be invoked. The base class thus does the invocation of the coroutines on our behalf and  masks some involved “plumbing” code which is necessary to make the coroutine mechanism easy to use.

Because coroutines are a set of mutually cooperating methods it is natural to design the coroutine library in such a way that we define the coroutines as methods which are all members of  a class. This class derives from an underlying base class that does the housekeeping necessary for tracking each coroutine’s state. The abstract base class is named Cooperative and all the user need do is derive a class from this and in that derived class implement a set of coroutines.

Here is an example of simple class that leverage Cooperative and defines two coroutines, this will help you see how coroutines actually look before we explore how the underlying implementation is coded:

public class KeyboardCoroutines : Cooperative
{
    private Queue<ConsoleKeyInfo> key_queue = new Queue<ConsoleKeyInfo>();

    public override void BeginProcessing(object Arg)
    {
        StartByActivating(ProduceFromKeyboard,Arg);
    }

    private IEnumerator<Activation> ProduceFromKeyboard(object Arg)
    {
        ConsoleKeyInfo info = Console.ReadKey(true);

        while (info.Key != ConsoleKey.Escape)
        {
            while (key_queue.Count < 10 && info.Key != ConsoleKey.Escape)
            {
                key_queue.Enqueue(info);
                info = Console.ReadKey(true);
            }
            
            if (info.Key == ConsoleKey.Escape)
                yield return Activate(ConsumeFromQueue,1);
            else
            {
                yield return Activate(ConsumeFromQueue, 2);
                key_queue.Enqueue(info);
            }

            Debug.WriteLine("ProduceFromKeyboard sees a result of: " + Result.ToString());
        }
    }

    private IEnumerator<Activation> ConsumeFromQueue(object Arg)
    {
        ConsoleKeyInfo key = key_queue.Dequeue();

        while (key.Key != ConsoleKey.Escape)
        {
            while (key_queue.Count > 0 && key.Key != ConsoleKey.Escape)
            {
                Console.Write(key.KeyChar);
                key = key_queue.Dequeue();
            }

            if (key.Key == ConsoleKey.Escape)
                yield return Activate(ProduceFromKeyboard,3);
            else
            {
                Console.Write(key.KeyChar);
                yield return Activate(ProduceFromKeyboard,4);
            }
            Debug.WriteLine("ConsumeFromQueue sees a result of: " + Result.ToString());
        }
    }
}

You’ll notice right away that a coroutine has a return type of IEnumerable<Activation> and a coroutine passes control to some other coroutine by executing a yield return expression which is a function call to Activate in the base class. The base class therefore enumerates a coroutine and uses the value returned by each iteration to select and call the enumerator associated with the next coroutine to execute. Each coroutine’s execution is temporarily suspended at the point it yields and resumes at the next statement when some other coroutine passes control back to it.

In my next post on this subject I’ll show you the base class implementation and explore some alternative ways to expose this coroutine mechanism.

 

 

 

 

Nested IEnumerables

I’ve been exploring the subject of coroutines recently and I’ll be writing more about this in a separate post in the near future. Designing an implementation of coroutines for C# requires taking full advantage of C#’s support for yield return and IEnumerable<T>.

A spinoff from this exploratory work has been a practical mechanism for supporting yield return for both sequence values and complete sequences – a capability sadly absent from the C# language.

Conceptually here’s some pseudo-code that conveys this idea:

        public IEnumerable<string> FirstSequence()
        {
            yield return "1";
            yield return "2";
            yield sequence SecondSequence();
            yield return "7";
            yield return "8";
        }

        public IEnumerable<string> SecondSequence()
        {
            yield return "3";
            yield return "4";
            yield return "5";
        }

The yield sequence keywords are of course fictitious but convey the requirement nicely – namely that when enumerating values from FirstSequence() the value present within SecondSequence() are automatically enumerated and returned – as if they’d been yielded directly from within FirstSequence().

The current C# language (Version 5) does not permit such constructs and one must code the following in order to get the desired effect:

        public IEnumerable<string> FirstSequence()
        {
            yield return "1";
            yield return "2";
            foreach (string V in SecondSequence())
               yield return V;
            yield return "7";
            yield return "8";
        }

        public IEnumerable<string> SecondSequence()
        {
            yield return "3";
            yield return "4";
            yield return "5";
        }

It seems – to me at least – that the implementation of yield return is unduly limited, mainly because there seems to be no significant reason why the C# compiler cannot transform: yield sequence S; into: foreach (var V in S) yield return V; since the latter is fully supported and provides the desired semantics and the transformation seems straightforward.

We can overcome this limitation and approach the elegance and simplicity of our imagined yield sequence by adopting a similar design to that adopted for implementing coroutines which I’ll discuss in a future post. Namely we create an object which manages the enumeration for us – a sort of enumeration proxy – this iterator object can then provide the processing required to make everything work. We can’t transform the code into another form (as the C# compiler does when it encounters the yield keyword) but we can invisibly enumerate embedded sequences by creating a stack of enumerators enabling us to suspend enumeration of one sequence and begin enumeration of the embedded sequence.

Once all elements have been enumerated from the embedded sequence we can pop the stack and resume enumeration using the previous enumerator thus continuing with the original sequence, this technique will be explored along with some real code in a future post.

 

Lexcial Analysis With F# – Part 5

I’ve begun to establish a reasonably sound design pattern for the lexical analyzer. Of course this isn’t intended to be an ideal solution to the general case of writing a tokenizer for any language, it does not support any kind of short hand for describing token structure for example. But it isn’t overly complex and at this stage supports some of the common tokens seen in C, C++ or C#.

Continue reading