NUTS

Creating a text editor in C using rope-like data structure.

NUTS

Published: Mon, 17 Feb 2025 00:51:07-4:00
C | Unity | GCC

What’s NUTS?

NUTS, Naokotani’s Unusable Text Software (I am also playing with the idea of PEANUTS, Portable Executable Application: Naokotani’s Unusable Text Software), is my new hobby coding project to create a rudimentary TUI text editor similar to Vi or Nano in C. The plan is to create a text editor that stores text in a rope-like data structure based on logical text divisions; buffers, lines and words (“words” are really text snippets, they can be punctuation, white space, etc.). A linked list of buffers (IE, vim, Emacs buffers etc.) points to linked lists of lines, which point to linked lists of words, which contain the strings that make up the text buffer. The intention is not to create the latest greatest vscode competitor, but to do a deep dive into strategies to store text in computers at a relatively low level.

Why This?

I was recently doing some work implementing a simple linked list in C along with associated functions for working with the data structure. The basic implementation was not very interesting, so I decided to see what other interesting directions I could take the code in. Initially the linked list stored integers, but I started to play with storing dynamic character pointers in the linked list. Next I implemented the functions as Clojure inspired recursive functions (functions like first, rest). Once I had this working with small basic hard coded strings, I created a few simple functions to parse text and place it into my linked list. I fed it some books like the King James Bible and Moby Dick (no rhyme or reason here, its just what came up on Project Gutenberg), and to my utter shock and horror, it instantly crashed with a segmentation fault. The issue was that the recursive functions were causing stack overflows when I was feeding them these longer texts (I believe Moby Dick parsed out to somewhere around 280 000 words and the program got to around 90 000 and got very sad). After some investigation I found that GCC actually supports tail call optimizations as long as you pass the -O2 flag and the return of the recursive function is the last piece of logic in the function. In order to make the tail call optimization work, I had to jump through some hoops and the code got a bit messy, but it worked, and performed pretty well. I didn’t bench mark it, but it was able to parse Moby Dick, put it in a linked list, a recursively copy the list to pull out a random quote in a split second.

This project, while interesting, didn’t feel like it was going anywhere. I would like to spruce up the UI a little bit for creating quotes, but programs that do nothing put pull random quotes out of .txt files are not exactly in hot demand at the moment. Furthermore, the tail call optimization was fun to explore, but I had to bend over backwards to conform the functions to it’s requirements. I do think that there are niche situations where it might be helpful, so its nice to know how it works and some of the pitfalls of recursion in C, but there are costs associated with it as well.

I realized at some point that what I had was already an extremely rudimentary form of text editor. It could only operate on words and couldn’t manipulate white space, but I was able to make arbitrary changes to long passages of text fairly efficiently. By extending the idea up from the word to the line and buffer and down from the word to the character I realized I could create a rudimentary text editor with the knowledge I had gained. More surprisingly, this happened to be at least approximately how text editors often actually work. Sadly, the recursive functions wouldn’t make it to the next step. There is already infinite complexity to explore within the paradigm of a text editor without needing to arbitrarily complicate it.

Where is NUTS in the Nowness of time?

So far I have created the basic project structure using a template provided by Unity testing framework and created the basic outline of the types I will need. I have started working primarily in Word.c where I am building out the functions needed to manipulate strings. The ability to add and delete characters and to add and remove regions of text. The following snippet gives a good overview of my approach and is pulled largely from my addString() function in Word.c.

// Save the start index.
int index = startIndex;

// Get length of the sub string to copy over from the
// difference of the indicies
int length = endIndex - startIndex;

// Decrement the startIndex until it reaches zero. Increment
// and copy the derefenced pointer to the new string.
while (startIndex-- && (*newStrPtr++ = *oldStrPtr++))
  ;

// Copy the newStr from the new string passed as an argument to the fuction.
// Because the pointer was already incremented and hasn't been reset, its
// already in the correct postion to start copying.
while ((*newStrPtr++ = *inputString++))
  ;

// Increment the index by the length of the copied String.
// This can be used to determine how much if any of the old
// string remains to be copied.
index += length;

// I need to decrement here because the pointer is after the '\0'
// character because of the final increment after the while fails in
// the preceeding loop.
incNew--;

// If more of the string remains, append it to the to new string
// I could also use a while loop, but strcpy internally uses the
// same approach, so it works well to tack on the last bit.
if (index < (int)word->size) {
  oldStrPtr += length;
  strcpy(incNew, incOld);
}

I took this logic for copying strings from “The C Programming Language” by K & R (especially pages 86-88), which describes the logic in detail and is an interesting read. I could make more liberal use of memset or strcpy, but especially in the case of strcpy, the implementation used in the GCC source code uses a nearly identical approach, so I don’t think there is huge advantage to using them over simple while loops. This approach makes it easier to save the position of the pointer for further operations. The function to delete a region of a string uses the same approach. It copies up the start index, increments the pointer by end index minus start index and copies what, if anything, remains after the deleted region.

What about Performance!

After doing a bit reading about ropes, I realized that they tend to be immutable and favor statically sized arrays, while my implementation uses malloc to assign memory to the character pointers. This prompted me to do a little bit of investigating with some very rudimentary benchmarking programs.

#define ITERATIONS 1000000

int main(void) {
  char testTemplate[] = "Hello, World!";
  float fixedStartTime = (float)clock() / CLOCKS_PER_SEC;

  for (int i = 0; i < ITERATIONS; i++) {
    char testChar[14];
    strcpy(testChar, testTemplate);
  }

  float fixedEndTime = (float)clock() / CLOCKS_PER_SEC;
  float fixedTimeElapsed = fixedEndTime - fixedStartTime;
  float mallocStartTime = (float)clock() / CLOCKS_PER_SEC;

  for (int i = 0; i < ITERATIONS; i++) {
    char *testChar = malloc(14 * sizeof(char));
    strcpy(testChar, testTemplate);
  }

  float mallocEndTime = (float)clock() / CLOCKS_PER_SEC;
  float mallocTimeElapsed = mallocEndTime - mallocStartTime;

  printf("Fixed time elapsed: %f\n", fixedTimeElapsed);
  printf("Malloc time elapsed: %f\n", mallocTimeElapsed);
  return 0;
}

This created an output on my laptop of approximately 0.002 seconds for copying the fixed size array and 0.02 seconds for the malloc version without any compiler optimizations (I also tried it with calloc, which was slightly slower, but the difference was quite negligible). This is a huge difference in performance between the two, but its not really a realistic scenario, and there are a lot of other considerations.

For example, either the fixed arrays would need to be quite large, yet often would only need to hold a small number of characters, or they would be small and then it would be a common scenario where they would need to be split into multiple nodes in the rope structure to hold single blocks of text. Operations that print a single word might require multiple string copies distributed between multiple nodes. That being said, I do think it is likely that you could find a sweet spot where the array was not too large, but in which the vast majority of words could fit. Additionally, making strings dynamically sized greatly simplifies higher level navigation functions such as next word, delete word etc. If strings were distributed across multiple nodes, even very infrequently, a next word function would need to be able to navigate multiple nodes until it determined it was at the node that held the end of the string for the word.

I attempt to mitigate this problem somewhat by over sizing the memory allocated to the strings with malloc. The addChar() and delChar() functions, theoretically the ones that will be used for normal one-character-at-a-time (IE. regular typing vs. a put-from-kill-ring style function for example), will only need to resize the buffer once every n number of inputs where n is the BUFFER_SIZE defined. If adding a new character would make the word size equal to a multiple of BUFFER_SIZE, reallocarray is used to resize the buffer appropriately by increasing its size by BUFFER_SIZE. Later on it will be interesting to play with that value and see how it would perform in different real world scenarios, but for now I’m left only speculating. Out of curiosity, I made a similar program and, without compiler optimizations, it was able to put a single character in a static and dynamic char array using array indexing 100 million times, and while the static array was faster, the difference was trivial. Let’s just say I don’t think you will notice a big difference unless you spend WAY too much time on monkeytype.com.

All this speculating is interesting, but ultimately largely academic because performance should be well within the range of the acceptable even for very large text files, and certainly more than acceptable for an unusable text software interface. Right now my biggest concern is creating functioning logic so I can start to work on the myriad problems of user input, saving, loading, managing buffers and so on.

System Crafters Web Ring

Messing around with computers and coding since I was 8. Now getting paid to do what I love.