Parrot Time Magazine

The Thinking of Speaking
Issue #30 November / December 2017
Extras
The Technical Secrets of LingoHut, Maybe

The Technical Secrets of LingoHut, Maybe

by Erik Zidowecki
November / December 2017 | 

You have probably seen LingoHut by now, or at least got an idea of what it is like. It is slick. It is powerful. And I can guarantee you, the work that has gone into is phenomenal.

Most will be amazed by the volume of vocabulary there is for each language, along with recordings, and how much of that is spread across a dozen languages. That work alone is a cumulation of years and would correspond to several printed books.


As a programmer, I also look at websites and see how things were put together, think about what they had to overcome, and marvel at how smoothly it all works. Always remember that behind every learning platform is a devoted programmer.

Most of the time we hear from Kendal, who is the promoter and "face" of LingoHut, but she will always be sure to include her husband and partner, Philipp, in the credits, as he is the programmer who made it all possible.

Rather than ask Philipp all his secrets, I thought instead to talk about three of the larger issues which have to be considered and built in something like this. Just like how simple a wall switch appears when you flood the room with light, the ease of using the site hides the complexity of the system.

Words

The biggest part of LingoHut is its collection of vocabulary. They have compiled hundreds of words and phrases together to make up the lessons, and they need to have those for multiple languages. So how are they storing those?

Let us look at German as a basic example. The German course gives the person a word in both English and German, with an audio of the German pronunciation. This means that somehow, the English and the German words have to be stored as pairs, or at least in a way in which they can easily be paired together.

Undoubtedly the easiest way to do this is to have all the words stored in a file, with each line holding an English / German combination, separated by a character. For example, dog/Hund. Then the file could be loaded in, stored as an array (which is the computer equivalent to a list), and any pair could be accessed effortlessly.


This method is easy to make, simple to maintain, and straightforward to read. The only real drawback with it in terms of speed is that it takes a few extra microseconds to parse (split) the words and place them into the array. If you are willing to lose some of the file space and readability, you can store the data as a PHP array, already parsed. This means that whenever you need to load it, you just tell the compiler to load it. Boom. The job is done.

My main concern for storing it all like this is it is rather inflexible. From this list, I can only ever produce materials for English / German. If I wanted something like Italian / German, I would need a completely new file with all those words. Not bad, but what if you wanted several mixes, like French / German, Turkish / German, Swahili / German, etc. And what happens when you find you misspelled or mistranslated a German word? You would need to change numerous files.

If this is a concern to you as well, then we would have to consider a way to link words together across languages. It is not actually that hard. We simply assign each word a unique key of some kind (numbers, alphanumeric, etc.) and use that key for the same word in each language. For example, we could assign the key "1" to "dog". So in the German file, rather than looking for "dog", we look for "1" and find with it "Hund". In the Italian, we would find "cane". This is the way basic databases work, with keys linking data together.

With this method, you can now mix any language with any other language without creating a new file for each pairing.

There is another file solution I will mention only as advisement on what not to do. When creating data storage methods, there is always the question of balance between space and speed. The first way I listed, with just the words in a paired listing, was simple to read and relatively small, needing only space for the words. The second method, with storing the file as a premade array, made the file harder to read and increased the size, but made it faster to load into the program.

Several years ago, a storage method was introduced using files and "tags". A tag is a specific word that defines the data which comes next. HTML is an example of a widespread existing markup language (the "ML" in "HTML"). You would use a tag like "title", enclosed in brackets <> to tell what the title is. A slash would be included to end the data. How to Build a Fish

This new system was called "XML" (note the "ML" again) and it took the internet by storm. Everyone was putting all their information into it, in the hope it would make sharing the data easier. Because it was tag driven, everyone could within the code define their own tags, so it was conceivably a universal data structure.

However, there is a reason they teach courses on data structures, with the pros and cons for each laid out. That is because there is no "best" or "universal" way.

An associate of mine discovered this when he attempted to take all the word data we were using for a site and implement it this way. It was pretty. It was ordered. It was also bloated and slow.


See, to make data retrieval fast, you want to do as little work as possible. When you know that each line has two pieces of code and that the first one is an English word and the second is a German one, then you can instantly grab those pieces of data and store them in the right place. You do not need to figure out what they are.

However, in the tag system, before you can find the data, you need to find the tags defining where it is. Essentially, you need to find a tag, like and the other tag , then do the math to figure out which characters in between are to be extracted as the English word. You do the same for the German. But you also need to make sure they are for the same pair, so you have to make sure they both fall within the tags . Programming wise, this requires lots of comparison tests and some basic math for each item.

So right away, your speed is gone. What about size? Look again at our dog/Hund example. That is 8 characters long ( dog = 3, / = 1, Hund = 4). Putting the same data into an XML file might look like dogHund. That is 55 characters, which is an increase in size by almost 7!

When we attempted to use the data stored in this structure, what used to take a page a few seconds to load now took 5 minutes! So we lost both speed and size with this data structure.

These solutions all depend on using files for word lists. Some people do not like those, as they can easily be garbled if an edit goes wrong, or completely deleted with the touch of a wrong key.

So that is when we put them all directly into a database. A database can be made to act like a list because you are essentially still pulling in all the relative data and storing it again in an array. And you can store it as dual language words per entry, as described in the first method, or as single language entries with keys. Actually, when using a database, you will likely have at least one unique key for everything.

The main strength perhaps of a database is the flexible access. You can change any word without affecting the others, while with files, you are opening a file with all words, making a change, and saving it again, hopefully without affecting anything else.

Databases are also good for when you have many people making changes to the data. Having people making changes to files can be tricky and hazardous.

The downside is the overhead, since every entry needs extra data to define it, and you need to do a load on all the data and putting it into an array again. But in truth, the differences between all three methods (flat file, array file, database), is probably so small so as to not be noticed on most systems.


12All pages
1
The Technical Secrets of LingoHut, Maybe
Writer: Erik Zidowecki
Images:
Petey: Data tunnel (splash title); Pencil and pad; Lockers; Database; Microphone; Cafeteria wall

All images are Copyright - CC BY-SA (Creative Commons Share Alike) by their respective owners, except for Petey, which is Public Domain (PD) or unless otherwise noted.

Comments

comments powered by Disqus
In this issue:



Archives
Missed something?
Find previous issues in the archives.





Google