Parrot Time Magazine

The Thinking of Speaking
Issue #30 November / December 2017
Extras
The Technical Secrets of LingoHut, Maybe

The Technical Secrets of LingoHut, Maybe

by Erik Zidowecki
November / December 2017 |  asd

You have probably seen LingoHut by now, or at least got an idea of what it is like. It is slick. It is powerful. And I can guarantee you, the work that has gone into is phenomenal.

Most will be amazed by the volume of vocabulary there is for each language, along with recordings, and how much of that is spread across a dozen languages. That work alone is a cumulation of years and would correspond to several printed books.


As a programmer, I also look at websites and see how things were put together, think about what they had to overcome, and marvel at how smoothly it all works. Always remember that behind every learning platform is a devoted programmer.

Most of the time we hear from Kendal, who is the promoter and "face" of LingoHut, but she will always be sure to include her husband and partner, Philipp, in the credits, as he is the programmer who made it all possible.

Rather than ask Philipp all his secrets, I thought instead to talk about three of the larger issues which have to be considered and built in something like this. Just like how simple a wall switch appears when you flood the room with light, the ease of using the site hides the complexity of the system.

Words

The biggest part of LingoHut is its collection of vocabulary. They have compiled hundreds of words and phrases together to make up the lessons, and they need to have those for multiple languages. So how are they storing those?

Let us look at German as a basic example. The German course gives the person a word in both English and German, with an audio of the German pronunciation. This means that somehow, the English and the German words have to be stored as pairs, or at least in a way in which they can easily be paired together.

Undoubtedly the easiest way to do this is to have all the words stored in a file, with each line holding an English / German combination, separated by a character. For example, dog/Hund. Then the file could be loaded in, stored as an array (which is the computer equivalent to a list), and any pair could be accessed effortlessly.


This method is easy to make, simple to maintain, and straightforward to read. The only real drawback with it in terms of speed is that it takes a few extra microseconds to parse (split) the words and place them into the array. If you are willing to lose some of the file space and readability, you can store the data as a PHP array, already parsed. This means that whenever you need to load it, you just tell the compiler to load it. Boom. The job is done.

My main concern for storing it all like this is it is rather inflexible. From this list, I can only ever produce materials for English / German. If I wanted something like Italian / German, I would need a completely new file with all those words. Not bad, but what if you wanted several mixes, like French / German, Turkish / German, Swahili / German, etc. And what happens when you find you misspelled or mistranslated a German word? You would need to change numerous files.

If this is a concern to you as well, then we would have to consider a way to link words together across languages. It is not actually that hard. We simply assign each word a unique key of some kind (numbers, alphanumeric, etc.) and use that key for the same word in each language. For example, we could assign the key "1" to "dog". So in the German file, rather than looking for "dog", we look for "1" and find with it "Hund". In the Italian, we would find "cane". This is the way basic databases work, with keys linking data together.

With this method, you can now mix any language with any other language without creating a new file for each pairing.

There is another file solution I will mention only as advisement on what not to do. When creating data storage methods, there is always the question of balance between space and speed. The first way I listed, with just the words in a paired listing, was simple to read and relatively small, needing only space for the words. The second method, with storing the file as a premade array, made the file harder to read and increased the size, but made it faster to load into the program.

Several years ago, a storage method was introduced using files and "tags". A tag is a specific word that defines the data which comes next. HTML is an example of a widespread existing markup language (the "ML" in "HTML"). You would use a tag like "title", enclosed in brackets <> to tell what the title is. A slash would be included to end the data. How to Build a Fish

This new system was called "XML" (note the "ML" again) and it took the internet by storm. Everyone was putting all their information into it, in the hope it would make sharing the data easier. Because it was tag driven, everyone could within the code define their own tags, so it was conceivably a universal data structure.

However, there is a reason they teach courses on data structures, with the pros and cons for each laid out. That is because there is no "best" or "universal" way.

An associate of mine discovered this when he attempted to take all the word data we were using for a site and implement it this way. It was pretty. It was ordered. It was also bloated and slow.


See, to make data retrieval fast, you want to do as little work as possible. When you know that each line has two pieces of code and that the first one is an English word and the second is a German one, then you can instantly grab those pieces of data and store them in the right place. You do not need to figure out what they are.

However, in the tag system, before you can find the data, you need to find the tags defining where it is. Essentially, you need to find a tag, like and the other tag , then do the math to figure out which characters in between are to be extracted as the English word. You do the same for the German. But you also need to make sure they are for the same pair, so you have to make sure they both fall within the tags . Programming wise, this requires lots of comparison tests and some basic math for each item.

So right away, your speed is gone. What about size? Look again at our dog/Hund example. That is 8 characters long ( dog = 3, / = 1, Hund = 4). Putting the same data into an XML file might look like dogHund. That is 55 characters, which is an increase in size by almost 7!

When we attempted to use the data stored in this structure, what used to take a page a few seconds to load now took 5 minutes! So we lost both speed and size with this data structure.

These solutions all depend on using files for word lists. Some people do not like those, as they can easily be garbled if an edit goes wrong, or completely deleted with the touch of a wrong key.

So that is when we put them all directly into a database. A database can be made to act like a list because you are essentially still pulling in all the relative data and storing it again in an array. And you can store it as dual language words per entry, as described in the first method, or as single language entries with keys. Actually, when using a database, you will likely have at least one unique key for everything.

The main strength perhaps of a database is the flexible access. You can change any word without affecting the others, while with files, you are opening a file with all words, making a change, and saving it again, hopefully without affecting anything else.

Databases are also good for when you have many people making changes to the data. Having people making changes to files can be tricky and hazardous.

The downside is the overhead, since every entry needs extra data to define it, and you need to do a load on all the data and putting it into an array again. But in truth, the differences between all three methods (flat file, array file, database), is probably so small so as to not be noticed on most systems.

Audio


LingoHut also incorporates pronunciation into its learning system, making it necessary to store, essentially, a recording of every word for every language.

Probably the easiest way to do this is to keep each audio word in a separate file. What format you store it in will depend on a few things, which we will discuss shortly.

Each file name will then have to be tied to the appropriate language word which is selected. The most straight forward way, going back to our original very simple example of a word list would be to have a third entry on each line, which is the file name. For example: dog/Hund/deu_hund.wav.

This works well for storing word pairs or individual words with keys, as well as a database. What might sound tempting to do, if you using keys, is to store in an array or file a list of all sound file names along with the key. That is a mistake because while the key links the same concept, like a dog, the words will be different in each language. You would have to expand the key with a prefix like the language code (in this case, deu for German), so the audio file name "Hund" might be stored as deu_1 and for Italian "cane" it would be ita_1.

Whichever you do it, it would be vital to ensure that any changes to the words mean a change in some way to the audio. For example, if someone prefers to use der "RĂ¼de", you can esaily change the text, but the audio will now also need to be updated.

Besides saving each audio word as a single file, there are two other storage methods I have seen. The first is putting the sound data directly into the database along with the word. That requires an encoding of the raw sound data into something which can be stored safely as normal text. The plus side is you do not have hundreds or thousands of audio files stored on your system (something shared hosting services hate). However, depending on how long the audio is, this could also blow up the size of your database.

The other method is a bit trickier and can really only be done when you have all sound files completed and they will not be changed again. If this is the case, you can join the recordings together into large chunks, keeping track for each one of while audio chunk it is on, along with how many seconds in it is and the duration.

While it may seem silly to do this, consider teaching someone a sentence. If you have the single sentence recorded, you can also store the information for each word, thus playing any fragment of the sentence. Pimsleur uses a method of asking the learner to repeat each word separately, then as a whole sentence. If you store the audio information this way, you can have one longer file rather than several smaller ones.

Now what kind of files you use really depends on a few different factors. The main issue is what can you get your browser to play back properly. In the past, coding playback has been tricky, dependent on creating sound objects and linking in players which may or may not work with all browsers. With the introduction of more sophisticated sound libraries and HTML5, playback has become more flexible and reliable.

Then it becomes a question of which works best. Most people look for files to be stored as MP3 files (.mp3), but the common data format of choice has been Wave files (.wav). MP3s are normally more compact and sound better, but you may not have the ability to play them. I have found playing sounds on the internet to be one of the most constantly changing and at times frustrating things to work with, beaten out only by playing streaming audio or recorded video.

Site Translation


In the age of globalization, you really need to have your website in more than one language if you wish to reach the international community. Naturally, LingoHut has the ability to wrap itself in many languages. This is what gives it the full power it has in the ESL (English as a Second Language) section.

But while it might seem like an obvious thing to do, the method of doing it may not. Most web pages are stored as static files which do not change. For those people, having a multilingual site means having a separate page for each language. This is a real problem when you need to change the way your site looks, as you now have dozens of copies to change.

Some sites which do more interactive stuff will have the pages built using something like Python, Perl, or PHP, so they filled in differently as needed before being delivered to your browser. This means the proper translations can be added to a page with only one source, so changes are easy.

If you want to keep your pages static and not use a program to build them, you can use Javascript and Ajax to load in the proper translations after the page has loaded. What this looks like is a formatted page full of data but no text, just little animated wheels showing something is loading. This can be a disaster if the person has disabled Javascript or you have another slow element loading, like a large picture, because the text loads last.

Once you figure out how to load it in, if that is what you plan to do, you will need to decide how to store it, which, perhaps unsurprisingly, goes back to the first issue of storing the words. Both are really translations between two (or more) languages. So again, you might have a file with an English instruction followed by a German instruction on each line which you load in and then place whichever version you need in the right place, using a preloader like PHP or a postloader like Javascript / Ajax.

Bonded Pair

I have focused just on a few issues that I have had to tackle myself and I know went into LingoHut. I do not know how Philipp handled them, whether it was in one of the ways I described or in a totally different way I do not even know about.

Kendal started her article in this issue talking about "two dreamers" who came together to build something for everyone. To me, they represent how we need both the dreamers and the engineers to make things happen.

This pair of dreamers is in a dataset all their own.

The Technical Secrets of LingoHut, Maybe
Writer: Erik Zidowecki
Images:
Petey: Data tunnel (splash title); Pencil and pad; Lockers; Database; Microphone; Cafeteria wall

All images are Copyright - CC BY-SA (Creative Commons Share Alike) by their respective owners, except for Petey, which is Public Domain (PD) or unless otherwise noted.

Searching for language resources?
Find entertaining and educational books for learning a language at Scriveremo Publishing. Just click the link below to find learning books for more than 30 languages!
Parleremo Languages Word Search Puzzles Travel Edition Italian - Volume 1



Also in this issue




Others like this

Comments

comments powered by Disqus