i am wondering how one could store a trie data structure in database? once you built a trie, if you want to store it both in memory and hard disk, how would you do it? SO mentions materialized path as one of way doing it Relational. how about nosql like casandra and memcache like redis?
Interesting question. Never really thought about it. I'm mostly noob in this topic, but trie seems like primarily an in memory structure. Afaik, commercial text search solutions use simpler, more easily serializable (prefix, postings list) kv format. Though serializing a trie to a file(s), should be possible, if somewhat complicated.
If used for text validation you could fit every word in every language in memory with plenty to spare. Human dictionaries are not large and tries store them quite efficiently.
Key value store. Root tree has a known id. Store subtrees in flat files (json blob?), limit each subtree by file size, nodes contain reference to next subtree etc. Use LFU cache to keep subtrees around in memory.
In the codebase that I've seen the trie was just a regular key value in-memory database. In-memory means only in memory with occasional disk backups. So the key 12345 would be on one server, the key 1234 would be on another server, the key 123 somewhere else. The client would generate N keys and do N requests to different servers. It didn't need to support the "get all children" operation, only the "if this node child exist" operation so a key value design was good enough. Of course this isn't the ideal system design interview answer because it doesn't solve hot spotting (for example, the key of the length 1 will get more requests). But in the real system the number of requests was low enough to not worry about hot spots.
What N keys here are?
For example, it is a search system. You analyzed the search data and built a trie offline with the structure "prefix - top 10 results". So for the word "house" you will need to write 5 keys h,ho,hou,hous,house. You write them to a key-value (nosql) storage where you select a server by the key hash, so it can be possible that you write 5 keys above to 5 different servers. But when you read it, you need to read only 1 key, what the user entered in the search bar. Everything depends on the use case.
Easiest way is to serialize it into files like json
That is for the backup right? Because reading from actual file and doing prefix search might be too much to do?
how about preorder serialization like some lc question
Trie is an indexing structure which is mainly used for reads so I would much prefer it to be stored exclusively in memory (backed up on disk just for stable storage in case of failures). If the trie is too large to be fit into memory probably want to partition it across multiple machines (first character A-C in node1, D-F in node 2 and so on ..)- this also gives better read query performance since it federates read queries (maybe also replicate it for faster reads plus resilience). The answer depends a lot on what you are using a trie for and if we actually want to store it on disk/ memory both.
Partitions won't necessarily be uniform with this approach.
I would try the following. For NoSQL write a json serialization of the trie. So a single entry containing just the trie data. For a Relational db create a table with the record consisting of a unique id, the value, and foreign key references to its children (one child reference per record). This is my simplistic answer.
If you are building a search autocomplete suggestion system it won’t scale by having a massive trie serialized or stored somewhere. You need some offline batch processing job (eg Hadoop or whatever is cool these days) that generates results (based on the queries) which can be used to generate suggestion tries locally on each server. These local tries can be updated periodically after you have enough new samples/queries to process. Then you need a sharding scheme to distribute the load. Ranges based partitioning -eg host 1 stores [A-C] - is a good idea but it has the complication of dynamically managing the ranges to avoid hotspots. These ranges cannot remain static as your historical data grows.
IMO, partition rebalancing won't be a big issue. Just pre-emptively create n partitions, where n>>max no. of nodes you see yourself operating under 5x max load. Then a new node can easily steal some of the extra partitions from each existing node. Dynamic splitting works too. I think bigger issue with range partitioning is that partitions with common prefixes will always be hot, like someone else mentioned above.
@linkedin even though you have n partitions but when you add a new node data needs to shuffle across all partitions. Consistent bashing will help in this case to miminize less shuffles.
Check out MARISA trie. The only problem with it is, it only allows you to store keys in it. You will need another DB to store values from. Its still very useful design pattern in use cases like auto-complete or URL database. Where you can compile the info you have into single large trie file, serialize it and ship it to various nodes. Use other serializable db along with it to store values (sqllite/leveldb etc) and you can usually address any usecase which is lookup heavy.
Not challenged enough at salesforce you have time to ponder such things?
Prepping for interviews, most likely!
Salesforce isn’t a bad spot to be during Covid