Best Practices Working with Billion-row Tables in Databases
This whole post is about a video from Hussein Nasser I saw from youtube. This is just the summarise version of the video along with my key takeaways from the video
Learning:
The discussion is on
How to handle data in billion row based table
What kind of approach can be taken
How to redesign the table to handle 2 billion rows based table
Summary:
Here the discussion is being started with how a twitter follower table has been designed. A simple approach, make a table with which person follows whom. A 2-3 columns based rows. But it will be a huge table in the long term in sense if we use this for twitter. So what can be done
Do normal query without the concept of
indexing
. Justbrute forcing
the data without the concept of anything. Do multi threading, multi processing and find the data from the table using lots of machines (map reduce
)Use
indexing
on the table and find the data from the table using the indexed data.Now you have billions of data so indexing is huge, to search that use database
partitioning
in the same disk. Use pair partitioningNow to optimize more use
sharding
in the system(multiple host). But it adds morecomplexity
in the system. The client needs to be aware of the shard info before querying, then needs to find the properpartition
for the query, then make an actual query. Which makes another layer of logics along with business logics.Another way is to
redesign
the system like in the profile table add 2 more rows to hold db columns like follower count, followers(in json). In that way a profile holds all information about followers of a profile. The problem might generate how to write/edit this data. But that is another kind of system design like querying, CQRS, event based solution. It solves current issues.
💡 Initially written here in dev.to