- TinyURL
- Data analysis like click events, user sources
- Short url length to fit social media content limit (140 characters)
- Avoid the website is blacklisted by domain name
- Shortening: Take a url and return a much shorter url.
- Ex: http://www.interviewbit.com/courses/programming/topics/time-complexity/ => http://goo.gl/GUKA8w/
- Gotcha: What if two people try to shorten the same URL?
- Redirection: Take a short url and redirect to the original url.
- Custom url: Allow the users to pick custom shortened url.
- Analytics: Usage statistics for site owner.
- Ex: How many people clicked the shortened url in the last day?
- Each url can have multiple tiny urls?
- Yes
- Tiny url encoded length?
- 6
- QPS
- 500M new records per month
- 10:1 read write ratio
- URL is not guessable?
- Yes
- Needs original url validation
- No
- Automatic link expiration
- Manual link removal
- UI vs API
- Latency
- Our system is similar to DNS resolution, higher latency on URL shortener is as good as a failure to resolve.
- Consistency vs Availability
- Both are extremenly important. However, CAP theorem dictates that we choose one. Do we want a system that always answers correctly but is not available sometimes? Or else, do we want a system which is always available but can sometime say that a URL does not exists even if it does? This tradeoff is a product decision around what we are trying to optimize. Let's say, we go with consistency here.
- URL as short as possible
- URL shortener by definition needs to be as short as possible. Shorter the shortened URL, better it compares to competition.
- QPS: 500M per month
- 200 per second write
- If read write ratio 10:1, then read 2000
- Performance:
- Query with index should be around 1ms ~ 2ms
- One write should be around 5ms for SSD disk
- Capacity:
- 1 CPU core can handle 200 operation
- Usually database server: 56 CPU cores -> 60 CPU cores or more
- 5-10 CPU cores should be enough without cache
- One database should be good enough to handle the load
- Hash based on MAC address + Current datetime
- 36 characters: 32 alphanumeric characters and 4 hyphens
- The granularity of current datetime: microseconds. Lots of collision
- Pros
- Pure random
- Not guessable and reversable
- Cons
- Too long
- Key duplication if only use 6 prefix characters
-
Traditional Crypto hash function: MD5 and SHA-1 + Secure but slow
-
Fast hash function: Murmur and Jenkins + Performance + Have 32-, 64-, and 128-bit variants available
-
Pros
- Not guessable and reverserable
-
Cons
- Too long
- Key duplication if only use 6 prefix characters
- Not random -> hash (current time or UUID + url)
-
Pros
- No need to write additional hash function, easy to implement
- Are randomly distributed
- Support URL clean
Problem | Possible solution |
---|---|
Not short enough (At least 4 bytes) | Only retrieve first 4-5 digit of the hash result |
Collision cannot be avoided | Use long_url + timestamp as hash argument, if conflict happens, try again (timestamp changes) -> multiple attempts, highly possible conflicts when data is big |
Slow |
- Retention period varies with base radix. For example, assume 500M new records per month
- If length == 8, 62^8 ~ 200 trillion ~ 33333 years
- If length == 7, 62^7 ~ 3 trillion ~ 600 years
- If length == 6, 62^6 ~ 57 B ~ 10 years
Encoding | Base10 | Base62 | Base64 |
---|---|---|---|
Year | 36,500,000 | 36,500,000 | |
Usable characters | [0-9] | [0-9a-zA-Z] | [0-9a-zA-Z+/=] |
Encoding length | 8 | 5 |
-
Pros:
- Shorter URL compared with hash
- No collision
- Simple computation
- Easy to generate url without duplication (increment id)
- Advantages of Base64 vs Base62
-
Cons:
- Too long, reversable if directly apply base62 to URL
- For example, Base64 =
- ShortUrl => API Gateway => TinyUrl Service => Database => TinyUrlService => API Gateway => 301 redirect => Original Url
- API Gateway: Can be REST API or GraphQL
- 200: 2XX OK -> Successful
- 302: temporary redirect
- 301: permanent redirect
- 400: bad request
- 402 / 403: forbidden or unauthorized
- 413: payload too large
- 500-5XX: service error or internal error
public void generateTinyUrl(TinyUrlRequest request) throw ServiceException;
class TinyUrlRequest
{
String originalUrl;
Date expiredDate;
}
public string getOriginalUrl(String tinyUrl) throw ServiceException;
- If we need to provide service (API) to other vendors, we need api key and rate limiting
public void generateTinyUrl(String APIKey, TinyUrlRequest request) throw ServiceException;
public string getOriginalUrl(String APIKey, String tinyUrl) throw ServiceException;
- Post /tinyurl
- Get /tinyurl?url=xxxxx
- Put data into post body
data:
{
original_url: xxxx,
expired_date: xxxx,
ip: xxxx
}
Create table tiny_url (
Id bigserial primary key,
short_url text not null,
original_url text not null,
expired_datetime timestamp without timezone)
- For querying 400 million records
- With index, around 0.3ms
- Without indexes, about 1 minute
- Should create index for what we query
- Before insert random url into database. Check whether it exists within database, if not then insert.
- Cons:
- Easy to timeout
- Heavy load on database when size is big
public string longToShort(string url)
{
while(true)
{
string newShortUrl = randomShorturl()
if (!database.filter(shortUrl=newShortUrl).exists())
{
database.create(shortUrl=newShortUrl, longUrl=url);
return shortUrl;
}
}
}
-
Offline job generates keys (Daemon process) and stores into database.
- Offline job can tolerate longer query time.
-
Keys database schema
- create table keys(Key text primary key, status integer)
- status
- 0: available
- 1: occupied
-
Keys database query
- Select key from keys where status = 0 limit 1
- Update keys set status = 1 where key = a_key
-
Cons: Race condition
- Multiple users can get same keys if qps is high
- Select / Update is not atomic operation
- Database is not suitable for queuing
- Solution:
- Database lock: Select key from keys where status = 0 limit 1 for update skip locked
- Distributed lock: bad performance.
- Redis -> LPush and LPop: Redis list data structure is actually queue and it is thread-safe and production ready
- Multiple users can get same keys if qps is high
- Use relational database incremental id to avoid duplication
- Relies on a single node to generate UUID. Single point of failure and performance is low.
class TinyURL
{
map<longURL, shortURL> longToShortMap;
map<shortURL, longURL> shortToLongMap;
shortURL insert( longURL )
{
if longToShortMap not containsKey longURL
generate shortURL;
put<longURL, shortURL> into longToShortMap;
put<shortURL, longURL> into shortToLongMap;
return longToShortMap.get(longURL);
}
longURL lookup( shortURL )
{
return shortToLongMap.get( shortURL );
}
}
public String longToShort( String url )
{
if ( url2id.containsKey( url ) )
{
return "http://tiny.url/" + idToShortKey( url2id.get( url ) );
}
GLOBAL_ID++;
url2id.put( url, GLOBAL_ID );
id2url.put( GLOBAL_ID, url );
return "http://tiny.url/" + idToShortKey( GLOBAL_ID );
}
private String idToShortKey( int id )
{
String chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
String short_url = "";
while ( id > 0 )
{
short_url = chars.charAt( id % 62 ) + short_url;
id = id / 62;
}
while ( short_url.length() < 6 )
{
short_url = "0" + short_url;
}
return short_url;
}
public String shortToLong( String url )
{
String short_key = getShortKey( url );
int id = shortKeytoID( short_key );
return id2url.get( id );
}
private String getShortKey( String url )
{
return url.substring( "http://tiny.url/".length() );
}
private int shortKeytoID( String short_key )
{
int id = 0;
for ( int i = 0; i < short_key.length(); ++i )
{
id = id * 62 + toBase62( short_key.charAt( i ) );
}
return id;
}
- Two maps
- longURL -> shortURL
- shortURL -> longURL
- In order to store less data ( Given shortURL, its corresponding sequential ID can be calculated )
- longURL -> Sequential ID
- Sequential ID -> longURL
- Create index on longURL column, only needs to store one table
- Sequential ID -> longURL
- Memcache or Rediss usually are cluster, and usually one operation takes 0.1ms or less.
- 1 CPU cores => 5000-10000 requests
- 20-40 CPU cores => 200K requests (one machine)
string originalUrl = getUrlFromMemcache(tinyUrl)
if (originalUrl == null)
{
originalUrl = readFromDatabase(tinyUrl);
putUrlIntoCache(tinyUrl, originalUrl);
}
return originalUrl;
- Write to master, streamly replicated to slaves, usually less than one second Read from slaves.
- Pros
- Increase availability to handle single point of failure
- Reduce read pressure on a single node
- Cons
- Replication lag
- Solution1: We can write to memcache when creating a new tiny url. Service can get tiny url from memcache instead of database.
- Solution2: We can implement Raft protocol against relational database or use opensource or commercial realted database.
- Replication lag
- Web server
- Different web servers deployed in different geographical locations
- Use DNS to parse different web servers to different geographical locations
- Database
- Centralized MySQL + Distributed memcached server
- Cache server deployed in different geographical locations
- Too many write operations
- The limit of 62^6 is approaching
- Problem: For "31bJF4", how do we find which physical machine the code locate at?
- Our query is
- Select original_url from tiny_url where short_url = XXXXX;
- The simple one
- We query all nodes. But there will be n queries
- This will be pretty expensive, even if we fan out database calls.
- But the overall QPS to database will be n times QPS
- Solution
- Add virtual node number to prefix "31bJF4" => "131bJF4"
- When querying, we can find virtual nodes from prefix (the tinyUrl length is 6). Then we can directly query node1.
// Assign virtual nodes to physical nodes
// Store within database
{
0: "db0.tiny_url.com",
1: "db1.tiny_url.com",
2: "db2.tiny_url.com",
3: "db3.tiny_url.com"
}
- Vertical sharing
- Only one table
- Even with Custom URL, two tables in total
- Horizontal sharding: Choose sharding key?
- Use Long Url as sharding key
- Short to long operation will require lots of cross-shard joins
- Use ID as sharding key
- Short to long url: First convert shourt url to ID; Find database according to ID; Find long url in the corresponding database
- Long to short url: Broadcast to N databases to see whether the link exist before. If not, get the next ID and insert into database.
- Combine short Url and long Url together
- Hash(longUrl)%62 + shortkey
- Given shortURL, we can get the sharding machine by the first bit of shortened url.
- Given longURL, get the sharding machine according to Hash(longURL) % 62. Then take the first bit.
- Sharding according to the geographical info.
- First know which websites are more popular in which region. Put all websites popular in US in US DB.
- Use Long Url as sharding key
- Zookeeper
- Use a specialized database for managing IDs
- Solution1: Check data in the service level, if expired, return null
- Pros: Simple and keep historical recordds
- Cons: Waste disks
- Solution2: Remove expired data in the database and cache using daemon job
- Pros: Reduce storage and save cost, improve query performance
- Cons:
- Lost historical records
- Complicated structure
- Easy to fail
while(true)
{
List<TinyUrlRecord> expiredRecords = getExpiredRecords();
For (TinyUrlRecord r: expiredRecords)
{
deleteFromDb(r);
removeFromCache(r);
}
}
- Question: How would we store statistics such as
- How many times a short URL has been used
- What were user locations
- If the Kafka producer produces a lot of events, the broker has a lot of backlog
- Solution1: Add Kafka brokers and partition
- Solution2: Add memory buffer in the Kafka producer and batch send to Kafka broker
- If it is part of a DB row that gets updated on each view, what will happen when a popular URL is slammed with a large number of concurrent requests