This is a blog to introduce a team project in CMU with Fan and Eugene.
Background: We are given a large data set of twitter data.
Query 1
This is the first query to help set up heart beat and web service.
Requirement:
- Set up two frontend server on AWS using m3.medium and not better than m4.large
- Write a program to decrypt the message with key provided.
- No ETL process for this query
Budget restricted:
- m3.medium is $0.067/hour and elastic load balancer is $0.0225/hour
REST server (Connection maintained by the client side):
- Pretty lightweight, “Undertow” (We focused on Undertow)
- Jetty
Reason to select undertow:
- Very easy to use
- It is Java based.
- Built-in handler: most important Path. where I can delegate to a handler based on the path of request.
Load Balancer Selections:
- HAProxy 9000 rps
- Nginx 6000 rps
- ELB 26000 rps
For unit test: using curl Tools to monitor: htop and iftop to check CPU and network utilizations
Outcome:
- Around 30000 request per second (rps)
Query 2
Query Request:
- One hashtag
- number N
- keywords
Response:
- Top frequencies of user ids
- frequencies of keyword provided
ETL Requirement (Hadoop Streaming):
- Total Data Size is 1 TB, 150MB compressed, 1G for each json. Total 1000 json files.
- [Mapper] Removed malformed tweets and remove not in the preset language setting
- [Reducer] Remove in duplicate, and output formated data to be loaded into MySQL
- Output dataset size is around 8GB.
- It shall be noted that for HBase, we have special requirement, have the hashtag#user_id as the row key
- [Loading] for mysql, scripting a sql in a file and use
time mysql -u root --password=tartans@1 --local-infile=1 < create_twitter_db.sql
- [Loading] for hbase, we use import tsv tool provided by hadoop
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
to convert it as hdfs file and thenhbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
into hbase
We worked on 7 m3.medium AWS instances.
Some optimization:
- Pre-counting the frequencies
- Preprocessing of sensitive words
- [Hop Spot Problem] for HBASE! We noticed this using iftop. Hash rowkey (MD5). More distributed evenly.
- Impractical input query! N = 1 million users…
- Shorten the family qualifier
- Force split region
create 'tweets','f',{NUMREGIONS => 4, SPLITALGO => 'HexStringSplit', DURABILITY => 'ASYNC_WAL'}
-
load local data infile for my sql can be troublesome with unicode may use mysqlimport instead - Choose the engine to be MYISAM (read)
- Default char for MYSQL changed to UTF8mb4 (4 bytes UTF-8 unicode formating), otherwise emoji crash…
- LRU caching :) Java can extends LinkedHashMap :) save a lot of methods
Main SQL used:
SELECT hashtag_num,userid,text,hashtag_text FROM tweets WHERE hashtag_text = BINARY “banaSevmeyiAnlat”;
Query 5
Requirement graph relationship: user_id -> tid -> another user_id -> tid -> user id
We need to figure out the shortest path between two users.
Tricks to optimize:
- Re-design schema, TID: lots of UID schema
- Enable MYSQL cache function by increase query cache size
query_cache_size
- BFS
- Two END BFS
- LRU cache again
- Multi-threading - though may not be super efficiency context switching overhead