Cloud-Team-Project-Twitter-Analytics

This is a blog to introduce a team project in CMU with Fan and Eugene.

Background: We are given a large data set of twitter data.

Query 1

This is the first query to help set up heart beat and web service.

Requirement:

Budget restricted:

REST server (Connection maintained by the client side):

Reason to select undertow:

Very easy to use
It is Java based.
Built-in handler: most important Path. where I can delegate to a handler based on the path of request.

Load Balancer Selections:

For unit test: using curl Tools to monitor: htop and iftop to check CPU and network utilizations

Outcome:

Query Request:

Response:

ETL Requirement (Hadoop Streaming):

Total Data Size is 1 TB, 150MB compressed, 1G for each json. Total 1000 json files.
[Mapper] Removed malformed tweets and remove not in the preset language setting
[Reducer] Remove in duplicate, and output formated data to be loaded into MySQL
Output dataset size is around 8GB.
It shall be noted that for HBase, we have special requirement, have the hashtag#user_id as the row key
[Loading] for mysql, scripting a sql in a file and use time mysql -u root --password=tartans@1 --local-infile=1 < create_twitter_db.sql
[Loading] for hbase, we use import tsv tool provided by hadoop hbase org.apache.hadoop.hbase.mapreduce.ImportTsvto convert it as hdfs file and then hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles into hbase

We worked on 7 m3.medium AWS instances.

Some optimization:

Pre-counting the frequencies
Preprocessing of sensitive words
[Hop Spot Problem] for HBASE! We noticed this using iftop. Hash rowkey (MD5). More distributed evenly.
Impractical input query! N = 1 million users…
Shorten the family qualifier
Force split region create 'tweets','f',{NUMREGIONS => 4, SPLITALGO => 'HexStringSplit', DURABILITY => 'ASYNC_WAL'}
load local data infile for my sql can be troublesome with unicode may use mysqlimport instead
Choose the engine to be MYISAM (read)
Default char for MYSQL changed to UTF8mb4 (4 bytes UTF-8 unicode formating), otherwise emoji crash…
LRU caching :) Java can extends LinkedHashMap :) save a lot of methods

Main SQL used:

SELECT hashtag_num,userid,text,hashtag_text FROM tweets WHERE hashtag_text = BINARY “banaSevmeyiAnlat”;

Requirement graph relationship: user_id -> tid -> another user_id -> tid -> user id

We need to figure out the shortest path between two users.

Tricks to optimize: