Ingest, Analyze and Visualize
When we talk about the future and all it entails, one big elephant in the middle of it all is data, luckily it’s not a hard thing to find these days. We are all core contributors to the big warehouse of data. At ParallelScore, we love DATA, we love to gather it, share it and more importantly analyze it. It’s at the center of our user experience explorations. This month, as part of our ongoing POC (Proof Of Concept) exercise and client collaboration, I was tasked to build a complete pipeline for data journey, from ingestion to several analysis and then visualization. My focus will be to analyze a stream of tweets from twitter and get the sentiments using some machine learning.
Some of the tools listed below for this POC can be installed in different ways, via manual downloads or via homebrew for Mac Os users, or apt-get for linux users, i will be focusing on Mac Os X installations.
Step 1: Setup Docker, Java, Python and Scala
- Make sure you have java 8 installed, download from Oracle website , to check run java -version in terminal.
- Download and install python 3 or run brew install python3
- Download and install Scala brew install scala & brew install sbt
- Download and install Docker CE from Docker Store
Step 2: Setup Apache Zookeeper
- Install zookeeper from Apache Zookeeper or brew install zookeeper
- Then set up zookeeper to run and start as a service, for mac os brew services enable zookeeperd
- brew services start zookeeperd to start zookeeper in the background
Step 3: Setup Apache Kafka
- brew install kafka to have kafka installed and running
- You can also configure kafka to run as a service
- brew services start kafka to start kafka as a service on port 9092
Step 4: Setup Apache NIFI
- Download and setup NIFI from NIFI Download Pages or via brew install nifi
- To run NIFI in foreground go to installation directory from terminal and type bin/nifi.sh run
- To have it running in the background, from installation directory, type bin/nifi.sh start
- To install NIFI as a service, from installation directory, type bin/nifi.sh install
- Apache NIFI will be started on port 8080 to change the running port, edit the nifi.properties in conf file and change nifi.web.http.port=8080 to any number of your choice
Take a coffee break, after all you can’t work and not play. Treat yourself, you’re almost there.
Step 5: Setup Apache Spark
To install apache Spark, there are a few configuration to be made, I will be relying on apache spark in a standalone mode– please note that running Apache spark or any cluster application in stand-alone mode is a single point of failure, but as a POC, we’re going to keep it simple.
- Download and install apache spark from Apache Spark Page or use brew install apache-spark
- To confirm installation, type pyspark in terminal, you should have a similar view as this
- Setting environment variable should have been done during installation but you still need to add spark home to env variable like this
- The next phase would be to setup spark in stand-alone mode
Step 6: Setup Spark
To setup spark in stand-alone cluster mode please visit this page Spark Stand-Alone Cluster
Step 7: Setup TiDB:
- in terminal run brew install mysql-client , after that run docker pull pingcap/tidb:latest to pull the latest TiDB docker image from docker
- Then you have to start tidb-server in a background container, to do this simply type docker run –name tidb-server -d -p 4000:4000 pingcap/tidb:latest ,this starts and run tidb-server docker in background mode on port 4000
- Connections can be made to TiDB using mysql connection url such as jdbc:mysql://127.0.0.1:4000/DatabaseName
Another break, you deserve it. It’s all worth it at the end.
Step 8: Architecture
Now getting to the interesting part, I have completely installed all the required tools and now take a look at the pipeline architecture diagram below. It’s quite self explanatory
Step 9: Running the pipeline
Let’s go back to step 4 where you had to specify the apache NIFI port to use, my NIFI web port runs on port 8550, so you have to specify your port in your browser and navigate to localhost:8550 you will have a NIFI page where you can access and add processors , create and connect NIFI processors as it is in the architectural diagram
- The first flow will stream data with GetTwitter Processor(you have to supply your twitter developers api Keys), evaluate the success response with EvaluateJsonPath processor, then rebuild the json attributes with AttributesToJson Processor and forward to kafka under a topic e.g ‘TweetStream’
- The second flow will get data from spark through a GetKafka processor under a specific topic ‘ToTiDB’, evaluate the json, rebuild the attributes, then converts to sql statement for insertion into the database which in this case is TiDB
After the whole proper connection you should have something similar to the view below
- Python Script below will send data to spark-cluster as a Job for analytics and sentiment analysis on the received tweets.
Receive ‘TweetStream’ topic from kafka
bootstrapServer, topic = getMainArgs()[1:]
# get the kafka bootstrap server and topic name from command-line args e.g (localhost:9092 , ‘TweetStream’)
df = spark.readStream\
.option(‘kafka.bootstrap.servers’, bootstrapServer) \
.option(‘subscribe’, topic) \
- Do some spark analytics and then apply some machine learning to get the sentiments analysis. I used a pre-trained model from NLTK Vader SentimentIntensityAnalyser (will be adding more machine learning algorithms in future). You may have to install some python modules.
Here are some of the required modules,
pip install BeautifulSoup4 for Text cleanup,
pip install nltk for natural language processing and sentiment analysis
Then in the python script import them and allow nltk to download the needed modules
from bs4 import BeautifulSoup
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Sentiment analysis helper Class
”’NLTK VADER Sentiment Analysis Kernel”’
tweet = ”
classifier = SentimentIntensityAnalyzer()
def __init__(self, tweet_text):
self.tweet = tweet_text
scores = self.getPolarityScores()
return max(scores, key =lambda k: scores[k])
- Send the analyzed tweet plus the sentiment with the aid of spark structured streaming to a kafka topic name ‘ToTiDB’ for insertion into the cluster DB which is the second flow in NIFI
tweettextDF.selectExpr(“to_json(struct(*)) AS value”).writeStream\
.option(“kafka.bootstrap.servers”, bootstrapServer) \
.option(“topic”, “ToTIDB”) \
- To submit this python script to spark as a job, you need to run the command below with you spark-master url specified
spark-submit –packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 –master spark://10.0.0.4:7077 StreamAnalytics.py localhost:9092 TweetStream
Spark downloads the required packages you specified and then sends the job to the master url specified along with the name of the python Script. The python script receives two arguments, “kafka bootstrap server” and “topic name”. The above command starts the spark structured streaming API, waiting for data ingestion from NIFI under the topic we specified.
You also need to start all the NIFI processors to run the complete pipeline, to do this.. Go to your browser on the NIFI page, right-click on the grid and click start to initiate the data ingestion from twitter.
Wheew, you made it. Last step.
Step 10: Data visualization
I used Spring boot and react for my data visualization, and here is a sample of how it looks when the page is pulling near real time sentiments from the database.
There were lots of challenges in trying to fuse these technologies, setting up a server to host them on Azure. The entire process was an eye opener, understanding series of ways in which big data tools can be implemented and used to solve real world problems and aid decision making. Don’t worry, there’s more to come. I will be spending more time with other big data and machine learning tools. The challenges are tasking but the joy is in getting them done and seeing the result.
Want to learn more about ParallelScore and how we use data, feel free to send us a message at email@example.com, subject line: We Love Data.