Check out this answer by Ifueko Igbinedion on TwoCents

Paul-David Maijeh

Hi Ifueko! I really enjoyed your session and the exhaustive answers, thanks a lot! I'm an up and coming financial engineer and trader, and we use a lot of ML, DL and AI to draw insights and create actions. While I consider myself a beginner at programming, because of my econ and stat background, I have some good theoretical knowledge and I'm currently trying to research into applying sentimental learning to ascertain correlations and causation between people's sentiments, trade volumes and values, using engagements on twitter, reddit and facebook. I've read a lot about the Naive Baye's Algorithm and it's popularity in analysing sentiments, however, I'm trying to figure if NLP is much more effective, given its wider reach at analysing textual information, and giving a better qualitative approach. Please which do you think would be more effective or should I use both and test for the highest efficacy?

3 Answer requests

Ifueko Igbinedion Doctoral Student @ MIT

Boston, United States •

I think this is a bit dangerous. Attempting to ascertain sentimental correlations and apply them to huge financial decisions may work in certain contexts and you could definitely train a model with 99% training accuracy on this task, but future situations that are dependent on complex human action can never be adequately represented by a numerical parameterization and a finite state machine. If the model is not large enough, we will not learn all the possible combinations of interactions. If it is too large, then we only learn the context of our training dataset. That being said, you could do both and get good results during training. Personally, I do not have extensive NLP experience or Bayesian experience in production, but their fundamentals suggest that they would learn this type of model well independently or in conjunction. Naïve Bayes is good for state estimation-based decision making, and NLP can be used to model language and extract sentiment. However, these models depend completely on the input dataset that one utilizes, and the chosen labels (if using a supervised method) that are often subjective. Using data from the internet is also dangerous because it is next to impossible to have humans annotate every piece of training data without spending a large amount of money, and learning from problematic input data can lead to problematic situations.

To make this less vague, take the 2016 example where Tay, a chatbot made by Microsoft and trained on Twitter data, became extremely racist in less than a day of online training (https://twitter.com/geraldmellor/status/712880710328139776). Attempting to determine causation in a data driven sense is a slippery slope, and until AI solves the data-driven generalization problem (which I believe may be never) I wouldn't build a system like this in production until I could guarantee significant human supervision and have looked at the ethical implications on those who do not financially benefit from the proposed system.

Go to Session