NLP / Data Science News & Resources

NLP and data science news and resources.

Quick Summaries

"Your AI skills are worth less than you think"

by Ric Szopa

Thanks to open-source libraries and the "open culture of the AI community," machine learning skills are becoming less valuable as time goes by, not more. Freely available libraries and shared knowledge mean that a developer's background knowledge doesn't need to be as deep as it was in the last couple of years.

Taking the best performing architecture currently described in the literature and retraining it on your own data is a battle tested strategy if your goal is to solve a problem (as opposed to making an original contribution to science). If there’s nothing really good available right now, it’s often a matter of waiting a quarter or two until someone comes up with a solution. Especially that you can do things like host a Kaggle competition to incentivize researchers to look into your particular problem.

Szopa goes on to explain where, eventually, he feels the competitive advantage in the field is going to be:

So, how would you go about building a maintainable competitive advantage for an AI product? Some time ago I had the pleasure of talking to Antonio Criminisi from Microsoft Research. His idea is that the project’s secret sauce shouldn’t consist only of AI. For example, his InnerEye project uses AI and classical (not ML based) computer vision to analyze radiological images. To some extent, this may be at odds with why you are doing an AI startup in the first place. The ability to just throw data at a model and see it work is incredibly attractive. However, a traditional software component, the sort of which requires programmers to think about algorithms and utilize some hard to gain domain knowledge, is much more difficult to reproduce.

In other words, from a long-term perspective, this new approach to machine learning—just "throwing data at a problem" and getting good results—may be just a passing fad. To excel at what you're doing, you'll need domain expertise. Which is pretty much where we were before the deep learning craze.

"Will NLP Change Search as We Know It?"

by Paul Nelson

Maybe a better title would have been "Automatically Generating Technical Documentation."

Nelson discusses representing technical—e.g. software—documentation as knowledge graphs like DBpedia and ConceptNet that can be automatically generated. Here I've abridged one section of his article.

There are many good reasons why technical documentation may be the first to completely break away from paper-based documentation:

Technical documents are narrow in scope

Lots of technical documentation is already generated automatically — this is already (mostly) the case for Javadoc and similar sorts of library documentation. It’s a small step from this to creating machine-readable knowledge

People reading technical documentation don’t want long narratives — they want short paragraphs and lots of lists and examples

The last item is what particularly grabbed my attention. I've come to believe that one of the benefits of unit tests are that they can provide clients (that is, other programmers using the method or class) real-world examples of how to use the code. A unit test could easily be transformed into different kinds of lists: input-output pairs (for this input, you get this output), or a set of instructions (you need to take these steps in order to properly call this method). With @-marked metadata at the start or within a unit test, doing this automatically is not such a far-fetched idea.

Data Science/ML Links

Glossary of common Machine Learning, Statistics and Data Science terms
The 10 Statistical Techniques Data Scientists Need to Master
Stephan Raaijmakers: Deep Learning for Natural Language Processing
Keras Examples
Is There a Smarter Path to Artificial Intelligence?: Don't let the hype over Deep Learning raise our expectations and lead to another round of AI disillusionment.
Canada's AI Ecosystem: Montreal
From big data to fast data
NFL Salaries for AI Talent
NY Times: The Great AI Awakening
Google Translate and Deep Learning, Pt. 1
Google Translate and Deep Learning, Pt. 2
"The 5 Basic Statistics Concepts Data Scientists Need to Know"
"Best Deals in Deep Learning Cloud Providers"
"The Most in Demand Skills for Data Scientists": As of Oct. 23, 2018

NLP Links

"The Big Bad NLP Database: Access Nearly 300 Datasets"
Top 10 Books on NLP and Text Analysis
Faster NLP in Python: Combining Spacy and Cython for faster NLP.
A Framework for Approaching Textual Data Science Tasks: Includes a discussion of the differences between text mining and NLP.
Chatbots: Theory and Practice
"Natural Language Processing (NLP) Techniques for Extracting Information" by Paul Nelson
Text Analytics: A Primer
"What NLP & Text Analytics Companies Got Funded or Acquired in 2016?" by Seth Grimes
Siri and autism: a mother's notes
Types of Bots

NLP Resources

Parsing, Events, Etc.

Penn Treebank Tag Set. As far as I can tell, the University of Pennsylvania doesn't have a permanent web page dedicated to their own tag set. Try this, though I can't guarantee the link will remain valid: Open American National Corpus
Universal Dependencies
VerbNet
FrameNet
EVITA - Events In Text Analyzer
TimeML: Markup Language for Temporal and Event Expressions
HeidelTime

Close to the Machine

NLP / Data Science News & Resources

Quick Summaries

"Your AI skills are worth less than you think"

"Will NLP Change Search as We Know It?"

Data Science/ML Links

NLP Links

NLP Resources

Parsing, Events, Etc.

Multi-Purpose Toolkits

Statistical Semantics

World Knowledge

Other