I have a Bachelor’s Degree in Software Engineering from the best technical university in Denmark. I have a Master’s Degree in Business.
I see myself as quite a technical person. My job is to make sure that the developers in my company make great code, and that we pick the right architectural decisions. We do a good job, and our customers seem to love us (well, they keep buying stuff!).
But I am a business guy. From a business perspective, I am totally hyped about machine learning and artificial intelligence (which for some reason is the same in my mind – well, let us say – was). I believe it’s the future, and I totally want to build the next “machine-learning”-applied product that will conquer the world.
But I have no experience with machine learning, artificial intelligence, deep learning or any of the other fancy words I’ve read.
When Amazon launched their Amazon Machine Learning, I was hyped. Now I could finally become an AI, Machine learning learning hacker, ready to hack the world.
I saw their introduction video (around 30 minutes long), and it looked really cool. When the presenter did it, it also seemed really simple. With that in mind, I decided that this was something I had to try out.
My review and experience with Amazon Machine Learning
One of my customers is a very big webshop in Denmark. I took the last 5.5 years of revenue data which was approximately 2000 rows.
That meant I had a file with:
- Date created
Now my (naive?) thesis was: let us import this into the tool, wait 5 minutes, and have a near perfect prediction of the future. I mean, max 5-10% error rate.
I got into the tool. Created a new data source. Ready to rock the world. Failed.
I reviewed the data source. Oh. I could not have white space in my columns. Fair enough. Try again.
Now it didn’t like my date format. Apparently a Danish time format is not popular and gives errors.
I decided to find my Python skills, import the data and play around.
I finally created a new file with data, that had some columns, an ID, correct dates and most importantly: nicely formatted revenues.
I created a data source. Success.
Now I was ready. I had finally beaten the system, and could get it to work.
I selected my revenue as target data and date as categorial. Click. Click. Predict shit now, please.
Now it said pending.
After 10 minutes (and I assumed the 10 minutes meant the data would be so awesome I could tell my customers to fire the whole business intelligence team and use my machine learning skills instead), it said the data was READY!
The output was a file. It had two columns. The “score” had numbers such as “0.631” and “0.521”.
I tried to play around in the interface, and I simply couldn’t find a “predict the future value” button.
I kept clicking around. There was nothing to do. There was weird graphs, weird numbers and weird files in my Amazon storage.
I could have given up…
I might be naive and stupid (you can be the judge), but at least I know my limits.
I went to Upwork (earlier Odesk), and called for help. After a couple of useless applicants, I found my savior.
Mario Filho from Brazil made a (20 USD/hr) offer with this profile:
“I have experience using a wide range of machine learning algorithms, both for supervised (classification, regression) and unsupervised (clustering) learning, to solve real-world problems with data.
If you hire me, these are the steps I will take to make your data project a success:
– Understand your goals and expectations.
– Clean and prepare the data.
– Develop models respecting technical standards.
– Report the performance in a test environment.
– Deploy the model to production.”
This sounded like the guy I needed. Maybe he never worked with Amazon Machine Learning, but I was sure that “0.631” and “0.535” results meant more to him than me.
(Just a recommendation here: if you need any help with machine learning, contact him on LinkedIn: https://www.linkedin.com/in/mariofilho – extremely helpful, friendly and knows his stuff! I don’t think he will keep his current hourly price for long – but even if it was much higher it’s worth it.)
I sent him my order revenue, and told him that when he had something to show, we could do a screencast.
After ~2 hours Mario was ready.
Holy cow: Amazon Machine Learning is complex
The first thing I saw on the screencast was a “small” 49-line-long Python script he wrote, to make the data nice. I was shocked: aren’t we supposed to click the “upload button” on any type of file?
The truth is: no. The next 30 minutes were amazing.
Mario made multiple data sources (apparently you need both data sources for your data, test and evaluation). He made batch predictions. He made everything you needed to get actual data.
I was shocked.
I thought I had a chance after watching the initial introduction video.
Boy, I was wrong. There is absolutely no chance I would have done this right myself.
In the end, we tried to upload a file as a data source, which we could use to generate a prediction for May 7th 2015 (we had revenue data until May 6th 2015).
And we got a number.
I compared it to real life data, and saw we have a 14% error margin.
I was just about to tell Mario how awesome it was, but then Mario had written this small Python function (well, because, Brazil?) and ran it:
print ‘MAPE’, np.abs((y_true – y_pred) / y_true).mean()
It turned out we had a 34% error margin.
34%? That’s … useless!
I was shocked! 34%?
I couldn’t tell my customers to fire their business intelligence team with 34% error margin!
Mario calmed me down. He told me there was a couple of things to consider:
- You have to clean the dataset for outliers (data points that stick out)
- We could benefit from normal machine learning tricks such as normalizing
- Amazon uses a very simple linear regression technique. He mentioned something about a Gradient Boosted Decision-something that could help
- Creating models for specific subsets of the data (like a model for each SKU, or region)
- Feature engineering: creating relevant features, and exploring interactions between them (which I still have absolutely no clue what means… which probably means it is important)
He told me that machine learning is hard work. You have to work a lot with your data, and only then can you expect better results.
Apparently it turns out machine learning is hard.
30 Minutes after talking to Mario, he wrote to me on Skype:
“One last thing (I promise), I ran a Gradient Boosted Decision Trees model just to see what happens, and the MAPE was down to 25%. This is a good signal that a more complex model, with tuned parameters, can give you a better error.”
Now we’re getting closer! With some extra work, who knows, maybe we could get it down to 15-20%?
No matter what, I was quite sad with the result. Let’s just say we applied a lot of work and got the error rate down to 15%. That’s not very good to predict the future.
But I made a very big mistake: I did not understand what machine learning is.
Amazon Machine Learning is boring mathematics, not evil AI that is about to conquer the world.
Amazon Machine Learning is quite simple.
What it simply does is a lot of mathematics. It’s linear regression. It’s simply a lot of matrix calculations, and then finding the optional values from the model.
Machine learning does not understand your data. Machine learning is simply a combination of statistics and computer science.
Looking back, I was extremely naive.
Of course, how could I expect much better than 25% error rate? Using normal thinking, it does make sense and it’s impossible to predict.
Maybe if it actually understood the context behind the numbers. And it had a brain. And more data. Then maybe, it could be better.
But Amazon Machine Learning is simply mathematics.
Amazon Machine Learning is not for your average developer – yet
Agreed. I should have read more. I should probably have read articles and videos, and even have read the article Mario sent me (ahem).
But from a user perspective, it’s far from being ready.
It’s NOT a tool your average developer can sit down and use.
It does require you to understand some very basic foundations of machine learning. It was extremely interesting to see that a person such as Mario who never worked with the tool, was able to understand it in no time.
But I, who never worked with it, had no chance.
I am sure it’s a great tool. With the right help, I am sure it can be very useful. But it does not solve world hunger or predict how much money is in my pocket tomorrow; it is, however, a fun experience.