Anima Mundi

R or Python ? Nobody Cares !

This story tackles a very shitty common question in the Data Science space. I do not advertise any tool/package here, and all views are my own and based on my personal experience so please feel free to judge me.

The first and most common question when operationally getting in data science is about the tools and programming language to use.

I. Break Free !

Set aside commercial tools like SAS & Matlab, open source projects have strong and very rich communities working on them and are extremely popular in the community. They have way more functionality, give complete freedom, and evolve much faster than tools like SAS thanks to all the new versions and state of the art packages.

They are arguably the best choice for data science. Plus they're free, which is also nice, you don't have to worry about your choice, because it costs you nothing, you can just switch tools whenever you want, or use a combination of them.

Please, this is very important, keep this comic in mind :

II. Who's best ?

So what it the "best" toolset ? R ? Python ? Julia ? Supposing you have complete freedom of choice, the answer to this is pretty straightforward : IT DOES NOT MATTER AT ALL, tools are nothing more than tools, and they should be chosen with purpose in mind. R & Python being the most common choices, they pretty much have the same functionality and features. In short points, having used both of them in various contexts, my opinion is :

So just make your choice, depending on what will make you get stuff done fast in an easy way. Be lazy, and keep in mind that you can always go back and learn another language. Also, with all the resources and awesome communities on the interwebs, learning and using multiple languages at the same time have never been so easy.

I for example used R for years before switching to Python, because of libraries like PySpark. I still use R (for maintaining this blog for example).

NB : thanks to the guys at Rstudio, R has a real, fully functional interface to Python called retirculate. That means that you can now load your dataset with pandas, do all your data management with dplyr and the tidyverse, visualize the data with ggplot, and then fit a scikit-learn model or call Tensorflow. All of that in the same R script. It's not going to end the eternal debate, but check it out ! Using it, I noticed a bit of annoying overhead with serialization & deserialization. Also, the conda env management adds complexity and hard disk fingerprint. Really cool feature but I prefer using pure Python or R for the sake of keeping things simple.

NB2 : I've been hearing some crazy things about Julia. The programmong language seems to be über powerful/fast, and there is a growing number of Data Science & ML libraries. Will definitely check it out.

I hope you enjoyed the read ! 😊👋

~ Anas EL KHALOUI

#AI #Tech #english