As a data scientist, I feel that Nim has tremendous potential for data science, machine learning and deep learning.
For the past 3 months I've been working on Arraymancer, a tensor library that currently provides a subset of Numpy functionality in a fast and ergonomic library. It features:
- Creating tensors from nested sequences and arrays (even 10 level of nesting)
- Pretty printing of up to 4D tensors (would need help to generalize)
- Slicing with Nim syntax
- Slices can be mutated
- Reshaping, broadcasting, concatenating tensors. Also permuting their dimensions.
- Universal functions
- Accelerated matrix and vector operations using BLAS
- Iterators (on values, coordinates, axis)
- Aggregate and statistics (sum, mean, and a generic aggregate higher order function)
Next steps (in no particular order) include:
- adding CUDA support using andrea's nimcuda package
- adding Neural Network / Deep Learning functions
- Improving the documentation and adding the library on Nimble
The library: https://github.com/mratsim/Arraymancer
I welcome your feedback or expected use case. I especially would love to know the pain points people have with deep learning and putting deep learning models in production.
I've been following this for a while on GitHub and I think it is a very impressive project. Nim would be a great language for scientific computing, but it needs to have the numerical libraries and this is an excellent first step in creating them.
A couple of questions. First, are you planning to add neural network functionality directly to Arraymancer? Surely that would be something better suited for a separate, specialised library? A second, more general, question I have is whether you'd consider making the get_data_ptr proc public. It would be nice to be able to integrate your tensors with wrappers for existing numerical software written in C and we'd need access to the raw data for that.
get_data_ptr is now public .
For now, I will add the neural network functionality directly in Arraymancer.
The directory structure will probably be:
- src/arraymancer ==> core Tensor stuff
- src/autograd ==> automatic gradient computation (i.e. Nim-rmad ported to tensors)
- src/neuralnet ==> neural net layers
This mirrors PyTorch's tree
I made this choice for the following reasons:
- It's easier for me to keep track of one repo, refactor code, document and test.
- I'm focusing on deep learning
- It's much easier to communicate about one single package (and attracts new people to Nim ).
- Data scientists are used to have deep learning in a single package (tensor + neural net interface): Tensorflow, Torch/PyTorch, Nervana Neon, MxNet ...
- Nim's DeadCodeElim will ensure that unused code will not be compiled.
If the tensor part (without the NN) get even 0.1% of Numpy popularity and people start using it in several packages that means:
- It's a rich man problem!
- We get new devs and input for scientific/numerical Nim.
- We can reconsider splitting as we will know actual expectations.
- We can even build a "scinim" community which drives all key scientific nim packages.
In the mean time I think it's best if I do what is easier for me and worry about how to scale later.
A late reply because I was hoping to dive into this a bit deeper before replying. But due to lack of time, a high-level feedback must suffice: This looks awesome!
I completely agree with your observation that there is a gap between developing prototypes e.g. in Python and bringing them into production -- not only in deep learning, but data science in general. And I also think that Nim's feature set would be perfect to fill this gap.
A quick question on using statically-typed tensors: I assume that this implies that the topolgy of a network cannot be dynamic at all? I'm wondering if there are good work-arounds to situations where dynamic network topologies are required, for instance when a model wants to choose its number of hidden layer nodes iteratively, picking the best model variant. Are dynamically typed tensors an option or would that defeat the design / performance?
The only static parts of the Tensor types are the Backend (Cpu, CUDA, ...) and the internal type (int32, float32, object ...).
The network topology will be dynamic and using dynamic graphs more akin to PyTorch/Chainer/DyNet than Theano/Tensorflow/Keras.
My next step is to build an autograd so people only need to implement the forward pass, backpropagation will be automatic. For this part I'm waiting for VTable.
PS: I think NimData is great too, Pandas seems like a much harder beast!