A deep learning framework built from scratch in C++ on CUDA and cuDNN. No TensorFlow, no PyTorch, no Caffe under the hood — just a node-based computational graph engine, hand-written CUDA kernels, and working generative models that produce real face images.
128×128 faces generated by a DCGAN trained entirely on the DeepFlow engine (CelebA dataset).
I wanted to understand deep learning at the level below the frameworks — how convolutions are dispatched on a GPU, how gradients propagate through a computational graph, how optimizers update parameters across scoped subgraphs. DeepFlow is the result: a complete, working DL framework that I used to train GANs, VAEs, and classifiers from scratch.
Computational graph engine. Models are built by chaining nodes in a directed graph. Each node manages its own forward and backward passes on the GPU. The graph supports multiple execution phases, scoped parameter blocks (df.with("generator") / df.with("discriminator")), and per-variable solver assignment.
50+ node types with CUDA kernels covering:
| Category | Operations |
|---|---|
| Convolution | conv2d, transposed_conv2d (with stride, padding, dilation) |
| Activations | relu, leaky_relu, sigmoid, tanh, elu, clipped_relu, prelu, dprelu |
| Normalization | batch_normalization, lrn (local response normalization) |
| Pooling & spatial | pooling, spatial_transformer, resize, lifting, patching, patch_sampling |
| Reduction | argmax, argmin, reduce_max, reduce_min, reduce_mean, reduce_sum, reduce_norm1, reduce_norm2 |
| Math | add, subtract, dot, matmul, dense, square, abs, exp, log, square_error, nand, max |
| Selectors | multiplexer, random_selector, switch (for staged/progressive training) |
| Generators | mnist_reader, imbatch (image batch loader), data_generator, text_image_generator |
| I/O | display (live OpenCV window), imwrite, print, logger, psnr |
| Other | dropout, bias_add, softmax, loss, gaussian, gabor_kernel, gaussian_blur, concate, reshape, restructure, replay_memory, batch_stddev, pass_through |
4 optimizers implemented from scratch: Adam, SGD, RMSProp, AdaDelta — each with hand-written CUDA update kernels.
9 weight initializers: random uniform, random normal, truncated normal, constant, fill, zeros/ones, step, three-state, gradient fill.
Model serialization via Protocol Buffers, with Caffe model import support (load_from_caffe_model).
C++ code generation from the graph — session->to_cpp() emits compilable C++ that reconstructs the model.
Least-square GAN with a multi-scale generator (skip connections across resolutions) and a discriminator using learned downsampling via lifting nodes. Separate Adam solvers for G and D with independent learning rates.
Progressive training from 8×8 up to 128×128 with multi-resolution multiplexers controlling staged discriminator/generator growth. The training loop manually advances through 5 resolution stages, toggling multiplexer inputs and switch nodes at each stage — all orchestrated through the DeepFlow graph API.
Variational autoencoder with a Gaussian sampling node (df.gaussian(mean, sigma)) for the reparameterization trick.
Deep autoencoder on 128×128 color faces.
Least-square GAN on MNIST digits.
Classic LeNet classifier on MNIST.
Spatial transformer network on MNIST with learned affine transformations.
Building a DCGAN generator in DeepFlow reads like a blueprint of the architecture:
DeepFlow df;
df.with("generator");
auto solver = df.adam_solver(AdamSolverOp("g_adam").lr(0.0002f).beta1(0.5f).beta2(0.98f));
auto node = df.place_holder({batch, 100, 1, 1}, PlaceholderOp("g_input"));
node = df.dense(node, {100, fn * 4, 4, 4}, solver, DenseOp("gfc").no_bias());
node = df.batch_normalization(node, fn * 4, solver, BatchNormalizationOp("gfc_bn"));
node = df.relu(node);
node = df.transposed_conv2d(node, fn * 4, fn * 4, solver, ConvolutionOp("g8").kernel(3).stride(2));
node = df.batch_normalization(node, fn * 4, solver, BatchNormalizationOp("g8_bn"));
node = df.relu(node);
// ... layers up to 128×128 ...
node = df.transposed_conv2d(node, fn * 4, 3, solver, ConvolutionOp("g128").kernel(3).stride(1));Separate solver scopes, selective gradient updates, and multiplexer-based training phases are first-class:
session->apply_solvers("discriminator"); // update only D weights
session->reset_gradients("discriminator");
session->apply_solvers("generator"); // then update only G weights- NVIDIA GPU with CUDA support
- CUDA 9.0
- cuDNN v7.1
- OpenCV 3.0
- Protocol Buffers
- glog / gflags
- Visual Studio 2015
This was a personal research project (2017–2018). It is not under active development, but represents a complete, working implementation — the examples above were trained end-to-end on this framework and the results are real.
BSD 3-Clause




