Skip to content

omidsakhi/deepflow

Repository files navigation

DeepFlow

A deep learning framework built from scratch in C++ on CUDA and cuDNN. No TensorFlow, no PyTorch, no Caffe under the hood — just a node-based computational graph engine, hand-written CUDA kernels, and working generative models that produce real face images.

Face DCGAN results on CelebA 128×128

128×128 faces generated by a DCGAN trained entirely on the DeepFlow engine (CelebA dataset).

Why this exists

I wanted to understand deep learning at the level below the frameworks — how convolutions are dispatched on a GPU, how gradients propagate through a computational graph, how optimizers update parameters across scoped subgraphs. DeepFlow is the result: a complete, working DL framework that I used to train GANs, VAEs, and classifiers from scratch.

Architecture

Computational graph engine. Models are built by chaining nodes in a directed graph. Each node manages its own forward and backward passes on the GPU. The graph supports multiple execution phases, scoped parameter blocks (df.with("generator") / df.with("discriminator")), and per-variable solver assignment.

50+ node types with CUDA kernels covering:

Category Operations
Convolution conv2d, transposed_conv2d (with stride, padding, dilation)
Activations relu, leaky_relu, sigmoid, tanh, elu, clipped_relu, prelu, dprelu
Normalization batch_normalization, lrn (local response normalization)
Pooling & spatial pooling, spatial_transformer, resize, lifting, patching, patch_sampling
Reduction argmax, argmin, reduce_max, reduce_min, reduce_mean, reduce_sum, reduce_norm1, reduce_norm2
Math add, subtract, dot, matmul, dense, square, abs, exp, log, square_error, nand, max
Selectors multiplexer, random_selector, switch (for staged/progressive training)
Generators mnist_reader, imbatch (image batch loader), data_generator, text_image_generator
I/O display (live OpenCV window), imwrite, print, logger, psnr
Other dropout, bias_add, softmax, loss, gaussian, gabor_kernel, gaussian_blur, concate, reshape, restructure, replay_memory, batch_stddev, pass_through

4 optimizers implemented from scratch: Adam, SGD, RMSProp, AdaDelta — each with hand-written CUDA update kernels.

9 weight initializers: random uniform, random normal, truncated normal, constant, fill, zeros/ones, step, three-state, gradient fill.

Model serialization via Protocol Buffers, with Caffe model import support (load_from_caffe_model).

C++ code generation from the graph — session->to_cpp() emits compilable C++ that reconstructs the model.

Working examples

Face DCGAN — CelebA 128×128

Least-square GAN with a multi-scale generator (skip connections across resolutions) and a discriminator using learned downsampling via lifting nodes. Separate Adam solvers for G and D with independent learning rates.

Face DCGAN

Face Progressive GAN — CelebA 128×128

Progressive training from 8×8 up to 128×128 with multi-resolution multiplexers controlling staged discriminator/generator growth. The training loop manually advances through 5 resolution stages, toggling multiplexer inputs and switch nodes at each stage — all orchestrated through the DeepFlow graph API.

Face VAE — CelebA 128×128

Variational autoencoder with a Gaussian sampling node (df.gaussian(mean, sigma)) for the reparameterization trick.

Face VAE

Face autoencoder — CelebA 128×128

Deep autoencoder on 128×128 color faces.

Face AC

MNIST DCGAN

Least-square GAN on MNIST digits.

MNIST DCGAN

MNIST autoencoder

MNIST AC

MNIST LeNet

Classic LeNet classifier on MNIST.

MNIST spatial transformer

Spatial transformer network on MNIST with learned affine transformations.

Code sample

Building a DCGAN generator in DeepFlow reads like a blueprint of the architecture:

DeepFlow df;
df.with("generator");

auto solver = df.adam_solver(AdamSolverOp("g_adam").lr(0.0002f).beta1(0.5f).beta2(0.98f));

auto node = df.place_holder({batch, 100, 1, 1}, PlaceholderOp("g_input"));

node = df.dense(node, {100, fn * 4, 4, 4}, solver, DenseOp("gfc").no_bias());
node = df.batch_normalization(node, fn * 4, solver, BatchNormalizationOp("gfc_bn"));
node = df.relu(node);

node = df.transposed_conv2d(node, fn * 4, fn * 4, solver, ConvolutionOp("g8").kernel(3).stride(2));
node = df.batch_normalization(node, fn * 4, solver, BatchNormalizationOp("g8_bn"));
node = df.relu(node);

// ... layers up to 128×128 ...

node = df.transposed_conv2d(node, fn * 4, 3, solver, ConvolutionOp("g128").kernel(3).stride(1));

Separate solver scopes, selective gradient updates, and multiplexer-based training phases are first-class:

session->apply_solvers("discriminator");  // update only D weights
session->reset_gradients("discriminator");

session->apply_solvers("generator");      // then update only G weights

Dependencies

Status

This was a personal research project (2017–2018). It is not under active development, but represents a complete, working implementation — the examples above were trained end-to-end on this framework and the results are real.

License

BSD 3-Clause

About

A from-scratch deep learning framework in C++/CUDA/cuDNN — node-based computational graph, 50+ ops with CUDA kernels, 4 optimizers, and working GAN/VAE examples producing real face images.

Topics

Resources

License

Stars

Watchers

Forks

Contributors