DeepFlow

A deep learning framework built from scratch in C++ on CUDA and cuDNN. No TensorFlow, no PyTorch, no Caffe under the hood — just a node-based computational graph engine, hand-written CUDA kernels, and working generative models that produce real face images.

128×128 faces generated by a DCGAN trained entirely on the DeepFlow engine (CelebA dataset).

Why this exists

I wanted to understand deep learning at the level below the frameworks — how convolutions are dispatched on a GPU, how gradients propagate through a computational graph, how optimizers update parameters across scoped subgraphs. DeepFlow is the result: a complete, working DL framework that I used to train GANs, VAEs, and classifiers from scratch.

Architecture

Computational graph engine. Models are built by chaining nodes in a directed graph. Each node manages its own forward and backward passes on the GPU. The graph supports multiple execution phases, scoped parameter blocks (df.with("generator") / df.with("discriminator")), and per-variable solver assignment.

50+ node types with CUDA kernels covering:

Category	Operations
Convolution	`conv2d`, `transposed_conv2d` (with stride, padding, dilation)
Activations	`relu`, `leaky_relu`, `sigmoid`, `tanh`, `elu`, `clipped_relu`, `prelu`, `dprelu`
Normalization	`batch_normalization`, `lrn` (local response normalization)
Pooling & spatial	`pooling`, `spatial_transformer`, `resize`, `lifting`, `patching`, `patch_sampling`
Reduction	`argmax`, `argmin`, `reduce_max`, `reduce_min`, `reduce_mean`, `reduce_sum`, `reduce_norm1`, `reduce_norm2`
Math	`add`, `subtract`, `dot`, `matmul`, `dense`, `square`, `abs`, `exp`, `log`, `square_error`, `nand`, `max`
Selectors	`multiplexer`, `random_selector`, `switch` (for staged/progressive training)
Generators	`mnist_reader`, `imbatch` (image batch loader), `data_generator`, `text_image_generator`
I/O	`display` (live OpenCV window), `imwrite`, `print`, `logger`, `psnr`
Other	`dropout`, `bias_add`, `softmax`, `loss`, `gaussian`, `gabor_kernel`, `gaussian_blur`, `concate`, `reshape`, `restructure`, `replay_memory`, `batch_stddev`, `pass_through`

4 optimizers implemented from scratch: Adam, SGD, RMSProp, AdaDelta — each with hand-written CUDA update kernels.

9 weight initializers: random uniform, random normal, truncated normal, constant, fill, zeros/ones, step, three-state, gradient fill.

Model serialization via Protocol Buffers, with Caffe model import support (load_from_caffe_model).

C++ code generation from the graph — session->to_cpp() emits compilable C++ that reconstructs the model.

Working examples

Face DCGAN — CelebA 128×128

Least-square GAN with a multi-scale generator (skip connections across resolutions) and a discriminator using learned downsampling via lifting nodes. Separate Adam solvers for G and D with independent learning rates.

Face Progressive GAN — CelebA 128×128

Progressive training from 8×8 up to 128×128 with multi-resolution multiplexers controlling staged discriminator/generator growth. The training loop manually advances through 5 resolution stages, toggling multiplexer inputs and switch nodes at each stage — all orchestrated through the DeepFlow graph API.

Face VAE — CelebA 128×128

Variational autoencoder with a Gaussian sampling node (df.gaussian(mean, sigma)) for the reparameterization trick.

Face autoencoder — CelebA 128×128

Deep autoencoder on 128×128 color faces.

MNIST DCGAN

Least-square GAN on MNIST digits.

MNIST autoencoder

MNIST LeNet

Classic LeNet classifier on MNIST.

MNIST spatial transformer

Spatial transformer network on MNIST with learned affine transformations.

Code sample

Building a DCGAN generator in DeepFlow reads like a blueprint of the architecture:

DeepFlow df;
df.with("generator");

auto solver = df.adam_solver(AdamSolverOp("g_adam").lr(0.0002f).beta1(0.5f).beta2(0.98f));

auto node = df.place_holder({batch, 100, 1, 1}, PlaceholderOp("g_input"));

node = df.dense(node, {100, fn * 4, 4, 4}, solver, DenseOp("gfc").no_bias());
node = df.batch_normalization(node, fn * 4, solver, BatchNormalizationOp("gfc_bn"));
node = df.relu(node);

node = df.transposed_conv2d(node, fn * 4, fn * 4, solver, ConvolutionOp("g8").kernel(3).stride(2));
node = df.batch_normalization(node, fn * 4, solver, BatchNormalizationOp("g8_bn"));
node = df.relu(node);

// ... layers up to 128×128 ...

node = df.transposed_conv2d(node, fn * 4, 3, solver, ConvolutionOp("g128").kernel(3).stride(1));

Separate solver scopes, selective gradient updates, and multiplexer-based training phases are first-class:

session->apply_solvers("discriminator");  // update only D weights
session->reset_gradients("discriminator");

session->apply_solvers("generator");      // then update only G weights

Dependencies

NVIDIA GPU with CUDA support
CUDA 9.0
cuDNN v7.1
OpenCV 3.0
Protocol Buffers
glog / gflags
Visual Studio 2015

Status

This was a personal research project (2017–2018). It is not under active development, but represents a complete, working implementation — the examples above were trained end-to-end on this framework and the results are real.

License

BSD 3-Clause

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
auxiliary/mnist-dcgan		auxiliary/mnist-dcgan
build		build
data		data
docs		docs
examples		examples
include		include
models		models
scripts		scripts
src		src
third-party		third-party
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepFlow

Why this exists

Architecture