Castle Windsor Interceptor Performance

It’s been some time dynamic proxies were considered new & trendy. Still, AOP methods they are enabling are not used as much as often as they deserve to.

One of the reasons may be that the general expectation among developers is that every magic costs cycles. Given that behaviors provided by proxies look very much magical for programmers accustomed to the rigid nature of Java/C# style OOP, they are often guilty of slowness by association.

In this short text, I will present results of speed measurement of a particular implementation of this pattern, Castle Windsor interceptors. They should be just a thin wrapper around DynamicProxy that is deemed to be quite fast, but let’s not think to much about what should hold for now. We need to be sure that the final product won’t threaten our latency.

The measurement

BenchmarkDotNet is a fine tool that will handle most of the grunt work for us – it will warm up the code, repeat the test many times and even count GC events and memory consumption of the system under test.

I configured BenchmarkDotNet to test using both old and new .NET 64bit JIT.

The code we are testing performance of is repeatedly applying a single interceptor to a simple service, with various degrees of what “repeatedly” means. Besides multiplicity changes, we are asking computer for three things:

  • Call method behind an already (outside of test case) resolved dynamic proxy.
  • Resolve the service for us with transient lifestyle and then call its method once. (Interceptor itself is transient.)
  • Resolve the service for us with singleton lifestyle and then call its method once. (Interceptor itself is transient.)

This should tell us how expensive is to live with classes behind proxy (the first case), and how difficult is going to resolving is going to became at composition root (the second and third case). In a web application setting, resolve is usually called per web request.

I’ll run it on Intel Core i7-3770 3.4 GHz processor with enough memory.


This is the result table generated by BenchmarkDotNet for legacy JIT (I removed “Gen 1” and “Gen 2” columns as they are empty):

Method Median StdDev Gen 0 Bytes Allocated/Op
1000 without Resolve 10,447.7374 ns 120.2564 ns 0.00 54.13
100 without Resolve 982.4656 ns 6.4550 ns 0.00 51.11
10 without Resolve 140.1337 ns 0.3116 ns 0.00 53.07
1 without Resolve 46.5154 ns 0.3040 ns 0.00 51.92
1000 with transient Resolve 2,366,683.9889 ns 354,723.1988 ns 9.89 259,306.51
100 with transient Resolve 245,274.9823 ns 10,027.2331 ns 1.16 29,493.19
10 with transient Resolve 36,152.1590 ns 784.5675 ns 0.15 3,985.78
1 with transient Resolve 14,478.4256 ns 146.0388 ns 0.06 1,432.40
1000 with singleton Resolve 10,945.2784 ns 103.2060 ns 0.01 196.44
100 with singleton Resolve 1,416.3591 ns 36.6199 ns 0.01 202.55
10 with singleton Resolve 579.2193 ns 4.8657 ns 0.01 205.69
1 with singleton Resolve 465.4228 ns 29.7862 ns 0.01 185.14
No interceptor (method call) 1.3443 ns 0.0242 ns - 0.00

These results are for Ryu JIT:

Method Median StdDev Gen 0 Bytes Allocated/Op
1000 without Resolve 17,714.9282 ns 52.5098 ns - 57.76
100 without Resolve 1,675.0514 ns 6.3128 ns 0.00 53.21
10 without Resolve 175.4645 ns 1.7936 ns 0.00 51.94
1 without Resolve 45.3703 ns 0.8649 ns 0.00 54.65
1000 with transient Resolve 1,982,570.9796 ns 104,373.3124 ns 10.97 287,303.35
100 with transient Resolve 211,425.9895 ns 9,469.7305 ns 1.24 31,469.77
10 with transient Resolve 30,744.9857 ns 830.1701 ns 0.14 3,725.85
1 with transient Resolve 12,995.5876 ns 339.3414 ns 0.06 1,464.94
1000 with singleton Resolve 19,442.3397 ns 512.6325 ns 0.00 204.07
100 with singleton Resolve 2,168.5428 ns 9.7609 ns 0.01 204.64
10 with singleton Resolve 544.7616 ns 3.4027 ns 0.01 185.14
1 with singleton Resolve 400.2460 ns 2.4580 ns 0.01 186.78
No interceptor (method call) 1.0747 ns 0.0273 ns - 0.00

Let’s process the data in Excel to get some slopes using linear regression:

Variant Slope Intercept (pun not intended)
Legacy JIT, without Resolve 10.44 ns 5.70 ns
Legacy JIT, transient Resolve 2354.96 ns 11557.28 ns
Legacy JIT, singleton Resolve 10.51 ns 433.77 ns
RyuJIT, without Resolve 17.73 ns -21.55 ns
RyuJIT, transient Resolve 1970.72 ns 12066.33 ns
RyuJIT, singleton Resolve 19.10 ns 333.51 ns


  • Pure method call cost 1 ns, method call overhead per one interceptor is 10 ns. This is negligible under almost all circumstances.
  • Windsor behaves nicely, so there is almost no overhead when calling Resolve against singleton (already constructed) instance.
  • Dynamic proxy construction costs 2 μs per interceptor. This is something you are probably willing to pay on a per-network-call basis, but you should be carefull if you are trying to create any kind of loop around Resolve. Open question: Is this cost included in typed factory facility calls?
  • RyuJIT construct proxies faster, call methods faster, but (surprisingly) is up to two times slower when simply calling a method behind proxy.