How unique are your bundles?

04 Jan 2024

In the modern front-end development we are all used to package managers, transpilation and bundling. These concepts are the biproduct of the tools which we hoped would simplify developer' burden and speed up the development.

However, how confident are you these tools are actually doing good job?

Developers seem comfortable off-loading the processing power to users' machine (browser, predominantly). We are not surprised by seeing slow websites anymore. A simple blog is downloading 55MB of JavaScript? Seems fine nowadays.

I currently work on a fairly small tool (MongoDB Relational Migrator), which also utilizes TypeScript, React and, of course, bundling. We use Vite for that. Our bundles are split into chunks (but I have accounted for that, too).

I went ahead and wrote a rather simple script which parses the bundles (using TypeScript compiler API, because why not), extracting the function definitions (both arrow functions and plain old function) and counts how many times they occur in the file. For this last bit, to make sure I am not counting a => true and x => true as different occurrences, I am minimizing the function definition with uglifyjs and counting the SHA256 hashes of the minimized functions (just to have a reasonable key in my hashmap instead of entire function code).

These are my findings.

Out of 54 chunks, 47 are not css-in-js chunks. Out of 47 remaining, 7 have any significant duplication (over 5%). But when they do, they do it hard: duplication varies between 18% and a whopping 42% of sheer file size. Absolute numbers are also astonishing: 33% to 59% functions are duplicates.

Found 15192 functions, 8963 are unique (59%)
Duplicates length: 1518418 bytes out of 3537579 bytes are duplicate code (42.92%)

Found 1202 functions, 494 are unique (41.1%)
Duplicates length: 130649 bytes out of 340227 bytes are duplicate code (38.4%)

Found 513 functions, 231 are unique (45.03%)
Duplicates length: 50160 bytes out of 136057 bytes are duplicate code (36.87%)

Found 598 functions, 267 are unique (44.65%)
Duplicates length: 57607 bytes out of 164737 bytes are duplicate code (34.97%)

Found 17 functions, 10 are unique (58.82%)
Duplicates length: 1932 bytes out of 6532 bytes are duplicate code (29.58%)

Found 154 functions, 98 are unique (63.64%)
Duplicates length: 11140 bytes out of 45135 bytes are duplicate code (24.68%)

Found 968 functions, 651 are unique (67.25%)
Duplicates length: 52616 bytes out of 281406 bytes are duplicate code (18.7%)

I thought my code might be wrong, so I looked into the bundle code itself. Here's a short excerpt:

Object.assign.bind():function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e},yR.apply(this,arguments)}var Zce;function bR(){return bR=Object.assign?Object.assign.bind():function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e},bR.apply(this,arguments)}var Wce;function wR(){return wR=Object.assign?Object.assign.bind():function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e},wR.apply(this,arguments)}var Uce;function $R(){return $R=Object.assign?Object.assign.bind():function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e},$R.apply(this,arguments)}var Gce;function OR(){return OR=Object.assign?Object.assign.bind():function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e},OR.apply(this,arguments)}var Kce;function xR(){return xR=Object.assign?Object.assign.bind():function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e},xR.apply(this,arguments)}

See how the following fragment of code repeats multiple times:

function(e){for(var t=1;t<arguments.length;t++){var n=arguments[t];for(var r in n)Object.prototype.hasOwnProperty.call(n,r)&&(e[r]=n[r])}return e}

In fact, this exact same fragment of code repeats 137 times in the same piece of bundle chunk (same file):

Repeated function definition in a single chunk of code

By the way, this is a production build of our front-end, built using Vite, with minification enabled.

The raw length of this function code is 146 characters. So in a single place, in a single file, you have 136 * 146 = 19_992 bytes of waste. Meaning, browser has to load these 20KB of code, parse it and create 136 duplicating functions.

Looking at the overall size of 3.5MB of code in this chunk and its insane 41% duplicated code (in sheer bytes, not occurrences, so 1.5MB wasted), imagine how much faster this single file could have been loaded in a browser.

I was keen on seeing what functions get duplicated most often and ran my script on an entire build output directory. Here are top 80-ish offenders:

Function code (minimized)	Copies
`function(r){for(var t=1;t<arguments.length;t++){var n,o=arguments[t];for(n in o)Object.prototype.hasOwnProperty.call(o,n)&&(r[n]=o[n])}return r}`	2205
`function n(){return Object.assign&&Object.assign.bind(),n.apply(this,arguments)}`	1197
`function n(){return Object.assign,n.apply(this,arguments)}`	1008
`function(){}`	753
`function(i){this.a=i}`	250
`function(n,e){if(null==n)return{};for(var r,t={},f=Object.keys(n),u=0;u<f.length;u++)r=f[u],0<=e.indexOf(r)\|\|(t[r]=n[r]);return t}`	191
`function(e,r){if(null==e)return{};var t,n=function(e,r){if(null==e)return{};for(var t,n={},l=Object.keys(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|(n[t]=e[t]);return n}(e,r);if(Object.getOwnPropertySymbols)for(var l=Object.getOwnPropertySymbols(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|Object.prototype.propertyIsEnumerable.call(e,t)&&(n[t]=e[t]);return n}`	187
`function(e,r){return r=r\|\|e.slice(0),Object.freeze(Object.defineProperties(e,{raw:{value:Object.freeze(r)}}))}`	159
`function(n){return this===n}`	119
`function(r,t){if("object"!=typeof r\|\|null===r)return r;var e=r[Symbol.toPrimitive];if(void 0===e)return("string"===t?String:Number)(r);e=e.call(r,t\|\|"default");if("object"!=typeof e)return e;throw new TypeError("@@toPrimitive must return a primitive value.")}`	113
`function(r,e,t){return i=function(r,e){if("object"!=typeof r\|\|null===r)return r;var t=r[Symbol.toPrimitive];if(void 0===t)return String(r);t=t.call(r,e);if("object"!=typeof t)return t;throw new TypeError("@@toPrimitive must return a primitive value.")}(e,"string"),(e="symbol"==typeof i?i:String(i))in r?Object.defineProperty(r,e,{value:t,enumerable:!0,configurable:!0,writable:!0}):r[e]=t,r;var i}`	111
`function(t){t=function(t,r){if("object"!=typeof t\|\|null===t)return t;var i=t[Symbol.toPrimitive];if(void 0===i)return String(t);i=i.call(t,r);if("object"!=typeof i)return i;throw new TypeError("@@toPrimitive must return a primitive value.")}(t,"string");return"symbol"==typeof t?t:String(t)}`	111
`function(e,n,r){return n in e?Object.defineProperty(e,n,{value:r,enumerable:!0,configurable:!0,writable:!0}):e[n]=r,e}`	104
`function(c,i){He.call(this,c,i)}`	94
`function(n,r){(null==r\|\|r>n.length)&&(r=n.length);for(var e=0,l=new Array(r);e<r;e++)l[e]=n[e];return l}`	93
`function(){return!0}`	92
`function(){return!1}`	78
`function(){return new gt(this)}`	77
`function(r){if(Array.isArray(r))return r}`	77
function(){throw new TypeError(`Invalid attempt to destructure non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}	77
`function(t){return t&&"object"==typeof t&&"default"in t?t:{default:t}}`	76
`function(i,t){this.a=i,this.b=t}`	58
`function(){return this.a}`	49
`function(t,e){var r,n=Object.keys(t);return Object.getOwnPropertySymbols&&(r=Object.getOwnPropertySymbols(t),e&&(r=r.filter(function(e){return Object.getOwnPropertyDescriptor(t,e).enumerable})),n.push.apply(n,r)),n}`	49
`function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(c,e),Object.defineProperty(y,e,t.get?t:{enumerable:!0,get:function(){return c[e]}}))}`	49
`function(){return c[b]}`	49
`function(a,e,i){var r,t=i["aria-label"],n=i["aria-labelledby"],c=i.title;switch(a){case"img":return t\|\|n\|\|c?(l(r={},"aria-labelledby",n),l(r,"aria-label",t),l(r,"title",c),r):{"aria-label":"".concat(e.replace(/([a-z])([A-Z])/g,"$1 $2")," Icon")};case"presentation":return{"aria-hidden":!0,alt:""}}}`	49
`function(i){Di(this,i)}`	48
`function(r){return Object.getOwnPropertyDescriptor(e,r).enumerable}`	48
`function(){throw M(new De)}`	46
`function(l,r){var t=null==l?null:typeof Symbol<"u"&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var n,u,e=[],a=!0,o=!1;try{for(t=t.call(l);!(a=(n=t.next()).done)&&(e.push(n.value),!r\|\|e.length!==r);a=!0);}catch(l){o=!0,u=l}finally{try{a\|\|null==t.return\|\|t.return()}finally{if(o)throw u}}return e}}`	44
`function(n){throw M(new De)}`	39
`function(r){var n;return r&&"object"==typeof r&&"default"in r?r:(n=Object.create(null),r&&Object.keys(r).forEach(function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(r,e),Object.defineProperty(n,e,t.get?t:{enumerable:!0,get:function(){return r[e]}}))}),n.default=r,Object.freeze(n))}`	39
`function n(r){return n(r)}`	38
`function(n){return typeof n}`	38
`function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o}`	38
`function(r){var n;return r&&r.__esModule?r:(n=Object.create(null),r&&Object.keys(r).forEach(function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(r,e),Object.defineProperty(n,e,t.get?t:{enumerable:!0,get:function(){return r[e]}}))}),n.default=r,Object.freeze(n))}`	37
`function(){return this.b}`	33
`function(l,r){var t=null==l?null:typeof Symbol<"u"&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var e,n,u,a,f=[],i=!0,o=!1;try{if(u=(t=t.call(l)).next,0===r){if(Object(t)!==t)return;i=!1}else for(;!(i=(e=u.call(t)).done)&&(f.push(e.value),f.length!==r);i=!0);}catch(l){o=!0,n=l}finally{try{if(!i&&null!=t.return&&(a=t.return(),Object(a)!==a))return}finally{if(o)throw n}}return f}}`	33
`function(t){Object.defineProperty(e,t,Object.getOwnPropertyDescriptor(n,t))}`	32
`function(n){return Ei(n)}`	30
`function(n){return L(fn,X,2,n,6,1)}`	30
`function(n){}`	29
`function(r){if(typeof Symbol<"u"&&null!=r[Symbol.iterator]\|\|null!=r["@@iterator"])return Array.from(r)}`	28
function(){throw new TypeError(`Invalid attempt to spread non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}	28
`()=>{}`	27
`function(){return null}`	27
`function(n,o){e.exports=o(m,on(),dn)}`	27
`function(i){Fi(this,i)}`	23
`function(n,o){e.exports=o(dn,on(),m)}`	22
`function(){throw M(new Fe(Re((xe(),Ds))))}`	21
`function(){return this}`	20
`function(i,t){this.b=i,this.a=t}`	19
`function(n,e){throw M(new De)}`	18
`function(){return 0}`	17
`function(r){Object.defineProperty(e,r,Object.getOwnPropertyDescriptor(t,r))}`	16
`()=>{var l;return null!=(l=e.options.debugAll)?l:e.options.debugHeaders}`	16
`function(n){return n}`	15
`function(c){Kr.call(this,c)}`	15
`function(){return this.c}`	15
`function(){return this.d}`	15
`function(n,r){return n}`	14
`d=>d.id`	14
`function(){var r=function(t,o){return(r=Object.setPrototypeOf\|\|({__proto__:[]}instanceof Array?function(t,o){t.__proto__=o}:function(t,o){for(var n in o)Object.prototype.hasOwnProperty.call(o,n)&&(t[n]=o[n])}))(t,o)};return function(t,o){if("function"!=typeof o&&null!==o)throw new TypeError("Class extends value "+String(o)+" is not a constructor or null");function n(){this.constructor=t}r(t,o),t.prototype=null===o?Object.create(o):(n.prototype=o.prototype,new n)}}`	14
`function(_,o){_.__proto__=o}`	14
`function(o,r){for(var t in r)Object.prototype.hasOwnProperty.call(r,t)&&(o[t]=r[t])}`	14
`function(n){return!1}`	13
`function(t){var l=-1,n=null==t?0:t.length;for(this.clear();++l<n;){var r=t[l];this.set(r[0],r[1])}}`	12
`function(a,e,l){var i,r=l["aria-label"],t=l["aria-labelledby"],n=l.title;switch(a){case"img":return r\|\|t\|\|n?(f(i={},"aria-labelledby",t),f(i,"aria-label",r),f(i,"title",n),i):{"aria-label":"".concat(e.replace(/([a-z])([A-Z])/g,"$1 $2")," Icon")};case"presentation":return{"aria-hidden":!0,alt:""}}}`	12
`a=>a`	12
`a=>a()`	12
`function(n,a){n.a=a}`	11
`function(){ia.call(this)}`	11
`function(i,t,h){this.a=i,this.b=t,this.c=h}`	11
`function(n,r){return r}`	11
`function(){return this.a.gc()}`	11
`function(e){return e&&e.__esModule?e:{default:e}}`	11
`()=>n(!1)`	11
`function(i){this.b=i}`	10
`function(c){this.c=c}`	10
`function(){return this.f}`	10
`function(n){return n\|\|"div"}`	10
`function(n,r,i){var l;return i=null!=(l=i)?l:"div",n\|\|("string"==typeof(null==r?void 0:r.href)?"a":i)}`	10

Let's dive deeper, shall we?

Imagine for a second that we could define the duplicated functions once and then just reuse the short name instead (sounds reasonable, does it not?).

But not all of those functions could be de-duplicated in such way (at least not so easily). Some of these functions use the outer closure functions and variables (something defined outside of the function itself), so we can skip these. For instance, function(i){Di(this,i)} and function(){throw M(new De)} can be ignored.

Then, there are function using this. These might be tricky (this is not what you think is a famous JavaScript mantra).

Lastly, some (if not most) of the functions could be either replaced with arrow functions or standard library function. But that is uncertain - one must understand what a function does first.

With those points in mind, let's look at the offenders once again:

Function code (minimized)	Copies	Notes
`function(r){for(var t=1;t<arguments.length;t++){var n,o=arguments[t];for(n in o)Object.prototype.hasOwnProperty.call(o,n)&&(r[n]=o[n])}return r}`	2205	spread operator?
`function n(){return Object.assign&&Object.assign.bind(),n.apply(this,arguments)}`	1197	`Object.assign` methods?
`function n(){return Object.assign,n.apply(this,arguments)}`	1008	`Object.assign` properties?
`function(){}`	753	no-op
`function(n,e){if(null==n)return{};for(var r,t={},f=Object.keys(n),u=0;u<f.length;u++)r=f[u],0<=e.indexOf(r)\|\|(t[r]=n[r]);return t}`	191	?
`function(e,r){if(null==e)return{};var t,n=function(e,r){if(null==e)return{};for(var t,n={},l=Object.keys(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|(n[t]=e[t]);return n}(e,r);if(Object.getOwnPropertySymbols)for(var l=Object.getOwnPropertySymbols(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|Object.prototype.propertyIsEnumerable.call(e,t)&&(n[t]=e[t]);return n}`	187	?
`function(e,r){return r=r\|\|e.slice(0),Object.freeze(Object.defineProperties(e,{raw:{value:Object.freeze(r)}}))}`	159	?
`function(r,t){if("object"!=typeof r\|\|null===r)return r;var e=r[Symbol.toPrimitive];if(void 0===e)return("string"===t?String:Number)(r);e=e.call(r,t\|\|"default");if("object"!=typeof e)return e;throw new TypeError("@@toPrimitive must return a primitive value.")}`	113	?
`function(r,e,t){return i=function(r,e){if("object"!=typeof r\|\|null===r)return r;var t=r[Symbol.toPrimitive];if(void 0===t)return String(r);t=t.call(r,e);if("object"!=typeof t)return t;throw new TypeError("@@toPrimitive must return a primitive value.")}(e,"string"),(e="symbol"==typeof i?i:String(i))in r?Object.defineProperty(r,e,{value:t,enumerable:!0,configurable:!0,writable:!0}):r[e]=t,r;var i}`	111	?
`function(t){t=function(t,r){if("object"!=typeof t\|\|null===t)return t;var i=t[Symbol.toPrimitive];if(void 0===i)return String(t);i=i.call(t,r);if("object"!=typeof i)return i;throw new TypeError("@@toPrimitive must return a primitive value.")}(t,"string");return"symbol"==typeof t?t:String(t)}`	111	?
`function(e,n,r){return n in e?Object.defineProperty(e,n,{value:r,enumerable:!0,configurable:!0,writable:!0}):e[n]=r,e}`	104	?
`function(n,r){(null==r\|\|r>n.length)&&(r=n.length);for(var e=0,l=new Array(r);e<r;e++)l[e]=n[e];return l}`	93	array spread?
`function(){return!0}`	92	always-true
`function(){return!1}`	78	always-false
`function(r){if(Array.isArray(r))return r}`	77	self explanatory
`function(){throw new TypeError('Invalid attempt to destructure non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.')}`	77	`isSymbol`?
`function(t){return t&&"object"==typeof t&&"default"in t?t:{default:t}}`	76	?
`function(t,e){var r,n=Object.keys(t);return Object.getOwnPropertySymbols&&(r=Object.getOwnPropertySymbols(t),e&&(r=r.filter(function(e){return Object.getOwnPropertyDescriptor(t,e).enumerable})),n.push.apply(n,r)),n}`	49	?
`function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(c,e),Object.defineProperty(y,e,t.get?t:{enumerable:!0,get:function(){return c[e]}}))}`	49	?
`function(l,r){var t=null==l?null:typeof Symbol<"u"&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var n,u,e=[],a=!0,o=!1;try{for(t=t.call(l);!(a=(n=t.next()).done)&&(e.push(n.value),!r\|\|e.length!==r);a=!0);}catch(l){o=!0,u=l}finally{try{a\|\|null==t.return\|\|t.return()}finally{if(o)throw u}}return e}}`	44	?
`function(r){var n;return r&&"object"==typeof r&&"default"in r?r:(n=Object.create(null),r&&Object.keys(r).forEach(function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(r,e),Object.defineProperty(n,e,t.get?t:{enumerable:!0,get:function(){return r[e]}}))}),n.default=r,Object.freeze(n))}`	39	?
`function n(r){return n(r)}`	38	`Function.apply`?
`function(n){return typeof n}`	38	`typeof`
`function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o}`	38	?
`function(r){var n;return r&&r.__esModule?r:(n=Object.create(null),r&&Object.keys(r).forEach(function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(r,e),Object.defineProperty(n,e,t.get?t:{enumerable:!0,get:function(){return r[e]}}))}),n.default=r,Object.freeze(n))}`	37	`import`?
`function(l,r){var t=null==l?null:typeof Symbol<"u"&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var e,n,u,a,f=[],i=!0,o=!1;try{if(u=(t=t.call(l)).next,0===r){if(Object(t)!==t)return;i=!1}else for(;!(i=(e=u.call(t)).done)&&(f.push(e.value),f.length!==r);i=!0);}catch(l){o=!0,n=l}finally{try{if(!i&&null!=t.return&&(a=t.return(),Object(a)!==a))return}finally{if(o)throw n}}return f}}`	33	?
`function(n){}`	29	no-op
`function(r){if(typeof Symbol<"u"&&null!=r[Symbol.iterator]\|\|null!=r["@@iterator"])return Array.from(r)}`	28	?
`function(){throw new TypeError('Invalid attempt to spread non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.')}`	28	?
`()=>{}`	27	no-op
`function(){return null}`	27	always-null
`function(){return this}`	20	always-this
`function(){return 0}`	17	always-zero
`function(n){return n}`	15	identity
`function(n,r){return n}`	14	always-first-argument
`d=>d.id`	14	`.id`
`function(){var r=function(t,o){return(r=Object.setPrototypeOf\|\|({__proto__:[]}instanceof Array?function(t,o){t.__proto__=o}:function(t,o){for(var n in o)Object.prototype.hasOwnProperty.call(o,n)&&(t[n]=o[n])}))(t,o)};return function(t,o){if("function"!=typeof o&&null!==o)throw new TypeError("Class extends value "+String(o)+" is not a constructor or null");function n(){this.constructor=t}r(t,o),t.prototype=null===o?Object.create(o):(n.prototype=o.prototype,new n)}}`	14	?
`function(_,o){_.__proto__=o}`	14	`Object.is_a`?
`function(o,r){for(var t in r)Object.prototype.hasOwnProperty.call(r,t)&&(o[t]=r[t])}`	14	?
`function(n){return!1}`	13	always-false
`a=>a`	12	identity
`a=>a()`	12	call first argument
`function(n,a){n.a=a}`	11	enum/const definition?
`function(n,r){return r}`	11	always-second-argument
`function(e){return e&&e.__esModule?e:{default:e}}`	11	import default
`function(c){this.c=c}`	10	enum/const definition?

As a matter of fact, someone on the internet did a very similar research few years ago. So I hoped to see the improvement in the build tools over the years.

As I mentioned above, our front-end is bundled with Vite. Let's see if using esbuild or bun (since both are fairly new and stand out in terms of architecture and performance) do a better job.

With few small adjustments to make things fair (e.g. build the same thing in the same way), like disabling the plugins for Vite, setting up svgr loader, here are some build time stats:

yarn install:

➤ YN0000: Done with warnings in 17s 798ms
yarn  12.11s user 19.69s system 175% cpu 18.122 total

bun install:

warn: esbuild's postinstall script took 748.9ms

 1028 packages installed [1.82s]
  Removed: 2
bun install  0.22s user 0.65s system 47% cpu 1.849 total

Bundler	Build time
bun	0.43s
esbuild	2.57s
vite	85.04s
webpack	138.64s

And the analysis of the built bundles:

vite:

Found 968 functions, 651 are unique (67.25%)
Found 598 functions, 267 are unique (44.65%)
Found 154 functions, 98 are unique (63.64%)
Found 17 functions, 10 are unique (58.82%)
Found 15192 functions, 8963 are unique (59%)
Found 1202 functions, 494 are unique (41.1%)
Found 513 functions, 231 are unique (45.03%)
= Total 18644 functions, 10714 are unique (57.4%)

Duplicates length: 52616 bytes out of 281406 bytes are duplicate code (18.7%)
Duplicates length: 57607 bytes out of 164737 bytes are duplicate code (34.97%)
Duplicates length: 11140 bytes out of 45135 bytes are duplicate code (24.68%)
Duplicates length: 1932 bytes out of 6532 bytes are duplicate code (29.58%)
Duplicates length: 1518418 bytes out of 3537579 bytes are duplicate code (42.92%)
Duplicates length: 130649 bytes out of 340227 bytes are duplicate code (38.4%)
Duplicates length: 50160 bytes out of 136057 bytes are duplicate code (36.87%)
= Total 1822522 out of 4511673 bytes are duplicate code (40.3%)

esbuild:

Found 46654 functions, 28952 are unique (62.06%)
Duplicates length: 6905599 bytes out of 9645594 bytes are duplicate code (71.59%)

bun:

Found 31113 functions, 25755 are unique (82.78%)
Duplicates length: 446020 bytes out of 5696964 bytes are duplicate code (7.83%)

webpack:

Found 2898 functions, 1434 are unique (49.48%)
Duplicates length: 320940 bytes out of 4645589 bytes are duplicate code (6.91%)

And a deeper analysis of the duplicated functions:

esbuild:

Function	Copies
`function(r){for(var t=1;t<arguments.length;t++){var n,o=arguments[t];for(n in o)Object.prototype.hasOwnProperty.call(o,n)&&(r[n]=o[n])}return r}`	2216
`function n(){return Object.assign&&Object.assign.bind(),n.apply(this,arguments)}`	1204
`function n(){return Object.assign,n.apply(this,arguments)}`	1010
`function(){}`	844
`function(t){return t&&"object"==typeof t&&"default"in t?t:{default:t}}`	260
`function(i){this.a=i}`	250
`function(n,e){if(null==n)return{};for(var r,t={},f=Object.keys(n),u=0;u<f.length;u++)r=f[u],0<=e.indexOf(r)\|\|(t[r]=n[r]);return t}`	203
`function(e,r){if(null==e)return{};var t,n=function(e,r){if(null==e)return{};for(var t,n={},l=Object.keys(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|(n[t]=e[t]);return n}(e,r);if(Object.getOwnPropertySymbols)for(var l=Object.getOwnPropertySymbols(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|Object.prototype.propertyIsEnumerable.call(e,t)&&(n[t]=e[t]);return n}`	194
`function(e,r){return r=r\|\|e.slice(0),Object.freeze(Object.defineProperties(e,{raw:{value:Object.freeze(r)}}))}`	160
`function(r,t){if("object"!=typeof r\|\|null===r)return r;var e=r[Symbol.toPrimitive];if(void 0===e)return("string"===t?String:Number)(r);e=e.call(r,t\|\|"default");if("object"!=typeof e)return e;throw new TypeError("@@toPrimitive must return a primitive value.")}`	137
`function(r,e,t){return i=function(r,e){if("object"!=typeof r\|\|null===r)return r;var t=r[Symbol.toPrimitive];if(void 0===t)return String(r);t=t.call(r,e);if("object"!=typeof t)return t;throw new TypeError("@@toPrimitive must return a primitive value.")}(e,"string"),(e="symbol"==typeof i?i:String(i))in r?Object.defineProperty(r,e,{value:t,enumerable:!0,configurable:!0,writable:!0}):r[e]=t,r;var i}`	134
`function(t){t=function(t,r){if("object"!=typeof t\|\|null===t)return t;var i=t[Symbol.toPrimitive];if(void 0===i)return String(t);i=i.call(t,r);if("object"!=typeof i)return i;throw new TypeError("@@toPrimitive must return a primitive value.")}(t,"string");return"symbol"==typeof t?t:String(t)}`	134
`()=>{}`	129
`function(n){return this===n}`	119
`function(e,n,r){return n in e?Object.defineProperty(e,n,{value:r,enumerable:!0,configurable:!0,writable:!0}):e[n]=r,e}`	115
`function(n,r){(null==r\|\|r>n.length)&&(r=n.length);for(var e=0,l=new Array(r);e<r;e++)l[e]=n[e];return l}`	106
`function(c,i){Bs.call(this,c,i)}`	94
`function(){return!0}`	93
`function(r){if(Array.isArray(r))return r}`	83
`function(){throw new TypeError(`Invalid attempt to destructure non-iterable instance.\nIn order to be iterable, non-array objects must have a Symbol.iterator method.`)}`	83
`function(){return!1}`	79
`function(){return new cu(this)}`	77
`function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(b,e),Object.defineProperty(E,e,t.get?t:{enumerable:!0,get:function(){return b[e]}}))}`	76
`function(){return b[P]}`	76
`function(a,e,l){var i,r=l["aria-label"],t=l["aria-labelledby"],n=l.title;switch(a){case"img":return r\|\|t\|\|n?(s(i={},"aria-labelledby",t),s(i,"aria-label",r),s(i,"title",n),i):{"aria-label":"".concat(e.replace(/([a-z])([A-Z])/g,"$1 $2")," Icon")};case"presentation":return{"aria-hidden":!0,alt:""}}}`	76
`function(r){var n;return r&&r.__esModule?r:(n=Object.create(null),r&&Object.keys(r).forEach(function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(r,e),Object.defineProperty(n,e,t.get?t:{enumerable:!0,get:function(){return r[e]}}))}),n.default=r,Object.freeze(n))}`	67
`function(n){return typeof n}`	64
`function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o}`	64
`function n(r){return n(r)}`	63
`function(t,e){var r,n=Object.keys(t);return Object.getOwnPropertySymbols&&(r=Object.getOwnPropertySymbols(t),e&&(r=r.filter(function(e){return Object.getOwnPropertyDescriptor(t,e).enumerable})),n.push.apply(n,r)),n}`	61
`function(i,t){this.a=i,this.b=t}`	58
`function(r){var n;return r&&"object"==typeof r&&"default"in r?r:(n=Object.create(null),r&&Object.keys(r).forEach(function(e){var t;"default"!==e&&(t=Object.getOwnPropertyDescriptor(r,e),Object.defineProperty(n,e,t.get?t:{enumerable:!0,get:function(){return r[e]}}))}),n.default=r,Object.freeze(n))}`	50
`function(r){if(typeof Symbol<"u"&&null!=r[Symbol.iterator]\|\|null!=r["@@iterator"])return Array.from(r)}`	49
function(){throw new TypeError(`Invalid attempt to spread non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.`)}	49
`function(){return this.a}`	49
`function(i){p0(this,i)}`	48
`function(l,r){var t=null==l?null:typeof Symbol<"u"&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var n,u,e=[],a=!0,o=!1;try{for(t=t.call(l);!(a=(n=t.next()).done)&&(e.push(n.value),!r\|\|e.length!==r);a=!0);}catch(l){o=!0,u=l}finally{try{a\|\|null==t.return\|\|t.return()}finally{if(o)throw u}}return e}}`	46
`function(){throw St(new Ss)}`	46
`function(n){throw St(new Ss)}`	39
`()=>{"use strict";Vu(),Du()}`	38
`function(l,r){var t=null==l?null:typeof Symbol<"u"&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var e,n,u,a,f=[],i=!0,o=!1;try{if(u=(t=t.call(l)).next,0===r){if(Object(t)!==t)return;i=!1}else for(;!(i=(e=u.call(t)).done)&&(f.push(e.value),f.length!==r);i=!0);}catch(l){o=!0,n=l}finally{try{if(!i&&null!=t.return&&(a=t.return(),Object(a)!==a))return}finally{if(o)throw n}}return f}}`	37
`function(){return this.b}`	33
`function(){X(x)}`	32
`function(n){}`	31
`a=>a()`	30
`function(n){return V1(n)}`	30
`function(n){return dn(oi,mr,2,n,6,1)}`	30
`function(){return null}`	29
`function(){return this}`	24
`()=>{"use strict";Du()}`	23
`function(i){g0(this,i)}`	23
`function(t,r){var e;if(t)return"string"==typeof t?k(t,r):"Map"===(e="Object"===(e=Object.prototype.toString.call(t).slice(8,-1))&&t.constructor?t.constructor.name:e)\|\|"Set"===e?Array.from(t):"Arguments"===e\|\|/^(?:Ui\|I)nt(?:8\|16\|32)(?:Clamped)?Array$/.test(e)?k(t,r):void 0}`	22
`function(t,r){var e;if(t)return"string"==typeof t?P(t,r):"Map"===(e="Object"===(e=Object.prototype.toString.call(t).slice(8,-1))&&t.constructor?t.constructor.name:e)\|\|"Set"===e?Array.from(t):"Arguments"===e\|\|/^(?:Ui\|I)nt(?:8\|16\|32)(?:Clamped)?Array$/.test(e)?P(t,r):void 0}`	22
`function(n){return null!=n&&n instanceof Array}`	21
`function(r){if(Array.isArray(r))return P(r)}`	21
`function(e){return e&&e.__esModule?e:{default:e}}`	21
`function(t,r){var e;if(t)return"string"==typeof t?Q(t,r):"Map"===(e="Object"===(e=Object.prototype.toString.call(t).slice(8,-1))&&t.constructor?t.constructor.name:e)\|\|"Set"===e?Array.from(t):"Arguments"===e\|\|/^(?:Ui\|I)nt(?:8\|16\|32)(?:Clamped)?Array$/.test(e)?Q(t,r):void 0}`	21
`function(){throw St(new Os(il((qs(),_g))))}`	21
`function(n){return n}`	20
`function(n){return null!=n&&n.nodeType===Node.ELEMENT_NODE}`	20
`function(e){throw Error("Received unhandled value: ".concat(e))}`	20
`function(r,n){return Array.isArray(r)?r.concat(n):"string"==typeof r?r:void 0}`	19

bun:

Function	Copies
`function(){}`	739
`function(i){this.a=i}`	250
`function(r){for(var t=1;t<arguments.length;t++){var n,o=arguments[t];for(n in o)Object.prototype.hasOwnProperty.call(o,n)&&(r[n]=o[n])}return r}`	197
`()=>{}`	141
`function(n){return this===n}`	119
`function(c,f){f7.call(this,c,f)}`	94
`function(){return!0}`	91
`function(){return new p9(this)}`	77
`function(){return!1}`	76
`function(i,t){this.a=i,this.b=t}`	58
`function(n,e){if(null==n)return{};for(var r,t={},f=Object.keys(n),u=0;u<f.length;u++)r=f[u],0<=e.indexOf(r)\|\|(t[r]=n[r]);return t}`	53
`function(r,t){if("object"!=typeof r\|\|null===r)return r;var e=r[Symbol.toPrimitive];if(void 0===e)return("string"===t?String:Number)(r);e=e.call(r,t\|\|"default");if("object"!=typeof e)return e;throw new TypeError("@@toPrimitive must return a primitive value.")}`	51
`function(){return this.a}`	49
`function(e,r){if(null==e)return{};var t,n=function(e,r){if(null==e)return{};for(var t,n={},l=Object.keys(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|(n[t]=e[t]);return n}(e,r);if(Object.getOwnPropertySymbols)for(var l=Object.getOwnPropertySymbols(e),o=0;o<l.length;o++)t=l[o],0<=r.indexOf(t)\|\|Object.prototype.propertyIsEnumerable.call(e,t)&&(n[t]=e[t]);return n}`	48
`function(r,e,t){return i=function(r,e){if("object"!=typeof r\|\|null===r)return r;var t=r[Symbol.toPrimitive];if(void 0===t)return String(r);t=t.call(r,e);if("object"!=typeof t)return t;throw new TypeError("@@toPrimitive must return a primitive value.")}(e,"string"),(e="symbol"==typeof i?i:String(i))in r?Object.defineProperty(r,e,{value:t,enumerable:!0,configurable:!0,writable:!0}):r[e]=t,r;var i}`	48
`function(t){t=function(t,r){if("object"!=typeof t\|\|null===t)return t;var i=t[Symbol.toPrimitive];if(void 0===i)return String(t);i=i.call(t,r);if("object"!=typeof i)return i;throw new TypeError("@@toPrimitive must return a primitive value.")}(t,"string");return"symbol"==typeof t?t:String(t)}`	48
`function(i){T6(this,i)}`	48
`()=>{R3(),G3()}`	46
`function(){throw x0(new w7)}`	46
`function(e,r){return r=r\|\|e.slice(0),Object.freeze(Object.defineProperties(e,{raw:{value:Object.freeze(r)}}))}`	45
`function(n,r){(null==r\|\|r>n.length)&&(r=n.length);for(var e=0,l=new Array(r);e<r;e++)l[e]=n[e];return l}`	41
`function(n){throw x0(new w7)}`	39
`function(r){if(Array.isArray(r))return r}`	36
`function(){throw new TypeError("Invalid attempt to destructure non-iterable instance.\\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.")}`	36
`function(){return this.b}`	33
`function(n){return n2(n)}`	30
`function(n){return J1($5,p1,2,n,6,1)}`	30
`function(t,e){var r,n=Object.keys(t);return Object.getOwnPropertySymbols&&(r=Object.getOwnPropertySymbols(t),e&&(r=r.filter(function(e){return Object.getOwnPropertyDescriptor(t,e).enumerable})),n.push.apply(n,r)),n}`	29
`function(l,r){var t=null==l?null:"undefined"!=typeof Symbol&&l[Symbol.iterator]\|\|l["@@iterator"];if(null!=t){var e,n,u,f,i=[],a=!0,o=!1;try{if(u=(t=t.call(l)).next,0===r){if(Object(t)!==t)return;a=!1}else for(;!(a=(e=u.call(t)).done)&&(i.push(e.value),i.length!==r);a=!0);}catch(l){o=!0,n=l}finally{try{if(!a&&null!=t.return&&(f=t.return(),Object(f)!==f))return}finally{if(o)throw n}}return i}}`	29
`function(n){}`	29
`function(){return null}`	28
`function(e){return Object.getOwnPropertyDescriptor(Z,e).enumerable}`	25
`function(e){Object.defineProperty(Z,e,Object.getOwnPropertyDescriptor(W,e))}`	25
`function(n){return n}`	23
`function(i){C6(this,i)}`	23
`()=>{G3()}`	22
`function(){throw x0(new C7(n7((y7(),KG))))}`	21
`function(){return this}`	19
`function(i,t){this.b=i,this.a=t}`	19
`function(n){return!1}`	18
`function(n,w){throw x0(new w7)}`	18
`function(){return 0}`	17
`function(n){return typeof n}`	17
`function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o}`	17
`()=>{R3()}`	16
`()=>{var e;return null!=(e=Z.options.debugAll)?e:Z.options.debugHeaders}`	16
`function(n,r){return n}`	15
`function(c){hX.call(this,c)}`	15
`function(){return this.c}`	15
`function(){return this.d}`	15
`function(e,n,r){return n in e?Object.defineProperty(e,n,{value:r,enumerable:!0,configurable:!0,writable:!0}):e[n]=r,e}`	14
`function(r){if("undefined"!=typeof Symbol&&null!=r[Symbol.iterator]\|\|null!=r["@@iterator"])return Array.from(r)}`	14
`function(){throw new TypeError("Invalid attempt to spread non-iterable instance.\\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.")}`	14
`d=>d.id`	14
`()=>W(!1)`	13
`a=>a`	12
`function(t){var l=-1,n=null==t?0:t.length;for(this.clear();++l<n;){var r=t[l];this.set(r[0],r[1])}}`	12
`function(n,c){}`	12
`function(n,r){return r}`	11
`function(n,a){n.a=a}`	11
`function(){_V.call(this)}`	11
`function(i,t,h){this.a=i,this.b=t,this.c=h}`	11
`function(){return this.a.gc()}`	11
`function(i){this.b=i}`	10
`function(c){this.c=c}`	10
`function(){return this.f}`	10

Interestingly enough, all three tools handled the job bad in different aspects:

vite was the slowest and produced second biggest bundle
esbuild was the fastest and produced the largest bundle
bun was slower than esbuild by a split of hair, but produced smallest bundle with least duplicates

Bonus points to bun for installing node modules in a link of an eye.

In terms of duplicates, however, all three failed miserably (in my opinion), with the best result being the bundle produced by bun with 18% duplicates and the rest having almost half the bundle wasted.

For the most part, bundlers seem to be doing a pretty bad job at tree shaking and keep a lot of those utility functions' duplicates. One can estimate how much of a wasted space these use, by multiplying the function code length by the number of duplicates minus one (for one definition).

Let's imagine some of the above functions could be de-duplicated. What are the benefit of that? For the most part, the front-end can load faster for users - simply because there is less bytes to transfer. On top of that, there are less functions to be created in memory. So technically, the front-end can act faster. Although, on the modern machines the difference between having one function and few thousand of the same function is negligible.

Here is a shortened list of top abusers from different bundlers for our tool:

Function	vite		esbuild		bun
Function	#	Bytes	#	Bytes	#	Bytes
`function(){}`	753	9036	844	10128	739	8868
`function(){return!0}`	92	1840	93	1860	91	1820
`function(){return!1}`	78	1560	79	1580	76	1520
`function(){return null}`	27	621	29	667	28	644
`function(){return this}`	20	460	24	552	19	437
`function(){return 0}`	17	340	n/a	0	n/a	0
`function(n){return!1}`	13	273	n/a	0	18	378
`function(n){}`	29	377	31	403	29	377
`function(n){return n}`	15	315	20	420	23	483
`function(n){return typeof n}`	n/a	0	64	1792	17	476
`function(n){return this===n}`	n/a	0	119	3332	119	3332
`function(n,r){return n}`	14	322	n/a	0	15	345
`function(n,r){return r}`	11	253	n/a	0	11	253
`function(n,c){}`	n/a	0	n/a	0	12	180
`()=>{}`	27	162	129	774	141	846
`a=>a`	12	48	n/a	0	12	48
`a=>a()`	12	72	n/a	0	n/a	0
`function(r){ for(var t=1;t<arguments.length;t++){ var n,o=arguments[t]; for(n in o) Object.prototype.hasOwnProperty.call(o,n)&&(r[n]=o[n]) } return r }`	2205	317520	2216	319104	197	28368
`function n(){ return Object.assign&&Object.assign.bind(),n.apply(this,arguments) }`	1197	95760	1204	96320	n/a	0
`function n(){return Object.assign,n.apply(this,arguments)}`	1008	58464	1010	58580	n/a	0
`function(n,r){(null==r\|\|r>n.length)&&(r=n.length); for(var e=0,l=new Array(r);e<r;e++)l[e]=n[e]; return l}`	93	9672	106	11024	n/a	0
`function(r){if(Array.isArray(r))return r}`	77	3157	83	3403	36	1476
`function(){ throw new TypeError('Invalid attempt to destructure non-iterable instance.\nIn order to be iterable, non-array objects must have a [Symbol.iterator]() method.')}`	77	13244	83	14276	36	6192
`function(t){ return t&&"object"==typeof t&&"default"in t?t:{default:t} }`	76	5624	260	19240	n/a	0
`function(_,o){_.__proto__=o}`	14	392	n/a	0	n/a	0
`function(o,r){ for(var t in r) Object.prototype.hasOwnProperty.call(r,t)&&(o[t]=r[t]) }`	14	1176	n/a	0	n/a	0
`function(e){return e&&e.__esModule?e:{default:e}}`	11	539	21	1029	n/a	0
`function(e,n,r){ return n in e?Object.defineProperty(e,n,{ value:r,enumerable:!0,configurable:!0,writable:!0 }):e[n]=r,e }`	n/a	0	115	13570	n/a	0
`function(r,n){ return Array.isArray(r)?r.concat(n):"string"==typeof r?r:void 0 }`	n/a	0	19	1520	n/a	0
`function(n){return null!=n&&n instanceof Array}`	n/a	0	21	987	n/a	0
`function(n,e){ if(null==n) return{}; for(var r,t={},f=Object.keys(n),u=0;u<f.length;u++) r=f[u],0<=e.indexOf(r)\|\|(t[r]=n[r]); return t }`	n/a	0	n/a	0	53	6890
`function(r,t){ if("object"!=typeof r\|\|null===r)return r; var e=r[Symbol.toPrimitive]; if(void 0===e)return("string"===t?String:Number)(r); e=e.call(r,t\|\|"default"); if("object"!=typeof e)return e; throw new TypeError("@@toPrimitive must return a primitive value.") }`	n/a	0	137	36853	51	13719
`function(e,r){ return r=r\|\|e.slice(0), Object.freeze(Object.defineProperties(e,{raw:{value:Object.freeze(r)}})) }`	n/a	0	160	17600	n/a	0
`function(o){ return o&&"function"==typeof Symbol&& o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o }`	n/a	0	64	7424	n/a	0
`function(r){ if("undefined"!=typeof Symbol&& null!=r[Symbol.iterator]\|\|null!=r["@@iterator"]) return Array.from(r) }`	n/a	0	n/a	0	14	1624

Interestingly enough, aside from a lot of () => {} and (a, b) => a and () => true (as I call them, utility) functions, there are a lot of ES6 / TypeScript helpers such as class definition and spread operator variants, presumingly made to be compatible with ES5-only browsers. Maybe if we had targeted only platforms supporting the latest ES features we would get better results?

Well, not quite much:

bundle sizes:

Bundler	Bundle size
bun	5.4M
esbuild	9.2M
esbuild (tuned)	8.0M
vite	7.1M
vite (tuned)	3.8M
webpack	4.4M

vite:

Found 15983 functions, 9581 are unique (59.94%)
Duplicates length: 1495985 bytes out of 4019326 bytes are duplicate code (37.22%)

esbuild:

Found 41736 functions, 29224 are unique (70.02%)
Duplicates length: 3406606 bytes out of 8347230 bytes are duplicate code (40.81%)

webpack is (should be) already using the target config option from tsconfig.json (which is set to ESNext in our case) and bun does not really have a whole lot of customization in this regard.

One unlikely possibility when these duplicates can be faster than having just one function is that running the nearly-defined code (in terms of a single block of code) might be slightly faster than making code jumps. This is super far-fetched idea from low-level programming, when CPU does not have to jump thousands or millions of (assembly) instructions back and forth but literally a few instead. This won't justify using verbose ES5-compatible code on ESnext browser, however. How about we run a very synthetic benchmark to check just this one theory?

function f(){return!1}

console.time('1');
for(var i=0;i<100000;i++){var a=[];for(var t=0;t<20000;t++)a.push(Math.random()*1000000);var x=a.filter(f).length}
console.timeEnd('1');

console.time('2');
for(var i=0;i<100000;i++){var a=[];for(var t=0;t<20000;t++)a.push(Math.random()*1000000);var f=function(){return!1},x=a.filter(f).length}
console.timeEnd('2');

The results are actually quite stable:

1: 17029ms
1: 16998ms
1: 16903ms

2: 21877ms
2: 21811ms
2: 21821ms

Having just one function instance is approx. 23% faster. But what happens at consequitive runs?

1: 9194ms
1: 9159ms
1: 14044ms
1: 13882ms
1: 13975ms
1: 9205ms
1: 14026ms

2: 21821ms
2: 13843ms
2: 13866ms
2: 13854ms
2: 13961ms

2: 21718ms
2: 13952ms
2: 13925ms
2: 13923ms

Seems like CPU does indeed do a little bit of instruction caching and branch prediction (first run is visibly slower than the subsequent runs). But the observation still holds: having one function definition instead of many copies (even "near" copies) has a much bigger impact.

With that being said, there is one interesting thing to try here: what if we actually replace some of those bulky duplicates with one-time declarations, within the same bundle?

Without performing an in-depth code analysis and optimization, I came up with the following naive implementation:

analyze the code and pick a few functions (biggest abusers) to fix them up
extract all function names, function parameter names (including potential destructuring objects in function params) and all variable and constant names
for each function to be de-duplicated, create a unique name (using the variable and function parameter and function names to avoid any clashes)
from the end of the file to the beginning, remove all occurrences of the function declaration and replace all function references with the unique name
add the function declarations to the beginning of the file

This approach is rather naive, since it does not account for a number of edge cases. For instance, if there are two functions to be replaced:

function(e){for(var r=1;r<arguments.length;r++){var t,l=arguments[r];for(t in l)Object.prototype.hasOwnProperty.call(l,t)&&(e[t]=l[t])}return e}

function p(){return(p=Object.assign||function(e){for(var r=1;r<arguments.length;r++){var t,l=arguments[r];for(t in l)Object.prototype.hasOwnProperty.call(l,t)&&(e[t]=l[t])}return e}).apply(this,arguments)}

e.g. one includes the other, the algorithm could have potentially evicted cases like this.

Before proceeding further, it is a good idea to test if the cleaned up bundle can safely replace the original one. Hence I just stuffed it in the static assets folder of our project and ran it with the modified bundle.

This way I figured few issues with the naive approach:

two function definitions are causing stack overflow:
- $z=function n(){return Object.assign,n.apply(this,arguments)}
- $q=function n(){return n=Object.assign&&Object.assign.bind(),n.apply(this,arguments)}
some of the empty functions are actually used as constructors (ES5-compatible OOP model) which is only discovered by finding the expressions like $FnName.prototype.something = somethingElse;
some functions are named and then referenced later in the code
some functions are not used at all:

For the shorthand functions I first tried manually fixing them up - had to replace them with $z=function(){return $z=Object.assign.bind(),$z.apply(this,arguments)} alikes. This worked, so I created an AST transformer to handle these one-line return-only functions:

const simplifyFunction = (code, fname) => {
    const tmpFilename = '_tmp';

    fs.writeFileSync(tmpFilename, code, 'utf-8');

    const root = ts.createSourceFile(
        tmpFilename,
        code,
        ts.ScriptTarget.ESNext,
        /* setParentNodes */ true
    );

    let rootFnName = undefined;

    const parse = (node) => {
        if (ts.isFunctionDeclaration(node) && ts.isIdentifier(node.name) && node.name.escapedText !== '') {
            rootFnName = node.name.escapedText;
            return;
        }

        ts.forEachChild(node, child => {
            if (child) {
                parse(child);
            }
        });
    };

    parse(root);

    if (!rootFnName) {
        fs.rmSync(tmpFilename);
        return code;
    }

    const transformer = (ctx) => (sourceFile) => {
        const visit = (node) => {
            if (ts.isIdentifier(node) && node.escapedText === rootFnName) {
                return ts.factory.createIdentifier(fname);
            }

            if (
                ts.isFunctionDeclaration(node) &&
                ts.isBlock(node.body) &&
                node.body.statements.length === 1 &&
                ts.isReturnStatement(node.body.statements[0]) &&
                ts.isBinaryExpression(node.body.statements[0].expression)
            ) {
                const next = ts.factory.createFunctionDeclaration(
                    [],
                    undefined,
                    undefined,
                    [],
                    [],
                    undefined,

                    ts.factory.createBlock([
                        ts.factory.createReturnStatement(
                            ts.factory.createComma(
                                ts.factory.createAssignment(
                                    ts.factory.createIdentifier(fname),
                                    node.body.statements[0].expression.left
                                ),

                                node.body.statements[0].expression.right
                            )
                        )
                    ])
                );

                return ts.visitEachChild(next, visit, ctx);
            }

            return ts.visitEachChild(node, visit, ctx);
        };

        return ts.visitNode(sourceFile, visit);
    };

    const s = ts.createSourceFile(tmpFilename, code, ts.ScriptTarget.ESNext);
    const { transformed } = ts.transform(s, [ transformer ]);

    const newCode = ts.createPrinter({ omitTrailingSemicolon: true })
        .printFile(transformed.find(({ fileName }) => fileName === tmpFilename));

    fs.rmSync(tmpFilename);

    return newCode;
};

The transformer is essentially a two-pass processor: it first parses the code and identifies the first function declaration. If none was found - it just returns the original code. If there was a so-called "root function" defined, it then replaces all identifier with that "root function" name with the alias provided as fname. It also replaces the return statements in form of a return something && something() with return alias = something, alias().

This approach is different from simply using uglifyjs to just try and minimize the code - it is way more complex (compared to just one function call). It results in few extra whitespaces being added. But using uglifyjs messes things up again and TypeScript compiler does not have an option for minimizing the output. But few extra whitespaces are totally acceptable given the much bigger savings from this transformation.

With the constructors I simply excluded them from being de-duplicated. This resulted in 155 fewer substitutions (in Vite mode), which is negligible on the overall scale of the problem.

As for the named functions which are in the global scope and are referenced later down the line, I had to create a list of "backwards-compatible aliases", mapping those old function names onto the new unique names.

I ended up with a three-pass parser-transformer utility (not really "simple script" anymore). The passes being:

figuring out the duplicate function declarations and replacing them from the end to the start of the code (to minimize the chance of writing over the just-changed code)
removing the duplicate global-scope named functions and replacing them with shorthand aliases to the de-duplicated declarations
replace the usages of all known aliases and shorthand-named de-duplicated functions with their corresponding new (generated) names

Other than those, replacing the original bundle with the optimized one worked like a charm!

The results? With the threshold of 20 duplicates or more:

Bundler	Before optimization
Bundler	Bundle size	Total functions	Unique functions	Unique functions, %	Duplicate code, %
bun	6.2M	8903	7443	83.6%	0.78%
esbuild	8.7M	13057	10250	78.5%	3.9%
vite	3.9M	3502	2365	67.53%	6.39%
webpack	4.4M	2898	1434	49.48%	6.91%
After optimization
bun	6.2M (same)	7865 (-1038)	7355 (-88)	93.52% (+9.92%)	0.51% (-0.27%)
esbuild	8.5M (-0.2M)	3265 (-9792)	2990 (-7260)	91.58% (+13.08%)	0.62% (-3.28%)
vite	3.6M (-0.3M)	2483 (-1019)	2277 (-88)	91.7% (+24.17%)	1.68% (-4.71%)
webpack	4.1M (-0.3M)	1484 (-1414)	1375 (-59)	92.65% (+43.17%)	0.43% (-6.48%)

In conclusion, the bundlers do a pretty average job at optimizing the bundles, even in production mode with some extra tuning. And if some brave soul is willing to invest even more time and effort than I did into developing a sophisticated solution (potentially improving the existing tools, like uglifyjs or bundlers themselves), the numbers can be improved even further. It would be really interesting to see what would the results be running this optimizer on a bigger bundle.

In my humble opinion, Bun does produce the cleanest bundle. It might be not the smallest one, but it has least unnecessary stuff. On top of that, it is the fastest tool in JS world I have ever used. By the way, this very blog is built using Bun and React SSR - on Github Actions it takes around a minute to build and publish:

Github Actions build and publish with Bun

TypeScript classes are not what you think

05 Jun 2023

There was a talk at CPPCon 2017 by Matt Goldbolt with an inspirational title, What has my compiler done for me lately?. It actually resonates with me quite a bit, especially in the world where we try to push so hard for statically typed compiled languages on front-end.

Recently I recalled one of (many) aspects why I think TypeScript is useless (as a way to introduce strong type system to JavaScript world) in many cases.

To be realistic and not just throw bare accusations around, this was inspired by the work on contracts for a new microservice, specifically - the request & response types to be used in both client and the back-end of the microservice. The types were similar to the types used on a database layer and one of the developers just returned objects of that DB layer model type on a controller (endpoint) level, which confused me.

Consider the code below:

class A {
    constructor(public moo: string, public foo: number) {}
}

class B {
    constructor(public moo: string, public foo: number, public zoo?: string[]) {}
}

const b: B = new B('ololo', -3.14, ['1']);
const a: A = b;
const c: B = a;

This works because of a (rather questionable) design decision by TypeScript team called Type compatibility, where any two classes or interfaces that have overlapping public fields are deemed compatible and can be mutually interplaceable.

However, in most languages with reasonable type system, you would expect two different classes to be just that - two different classes - in the case of a code above, an object of class B can never be assigned to a variable of type A and vice versa.

There is a way to achieve this in TypeScript, however (a bit cumbersome, though): by hiding the properties and only exposing them through non-getter/non-setter methods, when needed:

class A {
    constructor(private moo: string, private foo: number) {}

    getMoo() {
        return this.moo;
    }

    getFoo() {
        return this.foo;
    }
}

class B {
    constructor(private moo: string, private foo: number, private zoo?: ReadonlyArray<string>) {}

    getMoo() {
        return this.moo;
    }

    getFoo() {
        return this.foo;
    }

    getZoo() {
        return this.zoo;
    }
}

const b: B = new B('ololo', -3.14, ['1']);
const a: A = b; // Type 'B' is not assignable to type 'A'. Types have separate declarations of a private property 'moo'.(2322)
const c: B = a; // Property 'getZoo' is missing in type 'A' but required in type 'B'.(2741)

Without knowing this "feature" beforehand, one might end up with an inconsistent code or errors down the line (when somebody decides to modify the DB layer model and gets errors on an API layer). And TypeScript does not really help here.

How not to use monads

31 Jan 2023

Recently I have watched a video titled "Optionals and errors in Haskell & Rust - monads by example". In the video, the author makes a simple application simulating the retrieval of a HTML document via URL and checking if the document contains the title. This is a half-simulated experience, aimed at the beginners to demonstrate how the error handling is done in Rust and Haskell.

However, author in his code has made quite a few bad decisions. And that's what I want to make a focus of this blog: how one should not use monads and handle errors in Haskell.

Author's original source looks like this:

{-# LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE OverloadedRecordDot #-}
{-# LANGUAGE OverloadedRecordUpdate #-}

import Control.Monad (forM_)
import Data.List (isInfixOf)
import Data.Maybe (fromMaybe)
import Text.Printf (printf)

data Error = Error (String) deriving (Show)

data Doc = Doc {head :: Maybe Head}

data Head = Head {title :: Maybe String}

data Summary = Summary
  { title :: Maybe String,
    ok :: Bool
  }
  deriving (Show)

readDoc :: String -> Either Error Doc
readDoc url =
  if
      | isInfixOf "fail" url -> Left $ Error $ printf "Bad read of %s" url
      | otherwise ->
          Right $
            if
                | isInfixOf "head-missing" url -> Doc {head = Nothing}
                | isInfixOf "title-missing" url ->
                    Doc {head = Just Head {title = Nothing}}
                | isInfixOf "title-empty" url ->
                    Doc {head = Just Head {title = Just ""}}
                | otherwise ->
                    Doc
                      { head =
                          Just Head {title = Just $ printf "Title of %s" url}
                      }

buildSummary :: Doc -> Summary
buildSummary doc =
  Summary {title = doc.head >>= (.title), ok = True}

readAndBuildSummary :: String -> Summary
readAndBuildSummary url = case readDoc url of
  Left err -> Summary {title = Nothing, ok = False}
  Right doc -> buildSummary doc

isTitleNonEmpty :: Doc -> Maybe Bool
isTitleNonEmpty doc = do
  head <- doc.head
  title <- head.title
  return $ not $ null title

isTitleNonEmpty' :: Doc -> Maybe Bool
isTitleNonEmpty' doc = not <$> null <$> (doc.head >>= (.title))

isTitleNonEmpty'' :: Doc -> Maybe Bool
isTitleNonEmpty'' doc = not . null <$> (doc.head >>= (.title))

readWhetherTitleNonEmpty :: String -> Either Error (Maybe Bool)
readWhetherTitleNonEmpty url = do
  doc <- readDoc url
  return $ isTitleNonEmpty doc

readWhetherTitleNonEmpty' :: String -> Either Error (Maybe Bool)
readWhetherTitleNonEmpty' url = isTitleNonEmpty <$> readDoc url

main :: IO ()
main = do
  let urls = ["good", "title-empty", "title-missing", "head-missing", "fail"]
  forM_ urls $ \url -> do
    putStrLn $ printf "Checking \"https://%s/\":" url
    -- Summary.
    let summary = readAndBuildSummary url
    putStrLn $ printf "  Summary: %s" $ show summary
    putStrLn $ printf "  Title: %s" $ fromMaybe "" summary.title
    -- Has title.
    let hasTitle = readWhetherTitleNonEmpty url
    -- let hasTitleSure = either (const False) id $ fromMaybe False <$> hasTitle
    let hasTitleSure = fromMaybe False $ either (const Nothing) id hasTitle
    putStrLn $
      printf "  Has title: %s vs %s" (show hasTitle) (show hasTitleSure)

He then proceeded to wrapping this code in IO monad and ended up with the following:

{-# LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE OverloadedRecordDot #-}
{-# LANGUAGE OverloadedRecordUpdate #-}

import Control.Monad (forM_)
import Data.Functor ((<&>))
import Data.List (isInfixOf)
import Data.Maybe (fromMaybe)
import Text.Printf (printf)

data Error = Error (String) deriving (Show)

data Doc = Doc {head :: Maybe Head}

data Head = Head {title :: Maybe String}

data Summary = Summary
  { title :: Maybe String,
    ok :: Bool
  }
  deriving (Show)

-- readDoc

readDoc :: String -> IO (Either Error Doc)
readDoc url =
  return
    if
        | isInfixOf "fail" url -> Left $ Error $ printf "Bad read of %s" url
        | otherwise ->
            Right $
              if
                  | isInfixOf "head-missing" url -> Doc {head = Nothing}
                  | isInfixOf "title-missing" url ->
                      Doc {head = Just Head {title = Nothing}}
                  | isInfixOf "title-empty" url ->
                      Doc {head = Just Head {title = Just ""}}
                  | otherwise ->
                      Doc
                        { head =
                            Just Head {title = Just $ printf "Title of %s" url}
                        }

-- buildSummary

buildSummary :: Doc -> Summary
buildSummary doc =
  Summary {title = doc.head >>= (.title), ok = True}

-- readAndBuildSummary

readAndBuildSummary :: String -> IO Summary
readAndBuildSummary url = do
  docOrError <- readDoc url
  return $ case docOrError of
    Left err -> Summary {title = Nothing, ok = False}
    Right doc -> buildSummary doc

readAndBuildSummary' :: String -> IO Summary
readAndBuildSummary' url =
  readDoc url >>= \docOrError -> return $ case docOrError of
    Left err -> Summary {title = Nothing, ok = True}
    Right doc -> buildSummary doc

readAndBuildSummary'' :: String -> IO Summary
readAndBuildSummary'' url =
  readDoc url <&> \case
    Left err -> Summary {title = Nothing, ok = True}
    Right doc -> buildSummary doc

-- isTitleNonEmpty

isTitleNonEmpty :: Doc -> Maybe Bool
isTitleNonEmpty doc = do
  head <- doc.head
  title <- head.title
  return $ not $ null title

isTitleNonEmpty' :: Doc -> Maybe Bool
isTitleNonEmpty' doc = not <$> null <$> (doc.head >>= (.title))

isTitleNonEmpty'' :: Doc -> Maybe Bool
isTitleNonEmpty'' doc = not . null <$> (doc.head >>= (.title))

-- readWhetherTitleNonEmpty

readWhetherTitleNonEmpty :: String -> IO (Either Error (Maybe Bool))
readWhetherTitleNonEmpty url = do
  docOrError <- readDoc url
  return $ do
    doc <- docOrError
    return $ isTitleNonEmpty doc

readWhetherTitleNonEmpty' :: String -> IO (Either Error (Maybe Bool))
readWhetherTitleNonEmpty' url =
  readDoc url >>= \docOrError ->
    return $ docOrError >>= \doc -> return $ isTitleNonEmpty doc

readWhetherTitleNonEmpty'' :: String -> IO (Either Error (Maybe Bool))
readWhetherTitleNonEmpty'' url = (isTitleNonEmpty <$>) <$> readDoc url

readWhetherTitleNonEmpty''' :: String -> IO (Either Error (Maybe Bool))
readWhetherTitleNonEmpty''' url = readDoc url <&> (<&> isTitleNonEmpty)

-- main

main :: IO ()
main = do
  let urls = ["good", "title-empty", "title-missing", "head-missing", "fail"]
  forM_ urls $ \url -> do
    putStrLn $ printf "Checking \"https://%s/\":" url
    -- Summary.
    summary <- readAndBuildSummary url
    putStrLn $ printf "  Summary: %s" $ show summary
    putStrLn $ printf "  Title: %s" $ fromMaybe "" summary.title
    -- Has title.
    hasTitle <- readWhetherTitleNonEmpty url
    -- let hasTitleSure = either (const False) id $ fromMaybe False <$> hasTitle
    let hasTitleSure = fromMaybe False $ either (const Nothing) id hasTitle
    putStrLn $
      printf "  Has title: %s vs %s" (show hasTitle) (show hasTitleSure)

I see a few issues with this code:

the abuse of IO monad
the excessive (an I mean unnecessary) use of Maybe monad
too many unnecessary readDoc calls (bear in mind: it is supposed to be a network call + document parse, so quite a heavy function)

Improvements

My suggestions on how to make their code better:

only stick to IO where the function is actually interacting with the external world; in this case this would be either main or readDoc (when it actually makes a network call)
eliminate the abuse of Maybe Bool - in the functions where the Maybe Bool is used (isTitleNonEmpty and its variations) just Bool would suffice
only call readDoc once and use the result down the pipeline, no need to call it multiple times (readAndBuildSummary, readWhetherTitleNonEmpty)
reduce the nested if statements to guard expressions
replace the case eitherValue expressions with the either function
remove the dead / unused / unnecessary code (like those duplicated functions - readWhetherTitleNonEmpty', readWhetherTitleNonEmpty'' and readWhetherTitleNonEmpty''')

Use `IO` monad only for the functions which interact with the outside world

IO monad tells compiler and the people who read the code the function operates outside of the program internals. Like actually performing input-output, working with network, operating system, etc.

You can think of any such function as an unsafe part of an application.

In the example above, most of the functions do not comply with that - they are simply a side-effect of poor monad operations and could be easily changed as follows:

isTitleNonEmpty :: Doc -> Maybe Bool
isTitleNonEmpty doc = do
  head <- doc.head
  title <- head.title
  return $ not $ null title

buildSummary :: Doc -> Summary
buildSummary doc =
  Summary {title = doc.head >>= (.title), ok = True}

-- there is absolutely no need for this function
readWhetherTitleNonEmpty :: String -> IO (Either Error (Maybe Bool))
readWhetherTitleNonEmpty url = do
  docOrError <- readDoc url
  return $ do
    doc <- docOrError
    return $ isTitleNonEmpty doc

-- this is also redundant
readAndBuildSummary :: String -> IO Summary
readAndBuildSummary url = do
  docOrError <- readDoc url
  return $ case docOrError of
    Left err -> Summary {title = Nothing, ok = False}
    Right doc -> buildSummary doc

-- this is the (reduced) existing code
main__old :: IO ()
main__old = do
  let url = "good"

  putStrLn $ printf "Checking \"https://%s/\":" url

  summary <- readAndBuildSummary url

  putStrLn $ printf "  Summary: %s" $ show summary
  putStrLn $ printf "  Title: %s" $ fromMaybe "" summary.title

  hasTitle <- readWhetherTitleNonEmpty url

-- note how the document is only passed down to the functions which actually operate on the document
-- the only two functions using `IO` monad here are `main` and `readDoc`, as both perform input-output operations
main :: IO ()
main = do
  let url = "good"

  putStrLn $ printf "Checking \"https://%s/\":" url

  docOrError <- readDoc url

  let summary = case docOrError of
    Left err -> Summary {title = Nothing, ok = False}
    Right doc -> buildSummary doc

  putStrLn $ printf "  Summary: %s" $ show summary
  putStrLn $ printf "  Title: %s" $ fromMaybe "" summary.title

  let hasTitle = isTitleNonEmpty doc

Reduce calls to `IO` functions

Pure functions must return the exact same value for the same arguments, regardless of how many times they are called.

IO (and a lot of other monads, like Reader, Writer, etc.) work around this rule by changing the outside world. So even though a function technically returns the same value every time, the side-effect of an IO monad can potentially change something elsewhere, outside of the function itself.

This makes the programs non-pure and might cause a lot of negative impact. These side effects are what is thought to be the root of all evils in (functional) programming. They make it hard to figure out what a program actually does. They make the functions and programs hard to test and prove to be correct.

From the performance perspective, if a function actually does change the outside world, it can't be memoized and it won't be optimized by the compiler to only use its result, computed once.

Think calling a function which tries to create a user in a system or drop a database table. This would happen on every call. If you call such function twice, it will most likely fail (due to database failures, hopefully).

But what if the function creates data instead and there are no checks for duplicates in the external system? Like appending a log line. If such function was to be called multiple times, it can quickly bloat the storage. This is a potential security issue as well.

So instead of calling such functions (recap: functions returning IO monad), why not call them once and store the result? Then the entire program can operate on the stored result value.

Of course, there are plethora of cases where such behaviour (calling such functions multiple times) is actually desired. But in the sample code from above, this is not the case.

Check the transformed code from the previous point how this solution is utilized:

main :: IO ()
main = do
  let url = "good"

  putStrLn $ printf "Checking \"https://%s/\":" url

  -- store the result of fetching a document by its URL
  docOrError <- readDoc url

  -- work on the stored value instead of sending an extra HTTP request
  let summary = case docOrError of
    Left err -> Summary {title = Nothing, ok = False}
    Right doc -> buildSummary doc

  putStrLn $ printf "  Summary: %s" $ show summary
  putStrLn $ printf "  Title: %s" $ fromMaybe "" summary.title

  -- work on the stored value again, preventing one more HTTP request
  let hasTitle = case docOrError of
    Left err -> False
    Right doc -> isTitleNonEmpty doc

Eliminate the abuse of `Maybe Bool`

There's simply no need to go overboard with Maybe in methods which simply check whether something is present or not.

Maybe Bool introduces a tri-state:

Just False
Just True
Nothing

In cases where this is desired, is much better to represent each of the state with a semantic value:

type HeaderPresence = NoHeader | EmptyHeader | HasHeader

This way the code is much easier to follow and debug.

For the sample in question even this approach is redundant - the function simply checks whether the value (title) is present or not, so a simple Bool would suffice:

isTitleNonEmpty__old :: Doc -> Maybe Bool
isTitleNonEmpty__old doc = do
  head <- doc.head
  title <- head.title
  return $ not $ null title

isTitleNonEmpty :: Doc -> Bool
isTitleNonEmpty doc =
  not $ null title
  where
    title = doc.head doc >>= head.title

main__old :: IO ()
main__old = do
  let url = "good"

  docOrError <- readDoc url

  hasTitle <- case docOrError of
    Left err -> None
    Right doc -> isTitleNonEmpty__old doc

  let hasTitleSure = fromMaybe False $ either (const Nothing) id hasTitle

  putStrLn $
    printf "  Has title: %s vs %s" (show hasTitle) (show hasTitleSure)

main :: IO ()
main = do
  let url = "good"

  putStrLn $ printf "Checking \"https://%s/\":" url

  docOrError <- readDoc url

  -- a simple True | False value, no need to mess with Maybe
  let hasTitle = case docOrError of
    Left err -> False
    Right doc -> isTitleNonEmpty doc

  -- note: no need to wrap & unwrap Maybe and Either
  putStrLn $ printf "  Has title: %s" (show hasTitle)

Reduce the nested `if` statements - use guard expressions

The original readDoc function uses a rather unnecessarily bulky if statement:

readDoc :: String -> IO (Either Error Doc)
readDoc url =
  return
    if
        | isInfixOf "fail" url -> Left $ Error $ printf "Bad read of %s" url
        | otherwise ->
            Right $
              if
                  | isInfixOf "head-missing" url -> Doc {head = Nothing}
                  | isInfixOf "title-missing" url ->
                      Doc {head = Just Head {title = Nothing}}
                  | isInfixOf "title-empty" url ->
                      Doc {head = Just Head {title = Just ""}}
                  | otherwise ->
                      Doc
                        { head =
                            Just Head {title = Just $ printf "Title of %s" url}
                        }

Haskell provides a built-in language feature for just this case, which really does make code cleaner a lot:

readDoc :: String -> IO (Either Error Doc)

readDoc url
  | isInfixOf "fail" url = return $ Left $ Error $ printf "Bad read of %s" url
  | isInfixOf "head-missing" url = return $ Right Doc { head = Nothing }
  | isInfixOf "title-missing" url = return $ Right Doc { head = Just Head { title = Nothing } }
  | isInfixOf "title-empty" url = return $ Right Doc { head = Just Head { title = Just "" } }
  | otherwise = return $ Right Doc { head = Just Head { title = Just $ printf "Title of %s" url } }

Use existing functions on monads instead of `case` statements

In my experience, for the most part, you want to simply apply a function to a Maybe or an Either.

And, again, for the most part, you either ignore one of the states of the monad (be it Nothing or Left), or apply a different function to it.

There are handy helper functions for these situations which do make the code cleaner - fromMaybe and either.

Consider few examples:

hasTitle__old <- case docOrError of
  Left err -> None
  Right doc -> isTitleNonEmpty__old doc

hasTitle <- either (const None) (isTitleNonEmpty__old)

let summary__old = case docOrError of
  Left err -> Summary {title = Nothing, ok = False}
  Right doc -> buildSummary doc

-- needs a nice helper function
invalid_summary = Summary { title = Nothing, ok = False }

let summary = either invalid_summary buildSummary

let hasTitle__old = case docOrError of
  Left err -> False
  Right doc -> isTitleNonEmpty doc

let hasTitle = either (const False) isTitleNonEmpty

End result

Here's what the code looks like when applying all of the above suggestions and tidying up the code:

{-# LANGUAGE DuplicateRecordFields #-}

import Control.Monad (forM_)
import Data.List (isInfixOf)
import Data.Maybe (fromMaybe)
import Text.Printf (printf)

newtype Error = Error String deriving (Show)

newtype Doc = Doc {_head :: Maybe Head}

newtype Head = Head {title :: Maybe String}

data Summary = Summary
  { title :: Maybe String,
    ok :: Bool
  }
  deriving (Show)

-- helpers for Repl.It is meh with language extensions

docHead :: Doc -> Maybe Head
docHead = _head

headTitle :: Head -> Maybe String
headTitle = title

summaryTitle :: Summary -> Maybe String
summaryTitle = title

-- implementations

readDoc :: String -> IO (Either Error Doc)
readDoc url
  | "fail" `isInfixOf` url = return $ Left $ Error $ printf "Bad read of %s" url
  | "head-missing" `isInfixOf` url = return $ Right Doc {_head = Nothing}
  | "title-missing" `isInfixOf` url = return $ Right Doc {_head = Just Head {title = Nothing}}
  | "title-empty" `isInfixOf` url = return $ Right Doc {_head = Just Head {title = Just ""}}
  | otherwise = return $ Right Doc {_head = Just Head {title = Just $ printf "Title of %s" url}}

buildSummary :: Doc -> Summary
buildSummary doc =
  Summary {title = docHead doc >>= headTitle, ok = True}

invalidSummary :: Summary
invalidSummary = Summary {title = Nothing, ok = False}

readAndBuildSummary :: Either Error Doc -> Summary
readAndBuildSummary = either invalidSummary buildSummary

readAndBuildTitle :: Summary -> String
readAndBuildTitle summary = fromMaybe "" (summaryTitle summary)

isTitleNonEmpty :: Doc -> Bool
isTitleNonEmpty doc =
  not $ null title
  where
    title = docHead doc >>= headTitle

readWhetherTitleNonEmpty :: Either Error Doc -> Bool
readWhetherTitleNonEmpty = either (const False) isTitleNonEmpty

-- main

main :: IO ()
main = do
  let urls = ["good", "title-empty", "title-missing", "head-missing", "fail"]

  forM_ urls $ \url -> do
    putStrLn $ printf "Checking \"https://%s/\":" url

    docOrError <- readDoc url

    -- Summary
    let summary = readAndBuildSummary docOrError
    let title = readAndBuildTitle summary

    putStrLn $ printf "  Summary: %s" $ show summary
    putStrLn $ printf "  Title: %s" $ title

    -- Has title
    let hasTitle = readWhetherTitleNonEmpty docOrError

    putStrLn $ printf "  Has title: %s" (show hasTitle)

Floyd-Warshall algorithm, revised

12 Jan 2023

In this blog I would like to brush upon Floyd-Warshall algorithm implementation I have described previously.

See, generally, when explaining Floyd-Warshall algorithm, the graph is given as an adjacency matrix - an V x V matrix, where V is the number of vertices (nodes) in a graph and each matrix value representing the distance between each vertex. For instance, the value G[1][3] = 12 represents the edge from vertex 1 to vertex 3 with the value of 12.

Aside from adjacency matrix, there are two other common ways to represent a graph:

incidence matrix of size V x E for V vertices and E edges, where each row represents an edge value (G[i][j] > 0 for edge i -> j, G[i][j] < 0 for edge j -> i, G[i][j] = 0 for no edge between vertices i and j)
adjacency list, where each vertex is represented as an object with a list of its connections

Consider the graph from the previous blog:

It can be represented in three different ways:

adjacency list:

v0: [ [2 -> 1] ]
v1: [ [2 -> 6], [0 -> 7] ]
v2: [ [3 -> 5] ]
v3: [ [1 -> 2] ]

adjacency matrix:

   |  v0  v1  v2  v3
---+------------------
v0 |   0   0   1   0
v1 |   7   0   6   0
v2 |   0   0   0   5
v3 |   0   2   0   0

incidence matrix:

   |  e0  e1  e2  e3  e4
---+---------------------
v0 |   0   0   0   7   0
v1 |   0   0   2   0   0
v2 |   1   0   0   0   6
v3 |   0   5   0   0   0

However, arguably the most common and most memory-efficient way to represent graph in a real-world application would be the adjacency list.

Hence I thought a bit of a revised implementation is needed, since the previous one (aka "classic" implementation) relies on matrices. And those matrices contain quite a bit of unused data. Take the graph above as an example: it has 4 vertices and 5 edges. An adjacency matrix would use 16 numbers, incidence matrix would use 20 numbers and adjacency list would use 5 tuples of 2 numbers (10 numbers, so to speak).

"Classic" Floyd-Warshall algorithm implementation constructs an additional matrix for the shortest paths' distances (at most it can reuse the existing one representing the graph itself) and an additional V x V matrix for path reconstruction.

With a graph represented as an adjacency list (or just a list of node objects), the algorithm can use something like a dictionary (or a hashmap) to reduce the memory consumption. Subjectively, it also becomes easier to read and understand the algorithm.

First, we'd need few definitions:

class Node
{
    public string Name;
    public Dictionary<Node, int> Connections;

    public Node(string name)
    {
        this.Name = name;
        this.Connections = new Dictionary<Node, int>();
    }

    public override string ToString()
    {
        return this.Name;
    }
}

class Graph
{
    public List<Node> Nodes;

    public Graph()
    {
        this.Nodes = new List<Node>();
    }
}

class Path
{
    public Node From;
    public Node To;
    public List<Node> Nodes;
    public int Distance;

    public Path(Node from, Node to)
    {
        this.Distance = 0;
        this.From = from;
        this.To = to;
        this.Nodes = new List<Node>();
    }
}

The Path class is a bit excessive, as it really only needs a list of Node objects, but I decided to make it verbose on purpose - to make it easier to debug.

The algorithm then becomes like this:

Dictionary<(Node from, Node to), Node> FindAllPaths(Graph g)
{
    var distance = new Dictionary<(Node from, Node to), int>();
    var pathThrough = new Dictionary<(Node from, Node to), Node>();

    foreach (var nodeFrom in g.Nodes)
    {
        foreach (var nodeTo in g.Nodes)
        {
            if (nodeFrom == nodeTo)
            {
                distance.Add((nodeFrom, nodeTo), 0);
            }
            else
            {
                distance.Add((nodeFrom, nodeTo), int.MaxValue);
            }
        }

        foreach (var nodeTo in nodeFrom.Connections.Keys)
        {
            distance[(nodeFrom, nodeTo)] = nodeFrom.Connections[nodeTo];
            pathThrough.Add((nodeFrom, nodeTo), nodeTo);
        }

        pathThrough.Add((nodeFrom, nodeFrom), nodeFrom);
    }

    foreach (var nodeThrough in g.Nodes)
    {
        foreach (var nodeFrom in g.Nodes)
        {
            foreach (var nodeTo in g.Nodes)
            {
                if (nodeFrom == nodeTo || distance[(nodeThrough, nodeTo)] == int.MaxValue || distance[(nodeFrom, nodeThrough)] == int.MaxValue)
                {
                    continue;
                }

                if (distance[(nodeFrom, nodeTo)] > distance[(nodeFrom, nodeThrough)] + distance[(nodeThrough, nodeTo)])
                {
                    distance[(nodeFrom, nodeTo)] = distance[(nodeFrom, nodeThrough)] + distance[(nodeThrough, nodeTo)];
                    pathThrough[(nodeFrom, nodeTo)] = pathThrough[(nodeFrom, nodeThrough)];
                }
            }
        }
    }

    return pathThrough;
}

One can easily see the separate parts of the algorithm - initializing the intermediate storage (for the paths and distances) and actually finding the shortest paths. It is pretty much the same algorithm as before with the only difference being the use of dictionary instead of a matrix.

Path reconstruction bit is also a little bit different (not too much though):

Path ReconstructPath(Node from, Node to, Dictionary<(Node from, Node to), Node> allPaths)
{
    if (!allPaths.ContainsKey((from, to)))
    {
        return null;
    }

    var path = new Path(from, to);

    var tempNode = from;

    path.Nodes.Add(tempNode);

    while (tempNode != to)
    {
        var nextNode = allPaths[(tempNode, to)];

        path.Distance += tempNode.Connections[nextNode];
        path.Nodes.Add(nextNode);

        tempNode = nextNode;
    }

    return path;
}

Now, there is one major improvement we can make to this algorithm. See, when we initialize the dictionaries for paths and distances, we follow the same logic as with the classic implementation, putting in the values for edges which do not exist in a graph:

foreach (var nodeFrom in g.Nodes)
{
    // this bit will add entries for inexistent edges
    foreach (var nodeTo in g.Nodes)
    {
        if (nodeFrom == nodeTo)
        {
            distance.Add((nodeFrom, nodeTo), 0);
        }
        else
        {
            distance.Add((nodeFrom, nodeTo), int.MaxValue);
        }
    }

    // ...
}

But as a matter of fact, we do not really have to store these values - instead, we could simply have a check whether the needed entry exists in the dictionary or not:

Dictionary<(Node from, Node to), Node> FindAllPaths(Graph g)
{
    var distance = new Dictionary<(Node from, Node to), int>();
    var pathThrough = new Dictionary<(Node from, Node to), Node>();

    foreach (var nodeFrom in g.Nodes)
    {
        // -- no more unnecessary entries --

        foreach (var nodeTo in nodeFrom.Connections.Keys)
        {
            distance.Add((nodeFrom, nodeTo), nodeFrom.Connections[nodeTo]);
            pathThrough.Add((nodeFrom, nodeTo), nodeTo);
        }

        pathThrough.Add((nodeFrom, nodeFrom), nodeFrom);
    }

    foreach (var nodeThrough in g.Nodes)
    {
        foreach (var nodeFrom in g.Nodes)
        {
            foreach (var nodeTo in g.Nodes)
            {
                if (nodeFrom == nodeTo)
                {
                    continue;
                }

                // ++ but one extra check here, to prevent unnecessary iterations ++
                if (!distance.ContainsKey((nodeFrom, nodeThrough)) || !distance.ContainsKey((nodeThrough, nodeTo)))
                {
                    continue;
                }

                // ++ and one extra check here, to annotate the existence of a new, indirect shortest path between nodeFrom and nodeTo ++
                if (!distance.ContainsKey((nodeFrom, nodeTo)) || distance[(nodeFrom, nodeTo)] > distance[(nodeFrom, nodeThrough)] + distance[(nodeThrough, nodeTo)])
                {
                    distance[(nodeFrom, nodeTo)] = distance[(nodeFrom, nodeThrough)] + distance[(nodeThrough, nodeTo)];
                    pathThrough[(nodeFrom, nodeTo)] = pathThrough[(nodeFrom, nodeThrough)];
                }
            }
        }
    }

    return pathThrough;
}

For that matter, we don't really need to have a separate dictionary for path reconstruction - since both distance and pathThrough dictionaries serve the same purpose (have the exact same key, but different value), we can smash those two together and use even less memory:

Dictionary<(Node from, Node to), (Node through, int distance)> FindAllPaths(Graph g)
{
    var pathThrough = new Dictionary<(Node from, Node to), (Node through, int distance)>();

    foreach (var nodeFrom in g.Nodes)
    {
        foreach (var nodeTo in nodeFrom.Connections.Keys)
        {
            pathThrough.Add((nodeFrom, nodeTo), (nodeTo, nodeFrom.Connections[nodeTo]));
        }

        pathThrough.Add((nodeFrom, nodeFrom), (nodeFrom, 0));
    }

    foreach (var nodeThrough in g.Nodes)
    {
        foreach (var nodeFrom in g.Nodes)
        {
            foreach (var nodeTo in g.Nodes)
            {
                if (nodeFrom == nodeTo)
                {
                    continue;
                }

                if (!pathThrough.ContainsKey((nodeFrom, nodeThrough)) || !pathThrough.ContainsKey((nodeThrough, nodeTo)))
                {
                    continue;
                }

                if (!pathThrough.ContainsKey((nodeFrom, nodeTo)) || pathThrough[(nodeFrom, nodeTo)].distance > pathThrough[(nodeFrom, nodeThrough)].distance + pathThrough[(nodeThrough, nodeTo)].distance)
                {
                    pathThrough[(nodeFrom, nodeTo)] = (nodeThrough, pathThrough[(nodeFrom, nodeThrough)].distance + pathThrough[(nodeThrough, nodeTo)].distance);
                }
            }
        }
    }

    return pathThrough;
}

Essentially, nothing special - just using the tuples for both the intermediate node and the distance for the shortest paths.

And with a tiny bit of a change, path reconstruction:

Path ReconstructPath(Node from, Node to, Dictionary<(Node from, Node to), (Node through, int distance)> allPaths)
{
    if (!allPaths.ContainsKey((from, to)))
    {
        return null;
    }

    var path = new Path(from, to);

    var tempNode = from;

    path.Nodes.Add(tempNode);

    while (tempNode != to)
    {
        var (nextNode, distance) = allPaths[(tempNode, to)];

        path.Distance += distance;
        path.Nodes.Add(nextNode);

        tempNode = nextNode;
    }

    return path;
}

Parser monad

06 Dec 2022

Few years ago I did a course project for the "Functional programming" course at Jagiellonian University. The project was to implement a program for solving systems of equations... in Haskell.

For me back then, the entire concept of monads was quite foreign and scary, so I implemented a very crude state machine to parse the equations as strings from STDIN:

parseRow :: String -> Integer -> [Integer] -> [String] -> ([Integer], [String])
parseRow [] state coefficients var_names
	| state == 7 = (reverse coefficients, reverse var_names)
	| otherwise = error ("Invalid equation (state: " ++ show state ++ "; coefficients: " ++ show coefficients ++ "; var_names: " ++ show var_names ++ ")")

parseRow (c:cs) state coefficients var_names
	| (state == 1) && (c == '-') = parseRow cs 2 ((-1) : coefficients) var_names
	| state == 1 && isNum c = parseRow cs 3 (c_int : coefficients) var_names
	| state == 1 && isAlpha c = parseRow cs 4 (1 : coefficients) ([c] : var_names)
	| state == 2 && isNum c = parseRow cs 3 (repl_k : drop 1 coefficients) var_names
	| state == 2 && isAlpha c = parseRow cs 4 coefficients ([c] : var_names)
	| (state == 3) && (c == '=') = parseRow cs 5 coefficients var_names
	| state == 3 && isAlpha c = parseRow cs 4 coefficients ([c] : var_names)
	| state == 3 && isNum c = parseRow cs 3 (new_k : drop 1 coefficients) var_names
	| (state == 4) && (c == '-') = parseRow cs 2 ((-1) : coefficients) var_names
	| (state == 4) && (c == '+') = parseRow cs 2 (1 : coefficients) var_names
	| state == 4 && isAlphaNum c = parseRow cs 4 coefficients (new_v : drop 1 var_names)
	| (state == 4) && (c == '=') = parseRow cs 5 coefficients var_names
	| (state == 5) && (c == '-') = parseRow cs 6 ((-1) : coefficients) var_names
	| state == 5 && isNum c = parseRow cs 7 (c_int : coefficients) var_names
	| state == 6 && isNum c = parseRow cs 7 (repl_k : drop 1 coefficients) var_names
	| state == 7 && isNum c = parseRow cs 7 (new_k : drop 1 coefficients) var_names
	| otherwise = parseRow cs state coefficients var_names
	where
		k = abs $ head coefficients
		k_sign = sign (head coefficients)
		c_int = atoi c
		new_k = k_sign * ((k * 10) + c_int)
		repl_k = k_sign * c_int
		v = if not (null var_names) then head var_names else []
		new_v = v ++ [c]

parseLine :: String -> ([Integer], [String])
parseLine s = parseRow s 1 [] []

However, recently I (hopefully) got a grip of monads, so I decided to modify my old project in Haskell to make it prettier.

One of the improvements I made was the parsing part. Since one of the requirements for the course project was to not use any third-party libraries, I decided to stick to this and implement a parser monad.

This proved to be an interesting task and made the entire parsing part quite neat:

equationFactor :: Parser Fraction
equationFactor = do
    _sign <- factorSign

    zeroOrMore (sat isSpace)

    factor <- fmap (fromMaybe (1%1)) (zeroOrOne rationalFactor)

    return (_sign * factor)

equationMember :: Parser (Fraction, String)
equationMember = do
    factor <- equationFactor

    zeroOrMore (sat isSpace)
    zeroOrOne (sat (== '*'))
    zeroOrMore (sat isSpace)

    nameFirst <- oneOrMore (sat isAlpha)
    nameRest <- zeroOrMore (sat isAlphaNum)

    zeroOrMore (sat isSpace)

    return (factor, nameFirst ++ nameRest)

-- An equation consists of a list of pairs (factor, variable name) and a free member
equation :: Parser ([(Fraction, String)], Fraction)
equation = do
    members <- oneOrMore equationMember

    zeroOrMore (sat isSpace)
    sat (== '=')
    zeroOrMore (sat isSpace)

    freeMember <- rationalFactor

    return (members, freeMember)

This is all possible thanks to the Parser monad. Under the cut - the implementation details.

ConfigScript

07 Nov 2022

Once upon a time I thought OGRE was overly too complicated - all those unnecessary script files, custom formats, a ton of setup hassle. That was until I tried figuring out an entire modern 3D rendering stack from scratch. That's when I realized configuring a rendering process (rendering pipeline) can be tricky. And that's when I realized configuring application with config files can be helpful. OGRE suddenly became very appealing to me and I really began to appreciate all the work the devs have put into it.

One good aspect of OGRE was the "unnecessary" script and configuration files. But the syntax of those files looked much cleaner than that of JSON:

// This is a comment
object_keyword Example/ObjectName
{
    attribute_name "some value"

    object_keyword2 "Nested Object"
    {
        other_attribute 1 2 3
        // and so on..
    }
}

I thought if I could harvest anything from OGRE into my OpenGL application, that would be the configuration based on this format (rather than Lua or whatever scripts).

Hence I crafted this simple grammar in ANTLR4 to parse these files:

grammar ConfigScript;

config : (object | comment)* EOF ;

object
    : Identifier '{' property* '}'
    | Identifier STRING '{' (property)* '}'
    ;

property : Identifier propertyValue ;

propertyValue
    : vector
    | INT
    | FLOAT
    | BOOL
    | STRING
    | objectValue
    ;

objectValue
    : '{' property* '}'
    | STRING '{' (property)* '}'
    ;

vector
    : INT+
    | FLOAT+
    ;

comment : LINE_COMMENT | BLOCK_COMMENT ;

STRING : DOUBLE_QUOTED_STRING | SINGLE_QUOTED_STRING ;

BOOL : 'true' | 'false' ;

DOUBLE_QUOTED_STRING : '"' DoubleQuoteStringChar* '"' ;
SINGLE_QUOTED_STRING : '\'' SingleQuoteStringChar* '\'' ;

Identifier : ALPHA (ALPHA | NUM)* ;

fragment SingleQuoteStringChar : ~['\r\n] ;
    // : ~['\\\r\n]
    // | SimpleEscapeSequence ;

fragment DoubleQuoteStringChar : ~["\r\n] ;
    // : ~["\\\r\n]
    // | SimpleEscapeSequence ;

// fragment SimpleEscapeSequence : '\\' ['"?abfnrtv\\] ;

INT : '0'
    | '-'? [1-9] [0-9]*
    ;

FLOAT : ('+' | '-')? NUM+ '.' NUM+ ;

WHITESPACE : [ \r\n\t]+ -> skip ;
ALPHA : [a-zA-Z_] ;
NUM : [0-9] ;

LINE_COMMENT : '//' ~[\r\n]* -> skip ;
BLOCK_COMMENT : '/*' .*? '*/' -> skip ;

The only difference is that the object name can only be a quoted string:

// This is a comment
object_keyword "Example/ObjectName" // <--- this can not be just Example/ObjectName
{
    attribute_name "some value"

    object_keyword2 "Nested Object"
    {
        other_attribute 1 2 3
        // and so on..
        /* block comment */
    }
}

The only thing left with this parser thing is to compile it and use in a project.

OpenGL: advanced samples

26 Oct 2022

Damn this blog was long coming - I picked my interest in shaders over a decade ago, started to actually learn something in early 2021 and started writing this blog at the end of 2021. Now the time has finally come to share this with the world!

For a long time I was keen in learning low-level computer graphics APIs and algorithms, but only recently actually made any progress towards that goal. I have spent few weeks if not months learning about shaders, graphics pipeline and rendering techniques, so this write-up was long forecoming. There might be some mistakes, since I am not an expert in CG and OpenGL and there is a big chunk missing, namely rendering animated 3D models, compute shaders and tesselation, but I hope to fix those in the future.

One might ask: why OpenGL? Why not Vulkan (or DirectX 12)? Why even bother with these emulaiton techniques - why not jump straight to raytracing? Well, it is the end of October 2022 at the moment of publishing this blog, the most recent GPU on the market is GeForce RTX 4090, which is yet to arrive and its RRP (recommended retailer price) is A$2959 (as advertised by nVidia):

And most of those cards are gone in a blink of an eye and used for mining whatever-coins. My PC runs GTX1080, which is still more than enough for most of the games I play (being Genshin Impact at most).

I think it is still not a widely spread consumer market, and whilst raytracing, DLSS and all those fine things sound super cool and nice, it is still a niche market. That is not to say I won't be looking into that in the near future. I just think OpenGL still has place in this world, at least as a starting point to build a decent understanding of the modern graphics pipeline, some basic rendering techniques and justify switching to more recent APIs. OpenGL as well as DirectX 11 are still a very feasible fallback for those sytems which do not run top-notch hardware or are struggling running those massive open-world games in anything higher than 1080p@60fps.

In this blog I will try to condense the knowledge I have acquired. I will skip most basic parts such as initializing context, creating window and so on, since most of the tutorials and articles on the Internet focus on those topics. Unlike most tutorials and articles out there, this blog will be all about rendering techniques.

This article was heavily inspired by few blogs on russian website (Habr), namely "super-modern OpenGL" (part 1 and part 2) - they are quite messy and lack a lot of material on really interesting topics (again, rendering techniques). This blog partially builds on top of those two and uses a lot more other materials (references provided).

This blog is enormous, so blow is the table of contents for the topics covered. Feel free to jump straight to the topic that picks your interest, screenshots are provided:

Setting up
Basics I promised not to focus on
Optimization techniques
Rendering techniques
Common rendering techniques

The source code is available on GitHub.

Message compression formats in a web application

25 Oct 2022

I have seen quite a few web applications which are passing heaps of data between client and server. Pretty much all of them use JSON for that purpose. And whilst that solves the business problem, some of those applications aim to provide nearly real-time user experience, and that's where the issues arise.

With the rise of web sockets, RTC and HTTP2 tech, the data transfer speed issues might be not so noticeable, but the more data app tries to transfer, the more pressing the matter gets. This issue also becomes more apparent on the server side - server has to process a lot more data in a request thread.

At The Trade Desk I have observed an approach, where we are transferring a list of strings (later we converted them to integers to widen the bandwidth) between high-loaded services (think 15 million requests per second with a hard time bound on request processing of 100 milliseconds).

I thought this is a cool concept, worth exploring deeper. For this matter, I have searched the Internet for similar approaches.

One of the readings was an article on the russian resource about speeding up a web application by switching from HTTP1 to HTTP2, utilizing various data compression formats: for static assets - WEBP, AVIF and progressive JPEG; for general data - various stream compression algorithms - GZIP, Brotli and ZSTD; alongside with different data serialization formats - MessagePack.

I decided to expand my search in the direction of serialization formats. In this article I compare few of those and focus on their performance in serializing different kinds of data (lists, nested objects, integers and strings, large and small datasets) and the serialization & deserialization performance in browser and the compression rates.

The formats covered are:

One of the optimizations we have achieved at The Trade Desk was 7 times reduction in data size by switching from string IDs to integer values. Think "something-else" we used to operate everywhere was assigned an integer ID of 4092. Now that int is 4 bytes long, whereas the string value is 14 bytes long. That is more than 3x reduction in size, but you have to still store the mapping somewhere, which we already did (in our case).

But the overall idea is worth researching too - how much compression each format provides when working with long lists of strings and long lists of ints.

To make the thing more interesting, I came up with few various data types to be messed around:

a simple object Pet { name: string, kind: enum Kind { CAT, DOG } }
an object with an array of string identifiers StringIdsResource { ids: string[] }
an object with an array of integer identifiers StringIdsResource { ids: int[] }

As mentioned before, the metrics I'm going to be focusing on are:

encoding & decoding performance in browser environment
encoded message length (raw, as UTF-8 string and base-64 encoded UTF-8 string)
amount of runtime (bundled code) required to work with the format

To make it quick and easy, here's the summary (time measured in browser, on a nested object with a list of 1000 children of various data types):

Serializer	Encoding time	Decoding time	Encoded data size (byte array)	Compression	Encoded data size (base-64 utf-8 encoded)	Bundle size
Avro	12ms	4ms	30003	77.16%	40004	111.7kb
BSON	10ms	11ms	98912	24.71%	131884	98.0kb
CBOR	3ms	4ms	89017	32.24%	118692	30.1kb
MessagePack	3ms	3ms	89017	32.24%	118692	27.7kb
Protobuf	13ms	3ms	38000	71.07%	50668	76.6kb
Protobuf, compiled	6ms	1ms	38000	71.07%	50668	30.0kb
Flatbuffers	9ms	3ms	32052	75.60%	42736	3.1kb
Thrift (binary)	42ms	6ms	45009	65.74%	60012	109.7kb
Thrift (compact)	33ms	11ms	36005	72.59%	48008	109.7kb

Few learnings:

Thrift is quite slow and not that straightforward to use (after all, it was designed to provide entire communication layer for an application), but provides decent compression rate
Protobuf and Avro provide by far the most compact output (because of schema provided)
Protobuf library (protobufjs), unlike Avro, can't handle enumerations (Protobufjs requires raw integer values to be used whereas Avro supports semantic, string values)
Cap'n'Proto seems outdated and its JS plugin did not get any support for few years now, have to check the TS version
FlatBuffers is quite low-level and tricky to use (much more effort than those other tools)

My take on these results is that:

Protobuf, when using a compiled serialzier/deserializer for specific message(-s):
- has a comfortable API
- fast serialization / deserialization in browser
- decent compression rate
- relatively small bundle size increase
Flatbuffers:
- great compression rate
- tiny bundle size impact
- great performance in browser
- super-cumbersome API (well, that's low-level trade-offs for ya)

The source code for the tests could be found on my GitHub repo.

Under the cut you will find a stream of consciousness - my notes whilst implementing the tests for these tech.

Jargon-free functional programming. Part 1: problem statement

24 Aug 2022

Basics

Let me introduce you functional programming with as few jargonisms and buzz-words as possible.

Shall we start with a simple problem to solve: get a random board game from top-10 games on BoardGamesGeek website and print out its rank and title.

BoardGameGeek website has an API: a request GET https://boardgamegeek.com/xmlapi2/hot?type=boardgame will return an XML document like this:

<?xml version="1.0" encoding="utf-8"?>

<items termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
    <item id="361545" rank="1">
        <thumbnail value="https://cf.geekdo-images.com/lD8s_SQPObXTPevz-aAElA__thumb/img/YZG-deJK2vFm4NMOaniqZwwlaAE=/fit-in/200x150/filters:strip_icc()/pic6892102.png" />
        <name value="Twilight Inscription" />
        <yearpublished value="2022" />
    </item>

    <item id="276182" rank="2">
        <thumbnail value="https://cf.geekdo-images.com/4q_5Ox7oYtK3Ma73iRtfAg__thumb/img/TU4UOoot_zqqUwCEmE_wFnLRRCY=/fit-in/200x150/filters:strip_icc()/pic4650725.jpg" />
        <name value="Dead Reckoning" />
        <yearpublished value="2022" />
    </item>
</items>

In JavaScript a solution to this problem might look something like this:

fetch(`https://boardgamegeek.com/xmlapi2/hot?type=boardgame`)
    .then(response => response.text())
    .then(response => new DOMParser().parseFromString(response, "text/xml"))
    .then(doc => {
        const items = Array.from(doc.querySelectorAll('items item'));

        return items.map(item => {
            const rank = item.getAttribute('rank');
            const name = item.querySelector('name').getAttribute('value');

            return { rank, name };
        });
    })
    .then(games => {
        const randomRank = Math.floor((Math.random() * 100) % 10);

        return games[randomRank];
    })
    .then(randomTop10Game => {
        const log = `#${randomTop10Game.rank}: ${randomTop10Game.name}`;

        console.log(log);
    });

Quick and easy, quite easy to understand - seems good enough.

How about we write some tests for it? Oh, now it becomes a little bit clunky - we need to mock fetch call (Fetch API) and the Math.random. Oh, and the DOMParser with its querySelector and querySelectorAll calls too. Probably even console.log method as well. Okay, we will probably need to modify the original code to make testing easier (if even possible). How about we split the program into separate blocks of code?

const fetchAPIResponse = () =>
    fetch(`https://boardgamegeek.com/xmlapi2/hot?type=boardgame`)
        .then(response => response.text());

const getResponseXML = (response) =>
    new DOMParser().parseFromString(response, "text/xml");

const extractGames = (doc) => {
    const items = Array.from(doc.querySelectorAll('items item'));

    return items.map(item => {
        const rank = item.getAttribute('rank');
        const name = item.querySelector('name').getAttribute('value');

        return { rank, name };
    });
};

const getRandomTop10Game = (games) => {
    const randomRank = Math.floor((Math.random() * 100) % 10);

    return games[randomRank];
};

const printGame = (game) => {
    const log = `#${game.rank}: ${game.name}`;

    console.log(log);
};

fetchAPIResponse()
    .then(response => getResponseXML(response))
    .then(doc => extractGames(doc))
    .then(games => getRandomTop10Game(games))
    .then(game => printGame(game));

Okay, now we can test some of the bits of the program without too much of a hassle - we could test that every call of getRandomGame returns a different value (which might not be true) but within the given list of values. We could test the extractGames function on a mock XML document and verify it extracts all the <item> nodes and its <name> child. Testing fetchAPIResponse and getResponseXML and printGame functions, though, would be a bit tricky without either mocking the fetch, console.log and DOMParser or actually calling those functions.

Jargon-free functional programming. Part 2: functional wrappers

24 Aug 2022

Disclaimer

A big disclaimer before diving too deep: I am going to introduce all of the concepts without relying on any frameworks, libraries, specific programming languages and whatnot - just sticking to the hand-written TypeScript. This might seem like a lot of boilerplate and overhead for little benefit, but (with notes of a philosophy) the biggest benefit is in the cost of detecting and fixing errors in the code:

IDE highlighting an error (and maybe even suggesting a fix) - mere seconds of developer's time
Local build (compiling the code locally, before pushing the code to the repository) - minutes, maybe tens of minutes
CI server build, running all the tests possible - around an hour
Pre-production environment (manual testing on dedicated QA / staging environment or even testing on production) - around few hours, may involve other people
Production - measured in days or months and risking the reputation with the customers

Hence if we could detect the errors while writing the code the first time - we could potentially save ourselves a fortune measured in both time and money.

Fancy-less introduction to functional programming

The ideas of functional programming are quite simple. In functional programming the assumption is that every function only operates on the arguments it has been passed and nothing else. It can not change the "outer world" - it can have values temporarily assigned to internal constants, nothing more - take it as there are no variables. A function should always return the same result for the same arguments, so functions are always predictable, no matter how many times you call them.

That sounds good and nice, but how does that solve the issues of the above problem, you might ask. For the most part the functions we have extracted already comply with the ideas of functional programming - do they not?

Well, not really. For once, fetching the data from the API is a big questionmark on how it fits into the picture of functional programming. Leaving the fetching aside, we have a random call in the middle. We also log some output to the console (and thus change the "outer world", outside the printGame function).

For a lot of things like those, functional programming tries to separate the "pure functional operations" and "impure operations".

See, in functional programming you operate these "pure functions", which only use constants and inputs to produce their outputs. So a program would be nothing but a chain of function calls. With "simple" functions, returning the values which the next function in the chain can take as an input argument, this is quite easy.

Take the code above as an example:

fetchAPIResponse()
    .then(response => getResponseXML(response))
    .then(doc => extractGames(doc))
    .then(games => getRandomTop10Game(games))
    .then(game => printGame(game));

This could have been written as

fetchAPIResponse()
    .then(response =>
        printGame(
            getRandomTop10Game(
                extractGames(
                    getResponseXML(response)
                )
            )
        )
    );

Since each next function in the chain accepts exactly the type the previous function has returned, they all combine quite well.

JFYI: in other languages and some libraries there are operators to combine functions into one big function:

fetchAPIResponse()
    .then(response =>
        _.flow([ getResponseXML, extractGames, getRandomTop10Game, printGame ])(response)
    );

fetchAPIResponse()
    .then(response ->
        getResponseXML.andThen(extractGames).andThen(getRandomTop10Game).andThen(printGame).aplly(response)
    );

You will understand why this matters in a minute.

There are also functions which need to interact with the outer world. In that case, functional programming suggests that we wrap them in specific constructions and do not run them immediately. Instead, we weave them into the program, describing what would happen to the result of the wrapped function call when we get one. This makes programs again, "pure functional", "safe" (as in not operating outside of the boundaries of the program itself, all is contained in the function call chains). Then, once we run the program, we enter the world of "unsafe" and execute all those wrapped functions and run the rest of the code once we get the results in place.

Sounds a bit hard to comprehend.

Let me rephrase this with few bits of code.

For the problem above, we are trying to get the response of an API somewhere in the outer world. This is said to be an "unsafe" operation, since the data lives outside of the program, so we need to wrap this operation in a "safe" manner. Essentially, we will create an object which describes an intention to run the fetch call and then write our program around this object to describe how this data will be processed down the line, when we actually run the program (and the fetch request together with it).

Let's go through the thought process all together: we first need a class to wrap an unsafe function without executing it:

class IO {
    constructor(private intentionFunc: Function;) {
    }
}

We then need a way to explicitly execute this function when we are ready to do so:

class IO {
    constructor(private intentionFunc: Function;) {
    }

    unsafeRun() {
        this.intentionFunc();
    }
}

The last piece is we want to be able to chain this function with some other function in a safe manner (e.g. again, wrap the chained function):

class IO {
    constructor(private intentionFunc: Function) {
    }

    andThen(func: Function) {
        return new IO(() => func(this.intentionFunc()));
    }

    unsafeRun() {
        this.intentionFunc();
    }
}

Essentially, we save the function we intend to run in the intentionFunc member of an IO class. When we want to describe what would happen to the result of the data, we return a new IO object with a new function - a combination of a function we will call around the call to the function we saved. This is important to understand why we return a new object: so that we do not mutate the original object.

You might see this new IO thing is very similar to the Promise available in JS runtime already. The similarities are obvious: we also have this chaining with the then method. The call to the then method also returns a new object.

But the main issue with Promise is that they start running the code you passed in the constructor immediately. And that is exactly the issue we are trying to resolve.

Now, let us see how we would use this new IO class in the original problem:

new IO(() => fetch(`https://boardgamegeek.com/xmlapi2/hot?type=boardgame`))

That would not work, however, since fetch call will return a Promise. So we need to somehow work with Promise instances instead. Let me postpone this discussion for a short while.

Spoiler: we could have tried implementing an unpromisify helper which would make the fetch call synchronous, something like this:

const unpromisify = (promiseFn: Function) => {
    const state = { isReady: false, result: undefined, error: undefined };

    promiseFn()
        .then((result) => {
            state.result = result;
            state.isReady = true;
        })
        .catch((error) => {
            state.error = error;
            state.isReady = true;
        });

    while (!state.isReady);

    if (state.error) {
        throw state.error;
    }

    return state.result;
};

But in JS world, promises start executing not immediately, but once you leave the context of a currently running function. So having that endless while loop, waiting for a promise to get resolved has zero effect since this loop will be running until the end of days, but unless you exit the function beforehand, the promise won't start executing because JS is single threaded and the execution queue / event loop prevents you from running the promise immediately.

End of spoiler

Improvements

Use IO monad only for the functions which interact with the outside world

Reduce calls to IO functions

Eliminate the abuse of Maybe Bool

Reduce the nested if statements - use guard expressions

Use existing functions on monads instead of case statements

End result

Basics

Disclaimer

Fancy-less introduction to functional programming

Use `IO` monad only for the functions which interact with the outside world

Reduce calls to `IO` functions

Eliminate the abuse of `Maybe Bool`

Reduce the nested `if` statements - use guard expressions

Use existing functions on monads instead of `case` statements