BLOOM 176B — how to run a real LARGE language model in your own cloud?

How to set up BLOOM on your own cloud.

"BigScience Large Open-science Open-access Multilingual Language Model:

  • free transformer-based language model created by 1000 researchers
  • trained on about 1,6 TB pre-processed multilingual text.
  • biggest BLOOM model in parameters is 176B = ~GPT-3 scale
  • smaller models available: 7b, 3b, 1b7
  • Needs 360 GB of RAM ,,, but "Microsoft has provided a downsampled variant with INT8 weights (from original FLOAT16 weights) that runs on the DeepSpeed Inference engine and uses tensor paralellism.... tensors are split into 8 shards. So ... absolute model size is reduced and ... split and parallelized and can thus be distributed over 8 GPUs."

He then provides instructions for hosting on AWS as "it provides a SageMaker setup for a Deep Learning container capable of initializing the model... [but] get the instances through support, you can’t do it by self configuring... hosted model can be loaded from the Microsoft repository on Huggingface into an S3 ... [it's] 180 GB... costs about $32 per hour when running... you can start it in about 18 min, shutting down and freeing the resources takes seconds... We put a custom API gateway and lambda function in the interface on top of the Sagemaker endpoint that allows users to connect externally with an API key".

