Weekly paper roundup: Design2Code, automating front-end engineering (3/4/2024)

Spotlight

Design2Code: How Far Are We From Automating Front-End Engineering?

Authors: Stanford, Georgia Tech, Microsoft, Google Deepmind

Project page: Design2Code: How Far Are We From Automating Front-End Engineering

Summary

How well do LLMs such as GPT-4V work in converting design to code? This paper introduces a benchmark of 484 webpages as test cases as well as automatic and human evaluation-based metrics. Three models are then compared: GPT-4V, Gemini Pro Vision (GPV), and Design2Code-18B (open-source), with GPT-4V emerging as the clear winner.

Details

Generative AI for software is an area that we at the AI2 Incubator track closely (see Insights #10 and Insights #13). Turning a design (e.g. hand-drawn or in Figma) into front-end (HTML/CSS/JavaScript) code is an example use case of GenAI for software. Products such as Vercel’s V0, Locofy, Anima, Builder.io have been tackling this problem, leveraging models such as GPT-4V. Yet we know little about how well such products work in practice given the lack of benchmarks and studies. This paper contributes the first benchmark (data and evaluation metrics) and compares three models using it: GPT-4V, Gemini Pro Vision, and Design2Code-18B. It’s interesting to note that the list of authors include authors from Microsoft who is a close partner of GPT-4V’s creator OpenAI and from Google, Gemini Pro Vision’s creator. Below are some insights that we highlight.

  • The benchmark focuses on visual fidelity and eschews the challenging problem of measuring the accuracy of dynamic interactions, typically enabled by JavaScript.
  • GPT-4V emerged as the clear winner. The gap between it and other models is even larger in human evaluation.
  • With text-augmented prompting, the technique of providing the design’s text as part of the prompt, both GPT-4V and GPV improve their scores by a few percentage points.
  • With self-revision, the technique of giving the model a chance to look at its output and make another pass at generation, GPT-4V improves slightly, whereas GPV shows no improvement. The authors note that there’s a previous paper that showed that LLMs can’t self-correct yet.
  • Fine-tuning makes a huge difference on open-source models. Design2Code-18B was fine tuned on CogAgent-18B and reaches the performance of GPV.
  • Webpages generated by GPT-4V are preferred to the reference versions in 64% of the cases. The authors hypothesize that it is possible that the model has more access to modern and popular webpage design principles, such that it can automatically improve the original design based on these best practices. This also opens up many new opportunities for future work on website design improvement tools

The paper is well-written and full of insights and interesting observations; we highly recommend a full read of it.

Noteworthy papers

  • [2402.19473v1] Retrieval-Augmented Generation for AI-Generated Content: A Survey (arxiv.org). Authors: Peking University.
    • The existence of this survey point to the meteoric rise of RAG in just over a year since the ChatGPT big bang. For practitioners, it’s worth a quick skim.
  • SaulLM-7B: A pioneering Large Language Model for Law. Authors: Equall AI.
    • This is the latest example of the category of domain specific foundation models (DSFM, see our take in Insight #10), targeting the domain of law. It is built on top of Mistral-7B. While the model is released with an MIT license, the training data is not. The two key takeways are 1) pre-training and instruction tuning with domain specific data improve performance and 2) Mistral-7B is a strong base model to start, compared to other models such as Llama and Zephyr.
  • GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. Authors: Caltech and Meta.
    • An interesting research direction is training LLMs with limited compute. This paper proposes the technique of using low-rank projection of the gradients rather than the weights. When combined with other techniques such as 8-bit optimizers, the authors demonstrate that it’s possible to pre-train a 7B LLM on a consumer GPU card with 24GB of RAM such as the NVIDIA RTX 4090. It’s worth noting that Jeremy Howard and the Answer.ai team also blogged about their on-going effort toward training a 70B LLM with two such cards. Both of these works are exciting. We feel that the authors of the GaLore could have provided more details about what it actually takes to train a 7B LLM with an RTX 4090 (e.g. in number of years). In the Experiments section, they report using a cluster of 64 A100 GPUs to train a 7B LLaMa with 19.4B tokens