Text-to-audio generation is here. One of the next big AI disruptions could be in the music industry
]
The past few years have seen an explosion in applications of artificial intelligence to creative fields. A new generation of image and text generators is delivering impressive results. Now AI has also found applications in music, too.
Last week, a group of researchers at Google released MusicLM – an AI-based music generator that can convert text prompts into audio segments. It’s another example of the rapid pace of innovation in an incredible few years for creative AI.
With the music industry still adjusting to disruptions caused by the internet and streaming services, there’s a lot of interest in how AI might change the way we create and experience music.
Read more: Neil Young’s ultimatum to Spotify shows streaming platforms are now a battleground where artists can leverage power
Automating music creation
A number of AI tools now allow users to automatically generate musical sequences or audio segments. Many are free and open source, such as Google’s Magenta toolkit.
Two of the most familiar approaches in AI music generation are:
continuation, where the AI continues a sequence of notes or waveform data, and harmonisation or accompaniment, where the AI generates something to complement the input, such as chords to go with a melody.
Similar to text- and image-generating AI, music AI systems can be trained on a number of different data sets. You could, for example, extend a melody by Chopin using a system trained in the style of Bon Jovi – as beautifully demonstrated in OpenAI’s MuseNet.
Such tools can be great inspiration for artists with “blank page syndrome”, even if the artist themselves provide the final push. Creative stimulation is one of the immediate applications of creative AI tools today.
But where these tools may one day be even more useful is in extending musical expertise. Many people can write a tune, but fewer know how to adeptly manipulate chords to evoke emotions, or how to write music in a range of styles.
Although music AI tools have a way to go to reliably do the work of talented musicians, a handful of companies are developing AI platforms for music generation.
Boomy takes the minimalist path: users with no musical experience can create a song with a few clicks and then rearrange it. Aiva has a similar approach, but allows finer control; artists can edit the generated music note-by-note in a custom editor.
There is a catch, however. Machine learning techniques are famously hard to control, and generating music using AI is a bit of a lucky dip for now; you might occasionally strike gold while using these tools, but you may not know why.
An ongoing challenge for people creating these AI tools is to allow more precise and deliberate control over what the generative algorithms produce.
New ways to manipulate style and sound
Music AI tools also allow users to transform a musical sequence or audio segment. Google Magenta’s Differentiable Digital Signal Processing library technology, for example, performs timbre transfer.
Timbre is the technical term for the texture of the sound – the difference between a car engine and a whistle. Using timbre transfer, the timbre of a segment of audio can be changed.
Such tools are a great example of how AI can help musicians compose rich orchestrations and achieve completely new sounds. In the first AI Song Contest, held in 2020, Sydney-based music studio Uncanny Valley (with whom I collaborate), used timbre transfer to bring singing koalas into the mix.
Timbre transfer has joined a long history of synthesis techniques that have become instruments in themselves.
Taking music apart
Music generation and transformation are just one part of the equation. A longstanding problem in audio work is that of “source separation”. This means being able to break an audio recording of a track into its separate instruments.
Although it’s not perfect, AI-powered source separation has come a long way. Its use is likely to be a big deal for artists; some of whom won’t like that others can “pick the lock” on their compositions.
Meanwhile, DJs and mashup artists will gain unprecedented control over how they mix and remix tracks. Source separation start-up Audioshake claims this will provide new revenue streams for artists who allow their music to be adapted more easily, such as for TV and film.
Artists may have to accept this Pandora’s box has been opened, as was the case when synthesizers and drum machines first arrived and, in some circumstances, replaced the need for musicians in certain contexts.
But watch this space, because copyright laws do offer artists protection from the unauthorised manipulation of their work. This is likely to become another grey area in the music industry, and regulation may struggle to keep up.
New musical experiences
Playlist popularity has revealed how much we like to listen to music that has some “functional” utility, such as to focus, relax, fall asleep, or work out to.
The start-up Endel has made AI-powered functional music its business model, creating infinite streams to help maximise certain cognitive states.
Endel’s music can be hooked up to physiological data such as a listener’s heart rate. Its manifesto draws heavily on practices of mindfulness and makes the bold proposal we can use “new technology to help our bodies and brains adapt to the new world”, with its hectic and anxiety-inducing pace.
Other start-ups are also exploring functional music. Aimi is examining how individual electronic music producers can turn their music into infinite and interactive streams.
Aimi’s listener app invites fans to manipulate the system’s generative parameters such as “intensity” or “texture”, or deciding when a drop happens. The listener engages with the music rather than listening passively.
It’s hard to say how much heavy lifting AI is doing in these applications – potentially little. Even so, such advances are guiding companies’ visions of how musical experience might evolve in the future.
The future of music
The initiatives mentioned above are in conflict with several long-established conventions, laws and cultural values regarding how we create and share music.
Will copyright laws be tightened to ensure companies training AI systems on artists’ works compensate those artists? And what would that compensation be for? Will new rules apply to source separation? Will musicians using AI spend less time making music, or make more music than ever before?
If there’s one thing that’s certain, it’s change. As a new generation of musicians grows up immersed in AI’s creative possibilities, they’ll find new ways of working with these tools.
Such turbulence is nothing new in the history of music technology, and neither powerful technologies nor standing conventions should dictate our creative future.
Read more: No, the Lensa AI app technically isn’t stealing artists' work – but it will majorly shake up the art world
Google’s new AI turns text into music
]
Google researchers have made an AI that can generate minutes-long musical pieces from text prompts, and can even transform a whistled or hummed melody into other instruments, similar to how systems like DALL-E generate images from written prompts (via TechCrunch). The model is called MusicLM, and while you can’t play around with it for yourself, the company has uploaded a bunch of samples that it produced using the model.
The examples are impressive. There are 30-second snippets of what sound like actual songs created from paragraph-long descriptions that prescribe a genre, vibe, and even specific instruments, as well as five-minute-long pieces generated from one or two words like “melodic techno.” Perhaps my favorite is a demo of “story mode,” where the model is basically given a script to morph between prompts. For example, this prompt:
electronic song played in a videogame (0:00-0:15) meditation song played next to a river (0:15-0:30) fire (0:30-0:45) fireworks (0:45-0:60)
Resulted in the audio you can listen to here.
It may not be for everyone, but I could totally see this being composed by a human (I also listened to it on loop dozens of times while writing this article). Also featured on the demo site are examples of what the model produces when asked to generate 10-second clips of instruments like the cello or maracas (the later example is one where the system does a relatively poor job), eight-second clips of a certain genre, music that would fit a prison escape, and even what a beginner piano player would sound like versus an advanced one. It also includes interpretations of phrases like “futuristic club” and “accordion death metal.”
MusicLM can even simulate human vocals, and while it seems to get the tone and overall sound of voices right, there’s a quality to them that’s definitely off. The best way I can describe it is that they sound grainy or staticky. That quality isn’t as clear in the example above, but I think this one illustrates it pretty well.
That, by the way, is the result of asking it to make music that would play at a gym. You may also have noticed that the lyrics are nonsense, but in a way that you may not necessarily catch if you’re not paying attention — kind of like if you were listening to someone singing in Simlish or that one song that’s meant to sound like English but isn’t.
I won’t pretend to know how Google achieved these results, but it’s released a research paper explaining it in detail if you’re the type of person who would understand this figure:
A figure explaining the “hierarchical sequence- to-sequence modeling task” that the researchers use along with AudioLM, another Google project. Chart: Google
AI-generated music has a long history dating back decades; there are systems that have been credited with composing pop songs, copying Bach better than a human could in the 90s, and accompanying live performances. One recent version uses AI image generation engine StableDiffusion to turn text prompts into spectrograms that are then turned into music. The paper says that MusicLM can outperform other systems in terms of its “quality and adherence to the caption,” as well as the fact that it can take in audio and copy the melody.
That last part is perhaps one of the coolest demos the researchers put out. The site lets you play the input audio, where someone hums or whistles a tune, then lets you hear how the model reproduces it as an electronic synth lead, string quartet, guitar solo, etc. From the examples I listened to, it manages the task very well.
Like with other forays into this type of AI, Google is being significantly more cautious with MusicLM than some of its peers may be with similar tech. “We have no plans to release models at this point,” concludes the paper, citing risks of “potential misappropriation of creative content” (read: plagiarism) and potential cultural appropriation or misrepresentation.
AI Generator Can Turn Any Subject Into a Drake-Like Song
]
Can’t wait for the next Drake track to drop? This website may scratch that itch.
Drayk.it is an aptly named music generator that can turn any subject into a Drizzy-inspired record. Users simply go to the site, type in a song idea, and wait about a minute for GPT-3 to create a track. If you’re having trouble coming up with a prompt, just click on the dice on the lower right corner, and the AI technology will select a topic at random.
A number of users have shared their Drayk.it creations via Twitter. The tracks focus on everything from lost wallets to credit card solicitation calls to ship-raiding pirates. You can check out some of the examples below.
Drayk.it is presented by Mayk.it, a virtual music studio co-founded by Stefán Heinrich Henriquez and Akiva Bamberger. The executives spoke about the company in a 2022 Forbes interview, explaining the inspiration behind Mayk.it and their overall goals.
“The premise is that everyone should be able to make songs and work as an artist,” Henriquez said. “We’re unleashing music creativity for everyone. When we looked at other music-making apps, we found they were just too complex for us. They were almost all developed by professional musicians who didn’t have much empathy for beginners.”
He continued: “We can’t make music for people, but we can help them to express themselves musically. In a world of automation, creativity is how we will create new value, but people need the right tools to help them exploit their creativity.”
MusicLM: Google AI generates music in various genres at 24 kHz
]
On Thursday, researchers from Google announced a new generative AI model called MusicLM that can create 24 KHz musical audio from text descriptions, such as “a calming violin melody backed by a distorted guitar riff.” It can also transform a hummed melody into a different musical style and output music for several minutes.
MusicLM uses an AI model trained on what Google calls “a large dataset of unlabeled music,” along with captions from MusicCaps, a new dataset composed of 5,521 music-text pairs. MusicCaps gets its text descriptions from human experts and its matching audio clips from Google’s AudioSet, a collection of over 2 million labeled 10-second sound clips pulled from YouTube videos.
Generally speaking, MusicLM works in two main parts: first, it takes a sequence of audio tokens (pieces of sound) and maps them to semantic tokens (words that represent meaning) in captions for training. The second part receives user captions and/or input audio and generates acoustic tokens (pieces of sound that make up the resulting song output). The system relies on an earlier AI model called AudioLM (introduced by Google in September) along with other components such as SoundStream and MuLan.
Google claims that MusicLM outperforms previous AI music generators in audio quality and adherence to text descriptions. On the MusicLM demonstration page, Google provides numerous examples of the AI model in action, creating audio from “rich captions” that describe the feel of the music, and even vocals (which so far are gibberish). Here is an example of a rich caption that they provide:
Slow tempo, bass-and-drums-led reggae song. Sustained electric guitar. High-pitched bongos with ringing tones. Vocals are relaxed with a laid-back feel, very expressive.
Google also shows off MusicLM’s “long generation” (creating five-minute music clips from a simple prompt), “story mode” (which takes a sequence of text prompts and turns it into a morphing series of musical tunes), “text and melody conditioning” (which takes a human humming or whistling audio input and changes it to match the style laid out in a prompt), and generating music that matches the mood of image captions.
Advertisement
Further down the example page, Google dives into MusicLM’s ability to re-create particular instruments (e.g., flute, cello, guitar), different musical genres, various musician experience levels, places (escaping prison, gym), time periods (a club in the 1950s), and more.
AI-generated music isn’t a new idea by any stretch, but AI music-generation methods of previous decades often created musical notation that was later played by hand or through a synthesizer, whereas MusicLM generates the raw audio frequencies of the music. Also, in December, we covered Riffusion, a hobby AI project that can similarly create music from text descriptions, but not at high fidelity. Google references Riffusion in its MusicLM academic paper, saying that MusicLM surpasses it in quality.
In the MusicLM paper, its creators outline potential impacts of MusicLM, including “potential misappropriation of creative content” (i.e., copyright issues), potential biases for cultures underrepresented in the training data, and potential cultural appropriation issues. As a result, Google emphasizes the need for more work on tackling these risks and is holding back the code: “We have no plans to release models at this point.”
Google’s researchers are already looking ahead toward future improvements: “Future work may focus on lyrics generation, along with improvement of text conditioning and vocal quality. Another aspect is the modeling of high-level song structure like introduction, verse, and chorus. Modeling the music at a higher sample rate is an additional goal.”
It’s probably not too much of a stretch to suggest that AI researchers will continue improving music-generation technology until anyone can create studio-quality music in any style just by describing it—although no one can yet predict exactly when that goal will be attained or how exactly it will impact the music industry. Stay tuned for further developments.